STAT 1301: Statistical Packages

R Lecture 11: Regression

Junshu Bao

Regression

Correlation tells us about strength and direction of the linear relationship between two quantitative variables.

In addition, we would like to have a numerical description of how both variables vary together.

A regression line is a straight line  \(\hat y=b_0+b_1 x,\) that describes how a response variable \(y\) changes as an explanatory variable \(x\) changes.

The least-squares regression line of \(y\) on \(x\) is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible, i.e. it is the line that minimize \[\sum_{i=1}^n \left(y_i-\hat y_i \right)^2\] where \(y_i-\hat y_i=y_i-(b_0+b_1 x_i)\) is called the residual or prediction error for observation \(i\).

Least-Squares Regression Line

The least-squared regression line is unique.

The intercept and slope are functions of the \(x\) and \(y\) values:

\[\begin{eqnarray*} \hat \beta_1&=&\frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n(x_i-\overline x)^2}\\ \hat \beta_0&=&\overline y-\hat \beta_1\overline x \end{eqnarray*}\] Note that \[\begin{eqnarray*} \hat \beta_1&\equiv \frac{(n-1)\text{Cov}(x,y)}{(n-1)\text{Var}(x)} &\equiv \frac{s_{xy}}{s_x^2}=\frac{rs_xs_y}{s_x^2} &= r\cdot\frac{s_{y}}{s_{x}} \end{eqnarray*}\]

where \(s_{xy}\) is the sample covariance, \(s_x\) and \(s_y\) are the sample standard deviations of \(x\) and \(y\), respectively, and \(r\) is the sample correlation coefficient.

Example: Does Fidgeting Keep You Slim?

Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” (NEA) explains why. Some people may spontaneously increase NEA when fed more.

Researchers deliberately overfed 16 healthy young adults for eight weeks. They measured fat gain (in kg) and change in NEA (in calories).

The data set is as follows:

fatgain <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/fatgain.txt",head=T)
head(fatgain)
##   NEA Fat
## 1 -94 4.2
## 2 -57 3.0
## 3 -29 3.7
## 4 135 2.7
## 5 143 3.2
## 6 151 3.6

The scatterplot shows that the relationship between Fat and NEA is a moderately strong, negative, linear relationship.

The red straight line is the least square regression line.

Find the Linear Regression Line in R

We can use R as a calculator to find the estimates of intercept and slope manually according to the formulas:

attach(fatgain)
b1 <-cov(Fat,NEA)/var(NEA); b1
## [1] -0.003441487
b0 <-mean(Fat)-b1*mean(NEA);b0
## [1] 3.505123

Alternatively, we can use the function lm (linear model). The form is lm(model=y~x, data=, ...).

fit<- lm(Fat~NEA)
fit
## 
## Call:
## lm(formula = Fat ~ NEA)
## 
## Coefficients:
## (Intercept)          NEA  
##    3.505123    -0.003441

  

In-Class Exercise: I Feel Your Pain

“Empathy” means being able to understand what others feel. To see how the brain expresses empathy, researchers recruited 16 couples and zapped the man’s hand with an electrode while the woman watched, and measured the activity in several parts of the woman’s brain that would respond to her own pain.The women also completed a psychological test that measures empathy.

In-Class Exercise: Fit a Simple Linear Regression Model

Regress brain activity (Brain) on empathy score (Emp) using lm function.

empathy <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/empathy.txt")
attach(empathy)
empathy.fit <- lm(Brain~Emp)
empathy.fit
## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Coefficients:
## (Intercept)          Emp  
##   -0.057750     0.007613

Fitted Values and Residuals

The function fitted returns fitted values, \(\hat y\), the \(y\)-values that you would expect for the given \(x\)-values according to the best-fitting straight line.

fit<- lm(Fat~NEA)
fitted(fit)
##        1        2        3        4        5        6        7        8 
## 3.828623 3.701288 3.604926 3.040522 3.012990 2.985458 2.661959 2.283395 
##        9       10       11       12       13       14       15       16 
## 2.156060 1.877300 1.832560 1.663927 1.540034 1.509060 1.371401 1.130497

Superimpose Linear Regression Line

To put the fitted line (regression line) on the scatterplot, we can use the lines or abline function:

par(pty="s",mfrow=c(1,2))
plot(fatgain,pch=16)
lines(NEA, fitted(fit),col=2,lwd=2)
plot(fatgain,pch=16)
abline(fit,col=5,lwd=2)

Notice that the arguments of the two functions are different.

In-Class Exercise: Superimpose Linear Regression Line

Now, superimpose the linear regression line you found for the empathy data to the scatterplot.

plot(Emp,Brain,pch=16,col=4,cex=1.2,cex.lab=1.3,xlab="Empathy score", ylab="Brain activity")
abline(empathy.fit,col=2,lwd=2)

Display Residuals in A Scatterplot

To create a plot where residuals are displayed by connecting observations to corresponding points on the fitted line, you can do the following.

segments draws line segments; its arguments are the endpoint coordinates in the order \(x_1, y_1, x_2, y_2\).

plot(fatgain,pch=16); abline(fit,col=2,lwd=2)
segments(NEA,fitted(fit),NEA,Fat,col=4,lwd=2)

Predict \(y\) given a New Value of \(x\)

Suppose a person’s NEA is 650 calories. What is her expected fat gain after eight weeks?

predict(fit,data.frame(NEA=650))
##        1 
## 1.268156

In-Class Exercise: Prediction and Residual

  1. Subject 1 has empathy score 38. What is her predicted brain activity based on the linear regression model?
y_hat_1 <- predict(empathy.fit,data.frame(Emp=38))
y_hat_1
##         1 
## 0.2315374
  1. The subject’s actual brain activity level was \(-0.120\). What is the residual (prediction error) for her?

Facts about Least-Squares Regression

  1. The distinction between explanatory and response variables is essential in regression.

    Least-squares regression makes the distances of the data points from the line small only in the \(y\)-direction. If we reverse the roles of the two variables, we get a different least-squares regression line.

  2. There is a close connection between correlation and the slope of the least-squares line.

    The slope is \(\hat \beta_1 = r\cdot s_y/s_x.\) Notice that the slope and the correlation always have the same sign.

  3. The least-squares regression line always passes through the point \((\bar x, \bar y)\).

    This is because \(\hat \beta_0 = \bar y - \hat \beta_1 \bar x\).

  4. The correlation \(r\) describes the strength of a straight-line relationship.

    In the regression setting, this description takes a specific form: The square of the correlation, \(r^2\), is the fraction of the variation in the values of \(y\) that is explained by the least-squares regression of \(y\) on \(x\).

Introduction to Inference for Regression

When a scatterplot shows a linear relationship between two quantitative variables, \(x\) and \(y\),  we can use the least-squares regression line fitted to the data to predict \(y\) for a given value of \(x\).

When the data are a simple random sample from a larger population, we need statistical inference to answer questions like these about the population:

Conditions for Regression Inference

The slope \(\hat \beta_1\) and the intercept \(\hat \beta_0\) of the least-squares line are statistics because we calculated them from the sample data.  To do inference, think of \(\hat \beta_0\) and \(\hat \beta_1\) as estimates of unknown parameters \(\beta_0\) and \(\beta_1\) that describe the whole population.

We have \(n\) observations on an explanatory variable \(x\) and a response variable \(y\).

Conditions for Regression Inference (cont.)

The following linear regression model summarizes the conditions:

\[y_i = \beta_0 + \beta_1 x_i +\epsilon_i\]

where \(\epsilon_i \sim N(0,\sigma^2)\).

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\)

We know we can use least squares method to estimate \(\beta_0\) and \(\beta_1\).

The residuals \(e_i=y_i-\hat y_i\) are used to obtain an estimator of \(\sigma^2\).

The sum of squares of the residuals, often called the error sum of squares, is \[SSE=\sum_{i=1}^ne^2_i=\sum_{i=1}^n (y_i-\hat y_i)^2.\]

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\) (cont.)

It can be shown that the expected value of the error sum of squares is \(E(SSE)=(n-2)\sigma^2.\) Consequently, \[E\left(\frac{SSE}{n-2}\right)=\sigma^2\]

Therefore an unbiased estimator of \(\sigma^2\) is \[s^2=\frac{SSE}{n-2}=\frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n-2}\] \(s^2\) is also called mean squared error (MSE).

It can be shown that \(\sum_{i=1}^ne_i=\sum_{i=1}^n(Y_i-\hat Y_i)=0.\)

Example: Crying and IQ

Infants who cry easily may be more easily stimulated than others. This may be a sign of high IQ.

##    Crycount  IQ
## 1        10  87
## 2        20  90
## 3        17  94
## 4        12  94
## 5        12  97
## 6        16 100
## 7        19 103
## 8        12 103
## 9         9 103
## 10       23 103

Example: Crying and IQ (cont.)

Let us find the least-squares regression line and add it to the scatterplot.
attach(crying)
crying_lm <- lm(IQ~Crycount); crying_lm
## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Coefficients:
## (Intercept)     Crycount  
##      91.268        1.493
par(pty="s")
plot(crying,col=2,pch=16)
abline(crying_lm,col=4,lwd=2)

Thus, the regression line is \[\hat y = 91.268+1.493 x\]

Example: Crying and IQ (Estimating \(\sigma\))

Let us calculate the MSE in the crying and IQ example.

resid() function can be used to extract residuals from the regression output.

SSE <- sum(resid(crying_lm)^2)
n <- dim(crying)[1]
MSE <- SSE/(n-2); MSE
## [1] 306.2052
s <- sqrt(MSE); s
## [1] 17.49872

Example: Crying and IQ (Estimating \(\sigma\)) (cont.)

Actually, \(s=\sqrt{MSE}\) is in the output of the summary function. It is called “Residual standard error”.

summary(crying_lm)
## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

You may also extract it directly:

summary(crying_lm)$sigma
## [1] 17.49872

In-Class Exercise: Estimating \(\sigma^2\) with MSE

Earlier in the lecture, we fitted a regression model for the empathy data. Now, calculate \(\sqrt{MSE}\) in the model.  First do it manually then extract the \(\sqrt{MSE}\) from the output.

attach(empathy)
empathy.fit<-lm(Brain~Emp)
n2<-length(Brain)
MSE2<-sum(resid(empathy.fit)^2)/(n2-2)
s2 <- sqrt(MSE2); s2
## [1] 0.2442106
summary(empathy.fit)$sigma
## [1] 0.2442106

Distributions of \(\hat \beta_0\) and \(\hat \beta_1\)

The very basic expression of \(\hat \beta_1\) is

\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\equiv\frac{S_{xy}}{S_{xx}}\]

So \(\hat \beta_1\) is actually a linear combination of \(Y_i\)’s.  Hence, it follows a normal distribution.

Since \(\hat \beta_0\) is a linear function of \(\hat \beta_1\), \(\hat \beta_0\) is also normal.

It can be shown that \[ \begin{align*} \mu_{\hat\beta_0}&=\beta_0 \hspace{10pt}\mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_0}=\left(\frac1n+\frac{\overline{x}^2}{S_{xx}}\right)\sigma^2\\ \mu_{\hat\beta_1}&=\beta_1\hspace{10pt} \mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_1}=\frac{\sigma^2}{S_{xx}} \end{align*} \]

Standardized \(\hat\beta_1\)

Since \(\sigma^2\) is unknown, we have to use MSE (\(s^2\)) to replace it. The standard error of \(\hat\beta_1\) is \[se(\hat\beta_1)=\sqrt{\frac{s^2}{S_{xx}}}\]

So, we have the following standardized version of \(\hat\beta_1\):

\[t=\frac{\hat \beta_1 -\beta_1}{se({\hat\beta_1})}\]

and it follows a \(t\) distribution with \(n-2\) degrees of freedom.

Testing the hypothesis of no linear relationship

In the Crying and IQ example, data analysis supports the conjecture that children with higher crying counts tend to have higher IQs (since the slope is 1.49).

But is the positive association statistically significant? To answer this question, test hypotheses about the slope \(\beta_1\) of the population regression line: \[H_0: \beta_1=0\hspace{5pt}\text{ versus} \hspace{5pt}H_a: \beta_1 \ne 0\] Under the null hypothesis, the test statistic is \[t_0=\frac{\hat \beta_1 }{se({\hat\beta_1})}\] Since this is a two-sided test, the p-value is \[2P(t_{n-2}<-|t_0|)\]

Example: Crying and IQ (H.T.)

Let us first do the test manually.  Let \(\alpha=0.05\).

Sxx <- (n-1)*var(Crycount)
se_b1 <- s/sqrt(Sxx); se_b1
## [1] 0.4870011
b1 <-cov(IQ,Crycount)/var(Crycount)
t0 <- b1/se_b1; t0
## [1] 3.065489
p_value <- 2*pt(-abs(t0),n-2); p_value
## [1] 0.0041053

Since the p-value is less than 0.05, reject the null and conclude that there is a positive association between crying count and IQ.

Example: Crying and IQ (R output)

summary(crying_lm)
## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

In-Class Exercise: H.T. of no linear relationship

Now, conduct the test of no linear relationship for the empathy data.

summary(empathy.fit)
## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35257 -0.14980  0.02289  0.14572  0.44495 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.057750   0.188326  -0.307   0.7636  
## Emp          0.007613   0.003385   2.249   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2129 
## F-statistic: 5.057 on 1 and 14 DF,  p-value: 0.04115

Extract Coefficient Estimates, Standard Errors, etc.

b1.2<-summary(empathy.fit)$coefficients[2,1];b1.2
## [1] 0.007612827
se_b1.2<-summary(empathy.fit)$coefficients[2,2];se_b1.2
## [1] 0.003385372
t0.2<- summary(empathy.fit)$coefficients[2,3];t0.2
## [1] 2.248742
p_value2<-summary(empathy.fit)$coefficients[2,4];p_value2
## [1] 0.0411514

One-Sided Test about the Slope

If we want to test if the slope is positive, we need to do a one-sided test.

\(H_0: \beta_1=0\) vs. \(H_a: \beta_1>0\).

The test statistic stays the same, but the p-value is \(P(T_{n-2}>t_0):\)

1-pt(t0.2,n2-2)
## [1] 0.0205757

Note that the p-value is half of that of the two-sided test.

Confidence interval for the regression slope

The slope \(\beta_1\) of the population regression line is the most important parameter in a regression problem.

We already have a point estimator of \(\beta_1\) using the least-squares method

\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\]

A confidence interval is more useful for \(\beta_1\) because it shows how accurate the estimate \(\hat\beta_1\) is likely to be.

A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) of the population regression line is \[\hat\beta_1 \pm t^* \cdot se(\hat\beta_1)\] where \(t^*\) is the critical value for the \(t_{n-2}\) distribution with area \(1-\alpha\) between \(-t^*\) and \(t^*\).

Example: Crying and IQ (C.I. for \(\beta_1\))

You may extract \(\hat\beta_1\) as well as the standard error of it from the output and calculate the C.I. “manually”:

t <- qt(0.975,n-2)
b1<-summary(crying_lm)$coefficients[2,1]
se_b1 <- summary(crying_lm)$coefficients[2,2]
LL<-b1-t*se_b1; UL<-b1+t*se_b1; c(LL,UL)
## [1] 0.5052127 2.4805805

Alternatively, you may use the confint():

confint(crying_lm,'Crycount',level=0.95)
##              2.5 %   97.5 %
## Crycount 0.5052127 2.480581

Confidence Interval on the Mean Response

The estimated mean response at a specific value of \(x,\) say, \(x^*\) is \[\hat\mu_{y|x^*}=\hat\beta_0+\hat\beta_1x^*,\] where \(\hat\beta_0\) and \(\hat\beta_1\) are the least squares estimators.

\(\hat\mu_{y|x^*}\) is an unbiased estimator of \(\mu_{y|x^*}\) and it is normally distributed (since both \(\hat\beta_0\) and \(\hat\beta_1\) are normal).

It can be shown that \[\text{Var}(\hat\mu_{y|x^*})=\sigma^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]\]

Confidence intervals may be constructed on the mean response \(\mu_{y|x^*}\) based on the sampling distribution of it.

Confidence Interval of \(\mu_{y|x^*}\)

The population variance of \(Y\), \(\sigma^2\), is unknown and has to be estimated by \(s^2\),  the mean squared error (MSE).  So the standard error of \(\hat \mu_{Y|x^*}\) is \[\text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\] where \(s^2=\sum_{i=1}^{n}(y_i-\hat y_i)^2/(n-2)\).

Thus, a \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*})\]

Example: Crying and IQ (Confidence Interval for \(\mu_{y|x^*}\))

A \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*}) \hspace{10pt} \text{where}\hspace{10pt} \text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\]

Let us calculate a 95% confidence interval for \(\mu_{y|x}\) when \(x=19\).

b0<-summary(crying_lm)$coefficients[1,1]
b1<-summary(crying_lm)$coefficients[2,1]
x.star <- 19; y.star <- b0+b1*x.star
se.mu.y <- s*sqrt(1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.star +c(-1,1)*t.star*se.mu.y
## [1] 113.6619 125.6047

We also may use the R function predict.

new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95,
        interval="confidence")
##        fit      lwr      upr
## 1 119.6333 113.6619 125.6047

Prediction Interval for a single observation \(y\)

An important application of a regression model is predicting new or future observations \(Y\) corresponding to a specified value of regressor \(x\).

If \(x^*\) is the value of the regressor variable of interest, then the point estimator of the new or future value of the response \(Y^*\) is \[\hat Y=\hat\beta_0+\hat\beta_1x^*\] Now consider obtaining an interval estimate for this future observation \(Y^*.\) It can be shown that \[Var(Y^*-\hat Y)=\sigma^2\left[1+\frac1n+\frac{(x_0-\overline x)^2}{S_{xx}}\right]\]

Prediction Interval for a single observation \(y\) (cont.)

If we use the MSE, \(s^2\), to estimate \(\sigma^2\), we can show that \[\frac{Y^*-\hat Y}{\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}}\sim t_{n-2}.\]

From this we can develop the following prediction interval definition.

A \(100(1-\alpha)\%\) prediction interval on a future observation \(Y^*\) at the value of \(x=x^*\) is given by \[\begin{eqnarray*} \hat Y \pm t^*\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}. \end{eqnarray*}\]

Note that the prediction interval is wider than confidence interval about the mean response \(\mu_{y|x^*}\).

Example: Crying and IQ (Prediction Interval for \(Y^*\))

A prediction interval on a future observation \(Y^*\) at the value of \(x=x^*\) is given by \[\begin{eqnarray*} \hat Y \pm t^*\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}. \end{eqnarray*}\]

Let us calculate a 95% prediction interval for \(Y^*\) when \(x=19\).

x.star <- 19; y.hat <- b0+b1*x.star
se.y.hat <- s*sqrt(1+1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.hat + c(-1,1)*t.star*se.y.hat
## [1]  83.64541 155.62126

We also may use the R function predict.

new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95, interval="prediction")
##        fit      lwr      upr
## 1 119.6333 83.64541 155.6213

Confidence Band and Prediction Band

We may visualize the difference between the confidence interval of \(\mu_{y|x^*}\) and the prediction interval of \(Y^*\).

all<- seq(min(Crycount),max(Crycount)); new.data <- data.frame(Crycount=all)
CI.all <- predict(crying_lm,new.data,interval="confidence");PI.all <- predict(crying_lm,new.data,interval="prediction")
plot(crying,pch=16,cex=1)
lines(all,CI.all[,2],lwd=2,lty=3,col=2); lines(all,CI.all[,3],lwd=2,lty=3,col=2)
lines(all,PI.all[,2],lwd=2,lty=2,col=4); lines(all,PI.all[,3],lwd=2,lty=2,col=4)
legend("topleft", legend=c("Confidence Interval", "Prediction Interval"), lty=c(3,2),col=c(2,4), cex=0.8)

Adequacy of the Regression Model

Fitting a regression model requires several assumptions.

Residual Analysis

The residuals from a regression model are \(e_i=y_i-\hat y_i, i=1,2,\dots,n,\)  where \(y_i\) is an actual observation and \(\hat y_i\) is the corresponding fitted value from the regression model.

Residual Plots

A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.

Create a Residual Plot

Let us create a residual plot for the Crying and IQ example.

crying_rs <- resid(crying_lm)
plot(Crycount,crying_rs,pch=16,col=4, ylab="Residual")
abline(h=0,lwd=2,col=2)

It seems like an unstructured horizontal band. And, the points are roughly symmetric about 0. No curve pattern is observed.

Histogram and QQ plot of residuals

par(mfrow=c(1,2),pty="s")
hist(crying_rs,freq=F); lines(seq(-40,60,0.01),dnorm(seq(-40,60,0.01),mean(crying_rs),sd(crying_rs)),col=2,lwd=2)
qqnorm(crying_rs);qqline(crying_rs,col=2)

In-Class Exercise: Plots for the Empathy Example

Recall that we fitted a linear regression model and called it empathy.fit. The explanatory variable is the empathy level, Emp. Now, pull out the residuals and create a residual plot, a histogram and QQ plot for the residuals.

par(mfrow=c(1,3),pty="s")
empathy_rs <- resid(empathy.fit)
plot(Emp,empathy_rs,pch=16,col=4, ylab="Residual"); abline(h=0,lwd=2,col=2)
hist(empathy_rs,freq=F); lines(seq(-0.5,0.7,0.01),dnorm(seq(-0.5,0.7,0.01),mean(empathy_rs),sd(empathy_rs)),col=4,lwd=2)
qqnorm(empathy_rs);qqline(empathy_rs,col=4)

Normality Test about Residuals

We can also use the Shapiro-Wilk test to check the normality assumption:

shapiro.test(crying_rs)
## 
##  Shapiro-Wilk normality test
## 
## data:  crying_rs
## W = 0.96496, p-value = 0.2741

ANOVA table in regression analysis

We have seen the use of analysis of variance tables in grouped designs.  However, their use is not restricted to these designs but applies to the whole class of linear models.

The variation between and within groups for a one-way analysis of variance generalized to model variation and residual variation:

\[ \begin{align*} SSR=&\sum_{i=1}^n (\hat y_i-\bar y_.)^2\\ SSE=&\sum_{i=1}^n (y_i-\hat y_i)^2 \end{align*} \]

The role of the group means \(\bar y_i\) in the one-way analysis of classification is taken over by the fitted values \(\hat y_i\) in the more general linear model.

Testing utility of model

An F test for significance of the model is available in direct analogy with the one in the ANOVA test.

In multiple linear regression, the null hypothesis is “all the slopes are zero”, i.e. \(\beta_1=\beta_2=\cdots=\beta_k=0\).

In simple linear regression, this test is equivalent to testing that the regression coefficient \(\beta_1\) is zero.

Example: Crying and IQ

anova(crying_lm)
## Analysis of Variance Table
## 
## Response: IQ
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Crycount   1  2877.5 2877.48  9.3972 0.004105 **
## Residuals 36 11023.4  306.21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Crying and IQ (cont.)

Hypotheses:  \(H_0: \beta_1=0\) versus \(H_a: \beta_1\ne 0\)

Test statistic:
\[ \begin{align*} F=&\frac{SSR/DFR}{SSE/DFE}=\frac{MSR}{MSE}\\ =&\frac{2877.5/1}{11023.4/36}=\frac{2877.48}{306.21}=9.3972 \end{align*} \]

The p-value is \(P(F_{DFR,DFE}>F)\).

p_val <- 1 - pf(9.3972,1,36); p_val
## [1] 0.004105345

Conclusion:  Since the p-value is 0.004 < 0.05, reject \(H_0\). The model is significant (useful) at 0.05 level.

Test of utility versus t test of \(\beta_1\)

Since we know that the test of utility of model is equivalent to testing that \(\beta_1=0\), we may pull out the t test result and compare their results.

summary(crying_lm)
## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

Note that the p-value for the t statistic for \(\beta_1\) is the same with the p-value of the F statistic of the utility test.

In-Class Exercise: ANOVA table and utility test

Now, conduct a utility F test for the Empathy data set. The model object is empathy.fit.

What is the F statistic?  How about the p-value?  Do you think the model is “useful” in predicting the response?

anova(empathy.fit)
## Analysis of Variance Table
## 
## Response: Brain
##           Df  Sum Sq  Mean Sq F value  Pr(>F)  
## Emp        1 0.30158 0.301584  5.0568 0.04115 *
## Residuals 14 0.83494 0.059639                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Any alternative way to test for utility?

summary(empathy.fit)
## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35257 -0.14980  0.02289  0.14572  0.44495 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.057750   0.188326  -0.307   0.7636  
## Emp          0.007613   0.003385   2.249   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2129 
## F-statistic: 5.057 on 1 and 14 DF,  p-value: 0.04115

Coefficient of Determination (\(R^2\))

The analysis of variance (ANOVA) identity is as follows: \[\underbrace{\sum_{i=1}^n(Y_i-\overline Y)^2}_{SST}=\underbrace{\sum_{i=1}^n(\hat Y_i-\overline Y)^2}_{SSR}+\underbrace{\sum_{i=1}^n(Y_i-\hat Y_i)^2}_{SSE}\]

Coefficient of Determination,  denoted by \(R^2\),  is often used to judge the adequacy of a regression model. \[R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\]

Example: Crying and IQ

anova(crying_lm)
## Analysis of Variance Table
## 
## Response: IQ
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Crycount   1  2877.5 2877.48  9.3972 0.004105 **
## Residuals 36 11023.4  306.21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, we can extract information and calculate \(R^2\) manually.

SSR=2877.5; SSE=11023.4
SST=SSR+SSE
R_sq=SSR/SST; R_sq
## [1] 0.207001
summary(crying_lm)
## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

Example: Crying and IQ (cont.)

Let us also verify the relationship between coefficients of determination (\(R^2\)) and correlation \(r\).

r=cor(Crycount,IQ);r
## [1] 0.4549725
r^2; R_sq
## [1] 0.207
## [1] 0.207001

In-Class Exercise: \(R^2\)

What is the coefficient of determination for the linear model for Empathy? Verify that \(R^2=r^2.\)

r<-cor(Emp,Brain);r;r^2
## [1] 0.5151268
## [1] 0.2653556
summary(empathy.fit)$r.squared
## [1] 0.2653556
detach(empathy,crying,fatgain)