Junshu Bao
Correlation tells us about strength and direction of the linear relationship between two quantitative variables.
In addition, we would like to have a numerical description of how both variables vary together.
A regression line is a straight line \(\hat y=b_0+b_1 x,\) that describes how a response variable \(y\) changes as an explanatory variable \(x\) changes.
The least-squares regression line of \(y\) on \(x\) is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible, i.e. it is the line that minimize \[\sum_{i=1}^n \left(y_i-\hat y_i \right)^2\] where \(y_i-\hat y_i=y_i-(b_0+b_1 x_i)\) is called the residual or prediction error for observation \(i\).
The least-squared regression line is unique.
The intercept and slope are functions of the \(x\) and \(y\) values:
\[\begin{eqnarray*} \hat \beta_1&=&\frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n(x_i-\overline x)^2}\\ \hat \beta_0&=&\overline y-\hat \beta_1\overline x \end{eqnarray*}\] Note that \[\begin{eqnarray*} \hat \beta_1&\equiv \frac{(n-1)\text{Cov}(x,y)}{(n-1)\text{Var}(x)} &\equiv \frac{s_{xy}}{s_x^2}=\frac{rs_xs_y}{s_x^2} &= r\cdot\frac{s_{y}}{s_{x}} \end{eqnarray*}\]where \(s_{xy}\) is the sample covariance, \(s_x\) and \(s_y\) are the sample standard deviations of \(x\) and \(y\), respectively, and \(r\) is the sample correlation coefficient.
Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” (NEA) explains why. Some people may spontaneously increase NEA when fed more.
Researchers deliberately overfed 16 healthy young adults for eight weeks. They measured fat gain (in kg) and change in NEA (in calories).
The data set is as follows:
fatgain <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/fatgain.txt",head=T)
head(fatgain)
## NEA Fat
## 1 -94 4.2
## 2 -57 3.0
## 3 -29 3.7
## 4 135 2.7
## 5 143 3.2
## 6 151 3.6
The scatterplot shows that the relationship between Fat
and NEA
is a moderately strong, negative, linear relationship.
The red straight line is the least square regression line.
R
We can use R
as a calculator to find the estimates of intercept and slope manually according to the formulas:
attach(fatgain)
b1 <-cov(Fat,NEA)/var(NEA); b1
## [1] -0.003441487
b0 <-mean(Fat)-b1*mean(NEA);b0
## [1] 3.505123
Alternatively, we can use the function lm
(linear model). The form is lm(model=y~x, data=, ...)
.
fit<- lm(Fat~NEA)
fit
##
## Call:
## lm(formula = Fat ~ NEA)
##
## Coefficients:
## (Intercept) NEA
## 3.505123 -0.003441
lm
is a model formula in which the tilde symbol (~
) should be read as “described by”.lm
is very brief: the estimated intercept (\(\hat\beta_0\)) and the estimated slope (\(\hat\beta_1\)).lm
function handles much more complicated models than simple linear regression. For example, multiple regression models (e.g. \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2+\epsilon\)).“Empathy” means being able to understand what others feel. To see how the brain expresses empathy, researchers recruited 16 couples and zapped the man’s hand with an electrode while the woman watched, and measured the activity in several parts of the woman’s brain that would respond to her own pain.The women also completed a psychological test that measures empathy.
Regress brain activity (Brain
) on empathy score (Emp
) using lm
function.
empathy <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/empathy.txt")
attach(empathy)
empathy.fit <- lm(Brain~Emp)
empathy.fit
##
## Call:
## lm(formula = Brain ~ Emp)
##
## Coefficients:
## (Intercept) Emp
## -0.057750 0.007613
The function fitted
returns fitted values, \(\hat y\), the \(y\)-values that you would expect for the given \(x\)-values according to the best-fitting straight line.
fit<- lm(Fat~NEA)
fitted(fit)
## 1 2 3 4 5 6 7 8
## 3.828623 3.701288 3.604926 3.040522 3.012990 2.985458 2.661959 2.283395
## 9 10 11 12 13 14 15 16
## 2.156060 1.877300 1.832560 1.663927 1.540034 1.509060 1.371401 1.130497
To put the fitted line (regression line) on the scatterplot, we can use the lines
or abline
function:
par(pty="s",mfrow=c(1,2))
plot(fatgain,pch=16)
lines(NEA, fitted(fit),col=2,lwd=2)
plot(fatgain,pch=16)
abline(fit,col=5,lwd=2)
Notice that the arguments of the two functions are different.
Now, superimpose the linear regression line you found for the empathy
data to the scatterplot.
plot(Emp,Brain,pch=16,col=4,cex=1.2,cex.lab=1.3,xlab="Empathy score", ylab="Brain activity")
abline(empathy.fit,col=2,lwd=2)
To create a plot where residuals are displayed by connecting observations to corresponding points on the fitted line, you can do the following.
segments
draws line segments; its arguments are the endpoint coordinates in the order \(x_1, y_1, x_2, y_2\).
plot(fatgain,pch=16); abline(fit,col=2,lwd=2)
segments(NEA,fitted(fit),NEA,Fat,col=4,lwd=2)
Suppose a person’s NEA is 650 calories. What is her expected fat gain after eight weeks?
predict(fit,data.frame(NEA=650))
## 1
## 1.268156
y_hat_1 <- predict(empathy.fit,data.frame(Emp=38))
y_hat_1
## 1
## 0.2315374
The distinction between explanatory and response variables is essential in regression.
Least-squares regression makes the distances of the data points from the line small only in the \(y\)-direction. If we reverse the roles of the two variables, we get a different least-squares regression line.
There is a close connection between correlation and the slope of the least-squares line.
The slope is \(\hat \beta_1 = r\cdot s_y/s_x.\) Notice that the slope and the correlation always have the same sign.
The least-squares regression line always passes through the point \((\bar x, \bar y)\).
This is because \(\hat \beta_0 = \bar y - \hat \beta_1 \bar x\).
The correlation \(r\) describes the strength of a straight-line relationship.
In the regression setting, this description takes a specific form: The square of the correlation, \(r^2\), is the fraction of the variation in the values of \(y\) that is explained by the least-squares regression of \(y\) on \(x\).
When a scatterplot shows a linear relationship between two quantitative variables, \(x\) and \(y\), we can use the least-squares regression line fitted to the data to predict \(y\) for a given value of \(x\).
When the data are a simple random sample from a larger population, we need statistical inference to answer questions like these about the population:
Is there really a linear relationship between \(x\) and \(y\)?
What is the slope that relates \(y\) and \(x\) in the population, including a margin of error for our estimate of the slope?
If we use the least-squares line to predict \(y\) from a given value of \(x\), how accurate is our prediction?
The slope \(\hat \beta_1\) and the intercept \(\hat \beta_0\) of the least-squares line are statistics because we calculated them from the sample data. To do inference, think of \(\hat \beta_0\) and \(\hat \beta_1\) as estimates of unknown parameters \(\beta_0\) and \(\beta_1\) that describe the whole population.
We have \(n\) observations on an explanatory variable \(x\) and a response variable \(y\).
For any fixed value of \(x\), the response \(y\) varies according to a normal distribution. Repeated responses \(y\) are independent of each other.
The mean response, \(\mu_y\) has a straight-line relationship with \(x\) given by a population regression line \[\mu_y=\beta_0+\beta_1x\]
The standard deviation of \(y\) (call it \(\sigma\)) is the same for all values of \(x\).
The following linear regression model summarizes the conditions:
\[y_i = \beta_0 + \beta_1 x_i +\epsilon_i\]
where \(\epsilon_i \sim N(0,\sigma^2)\).
We know we can use least squares method to estimate \(\beta_0\) and \(\beta_1\).
The residuals \(e_i=y_i-\hat y_i\) are used to obtain an estimator of \(\sigma^2\).
The sum of squares of the residuals, often called the error sum of squares, is \[SSE=\sum_{i=1}^ne^2_i=\sum_{i=1}^n (y_i-\hat y_i)^2.\]
It can be shown that the expected value of the error sum of squares is \(E(SSE)=(n-2)\sigma^2.\) Consequently, \[E\left(\frac{SSE}{n-2}\right)=\sigma^2\]
Therefore an unbiased estimator of \(\sigma^2\) is \[s^2=\frac{SSE}{n-2}=\frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n-2}\] \(s^2\) is also called mean squared error (MSE).
It can be shown that \(\sum_{i=1}^ne_i=\sum_{i=1}^n(Y_i-\hat Y_i)=0.\)
Infants who cry easily may be more easily stimulated than others. This may be a sign of high IQ.
## Crycount IQ
## 1 10 87
## 2 20 90
## 3 17 94
## 4 12 94
## 5 12 97
## 6 16 100
## 7 19 103
## 8 12 103
## 9 9 103
## 10 23 103
attach(crying)
crying_lm <- lm(IQ~Crycount); crying_lm
##
## Call:
## lm(formula = IQ ~ Crycount)
##
## Coefficients:
## (Intercept) Crycount
## 91.268 1.493
par(pty="s")
plot(crying,col=2,pch=16)
abline(crying_lm,col=4,lwd=2)
Thus, the regression line is \[\hat y = 91.268+1.493 x\]
Let us calculate the MSE in the crying and IQ example.
resid()
function can be used to extract residuals from the regression output.
SSE <- sum(resid(crying_lm)^2)
n <- dim(crying)[1]
MSE <- SSE/(n-2); MSE
## [1] 306.2052
s <- sqrt(MSE); s
## [1] 17.49872
Actually, \(s=\sqrt{MSE}\) is in the output of the summary
function. It is called “Residual standard error”.
summary(crying_lm)
##
## Call:
## lm(formula = IQ ~ Crycount)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.126 -11.426 -2.126 10.860 51.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.268 8.934 10.216 3.5e-12 ***
## Crycount 1.493 0.487 3.065 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared: 0.207, Adjusted R-squared: 0.185
## F-statistic: 9.397 on 1 and 36 DF, p-value: 0.004105
You may also extract it directly:
summary(crying_lm)$sigma
## [1] 17.49872
Earlier in the lecture, we fitted a regression model for the empathy data. Now, calculate \(\sqrt{MSE}\) in the model. First do it manually then extract the \(\sqrt{MSE}\) from the output.
attach(empathy)
empathy.fit<-lm(Brain~Emp)
n2<-length(Brain)
MSE2<-sum(resid(empathy.fit)^2)/(n2-2)
s2 <- sqrt(MSE2); s2
## [1] 0.2442106
summary(empathy.fit)$sigma
## [1] 0.2442106
The very basic expression of \(\hat \beta_1\) is
\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\equiv\frac{S_{xy}}{S_{xx}}\]
So \(\hat \beta_1\) is actually a linear combination of \(Y_i\)’s. Hence, it follows a normal distribution.
Since \(\hat \beta_0\) is a linear function of \(\hat \beta_1\), \(\hat \beta_0\) is also normal.
It can be shown that \[ \begin{align*} \mu_{\hat\beta_0}&=\beta_0 \hspace{10pt}\mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_0}=\left(\frac1n+\frac{\overline{x}^2}{S_{xx}}\right)\sigma^2\\ \mu_{\hat\beta_1}&=\beta_1\hspace{10pt} \mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_1}=\frac{\sigma^2}{S_{xx}} \end{align*} \]
Since \(\sigma^2\) is unknown, we have to use MSE (\(s^2\)) to replace it. The standard error of \(\hat\beta_1\) is \[se(\hat\beta_1)=\sqrt{\frac{s^2}{S_{xx}}}\]
So, we have the following standardized version of \(\hat\beta_1\):
\[t=\frac{\hat \beta_1 -\beta_1}{se({\hat\beta_1})}\]
and it follows a \(t\) distribution with \(n-2\) degrees of freedom.
In the Crying and IQ example, data analysis supports the conjecture that children with higher crying counts tend to have higher IQs (since the slope is 1.49).
But is the positive association statistically significant? To answer this question, test hypotheses about the slope \(\beta_1\) of the population regression line: \[H_0: \beta_1=0\hspace{5pt}\text{ versus} \hspace{5pt}H_a: \beta_1 \ne 0\] Under the null hypothesis, the test statistic is \[t_0=\frac{\hat \beta_1 }{se({\hat\beta_1})}\] Since this is a two-sided test, the p-value is \[2P(t_{n-2}<-|t_0|)\]
Let us first do the test manually. Let \(\alpha=0.05\).
Sxx <- (n-1)*var(Crycount)
se_b1 <- s/sqrt(Sxx); se_b1
## [1] 0.4870011
b1 <-cov(IQ,Crycount)/var(Crycount)
t0 <- b1/se_b1; t0
## [1] 3.065489
p_value <- 2*pt(-abs(t0),n-2); p_value
## [1] 0.0041053
Since the p-value is less than 0.05, reject the null and conclude that there is a positive association between crying count and IQ.
R
output)summary(crying_lm)
##
## Call:
## lm(formula = IQ ~ Crycount)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.126 -11.426 -2.126 10.860 51.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.268 8.934 10.216 3.5e-12 ***
## Crycount 1.493 0.487 3.065 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared: 0.207, Adjusted R-squared: 0.185
## F-statistic: 9.397 on 1 and 36 DF, p-value: 0.004105
Now, conduct the test of no linear relationship for the empathy
data.
summary(empathy.fit)
##
## Call:
## lm(formula = Brain ~ Emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35257 -0.14980 0.02289 0.14572 0.44495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.057750 0.188326 -0.307 0.7636
## Emp 0.007613 0.003385 2.249 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared: 0.2654, Adjusted R-squared: 0.2129
## F-statistic: 5.057 on 1 and 14 DF, p-value: 0.04115
b1.2<-summary(empathy.fit)$coefficients[2,1];b1.2
## [1] 0.007612827
se_b1.2<-summary(empathy.fit)$coefficients[2,2];se_b1.2
## [1] 0.003385372
t0.2<- summary(empathy.fit)$coefficients[2,3];t0.2
## [1] 2.248742
p_value2<-summary(empathy.fit)$coefficients[2,4];p_value2
## [1] 0.0411514
If we want to test if the slope is positive, we need to do a one-sided test.
\(H_0: \beta_1=0\) vs. \(H_a: \beta_1>0\).
The test statistic stays the same, but the p-value is \(P(T_{n-2}>t_0):\)
1-pt(t0.2,n2-2)
## [1] 0.0205757
Note that the p-value is half of that of the two-sided test.
The slope \(\beta_1\) of the population regression line is the most important parameter in a regression problem.
We already have a point estimator of \(\beta_1\) using the least-squares method
\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\]
A confidence interval is more useful for \(\beta_1\) because it shows how accurate the estimate \(\hat\beta_1\) is likely to be.
A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) of the population regression line is \[\hat\beta_1 \pm t^* \cdot se(\hat\beta_1)\] where \(t^*\) is the critical value for the \(t_{n-2}\) distribution with area \(1-\alpha\) between \(-t^*\) and \(t^*\).
You may extract \(\hat\beta_1\) as well as the standard error of it from the output and calculate the C.I. “manually”:
t <- qt(0.975,n-2)
b1<-summary(crying_lm)$coefficients[2,1]
se_b1 <- summary(crying_lm)$coefficients[2,2]
LL<-b1-t*se_b1; UL<-b1+t*se_b1; c(LL,UL)
## [1] 0.5052127 2.4805805
Alternatively, you may use the confint()
:
confint(crying_lm,'Crycount',level=0.95)
## 2.5 % 97.5 %
## Crycount 0.5052127 2.480581
The estimated mean response at a specific value of \(x,\) say, \(x^*\) is \[\hat\mu_{y|x^*}=\hat\beta_0+\hat\beta_1x^*,\] where \(\hat\beta_0\) and \(\hat\beta_1\) are the least squares estimators.
\(\hat\mu_{y|x^*}\) is an unbiased estimator of \(\mu_{y|x^*}\) and it is normally distributed (since both \(\hat\beta_0\) and \(\hat\beta_1\) are normal).
It can be shown that \[\text{Var}(\hat\mu_{y|x^*})=\sigma^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]\]
Confidence intervals may be constructed on the mean response \(\mu_{y|x^*}\) based on the sampling distribution of it.
The population variance of \(Y\), \(\sigma^2\), is unknown and has to be estimated by \(s^2\), the mean squared error (MSE). So the standard error of \(\hat \mu_{Y|x^*}\) is \[\text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\] where \(s^2=\sum_{i=1}^{n}(y_i-\hat y_i)^2/(n-2)\).
Thus, a \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*})\]
A \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*}) \hspace{10pt} \text{where}\hspace{10pt} \text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\]
Let us calculate a 95% confidence interval for \(\mu_{y|x}\) when \(x=19\).
b0<-summary(crying_lm)$coefficients[1,1]
b1<-summary(crying_lm)$coefficients[2,1]
x.star <- 19; y.star <- b0+b1*x.star
se.mu.y <- s*sqrt(1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.star +c(-1,1)*t.star*se.mu.y
## [1] 113.6619 125.6047
We also may use the R
function predict
.
new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95,
interval="confidence")
## fit lwr upr
## 1 119.6333 113.6619 125.6047
An important application of a regression model is predicting new or future observations \(Y\) corresponding to a specified value of regressor \(x\).
If \(x^*\) is the value of the regressor variable of interest, then the point estimator of the new or future value of the response \(Y^*\) is \[\hat Y=\hat\beta_0+\hat\beta_1x^*\] Now consider obtaining an interval estimate for this future observation \(Y^*.\) It can be shown that \[Var(Y^*-\hat Y)=\sigma^2\left[1+\frac1n+\frac{(x_0-\overline x)^2}{S_{xx}}\right]\]
If we use the MSE, \(s^2\), to estimate \(\sigma^2\), we can show that \[\frac{Y^*-\hat Y}{\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}}\sim t_{n-2}.\]
From this we can develop the following prediction interval definition.
A \(100(1-\alpha)\%\) prediction interval on a future observation \(Y^*\) at the value of \(x=x^*\) is given by \[\begin{eqnarray*} \hat Y \pm t^*\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}. \end{eqnarray*}\]Note that the prediction interval is wider than confidence interval about the mean response \(\mu_{y|x^*}\).
Let us calculate a 95% prediction interval for \(Y^*\) when \(x=19\).
x.star <- 19; y.hat <- b0+b1*x.star
se.y.hat <- s*sqrt(1+1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.hat + c(-1,1)*t.star*se.y.hat
## [1] 83.64541 155.62126
We also may use the R
function predict
.
new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95, interval="prediction")
## fit lwr upr
## 1 119.6333 83.64541 155.6213
We may visualize the difference between the confidence interval of \(\mu_{y|x^*}\) and the prediction interval of \(Y^*\).
all<- seq(min(Crycount),max(Crycount)); new.data <- data.frame(Crycount=all)
CI.all <- predict(crying_lm,new.data,interval="confidence");PI.all <- predict(crying_lm,new.data,interval="prediction")
plot(crying,pch=16,cex=1)
lines(all,CI.all[,2],lwd=2,lty=3,col=2); lines(all,CI.all[,3],lwd=2,lty=3,col=2)
lines(all,PI.all[,2],lwd=2,lty=2,col=4); lines(all,PI.all[,3],lwd=2,lty=2,col=4)
legend("topleft", legend=c("Confidence Interval", "Prediction Interval"), lty=c(3,2),col=c(2,4), cex=0.8)
Fitting a regression model requires several assumptions.
The residuals from a regression model are \(e_i=y_i-\hat y_i, i=1,2,\dots,n,\) where \(y_i\) is an actual observation and \(\hat y_i\) is the corresponding fitted value from the regression model.
A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.
If all assumptions of a linear regression hold, the pattern of the residual plot will be an unstructured horizontal band centered at 0 and symmetric about 0, as in Figure (a)
A curved pattern,, like the one in Figure (b), indicates that the relationship between the response and the explanatory variable is curved rather than linear
A fan-shaped pattern like the one in Figure (c) shows that the variation of the response about the least-squares line increases as the explanatory variable increase
Let us create a residual plot for the Crying and IQ example.
crying_rs <- resid(crying_lm)
plot(Crycount,crying_rs,pch=16,col=4, ylab="Residual")
abline(h=0,lwd=2,col=2)
It seems like an unstructured horizontal band. And, the points are roughly symmetric about 0. No curve pattern is observed.
par(mfrow=c(1,2),pty="s")
hist(crying_rs,freq=F); lines(seq(-40,60,0.01),dnorm(seq(-40,60,0.01),mean(crying_rs),sd(crying_rs)),col=2,lwd=2)
qqnorm(crying_rs);qqline(crying_rs,col=2)
Recall that we fitted a linear regression model and called it empathy.fit
. The explanatory variable is the empathy level, Emp
. Now, pull out the residuals and create a residual plot, a histogram and QQ plot for the residuals.
par(mfrow=c(1,3),pty="s")
empathy_rs <- resid(empathy.fit)
plot(Emp,empathy_rs,pch=16,col=4, ylab="Residual"); abline(h=0,lwd=2,col=2)
hist(empathy_rs,freq=F); lines(seq(-0.5,0.7,0.01),dnorm(seq(-0.5,0.7,0.01),mean(empathy_rs),sd(empathy_rs)),col=4,lwd=2)
qqnorm(empathy_rs);qqline(empathy_rs,col=4)
We can also use the Shapiro-Wilk test to check the normality assumption:
shapiro.test(crying_rs)
##
## Shapiro-Wilk normality test
##
## data: crying_rs
## W = 0.96496, p-value = 0.2741
We have seen the use of analysis of variance tables in grouped designs. However, their use is not restricted to these designs but applies to the whole class of linear models.
The variation between and within groups for a one-way analysis of variance generalized to model variation and residual variation:
\[ \begin{align*} SSR=&\sum_{i=1}^n (\hat y_i-\bar y_.)^2\\ SSE=&\sum_{i=1}^n (y_i-\hat y_i)^2 \end{align*} \]
The role of the group means \(\bar y_i\) in the one-way analysis of classification is taken over by the fitted values \(\hat y_i\) in the more general linear model.
An F test for significance of the model is available in direct analogy with the one in the ANOVA test.
In multiple linear regression, the null hypothesis is “all the slopes are zero”, i.e. \(\beta_1=\beta_2=\cdots=\beta_k=0\).
In simple linear regression, this test is equivalent to testing that the regression coefficient \(\beta_1\) is zero.
anova(crying_lm)
## Analysis of Variance Table
##
## Response: IQ
## Df Sum Sq Mean Sq F value Pr(>F)
## Crycount 1 2877.5 2877.48 9.3972 0.004105 **
## Residuals 36 11023.4 306.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypotheses: \(H_0: \beta_1=0\) versus \(H_a: \beta_1\ne 0\)
Test statistic:
\[
\begin{align*}
F=&\frac{SSR/DFR}{SSE/DFE}=\frac{MSR}{MSE}\\
=&\frac{2877.5/1}{11023.4/36}=\frac{2877.48}{306.21}=9.3972
\end{align*}
\]
The p-value is \(P(F_{DFR,DFE}>F)\).
p_val <- 1 - pf(9.3972,1,36); p_val
## [1] 0.004105345
Conclusion: Since the p-value is 0.004 < 0.05, reject \(H_0\). The model is significant (useful) at 0.05 level.
Since we know that the test of utility of model is equivalent to testing that \(\beta_1=0\), we may pull out the t test result and compare their results.
summary(crying_lm)
##
## Call:
## lm(formula = IQ ~ Crycount)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.126 -11.426 -2.126 10.860 51.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.268 8.934 10.216 3.5e-12 ***
## Crycount 1.493 0.487 3.065 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared: 0.207, Adjusted R-squared: 0.185
## F-statistic: 9.397 on 1 and 36 DF, p-value: 0.004105
Note that the p-value for the t statistic for \(\beta_1\) is the same with the p-value of the F statistic of the utility test.
Now, conduct a utility F test for the Empathy data set. The model object is empathy.fit
.
What is the F statistic? How about the p-value? Do you think the model is “useful” in predicting the response?
anova(empathy.fit)
## Analysis of Variance Table
##
## Response: Brain
## Df Sum Sq Mean Sq F value Pr(>F)
## Emp 1 0.30158 0.301584 5.0568 0.04115 *
## Residuals 14 0.83494 0.059639
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Any alternative way to test for utility?
summary(empathy.fit)
##
## Call:
## lm(formula = Brain ~ Emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35257 -0.14980 0.02289 0.14572 0.44495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.057750 0.188326 -0.307 0.7636
## Emp 0.007613 0.003385 2.249 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared: 0.2654, Adjusted R-squared: 0.2129
## F-statistic: 5.057 on 1 and 14 DF, p-value: 0.04115
The analysis of variance (ANOVA) identity is as follows: \[\underbrace{\sum_{i=1}^n(Y_i-\overline Y)^2}_{SST}=\underbrace{\sum_{i=1}^n(\hat Y_i-\overline Y)^2}_{SSR}+\underbrace{\sum_{i=1}^n(Y_i-\hat Y_i)^2}_{SSE}\]
Coefficient of Determination, denoted by \(R^2\), is often used to judge the adequacy of a regression model. \[R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\]
anova(crying_lm)
## Analysis of Variance Table
##
## Response: IQ
## Df Sum Sq Mean Sq F value Pr(>F)
## Crycount 1 2877.5 2877.48 9.3972 0.004105 **
## Residuals 36 11023.4 306.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA table, we can extract information and calculate \(R^2\) manually.
SSR=2877.5; SSE=11023.4
SST=SSR+SSE
R_sq=SSR/SST; R_sq
## [1] 0.207001
summary(crying_lm)
##
## Call:
## lm(formula = IQ ~ Crycount)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.126 -11.426 -2.126 10.860 51.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.268 8.934 10.216 3.5e-12 ***
## Crycount 1.493 0.487 3.065 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared: 0.207, Adjusted R-squared: 0.185
## F-statistic: 9.397 on 1 and 36 DF, p-value: 0.004105
Let us also verify the relationship between coefficients of determination (\(R^2\)) and correlation \(r\).
r=cor(Crycount,IQ);r
## [1] 0.4549725
r^2; R_sq
## [1] 0.207
## [1] 0.207001
What is the coefficient of determination for the linear model for Empathy? Verify that \(R^2=r^2.\)
r<-cor(Emp,Brain);r;r^2
## [1] 0.5151268
## [1] 0.2653556
summary(empathy.fit)$r.squared
## [1] 0.2653556
detach(empathy,crying,fatgain)