STAT 1301: Statistical Packages

Regression

Correlation tells us about strength and direction of the linear relationship between two quantitative variables.

In addition, we would like to have a numerical description of how both variables vary together.

A regression line is a straight line \(\hat y=b_0+b_1 x,\) that describes how a response variable \(y\) changes as an explanatory variable \(x\) changes.

The least-squares regression line of \(y\) on \(x\) is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible, i.e. it is the line that minimize \[\sum_{i=1}^n \left(y_i-\hat y_i \right)^2\] where \(y_i-\hat y_i=y_i-(b_0+b_1 x_i)\) is called the residual or prediction error for observation \(i\).

Least-Squares Regression Line

The least-squared regression line is unique.

The intercept and slope are functions of the \(x\) and \(y\) values:

\[\begin{eqnarray*} \hat \beta_1&=&\frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n(x_i-\overline x)^2}\\ \hat \beta_0&=&\overline y-\hat \beta_1\overline x \end{eqnarray*}\] Note that \[\begin{eqnarray*} \hat \beta_1&\equiv \frac{(n-1)\text{Cov}(x,y)}{(n-1)\text{Var}(x)} &\equiv \frac{s_{xy}}{s_x^2}=\frac{rs_xs_y}{s_x^2} &= r\cdot\frac{s_{y}}{s_{x}} \end{eqnarray*}\]

where \(s_{xy}\) is the sample covariance, \(s_x\) and \(s_y\) are the sample standard deviations of \(x\) and \(y\), respectively, and \(r\) is the sample correlation coefficient.

Example: Does Fidgeting Keep You Slim?

Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” (NEA) explains why. Some people may spontaneously increase NEA when fed more.

Researchers deliberately overfed 16 healthy young adults for eight weeks. They measured fat gain (in kg) and change in NEA (in calories).

The data set is as follows:

fatgain <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/fatgain.txt",head=T)
head(fatgain)

##   NEA Fat
## 1 -94 4.2
## 2 -57 3.0
## 3 -29 3.7
## 4 135 2.7
## 5 143 3.2
## 6 151 3.6

The scatterplot shows that the relationship between Fat and NEA is a moderately strong, negative, linear relationship.

The red straight line is the least square regression line.

Find the Linear Regression Line in `R`

We can use R as a calculator to find the estimates of intercept and slope manually according to the formulas:

attach(fatgain)
b1 <-cov(Fat,NEA)/var(NEA); b1

## [1] -0.003441487

b0 <-mean(Fat)-b1*mean(NEA);b0

## [1] 3.505123

Alternatively, we can use the function lm (linear model). The form is lm(model=y~x, data=, ...).

fit<- lm(Fat~NEA)
fit

## 
## Call:
## lm(formula = Fat ~ NEA)
## 
## Coefficients:
## (Intercept)          NEA  
##    3.505123    -0.003441

The argument to lm is a model formula in which the tilde symbol (~) should be read as “described by”.
In its raw form, the output of lm is very brief: the estimated intercept (\(\hat\beta_0\)) and the estimated slope (\(\hat\beta_1\)).
The lm function handles much more complicated models than simple linear regression. For example, multiple regression models (e.g. \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2+\epsilon\)).

In-Class Exercise: I Feel Your Pain

“Empathy” means being able to understand what others feel. To see how the brain expresses empathy, researchers recruited 16 couples and zapped the man’s hand with an electrode while the woman watched, and measured the activity in several parts of the woman’s brain that would respond to her own pain.The women also completed a psychological test that measures empathy.

In-Class Exercise: Fit a Simple Linear Regression Model

Regress brain activity (Brain) on empathy score (Emp) using lm function.

empathy <- read.table("C:/Users/jub69/Dropbox/A-Pitt/STAT 1293/Lectures/Lecture9_RegAndCorr/empathy.txt")

attach(empathy)
empathy.fit <- lm(Brain~Emp)
empathy.fit

## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Coefficients:
## (Intercept)          Emp  
##   -0.057750     0.007613

Fitted Values and Residuals

The function fitted returns fitted values, \(\hat y\), the \(y\)-values that you would expect for the given \(x\)-values according to the best-fitting straight line.

fit<- lm(Fat~NEA)
fitted(fit)

##        1        2        3        4        5        6        7        8 
## 3.828623 3.701288 3.604926 3.040522 3.012990 2.985458 2.661959 2.283395 
##        9       10       11       12       13       14       15       16 
## 2.156060 1.877300 1.832560 1.663927 1.540034 1.509060 1.371401 1.130497

Superimpose Linear Regression Line

To put the fitted line (regression line) on the scatterplot, we can use the lines or abline function:

par(pty="s",mfrow=c(1,2))
plot(fatgain,pch=16)
lines(NEA, fitted(fit),col=2,lwd=2)
plot(fatgain,pch=16)
abline(fit,col=5,lwd=2)

Notice that the arguments of the two functions are different.

In-Class Exercise: Superimpose Linear Regression Line

Now, superimpose the linear regression line you found for the empathy data to the scatterplot.

plot(Emp,Brain,pch=16,col=4,cex=1.2,cex.lab=1.3,xlab="Empathy score", ylab="Brain activity")
abline(empathy.fit,col=2,lwd=2)

Display Residuals in A Scatterplot

To create a plot where residuals are displayed by connecting observations to corresponding points on the fitted line, you can do the following.

segments draws line segments; its arguments are the endpoint coordinates in the order \(x_1, y_1, x_2, y_2\).

plot(fatgain,pch=16); abline(fit,col=2,lwd=2)
segments(NEA,fitted(fit),NEA,Fat,col=4,lwd=2)

Predict \(y\) given a New Value of \(x\)

Suppose a person’s NEA is 650 calories. What is her expected fat gain after eight weeks?

predict(fit,data.frame(NEA=650))

##        1 
## 1.268156

In-Class Exercise: Prediction and Residual

Subject 1 has empathy score 38. What is her predicted brain activity based on the linear regression model?

y_hat_1 <- predict(empathy.fit,data.frame(Emp=38))
y_hat_1

##         1 
## 0.2315374

The subject’s actual brain activity level was \(-0.120\). What is the residual (prediction error) for her?

Facts about Least-Squares Regression

The distinction between explanatory and response variables is essential in regression.

Least-squares regression makes the distances of the data points from the line small only in the \(y\)-direction. If we reverse the roles of the two variables, we get a different least-squares regression line.
There is a close connection between correlation and the slope of the least-squares line.

The slope is \(\hat \beta_1 = r\cdot s_y/s_x.\) Notice that the slope and the correlation always have the same sign.
The least-squares regression line always passes through the point \((\bar x, \bar y)\).

This is because \(\hat \beta_0 = \bar y - \hat \beta_1 \bar x\).
The correlation \(r\) describes the strength of a straight-line relationship.

In the regression setting, this description takes a specific form: The square of the correlation, \(r^2\), is the fraction of the variation in the values of \(y\) that is explained by the least-squares regression of \(y\) on \(x\).

Introduction to Inference for Regression

When a scatterplot shows a linear relationship between two quantitative variables, \(x\) and \(y\), we can use the least-squares regression line fitted to the data to predict \(y\) for a given value of \(x\).

When the data are a simple random sample from a larger population, we need statistical inference to answer questions like these about the population:

Is there really a linear relationship between \(x\) and \(y\)?
What is the slope that relates \(y\) and \(x\) in the population, including a margin of error for our estimate of the slope?
If we use the least-squares line to predict \(y\) from a given value of \(x\), how accurate is our prediction?

Conditions for Regression Inference

The slope \(\hat \beta_1\) and the intercept \(\hat \beta_0\) of the least-squares line are statistics because we calculated them from the sample data. To do inference, think of \(\hat \beta_0\) and \(\hat \beta_1\) as estimates of unknown parameters \(\beta_0\) and \(\beta_1\) that describe the whole population.

We have \(n\) observations on an explanatory variable \(x\) and a response variable \(y\).

For any fixed value of \(x\), the response \(y\) varies according to a normal distribution. Repeated responses \(y\) are independent of each other.
The mean response, \(\mu_y\) has a straight-line relationship with \(x\) given by a population regression line \[\mu_y=\beta_0+\beta_1x\]
The standard deviation of \(y\) (call it \(\sigma\)) is the same for all values of \(x\).

Conditions for Regression Inference (cont.)

The following linear regression model summarizes the conditions:

\[y_i = \beta_0 + \beta_1 x_i +\epsilon_i\]

where \(\epsilon_i \sim N(0,\sigma^2)\).

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\)

We know we can use least squares method to estimate \(\beta_0\) and \(\beta_1\).

The residuals \(e_i=y_i-\hat y_i\) are used to obtain an estimator of \(\sigma^2\).

The sum of squares of the residuals, often called the error sum of squares, is \[SSE=\sum_{i=1}^ne^2_i=\sum_{i=1}^n (y_i-\hat y_i)^2.\]

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\) (cont.)

It can be shown that the expected value of the error sum of squares is \(E(SSE)=(n-2)\sigma^2.\) Consequently, \[E\left(\frac{SSE}{n-2}\right)=\sigma^2\]

Therefore an unbiased estimator of \(\sigma^2\) is \[s^2=\frac{SSE}{n-2}=\frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n-2}\] \(s^2\) is also called mean squared error (MSE).

It can be shown that \(\sum_{i=1}^ne_i=\sum_{i=1}^n(Y_i-\hat Y_i)=0.\)

Example: Crying and IQ

Infants who cry easily may be more easily stimulated than others. This may be a sign of high IQ.

A snap of rubber band on the sole of the foot caused the infants to cry.
The researchers recorded the crying and measured its intensity by the number of peaks in the most active 20 seconds.
They later measured the children’s IQ at age three.

##    Crycount  IQ
## 1        10  87
## 2        20  90
## 3        17  94
## 4        12  94
## 5        12  97
## 6        16 100
## 7        19 103
## 8        12 103
## 9         9 103
## 10       23 103

Example: Crying and IQ (cont.)

Let us find the least-squares regression line and add it to the scatterplot.

attach(crying)
crying_lm <- lm(IQ~Crycount); crying_lm

## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Coefficients:
## (Intercept)     Crycount  
##      91.268        1.493

par(pty="s")
plot(crying,col=2,pch=16)
abline(crying_lm,col=4,lwd=2)

Thus, the regression line is \[\hat y = 91.268+1.493 x\]

Example: Crying and IQ (Estimating \(\sigma\))

Let us calculate the MSE in the crying and IQ example.

resid() function can be used to extract residuals from the regression output.

SSE <- sum(resid(crying_lm)^2)
n <- dim(crying)[1]
MSE <- SSE/(n-2); MSE

## [1] 306.2052

s <- sqrt(MSE); s

## [1] 17.49872

Example: Crying and IQ (Estimating \(\sigma\)) (cont.)

Actually, \(s=\sqrt{MSE}\) is in the output of the summary function. It is called “Residual standard error”.

summary(crying_lm)

## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

You may also extract it directly:

summary(crying_lm)$sigma

## [1] 17.49872

In-Class Exercise: Estimating \(\sigma^2\) with MSE

Earlier in the lecture, we fitted a regression model for the empathy data. Now, calculate \(\sqrt{MSE}\) in the model. First do it manually then extract the \(\sqrt{MSE}\) from the output.

attach(empathy)
empathy.fit<-lm(Brain~Emp)
n2<-length(Brain)
MSE2<-sum(resid(empathy.fit)^2)/(n2-2)
s2 <- sqrt(MSE2); s2

## [1] 0.2442106

summary(empathy.fit)$sigma

## [1] 0.2442106

Distributions of \(\hat \beta_0\) and \(\hat \beta_1\)

The very basic expression of \(\hat \beta_1\) is

\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\equiv\frac{S_{xy}}{S_{xx}}\]

So \(\hat \beta_1\) is actually a linear combination of \(Y_i\)’s. Hence, it follows a normal distribution.

Since \(\hat \beta_0\) is a linear function of \(\hat \beta_1\), \(\hat \beta_0\) is also normal.

It can be shown that \[ \begin{align*} \mu_{\hat\beta_0}&=\beta_0 \hspace{10pt}\mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_0}=\left(\frac1n+\frac{\overline{x}^2}{S_{xx}}\right)\sigma^2\\ \mu_{\hat\beta_1}&=\beta_1\hspace{10pt} \mbox{ and }\hspace{10pt} \sigma^2_{\hat\beta_1}=\frac{\sigma^2}{S_{xx}} \end{align*} \]

Standardized \(\hat\beta_1\)

Since \(\sigma^2\) is unknown, we have to use MSE (\(s^2\)) to replace it. The standard error of \(\hat\beta_1\) is \[se(\hat\beta_1)=\sqrt{\frac{s^2}{S_{xx}}}\]

So, we have the following standardized version of \(\hat\beta_1\):

\[t=\frac{\hat \beta_1 -\beta_1}{se({\hat\beta_1})}\]

and it follows a \(t\) distribution with \(n-2\) degrees of freedom.

Testing the hypothesis of no linear relationship

In the Crying and IQ example, data analysis supports the conjecture that children with higher crying counts tend to have higher IQs (since the slope is 1.49).

But is the positive association statistically significant? To answer this question, test hypotheses about the slope \(\beta_1\) of the population regression line: \[H_0: \beta_1=0\hspace{5pt}\text{ versus} \hspace{5pt}H_a: \beta_1 \ne 0\] Under the null hypothesis, the test statistic is \[t_0=\frac{\hat \beta_1 }{se({\hat\beta_1})}\] Since this is a two-sided test, the p-value is \[2P(t_{n-2}<-|t_0|)\]

Example: Crying and IQ (H.T.)

Let us first do the test manually. Let \(\alpha=0.05\).

Sxx <- (n-1)*var(Crycount)
se_b1 <- s/sqrt(Sxx); se_b1

## [1] 0.4870011

b1 <-cov(IQ,Crycount)/var(Crycount)
t0 <- b1/se_b1; t0

## [1] 3.065489

p_value <- 2*pt(-abs(t0),n-2); p_value

## [1] 0.0041053

Since the p-value is less than 0.05, reject the null and conclude that there is a positive association between crying count and IQ.

Example: Crying and IQ (`R` output)

summary(crying_lm)

## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

In-Class Exercise: H.T. of no linear relationship

Now, conduct the test of no linear relationship for the empathy data.

summary(empathy.fit)

## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35257 -0.14980  0.02289  0.14572  0.44495 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.057750   0.188326  -0.307   0.7636  
## Emp          0.007613   0.003385   2.249   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2129 
## F-statistic: 5.057 on 1 and 14 DF,  p-value: 0.04115

Extract Coefficient Estimates, Standard Errors, etc.

b1.2<-summary(empathy.fit)$coefficients[2,1];b1.2

## [1] 0.007612827

se_b1.2<-summary(empathy.fit)$coefficients[2,2];se_b1.2

## [1] 0.003385372

t0.2<- summary(empathy.fit)$coefficients[2,3];t0.2

## [1] 2.248742

p_value2<-summary(empathy.fit)$coefficients[2,4];p_value2

## [1] 0.0411514

One-Sided Test about the Slope

If we want to test if the slope is positive, we need to do a one-sided test.

\(H_0: \beta_1=0\) vs. \(H_a: \beta_1>0\).

The test statistic stays the same, but the p-value is \(P(T_{n-2}>t_0):\)

1-pt(t0.2,n2-2)

## [1] 0.0205757

Note that the p-value is half of that of the two-sided test.

Confidence interval for the regression slope

The slope \(\beta_1\) of the population regression line is the most important parameter in a regression problem.

We already have a point estimator of \(\beta_1\) using the least-squares method

\[\hat \beta_1=\frac{\sum_{i=1}^n(x_i-\overline x)(Y_i-\overline Y)}{\sum_{i=1}^n(x_i-\overline x)^2}\]

A confidence interval is more useful for \(\beta_1\) because it shows how accurate the estimate \(\hat\beta_1\) is likely to be.

A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) of the population regression line is \[\hat\beta_1 \pm t^* \cdot se(\hat\beta_1)\] where \(t^*\) is the critical value for the \(t_{n-2}\) distribution with area \(1-\alpha\) between \(-t^*\) and \(t^*\).

Example: Crying and IQ (C.I. for \(\beta_1\))

You may extract \(\hat\beta_1\) as well as the standard error of it from the output and calculate the C.I. “manually”:

t <- qt(0.975,n-2)
b1<-summary(crying_lm)$coefficients[2,1]
se_b1 <- summary(crying_lm)$coefficients[2,2]
LL<-b1-t*se_b1; UL<-b1+t*se_b1; c(LL,UL)

## [1] 0.5052127 2.4805805

Alternatively, you may use the confint():

confint(crying_lm,'Crycount',level=0.95)

##              2.5 %   97.5 %
## Crycount 0.5052127 2.480581

Confidence Interval on the Mean Response

The estimated mean response at a specific value of \(x,\) say, \(x^*\) is \[\hat\mu_{y|x^*}=\hat\beta_0+\hat\beta_1x^*,\] where \(\hat\beta_0\) and \(\hat\beta_1\) are the least squares estimators.

\(\hat\mu_{y|x^*}\) is an unbiased estimator of \(\mu_{y|x^*}\) and it is normally distributed (since both \(\hat\beta_0\) and \(\hat\beta_1\) are normal).

It can be shown that \[\text{Var}(\hat\mu_{y|x^*})=\sigma^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]\]

Confidence intervals may be constructed on the mean response \(\mu_{y|x^*}\) based on the sampling distribution of it.

Confidence Interval of \(\mu_{y|x^*}\)

The population variance of \(Y\), \(\sigma^2\), is unknown and has to be estimated by \(s^2\), the mean squared error (MSE). So the standard error of \(\hat \mu_{Y|x^*}\) is \[\text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\] where \(s^2=\sum_{i=1}^{n}(y_i-\hat y_i)^2/(n-2)\).

Thus, a \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*})\]

Example: Crying and IQ (Confidence Interval for \(\mu_{y|x^*}\))

A \(100(1-\alpha)\%\) confidence interval for the mean response \(\mu_{y|x^*}\) is \[\hat y \pm t^* \cdot se(\hat\mu_{y|x^*}) \hspace{10pt} \text{where}\hspace{10pt} \text{se}(\hat\mu_{y|x^*})=\sqrt{s^2\left[\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}\]

Let us calculate a 95% confidence interval for \(\mu_{y|x}\) when \(x=19\).

b0<-summary(crying_lm)$coefficients[1,1]
b1<-summary(crying_lm)$coefficients[2,1]
x.star <- 19; y.star <- b0+b1*x.star
se.mu.y <- s*sqrt(1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.star +c(-1,1)*t.star*se.mu.y

## [1] 113.6619 125.6047

We also may use the R function predict.

new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95,
        interval="confidence")

##        fit      lwr      upr
## 1 119.6333 113.6619 125.6047

Prediction Interval for a single observation \(y\)

An important application of a regression model is predicting new or future observations \(Y\) corresponding to a specified value of regressor \(x\).

If \(x^*\) is the value of the regressor variable of interest, then the point estimator of the new or future value of the response \(Y^*\) is \[\hat Y=\hat\beta_0+\hat\beta_1x^*\] Now consider obtaining an interval estimate for this future observation \(Y^*.\) It can be shown that \[Var(Y^*-\hat Y)=\sigma^2\left[1+\frac1n+\frac{(x_0-\overline x)^2}{S_{xx}}\right]\]

Prediction Interval for a single observation \(y\) (cont.)

If we use the MSE, \(s^2\), to estimate \(\sigma^2\), we can show that \[\frac{Y^*-\hat Y}{\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}}\sim t_{n-2}.\]

From this we can develop the following prediction interval definition.

A \(100(1-\alpha)\%\) prediction interval on a future observation \(Y^*\) at the value of \(x=x^*\) is given by \[\begin{eqnarray*} \hat Y \pm t^*\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}. \end{eqnarray*}\]

Note that the prediction interval is wider than confidence interval about the mean response \(\mu_{y|x^*}\).

Example: Crying and IQ (Prediction Interval for \(Y^*\))

A prediction interval on a future observation \(Y^*\) at the value of \(x=x^*\) is given by \[\begin{eqnarray*} \hat Y \pm t^*\sqrt{s^2\left[1+\frac1n+\frac{(x^*-\overline x)^2}{\sum(x-\bar x)^2}\right]}. \end{eqnarray*}\]

Let us calculate a 95% prediction interval for \(Y^*\) when \(x=19\).

x.star <- 19; y.hat <- b0+b1*x.star
se.y.hat <- s*sqrt(1+1/n+(x.star-mean(Crycount))^2/((n-1)*var(Crycount)))
t.star <- qt(0.975,n-2)
y.hat + c(-1,1)*t.star*se.y.hat

## [1]  83.64541 155.62126

We also may use the R function predict.

new.x <- data.frame(Crycount=19)
predict(crying_lm, new.x,conf.level=0.95, interval="prediction")

##        fit      lwr      upr
## 1 119.6333 83.64541 155.6213

Confidence Band and Prediction Band

We may visualize the difference between the confidence interval of \(\mu_{y|x^*}\) and the prediction interval of \(Y^*\).

all<- seq(min(Crycount),max(Crycount)); new.data <- data.frame(Crycount=all)
CI.all <- predict(crying_lm,new.data,interval="confidence");PI.all <- predict(crying_lm,new.data,interval="prediction")
plot(crying,pch=16,cex=1)
lines(all,CI.all[,2],lwd=2,lty=3,col=2); lines(all,CI.all[,3],lwd=2,lty=3,col=2)
lines(all,PI.all[,2],lwd=2,lty=2,col=4); lines(all,PI.all[,3],lwd=2,lty=2,col=4)
legend("topleft", legend=c("Confidence Interval", "Prediction Interval"), lty=c(3,2),col=c(2,4), cex=0.8)

Adequacy of the Regression Model

Fitting a regression model requires several assumptions.

Estimation of the model parameters requires the assumption that the errors are uncorrelated random variables with mean zero and constant variance.
Tests of hypotheses and interval estimation require that the errors be normally distributed.
In addition, we assume that the order of the model is correct; that is, if we fit a simple linear regression model, we are assuming that the phenomenon actually behaves in a linear or first-order manner.
Model checking is an important exercise because if the model assumptions are violated, then our analysis (and all subsequent interpretations) could be compromised.

Residual Analysis

The residuals from a regression model are \(e_i=y_i-\hat y_i, i=1,2,\dots,n,\) where \(y_i\) is an actual observation and \(\hat y_i\) is the corresponding fitted value from the regression model.

Analysis of the residuals is frequently helpful in checking the assumption that the errors are approximately normally distributed with constant variance, and in determining whether additional terms in the model would be useful.
As an approximate check of normality, the experimenter can construct a frequency histogram of the residuals or a normal probability plot of residuals.
We may also standardize the residuals by computing \(d_i=e_i/s.\) If the errors are normally distributed, approximately 95% of the standardized residuals should fall in the interval \((-2, 2)\). Residuals that are far outside this interval may indicate the presence of an outlier, that is, an observation that is not typical of the rest of the data.

Residual Plots

A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.

If all assumptions of a linear regression hold, the pattern of the residual plot will be an unstructured horizontal band centered at 0 and symmetric about 0, as in Figure (a)
A curved pattern,, like the one in Figure (b), indicates that the relationship between the response and the explanatory variable is curved rather than linear
A fan-shaped pattern like the one in Figure (c) shows that the variation of the response about the least-squares line increases as the explanatory variable increase

Create a Residual Plot

Let us create a residual plot for the Crying and IQ example.

crying_rs <- resid(crying_lm)
plot(Crycount,crying_rs,pch=16,col=4, ylab="Residual")
abline(h=0,lwd=2,col=2)

It seems like an unstructured horizontal band. And, the points are roughly symmetric about 0. No curve pattern is observed.

Histogram and QQ plot of residuals

par(mfrow=c(1,2),pty="s")
hist(crying_rs,freq=F); lines(seq(-40,60,0.01),dnorm(seq(-40,60,0.01),mean(crying_rs),sd(crying_rs)),col=2,lwd=2)
qqnorm(crying_rs);qqline(crying_rs,col=2)

In-Class Exercise: Plots for the Empathy Example

Recall that we fitted a linear regression model and called it empathy.fit. The explanatory variable is the empathy level, Emp. Now, pull out the residuals and create a residual plot, a histogram and QQ plot for the residuals.

par(mfrow=c(1,3),pty="s")
empathy_rs <- resid(empathy.fit)
plot(Emp,empathy_rs,pch=16,col=4, ylab="Residual"); abline(h=0,lwd=2,col=2)
hist(empathy_rs,freq=F); lines(seq(-0.5,0.7,0.01),dnorm(seq(-0.5,0.7,0.01),mean(empathy_rs),sd(empathy_rs)),col=4,lwd=2)
qqnorm(empathy_rs);qqline(empathy_rs,col=4)

Normality Test about Residuals

We can also use the Shapiro-Wilk test to check the normality assumption:

shapiro.test(crying_rs)

## 
##  Shapiro-Wilk normality test
## 
## data:  crying_rs
## W = 0.96496, p-value = 0.2741

ANOVA table in regression analysis

We have seen the use of analysis of variance tables in grouped designs. However, their use is not restricted to these designs but applies to the whole class of linear models.

The variation between and within groups for a one-way analysis of variance generalized to model variation and residual variation:

\[ \begin{align*} SSR=&\sum_{i=1}^n (\hat y_i-\bar y_.)^2\\ SSE=&\sum_{i=1}^n (y_i-\hat y_i)^2 \end{align*} \]

The role of the group means \(\bar y_i\) in the one-way analysis of classification is taken over by the fitted values \(\hat y_i\) in the more general linear model.

Testing utility of model

An F test for significance of the model is available in direct analogy with the one in the ANOVA test.

In multiple linear regression, the null hypothesis is “all the slopes are zero”, i.e. \(\beta_1=\beta_2=\cdots=\beta_k=0\).

In simple linear regression, this test is equivalent to testing that the regression coefficient \(\beta_1\) is zero.

Example: Crying and IQ

anova(crying_lm)

## Analysis of Variance Table
## 
## Response: IQ
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Crycount   1  2877.5 2877.48  9.3972 0.004105 **
## Residuals 36 11023.4  306.21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Crying and IQ (cont.)

Hypotheses: \(H_0: \beta_1=0\) versus \(H_a: \beta_1\ne 0\)

Test statistic:
\[ \begin{align*} F=&\frac{SSR/DFR}{SSE/DFE}=\frac{MSR}{MSE}\\ =&\frac{2877.5/1}{11023.4/36}=\frac{2877.48}{306.21}=9.3972 \end{align*} \]

The p-value is \(P(F_{DFR,DFE}>F)\).

p_val <- 1 - pf(9.3972,1,36); p_val

## [1] 0.004105345

Conclusion: Since the p-value is 0.004 < 0.05, reject \(H_0\). The model is significant (useful) at 0.05 level.

Test of utility versus t test of \(\beta_1\)

Since we know that the test of utility of model is equivalent to testing that \(\beta_1=0\), we may pull out the t test result and compare their results.

summary(crying_lm)

## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

Note that the p-value for the t statistic for \(\beta_1\) is the same with the p-value of the F statistic of the utility test.

In-Class Exercise: ANOVA table and utility test

Now, conduct a utility F test for the Empathy data set. The model object is empathy.fit.

What is the F statistic? How about the p-value? Do you think the model is “useful” in predicting the response?

anova(empathy.fit)

## Analysis of Variance Table
## 
## Response: Brain
##           Df  Sum Sq  Mean Sq F value  Pr(>F)  
## Emp        1 0.30158 0.301584  5.0568 0.04115 *
## Residuals 14 0.83494 0.059639                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Any alternative way to test for utility?

summary(empathy.fit)

## 
## Call:
## lm(formula = Brain ~ Emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35257 -0.14980  0.02289  0.14572  0.44495 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.057750   0.188326  -0.307   0.7636  
## Emp          0.007613   0.003385   2.249   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2442 on 14 degrees of freedom
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2129 
## F-statistic: 5.057 on 1 and 14 DF,  p-value: 0.04115

Coefficient of Determination (\(R^2\))

The analysis of variance (ANOVA) identity is as follows: \[\underbrace{\sum_{i=1}^n(Y_i-\overline Y)^2}_{SST}=\underbrace{\sum_{i=1}^n(\hat Y_i-\overline Y)^2}_{SSR}+\underbrace{\sum_{i=1}^n(Y_i-\hat Y_i)^2}_{SSE}\]

Coefficient of Determination, denoted by \(R^2\), is often used to judge the adequacy of a regression model. \[R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\]

Note that \(0\le R^2 \le 1\).
\(R^2\) is the proportion of variability in the data explained or accounted for by the regression model.
If \(x\) contributes to prediction of \(Y_i\), then we expect \(SSE<<SST\), which implies that \(R^2\) is close to 1.
It could be shown that \(R^2=r^2\), the square of the sample correlation coefficient.

Example: Crying and IQ

anova(crying_lm)

## Analysis of Variance Table
## 
## Response: IQ
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Crycount   1  2877.5 2877.48  9.3972 0.004105 **
## Residuals 36 11023.4  306.21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, we can extract information and calculate \(R^2\) manually.

SSR=2877.5; SSE=11023.4
SST=SSR+SSE
R_sq=SSR/SST; R_sq

## [1] 0.207001

summary(crying_lm)

## 
## Call:
## lm(formula = IQ ~ Crycount)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.126 -11.426  -2.126  10.860  51.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.268      8.934  10.216  3.5e-12 ***
## Crycount       1.493      0.487   3.065  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 36 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.185 
## F-statistic: 9.397 on 1 and 36 DF,  p-value: 0.004105

Example: Crying and IQ (cont.)

Let us also verify the relationship between coefficients of determination (\(R^2\)) and correlation \(r\).

r=cor(Crycount,IQ);r

## [1] 0.4549725

r^2; R_sq

## [1] 0.207

## [1] 0.207001

In-Class Exercise: \(R^2\)

What is the coefficient of determination for the linear model for Empathy? Verify that \(R^2=r^2.\)

r<-cor(Emp,Brain);r;r^2

## [1] 0.5151268

## [1] 0.2653556

summary(empathy.fit)$r.squared

## [1] 0.2653556

detach(empathy,crying,fatgain)

STAT 1301: Statistical Packages

R Lecture 11: Regression

Regression

Least-Squares Regression Line

Example: Does Fidgeting Keep You Slim?

Find the Linear Regression Line in R

In-Class Exercise: I Feel Your Pain

In-Class Exercise: Fit a Simple Linear Regression Model

Fitted Values and Residuals

Superimpose Linear Regression Line

In-Class Exercise: Superimpose Linear Regression Line

Display Residuals in A Scatterplot

Predict \(y\) given a New Value of \(x\)

In-Class Exercise: Prediction and Residual

Facts about Least-Squares Regression

Introduction to Inference for Regression

Conditions for Regression Inference

Conditions for Regression Inference (cont.)

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\)

Three Parameters: \(\beta_0\), \(\beta_1\), and \(\sigma^2\) (cont.)

Example: Crying and IQ

Example: Crying and IQ (cont.)

Example: Crying and IQ (Estimating \(\sigma\))

Example: Crying and IQ (Estimating \(\sigma\)) (cont.)

In-Class Exercise: Estimating \(\sigma^2\) with MSE

Distributions of \(\hat \beta_0\) and \(\hat \beta_1\)

Standardized \(\hat\beta_1\)

Testing the hypothesis of no linear relationship

Example: Crying and IQ (H.T.)

Example: Crying and IQ (R output)

In-Class Exercise: H.T. of no linear relationship

Extract Coefficient Estimates, Standard Errors, etc.

One-Sided Test about the Slope

Confidence interval for the regression slope

Example: Crying and IQ (C.I. for \(\beta_1\))

Confidence Interval on the Mean Response

Confidence Interval of \(\mu_{y|x^*}\)

Example: Crying and IQ (Confidence Interval for \(\mu_{y|x^*}\))

Prediction Interval for a single observation \(y\)

Prediction Interval for a single observation \(y\) (cont.)

Example: Crying and IQ (Prediction Interval for \(Y^*\))

Confidence Band and Prediction Band

Adequacy of the Regression Model

Residual Analysis

Residual Plots

Create a Residual Plot

Histogram and QQ plot of residuals

In-Class Exercise: Plots for the Empathy Example

Normality Test about Residuals

ANOVA table in regression analysis

Testing utility of model

Example: Crying and IQ

Example: Crying and IQ (cont.)

Test of utility versus t test of \(\beta_1\)

In-Class Exercise: ANOVA table and utility test

Coefficient of Determination (\(R^2\))

Example: Crying and IQ

Example: Crying and IQ (cont.)

In-Class Exercise: \(R^2\)

Find the Linear Regression Line in `R`

Example: Crying and IQ (`R` output)