Linear Regression

Motivation

Mathematical model

Statistical model

Simple linear regression

Multiple linear regression

Collinearity

Interactions

Regression diagnostic

Qualitative predictors

Model selection

Motivation

Consider the advertising data set consists of the sales of some product (in thousands of units) in 200 different markets, along with advertising budgets (in thousands of dollars) for the product in each of those markets for three different media: TV, radio, and newspaper.

Our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

Motivation

attach(read.csv('Advertising.csv'))
par ( mfrow =c(1 ,3) )
plot(TV,Sales,cex.lab=2,cex.axis=1.2)
plot(Radio,Sales,cex.lab=2,cex.axis=1.2)
plot(Newspaper,Sales,cex.lab=2,cex.axis=1.2)

Motivation

Questions we might ask:

  1. Is there a relationship between advertising budget and sales?

  2. How strong is the relationship between advertising budget and sales?

  3. Which media contribute to sales?

  4. How accurately can we estimate the effect of each medium on sales?

  5. How accurately can we predict future sales?

  6. Is the relationship linear?

  7. Is there synergy (interaction) among the advertising media?

Mathematical model

We make a (sometimes huge) simplification of reality and assume that the response variable is a linear funciton of the covariates, for example :

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)Radio

or

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)TV

or

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)Newspaper

or even

‘true’ Sales=\(\beta_0+\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\beta_3\times\)Newspaper

Mathematical model

RADIO=seq(1,10)
beta0=2; beta1=1
true_sales=beta0+beta1*RADIO
plot(RADIO,true_sales);abline(beta0,beta1)

Statistical model

However, being statisticians, we don’t really believe in a deterministic relation…therefore, we have in mind a stochastic relation where we can only observe a noisy version of ‘true’ sales, that is

Sales= ‘true’ Sales + error

and we denote this unobserved error (noise) by \(\epsilon\) to get

\[ Sales=\beta_0+\beta_1\times Radio+\epsilon. \]

In general, we use \(Y\) for the response (label) and \(X_1, X_2, X_3,...,X_p\) for the \(p\) covariates (features) and then the statistical model takes the form

\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_p X_p+\epsilon\]

Statistical model

set.seed(123)
epsilon=rnorm(10)
SALES=true_sales+epsilon
plot(RADIO,SALES);abline(beta0,beta1)

Simple linear regression

In simple linear regression we consider only one covariate, say \(X\).

We assume that we have at hand \(n\) sample points of the response \(Y\) and the covariate \(X\) and write

\[Y_i=\beta_0+\beta_1 X_i+\epsilon_i,\,\,i=1,\dots,n.\]

Our goal is to estimate the unknown constants \(\beta_0, \beta_1\).

Using the \(n\) observations, we will construct some estimators, denoted by \(\hat\beta_0,\hat\beta_1\). Then we can predict future respose (e.g. sales) by

\[\hat y=\hat\beta_0+\hat\beta_1 x,\]

where \(\hat y\) indicates a prediction of \(Y\) on the basis of \(X = x\).

The hat symbol denotes an estimated value.

Simple linear regression

So how to construct estimators \(\hat\beta_0,\hat\beta_1\) ?

We follow the least squares approach.

Let \(\hat y_i=\hat\beta_0+\hat\beta_1 x_i\) be the prediction for \(Y\) based on the \(i\)th value of \(X\) and define

\[e_i = y_i-\hat y_i.\]

The term \(e_i\) represents the \(i\)th residual, i.e., the difference between an observation and its estimated value. Actually, it is an empirical version of the unobserved noise \(\epsilon_i\).

Define the residual sum of squares (RSS) as \(RSS = e_1^2+ e_2^2+\cdots+e_n^2\), or equivalently as

\[RSS = (y_1-\hat\beta_0-\hat\beta_1 x_1)^2+ (y_2-\hat\beta_0-\hat\beta_1 x_2)^2+\cdots+(y_n-\hat\beta_0-\hat\beta_1 x_n)^2.\]

The least squares approach chooses \(\hat\beta_0,\hat\beta_1\) to minimize the RSS.

Simple linear regression

It can be shown that in the case of simple linear regression we have

\[ \begin{aligned} \hat\beta_1&=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^n(x_i-\bar x)^2},\\ \hat\beta_0&=\bar y-\hat\beta_1\bar x, \end{aligned} \]

where \(\bar y=\sum_{i=1}^ny_i\) and \(\bar x=\sum_{i=1}^nx_i\) are the sample means.

We execute the three linear regressions corresponding to the three advertising media, as follows:

lm.tv=lm(Sales~TV)
lm.radio=lm(Sales~Radio)
lm.newspaper=lm(Sales~Newspaper)

Simple linear regression

par ( mfrow =c(1 ,3) )
plot(TV,Sales,cex.lab=2,cex.axis=1.2);abline(lm.tv, col="blue", lty=1, lwd=2)
plot(Radio,Sales,cex.lab=2,cex.axis=1.2);abline(lm.radio, col="blue", lty=1, lwd=2)
plot(Newspaper,Sales,cex.lab=2,cex.axis=1.2);abline(lm.newspaper, col="blue", lty=1, lwd=2)

Simple linear regression

To see the estimates, we execute

coef(lm.tv)
## (Intercept)          TV 
##  7.03259355  0.04753664
coef(lm.radio)
## (Intercept)       Radio 
##   9.3116381   0.2024958
coef(lm.newspaper)
## (Intercept)   Newspaper 
##  12.3514071   0.0546931

Simple linear regression

So, we fitted a linear line to the data.

Actually, thanks to our model, we have a nice interpretation such as :

a unit change in the covariate \(X\) is associated with a \(\beta_1\) change in the mean of the response \(Y\)”.

This statement reminds us that the model we have in mind is

\[Y=E(Y|X)+\epsilon.\]

In simple linear regression, we assumed that the conditional expectation of \(Y\) given \(X\) has a very simple linear form:

\[E(Y|X)=\beta_0+\beta_1 X.\]

Simple linear regression

So we can estimate parameters and can even predict future observations.

The next natural step would be to obtain some quantitative understandings such as :

  1. How accurate are the estimators obtained for the intercept and slope ?

  2. Is there a relationship between \(X\) and \(Y\) ?

NOTE: Claims of causality should be avoided for observational data.

  1. How accurate is our prediction for the regression line \(E(Y|X)\) ?

  2. How accurate can we predict a future observation \(Y\) ?

Simple linear regression

How accurate are the estimators obtained for the intercept and slope ?

Thanks to the statistical framework we adopted, we can assess the accuracy of our method.

For instance, if we assume that \(E[\epsilon_i]=0\) for all \(i=1,\dots,n\), then we have \(E[\hat\beta_0]=\beta_0\) and \(E[\hat\beta_1]=\beta_1\).

Furthermore, assuming that \(Var[\epsilon_i]=\sigma^2\) for all \(i=1,\dots,n\), and \(cov(\epsilon_i,\epsilon_j)=0\) for all \(i\neq j\), we can show that

\[ \begin{aligned} Var[\hat\beta_1]&=\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar x)^2},\\ Var[\hat\beta_0]&=\sigma^2\Big[\frac{1}{n}+\frac{\bar x^2}{\sum_{i=1}^n(x_i-\bar x)^2}\Big]. \end{aligned} \]

And if we are willing to assume that the distribution of the errors has some specific form (e.g. normal), then we can use the variances to compute confidence intervals.

Simple linear regression

A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

Loosely speaking, there is approximately a 95% chance that the interval

\[ \Big[\hat\beta_1-2\sqrt{Var[\hat\beta_1]},\hat\beta_1+2\sqrt{Var[\hat\beta_1]}\Big] \]

will contain the true value of \(\beta_1\) (under a scenario where we got repeated samples like the present sample).

Simple linear regression

confint(lm.tv, level=0.95)
##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256
confint(lm.radio, level=0.95)
##                 2.5 %     97.5 %
## (Intercept) 8.2015885 10.4216877
## Radio       0.1622443  0.2427472
confint(lm.newspaper, level=0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.12595560 13.57685854
## Newspaper    0.02200549  0.08738071

Simple linear regression

Now we address the question: Is there a relationship between \(X\) and \(Y\) ?

We answer this question using the notion of statistical testing:

\(H_0\): There is no relationship between \(X\) and \(Y\)

versus the alternative hypothesis

\(H_1\): There is some relationship between \(X\) and \(Y\)

Mathematically, this corresponds to testing

\(H_0: \beta_1 = 0\)

\(H_1: \beta_1 \neq 0\)

This makes sense since if \(\beta_1 = 0\) then the model reduces to \(Y =\beta_0 + \epsilon\), and consequently, \(X\) is not associated with \(Y\).

Simple linear regression

beta1=0
true_sales=beta0+beta1*RADIO
SALES=true_sales+epsilon
plot(RADIO,SALES);abline(beta0,beta1)

Simple linear regression

In order to execute a statistical test, we need to have some statistic.

Assuming that the distribuiton of the errors is normal, the statistic \(\frac{\hat\beta_1-0}{\sqrt{Var[\hat\beta_1]}}\) will have a normal distribution as well.

However, recall that \(Var[\hat\beta_1]\) depends on \(\sigma^2\) which is usually not known. Thus, we estimate it by \[ \hat\sigma^2=\frac{RSS}{n-2}=\frac{\sum_{i=1}^n(y_i-\hat y_i)^2}{n-2}.\]

This is an unbiased estimator of \(\sigma^2\) and we use it to obtain

\[ \widehat{Var}[\hat\beta_1]=\frac{\hat\sigma^2}{\sum_{i=1}^n(x_i-\bar x)^2}. \]

We define the Residual Standard Error: \(RSE=\sqrt{\hat\sigma^2}\).

Simple linear regression

To test the null hypothesis, we compute the \(t\)-statistic, given by

\[t =\frac{\hat\beta_1-0}{\sqrt{\widehat{Var}[\hat\beta_1]}}.\]

This statistic will have a \(t\)-distribution with \(n-2\) degrees of freedom, assuming \(\beta_1 = 0\).

We can calculate the probability of observing any value equal to \(|t|\) or larger, a quantity that we call p-value.

Another quantity that enables us to test the hypothesis is the \(F\) statistic given by

\[ F=\frac{SSreg}{RSE}, \]

where \(SSreg=TSS-RSS=\sum_{i=1}^n(\hat y_i-\bar y)^2\). If \(\beta_1=0\) then \(F\sim F_{1,n-2}\), namely, the \(F\) statistic will have \(F\) distribution with \(1\) and \(n-2\) degrees of freedom.

Simple linear regression

Another useful quantity is the R-squared or fraction of variance explained by the model,

\[ R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}, \]

where \(TSS=\sum_{i=1}^n(y_i-\bar y)^2\) is the Total Sum of Squares.

It can be shown that in simple linear regression setting we have \(R^2=r^2\), where \(r\) is the correlation between \(X\) and \(Y\) :

\[ r=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)} {\sqrt{\sum_{i=1}^n(x_i-\bar x)^2}\sqrt{\sum_{i=1}^n(y_i-\bar y)^2}}. \]

In the next slide we see how to extract all the information above using R.

Simple linear regression

summary(lm.tv)
## 
## Call:
## lm(formula = Sales ~ TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Simple linear regression

A classical notion is that of “Analysis of Variance Table”:

anova(lm.tv)
## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TV          1 3314.6  3314.6  312.14 < 2.2e-16 ***
## Residuals 198 2102.5    10.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In words, we can say that the actual sales in each market deviate from the true regression line by approximately 3,260 units, on average.

In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3260/14000 = 23%.

Simple linear regression

How accurate is our prediction for the regression line \(E(Y|X)\) ?

Here also we use confidence intervals. For example,

predict (lm.tv,data.frame(TV=100),interval ="confidence")
##        fit      lwr     upr
## 1 11.78626 11.26782 12.3047

In words:

If we collect a large number of data sets like the Advertising data set, and we construct a confidence interval for the average sales on the basis of each data set (given $100,000 in TV), then 95% of these confidence intervals will contain the true value of average sales.

So, 95% of intervals of this form will contain the true value of \(E[Y]\).

Simple linear regression

How accurate can we predict a future observation \(Y\) ?

We will use prediction intervals.

predict (lm.tv,data.frame(TV=100),interval ="prediction")
##        fit      lwr      upr
## 1 11.78626 5.339251 18.23326

A prediction interval can be used to quantify the uncertainty surrounding sales for a particular city. Given that $100,000 is spent on TV advertising in that city, the 95% prediction interval is [5.34, 18.23].

So, 95% of intervals of this form will contain the true value of \(Y\) for this city.

Note: both intervals are centered at 11,786 products, but the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Multiple linear regression

Here our model takes the form

\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_p X_p+\epsilon.\]

We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , holding all other predictors fixed.

Here also, the estimates for the parameters \(\beta_0,\beta_1,\dots,\beta_p\) will be the values that minimize the sum of squared residuals

\[ \begin{aligned} RSS&=\sum_{i=1}^n(y_i-\hat y_i)^2 =\sum_{i=1}^n(y_i-\hat\beta_0-\hat\beta_1 x_{i1}-\cdots-\hat\beta_p x_{ip})^2. \end{aligned} \]

Given the multiple least squares regression coefficient estimates \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p\) we can make predictions using the formula

\[ \hat y=\hat\beta_0+\hat\beta_1 x_1+\cdots+\hat\beta_p x_p. \]

Multiple linear regression

In the advertising example, the model becomes

sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\beta_3\times\)Newspaper+\(\epsilon\),

or

Sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Newspaper+\(\epsilon\),

or

Sales=\(\beta_0\)+\(\beta_1\times\)Radio+\(\beta_2\times\)Newspaper+\(\epsilon\).

or

Sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\epsilon\).

Consider the last model. We see the data and an estimated surface fit in the next two figures.

Multiple linear regression

You must enable Javascript to view this page properly.

Multiple linear regression

You must enable Javascript to view this page properly.

Multiple linear regression

Now consider the ‘full’ model:

\[ Sales=\beta_0+\beta_1\times TV+\beta_2\times Radio+\beta_3\times Newspaper+\epsilon \]

for which we obtain,

coef(lm(Sales~TV+Radio+Newspaper))
##  (Intercept)           TV        Radio    Newspaper 
##  2.938889369  0.045764645  0.188530017 -0.001037493

For a given amount of TV and newspaper advertising, spending an additional $1,000 on radio advertising leads to an increase in sales by approximately 189 units.

Is at least one of the predictors \(X_1,X_2,\dots,X_p\) useful in predicting the response ?

Here also we can use the F-statistic to answer this question.

Multiple linear regression

The statistic

\[ F=\frac{SSreg/p}{RSS/(n-p-1)}\sim F_{p,n-p-1}, \]

is used to test

\[ \begin{aligned} H_0&: \beta_1 =\beta_2=\cdots= 0, \\ H_1&:\,\ otherwise \end{aligned} \]

In the next slide we see the resulting F-statistic for this model.

We will see also that in the full model, Newspaper is not significant, however, it is in a model with Newspaper the only predictor. A possible explanation will be based on analysing the collinearity between the covariates.

Multiple linear regression

summary(lm(Sales~TV+Radio+Newspaper))
## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Multiple linear regression

## 
## Call:
## lm(formula = Sales ~ Newspaper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## Newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Collinearity

We look at the correlation matrix

cor(read.csv('Advertising.csv'))
##                   TV      Radio  Newspaper     Sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## Radio     0.05480866 1.00000000 0.35410375 0.5762226
## Newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## Sales     0.78222442 0.57622257 0.22829903 1.0000000

and conclude that newspaper sales are a surrogate for radio advertising; newspaper gets ‘credit’ for the effect of radio ad sales.

Collinearity

Note: it is possible for collinearity to exist between three (or more) covariates even if no pair of covariates has a particularly high correlation. This is a multicollinearity situation.

Checking for multicollinearity can be done using the variance inflation factor (VIF):

suppressWarnings(library(car))
vif (lm(Sales~TV+Radio+Newspaper) )
##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187

For a given covariate, this indicator is based on a regression of this specific covariate onto all of the other predictors. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

Two simple solutions: drop one of the problematic variables or combine the collinear variables together into a single predictor.

Collinearity

The ideal scenario is when the predictors are uncorrelated, namely, a balanced design:

Correlations amongst predictors cause problems:

Claims of causality should be avoided for observational data.

Interactions

So far our model assumes that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media.

Now we also check for interactions (non-additive relations), for example:

\[ Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times (Radio\times TV) + \epsilon \]

Which is the same as: \[ Sales= \beta_0 + (\beta_1 + \beta_3 \times Radio) \times TV + \beta_2 \times Radio + \epsilon. \]

and as:

\[ Sales= \beta_0 + \beta_1 \times TV +(\beta_2 + \beta_3 \times TV) \times Radio + \epsilon. \]

Interactions

\[ Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times (Radio\times TV) + \epsilon \]

coef(lm(Sales~TV*Radio))
## (Intercept)          TV       Radio    TV:Radio 
## 6.750220203 0.019101074 0.028860340 0.001086495

One interpatation: \(Sales= \beta_0 + (\beta_1 + \beta_3 \times Radio) \times TV + \beta_2 \times Radio + \epsilon.\)

In words: an increase in \(TV\) advertising of $1,000 is associated with increased \(Sales\) of \(19+1.1 \times Radio\) units.

Another interpatation: \(Sales= \beta_0 + \beta_1 \times TV +(\beta_2 + \beta_3 \times TV) \times Radio + \epsilon.\)

In words: an increase in \(Radio\) advertising of $1,000 will be associated with an increase in \(Sales\) of \(29 + 1.1\times TV\) units.

Interactions

summary(lm(Sales~TV*Radio))
## 
## Call:
## lm(formula = Sales ~ TV * Radio)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Interactions

The results suggest that interactions are important.

The p-value for the interaction term \(TV\times Radio\) is extremely low, indicating that there is strong evidence for \(\beta_3\ne 0\).

The \(R^2\) for the interaction model is 96.8%, compared to only 89.7% for the model that predicts \(Sales\) using \(TV\) and \(Radio\) without an interaction term:

1-sum(lm(Sales~TV+Radio)$residuals^2)/sum((Sales-mean(Sales))^2)
## [1] 0.8971943

We used the formula \(R^2=1-RSS/TSS\), where \(RSS=\sum_{i=1}^n e_i^2\) and \(TSS=\sum_{i=1}^n(y_i-\bar y)^2\).

Note: the hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Regression diagnostic

We will use residuals analysis in order to check for:

Anscombe’s 4 regression data sets

Anscombe’s 4 regression data sets

## $lm1
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
## 
## $lm2
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
## 
## $lm3
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
## 
## $lm4
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

Anscombe’s 4 regression data sets

## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anscombe’s 4 regression data sets

## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression diagnostic

Errors are normally distributed.

We will check this assumption using q-q plots.

Under the normality assumption, the residuals are normally distributed. We will plot the residuals versus the theoretical quantiles of the normal distribution.

We can also use studentized residuals which are based on the original residuals. Studentized residuals are sometimes called jackknifed residuals. If the residuals are normally distributed then the studentized residuals will have a t-distribution with \(n-p-1\) degrees of freedom. We will plot the studentized residuals versus the theoretical quantiles of the t-distribution.

If the errors are not normal, the least squares estimates may not be optimal. They will still be best linear unbiased estimates, but other robust estimators may be more effective. Tests and confidence intervals may not be exact. Long-tailed distributions in particular, cause large inaccuracies.

Regression diagnostic

qqnorm(residuals(lm(Sales~TV+Radio)), ylab="Residuals") 
qqline(residuals(lm(Sales~TV+Radio)))

Regression diagnostic

suppressWarnings(library(car))
qqPlot(lm(Sales~TV+Radio))

Regression diagnostic

Several distributions vs. theoretical standard normal quantiles.

Regression diagnostic

Several distributions vs. theoretical standard normal quantiles.

Regression diagnostic

The response \(Y\) is linear in the parameters.

The errors have constant variance and are symmetric around zero..

By definition, \(\sum_{i=1}^ne_i=0\), and under the model assumptions, the residuals \(e_i=y_i-\hat y_i\) are normally distributed (but not independent).

Thus, if the model is correct, then we would expect to see the residuals scattered randomly around their mean, zero.

This can be checked by plotting the residuals as a function of the fitted values \(\hat y_i\).

The same plot enables us to inspect whether the variance of the residuals is not similar for all observations.

Regression diagnostic

plot(lm(Sales~TV+Radio), which=1)

Regression diagnostic

Regression diagnostic

Regression diagnostic

Transformation of variables can sometimes solve problems of nonlinearity. We discuss here one example. The problem of heterogeneous variance can be solved by using transformations or weighted least squares.

In order to assess the quality of manual printing, a company collected data of average cost for printing a document (in cents) over 45 days. Data taken from here.

Regression diagnostic

Several transformations can be considered.

Regression diagnostic

summary(lm(Cost~Days))$r.squared
## [1] 0.7375462
summary(lm(log(Cost)~Days))$r.squared
## [1] 0.8174798
summary(lm(log(Cost)~sqrt(Days)))$r.squared
## [1] 0.8588231
summary(lm(log(Cost)~log(Days)))$r.squared
## [1] 0.8434075

Regression diagnostic

For simplicity, we choose the transformation with the highest \(R^2\) (we could also check model assumptions, etc.).

The theoretical model in that case is \[log(Cost)=\beta_0+\beta_1\sqrt{Days}+\epsilon.\]

If we want to predict the cost for day 40, then we should use \(\sqrt{40}=6.324555\) for the covariate.

logCost=log(Cost)
sqrDays=sqrt(Days)
predict(lm(logCost~sqrDays),data.frame(sqrDays=sqrt(40)))
##        1 
## 2.173689

To obtain the predicted cost for day \(40\) we transform back and get \(\exp(2.17)=8.79\).

Note: we can also use polynomial regression in order to deal with non-linear effects of predictors.

Regression diagnostic

The errors are independent.

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms.

If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be.

For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be.

This could cause us to erroneously conclude that a parameter is statistically significant.

If the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Regression diagnostic

In general, we use the runs test (if we deal with time series data, then we can also plot successive residuals to inspect for correlations). A “run” of a sequence is a maximal non-empty segment of the sequence consisting of adjacent equal elements. For example, the 16-element sequence “+ + - - - + + + - - - - - - + +”" consists of 5 runs, 3 of which consist of plus signs and 2 of minus signs. If the number of runs is significantly higher or lower than expected, the hypothesis of statistical independence of the elements may be rejected. Exact p-values can be calculated, however, for number of positive residuals larger than 10, and similarly for the negative ones, a normal approximation will be used:

suppressWarnings(library(lawstat))
runs.test(residuals(lm(Sales~TV+Radio)),alternative ="two.sided")
## 
##  Runs Test - Two sided
## 
## data:  residuals(lm(Sales ~ TV + Radio))
## Standardized Runs Statistic = -0.4253, p-value = 0.6706

Under the two-sided null hypothesis, the errors are not positively or negatively correlated. We will not reject the null hypothesis.

Regression diagnostic

High leverage points and outliers.

Let \(\hat y\) and \(y\) denote the vectors of fitted values and observations, respectively. It can be shown that

\[ \hat y= Hy \] where \(H\) is called the hat matrix. The leverage of observation \(i\) is the value of the \(i\)th diagonal term, \(h_{ii}\), of the hat matrix, \(H\). The diagonal terms satisfy \(0\leq h_{ii}\leq 1\), \(\sum_{i=1}^nh_{ii}=p+1\), where \(p\) is the number of covariates in the regression model, and \(n\) is the number of observations.

Leverage is a measure of the effect of a particular observation on the regression predictions. A large value of \(h_{ii}\) indicates that the \(i\)th case is distant from the center of all \(X\) values and has more leverage. Because the sum of the leverage values is \(p+1\), an observation \(i\) can be considered as an outlier if its leverage substantially exceeds the mean leverage value, \((p+1)/n\), for example, values of \(h_{ii} > 2(p+1)/n\).

Regression diagnostic

In the advertising data, in a model with \(TV\) and \(Radio\) we have \(p=2\) and \(n=200\):

 which(hatvalues(lm(Sales~TV+Radio)) > 1.8*(2+1)/200)
##   3   6   9  36  76 109 176 179 
##   3   6   9  36  76 109 176 179
 which(hatvalues(lm(Sales~TV+Radio)) > 2*(2+1)/200)
##   6 176 
##   6 176
 which(hatvalues(lm(Sales~TV+Radio)) > 3*(2+1)/200)
## named integer(0)

Regression diagnostic

plot(lm(Sales~TV+Radio), which=5)

Regression diagnostic

Cook’s distance measure is a combination of a residual effect and leverage. It shows the influence of each observation on the fitted response values (by calculating the distance, in some sense, between the coefficients of a regression with and without a particular observation):

\[ D_i=\frac{\sum_{j=1}^n(\hat y_j-\hat y_{j(-i)})}{(p+1)RSS}. \]

Several guidelines exist for an observation to be considered as an outlier, e.g., an observation with Cook’s distance larger than one, or “three times the mean Cook’s distance”.

Note: observation #131 was detected by Cook’s distance but not by leverage only.

Regression diagnostic

plot(lm(Sales~TV+Radio), which=4)

Regression diagnostic

Regression diagnostic

Qualitative predictors

Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables.

See for example the scatterplot matrix of the credit card data in the next slide.

In addition to the 7 quantitative variables shown, there are four qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American (AA) or Asian).

The R code used to generate the scatterplot matrix is:

Credit=read.csv('Credit.csv')
attach(Credit)
pairs (~Balance+Age+Cards+Education+Income+Limit+Rating, cex.labels = 1,gap=0,cex.axis=0.5)

Qualitative predictors

Example: investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable:

\[ \begin{equation*} z_i= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is female} \\ 0,\,\text{if i}{\it th}\text{ person is male} \end{array} \end{equation*} \]

The model in that case is:

\[ \begin{equation*} y_i=\beta_0+\beta_1 z_i+\epsilon_i= \bigg\{ \begin{array}{l} \beta_0+\beta_1+\epsilon_i,\,\text{if i}{\it th}\text{ person is female} \\ \beta_0+\epsilon_i,\,\text{if i}{\it th}\text{ person is male} \end{array} \end{equation*} \]

Qualitative predictors

summary(lm(Balance~Gender))
## 
## Call:
## lm(formula = Balance ~ Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -529.54 -455.35  -60.17  334.71 1489.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    509.80      33.13  15.389   <2e-16 ***
## GenderFemale    19.73      46.05   0.429    0.669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared:  0.0004611,  Adjusted R-squared:  -0.00205 
## F-statistic: 0.1836 on 1 and 398 DF,  p-value: 0.6685

Qualitative predictors

With more than two levels, we create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. The first is:

\[ \begin{equation*} z_{i1}= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is Asian} \\ 0,\,\text{if i}{\it th}\text{ person is not Asian} \end{array} \end{equation*} \] \[ \begin{equation*} z_{i2}= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is Caucasian} \\ 0,\,\text{if i}{\it th}\text{ person is not Caucasian} \end{array} \end{equation*} \]

Then we obtain the model:

\[ \begin{equation*} y_i=\beta_0+\beta_1 z_{i1}+\beta_2 z_{i2}+\epsilon_i= \bigg\{ \begin{array}{l} \beta_0+\beta_1+\epsilon_i,\,\text{if i}{\it th}\text{ person is Asian} \\ \beta_0+\beta_2+\epsilon_i,\,\text{if i}{\it th}\text{ person is Caucasian} \\ \beta_0+\epsilon_i,\,\text{if i}{\it th}\text{ person is AA} \end{array} \end{equation*} \]

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable, African American in this example, is known as the baseline.

Qualitative predictors

summary(lm(Balance~Ethnicity))
## 
## Call:
## lm(formula = Balance ~ Ethnicity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          531.00      46.32  11.464   <2e-16 ***
## EthnicityAsian       -18.69      65.02  -0.287    0.774    
## EthnicityCaucasian   -12.50      56.68  -0.221    0.826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

Qualitative predictors

Interactions between qualitative and quantitative variables.

Suppose that we wish to predict balance using income (quantitative) and student (qualitative).

Without an interaction term, the model takes the form

\[ \begin{equation*} \text{balance}_i\approx \beta_1 \text{income}_i+ \bigg\{ \begin{array}{l} \beta_0+\beta_2,\,\text{if i}{\it th}\text{ person is a student} \\ \beta_0,\,\text{if i}{\it th}\text{ person is not a student} \end{array} \end{equation*} \]

With interactions, the model takes the form

\[ \begin{equation*} \text{balance}_i\approx \bigg\{ \begin{array}{l} (\beta_0+\beta_2)+(\beta_1+\beta_3)\text{income}_i,\,\text{if i}{\it th}\text{ person is a student} \\ \beta_0+\beta_1 \text{income}_i,\,\text{if i}{\it th}\text{ person is not a student} \end{array} \end{equation*} \]

Qualitative predictors

Estimates without interactions:

lm(Balance~Income+Student)$coef
## (Intercept)      Income  StudentYes 
##  211.142964    5.984336  382.670539

Estimates with an interaction term between income and student.

lm(Balance~Income*Student)$coef
##       (Intercept)            Income        StudentYes Income:StudentYes 
##        200.623153          6.218169        476.675843         -1.999151

Qualitative predictors