Consider the advertising data set consists of the sales of some product (in thousands of units) in 200 different markets, along with advertising budgets (in thousands of dollars) for the product in each of those markets for three different media: TV, radio, and newspaper.

Our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

```
attach(read.csv('Advertising.csv'))
par ( mfrow =c(1 ,3) )
plot(TV,Sales,cex.lab=2,cex.axis=1.2)
plot(Radio,Sales,cex.lab=2,cex.axis=1.2)
plot(Newspaper,Sales,cex.lab=2,cex.axis=1.2)
```

Questions we might ask:

Is there a relationship between advertising budget and sales?

How strong is the relationship between advertising budget and sales?

Which media contribute to sales?

How accurately can we estimate the effect of each medium on sales?

How accurately can we predict future sales?

Is the relationship linear?

Is there synergy (interaction) among the advertising media?

We make a (sometimes huge) simplification of reality and assume that the response variable is a linear funciton of the covariates, for example :

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)Radio

or

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)TV

or

‘true’ Sales=\(\beta_0\)+\(\beta_1\times\)Newspaper

or even

‘true’ Sales=\(\beta_0+\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\beta_3\times\)Newspaper

```
RADIO=seq(1,10)
beta0=2; beta1=1
true_sales=beta0+beta1*RADIO
plot(RADIO,true_sales);abline(beta0,beta1)
```

However, being statisticians, we don’t really believe in a deterministic relation…therefore, we have in mind a stochastic relation where we can only observe a noisy version of ‘true’ sales, that is

and we denote this **unobserved** error (noise) by \(\epsilon\) to get

\[ Sales=\beta_0+\beta_1\times Radio+\epsilon. \]

In general, we use \(Y\) for the response (label) and \(X_1, X_2, X_3,...,X_p\) for the \(p\) covariates (features) and then the statistical model takes the form

\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_p X_p+\epsilon\]

```
set.seed(123)
epsilon=rnorm(10)
SALES=true_sales+epsilon
plot(RADIO,SALES);abline(beta0,beta1)
```

In simple linear regression we consider only one covariate, say \(X\).

We assume that we have at hand \(n\) sample points of the response \(Y\) and the covariate \(X\) and write

Our goal is to estimate the unknown constants \(\beta_0, \beta_1\).

Using the \(n\) observations, we will construct some estimators, denoted by \(\hat\beta_0,\hat\beta_1\). Then we can predict future respose (e.g. sales) by

where \(\hat y\) indicates a prediction of \(Y\) on the basis of \(X = x\).

The hat symbol denotes an estimated value.

So how to construct estimators \(\hat\beta_0,\hat\beta_1\) ?

We follow the least squares approach.

Let \(\hat y_i=\hat\beta_0+\hat\beta_1 x_i\) be the prediction for \(Y\) based on the \(i\)th value of \(X\) and define

\[e_i = y_i-\hat y_i.\]

The term \(e_i\) represents the \(i\)th residual, i.e., the difference between an observation and its estimated value. Actually, it is an empirical version of the unobserved noise \(\epsilon_i\).

Define the residual sum of squares (RSS) as \(RSS = e_1^2+ e_2^2+\cdots+e_n^2\), or equivalently as

\[RSS = (y_1-\hat\beta_0-\hat\beta_1 x_1)^2+ (y_2-\hat\beta_0-\hat\beta_1 x_2)^2+\cdots+(y_n-\hat\beta_0-\hat\beta_1 x_n)^2.\]

The least squares approach chooses \(\hat\beta_0,\hat\beta_1\) to minimize the RSS.

It can be shown that in the case of simple linear regression we have

\[ \begin{aligned} \hat\beta_1&=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^n(x_i-\bar x)^2},\\ \hat\beta_0&=\bar y-\hat\beta_1\bar x, \end{aligned} \]

where \(\bar y=\sum_{i=1}^ny_i\) and \(\bar x=\sum_{i=1}^nx_i\) are the sample means.

We execute the three linear regressions corresponding to the three advertising media, as follows:

```
lm.tv=lm(Sales~TV)
lm.radio=lm(Sales~Radio)
lm.newspaper=lm(Sales~Newspaper)
```

```
par ( mfrow =c(1 ,3) )
plot(TV,Sales,cex.lab=2,cex.axis=1.2);abline(lm.tv, col="blue", lty=1, lwd=2)
plot(Radio,Sales,cex.lab=2,cex.axis=1.2);abline(lm.radio, col="blue", lty=1, lwd=2)
plot(Newspaper,Sales,cex.lab=2,cex.axis=1.2);abline(lm.newspaper, col="blue", lty=1, lwd=2)
```

To see the estimates, we execute

`coef(lm.tv)`

```
## (Intercept) TV
## 7.03259355 0.04753664
```

`coef(lm.radio)`

```
## (Intercept) Radio
## 9.3116381 0.2024958
```

`coef(lm.newspaper)`

```
## (Intercept) Newspaper
## 12.3514071 0.0546931
```

So, we fitted a linear line to the data.

Actually, thanks to our model, we have a nice interpretation such as :

This statement reminds us that the model we have in mind is

\[Y=E(Y|X)+\epsilon.\]

In simple linear regression, we **assumed** that the conditional expectation of \(Y\) given \(X\) has a very simple linear form:

\[E(Y|X)=\beta_0+\beta_1 X.\]

So we can estimate parameters and can even predict future observations.

The next natural step would be to obtain some quantitative understandings such as :

How accurate are the estimators obtained for the intercept and slope ?

Is there a relationship between \(X\) and \(Y\) ?

**NOTE: Claims of causality should be avoided for observational data.**

How accurate is our prediction for the regression line \(E(Y|X)\) ?

How accurate can we predict a future observation \(Y\) ?

**How accurate are the estimators obtained for the intercept and slope ?**

Thanks to the statistical framework we adopted, we can assess the accuracy of our method.

For instance, if we assume that \(E[\epsilon_i]=0\) for all \(i=1,\dots,n\), then we have \(E[\hat\beta_0]=\beta_0\) and \(E[\hat\beta_1]=\beta_1\).

Furthermore, assuming that \(Var[\epsilon_i]=\sigma^2\) for all \(i=1,\dots,n\), and \(cov(\epsilon_i,\epsilon_j)=0\) for all \(i\neq j\), we can show that

\[ \begin{aligned} Var[\hat\beta_1]&=\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar x)^2},\\ Var[\hat\beta_0]&=\sigma^2\Big[\frac{1}{n}+\frac{\bar x^2}{\sum_{i=1}^n(x_i-\bar x)^2}\Big]. \end{aligned} \]

And if we are willing to assume that the distribution of the errors has some specific form (e.g. normal), then we can use the variances to compute **confidence intervals**.

**A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter**.

Loosely speaking, there is approximately a 95% chance that the interval

\[ \Big[\hat\beta_1-2\sqrt{Var[\hat\beta_1]},\hat\beta_1+2\sqrt{Var[\hat\beta_1]}\Big] \]

will contain the true value of \(\beta_1\) (under a scenario where we got repeated samples like the present sample).

`confint(lm.tv, level=0.95)`

```
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256
```

`confint(lm.radio, level=0.95)`

```
## 2.5 % 97.5 %
## (Intercept) 8.2015885 10.4216877
## Radio 0.1622443 0.2427472
```

`confint(lm.newspaper, level=0.95)`

```
## 2.5 % 97.5 %
## (Intercept) 11.12595560 13.57685854
## Newspaper 0.02200549 0.08738071
```

Now we address the question: **Is there a relationship between \(X\) and \(Y\) ?**

We answer this question using the notion of statistical testing:

\(H_0\): There is no relationship between \(X\) and \(Y\)

versus the alternative hypothesis

\(H_1\): There is some relationship between \(X\) and \(Y\)

Mathematically, this corresponds to testing

\(H_0: \beta_1 = 0\)

\(H_1: \beta_1 \neq 0\)

This makes sense since if \(\beta_1 = 0\) then the model reduces to \(Y =\beta_0 + \epsilon\), and consequently, \(X\) is not associated with \(Y\).

```
beta1=0
true_sales=beta0+beta1*RADIO
SALES=true_sales+epsilon
plot(RADIO,SALES);abline(beta0,beta1)
```

In order to execute a statistical test, we need to have some statistic.

Assuming that the distribuiton of the errors is normal, the statistic \(\frac{\hat\beta_1-0}{\sqrt{Var[\hat\beta_1]}}\) will have a normal distribution as well.

However, recall that \(Var[\hat\beta_1]\) depends on \(\sigma^2\) which is usually not known. Thus, we estimate it by \[ \hat\sigma^2=\frac{RSS}{n-2}=\frac{\sum_{i=1}^n(y_i-\hat y_i)^2}{n-2}.\]

This is an unbiased estimator of \(\sigma^2\) and we use it to obtain

\[ \widehat{Var}[\hat\beta_1]=\frac{\hat\sigma^2}{\sum_{i=1}^n(x_i-\bar x)^2}. \]

We define the **Residual Standard Error**: \(RSE=\sqrt{\hat\sigma^2}\).

To test the null hypothesis, we compute the **\(t\)-statistic**, given by

\[t =\frac{\hat\beta_1-0}{\sqrt{\widehat{Var}[\hat\beta_1]}}.\]

This statistic will have a \(t\)-distribution with \(n-2\) degrees of freedom, assuming \(\beta_1 = 0\).

We can calculate the probability of observing any value equal to \(|t|\) or larger, a quantity that we call **p-value**.

Another quantity that enables us to test the hypothesis is the \(F\) statistic given by

\[ F=\frac{SSreg}{RSE}, \]

where \(SSreg=TSS-RSS=\sum_{i=1}^n(\hat y_i-\bar y)^2\). If \(\beta_1=0\) then \(F\sim F_{1,n-2}\), namely, the \(F\) statistic will have \(F\) distribution with \(1\) and \(n-2\) degrees of freedom.

Another useful quantity is the **R-squared** or fraction of variance explained by the model,

\[ R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}, \]

where \(TSS=\sum_{i=1}^n(y_i-\bar y)^2\) is the **Total Sum of Squares**.

It can be shown that in simple linear regression setting we have \(R^2=r^2\), where \(r\) is the correlation between \(X\) and \(Y\) :

\[ r=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)} {\sqrt{\sum_{i=1}^n(x_i-\bar x)^2}\sqrt{\sum_{i=1}^n(y_i-\bar y)^2}}. \]

In the next slide we see how to extract all the information above using R.

`summary(lm.tv)`

```
##
## Call:
## lm(formula = Sales ~ TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
```

A classical notion is that of “Analysis of Variance Table”:

`anova(lm.tv)`

```
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## TV 1 3314.6 3314.6 312.14 < 2.2e-16 ***
## Residuals 198 2102.5 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In words, we can say that the actual sales in each market deviate from the true regression line by approximately 3,260 units, on average.

In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3260/14000 = 23%.

**How accurate is our prediction for the regression line \(E(Y|X)\)** ?

Here also we use confidence intervals. For example,

`predict (lm.tv,data.frame(TV=100),interval ="confidence")`

```
## fit lwr upr
## 1 11.78626 11.26782 12.3047
```

In words:

If we collect a large number of data sets like the Advertising data set, and we construct a confidence interval for the average sales on the basis of each data set (given $100,000 in TV), then 95% of these confidence intervals will contain the true value of average sales.

So, 95% of intervals of this form will contain the true value of \(E[Y]\).

**How accurate can we predict a future observation \(Y\)** ?

We will use **prediction intervals**.

`predict (lm.tv,data.frame(TV=100),interval ="prediction")`

```
## fit lwr upr
## 1 11.78626 5.339251 18.23326
```

A prediction interval can be used to quantify the uncertainty surrounding sales **for a particular city**. Given that $100,000 is spent on TV advertising in that city, the 95% prediction interval is [5.34, 18.23].

So, 95% of intervals of this form will contain the true value of \(Y\) for this city.

Note: both intervals are centered at 11,786 products, but the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Here our model takes the form

\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\cdots+\beta_p X_p+\epsilon.\]

We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , **holding all other predictors fixed**.

Here also, the estimates for the parameters \(\beta_0,\beta_1,\dots,\beta_p\) will be the values that minimize the sum of squared residuals

\[ \begin{aligned} RSS&=\sum_{i=1}^n(y_i-\hat y_i)^2 =\sum_{i=1}^n(y_i-\hat\beta_0-\hat\beta_1 x_{i1}-\cdots-\hat\beta_p x_{ip})^2. \end{aligned} \]

Given the **multiple least squares regression coefficient estimates** \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p\) we can make predictions using the formula

\[ \hat y=\hat\beta_0+\hat\beta_1 x_1+\cdots+\hat\beta_p x_p. \]

In the advertising example, the model becomes

sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\beta_3\times\)Newspaper+\(\epsilon\),

or

Sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Newspaper+\(\epsilon\),

or

Sales=\(\beta_0\)+\(\beta_1\times\)Radio+\(\beta_2\times\)Newspaper+\(\epsilon\).

or

Sales=\(\beta_0\)+\(\beta_1\times\)TV+\(\beta_2\times\)Radio+\(\epsilon\).

Consider the last model. We see the data and an estimated surface fit in the next two figures.

You must enable Javascript to view this page properly.

You must enable Javascript to view this page properly.

Now consider the ‘full’ model:

\[ Sales=\beta_0+\beta_1\times TV+\beta_2\times Radio+\beta_3\times Newspaper+\epsilon \]

for which we obtain,

`coef(lm(Sales~TV+Radio+Newspaper))`

```
## (Intercept) TV Radio Newspaper
## 2.938889369 0.045764645 0.188530017 -0.001037493
```

For a given amount of TV and newspaper advertising, spending an additional $1,000 on radio advertising leads to an increase in sales by approximately 189 units.

Is at least one of the predictors \(X_1,X_2,\dots,X_p\) useful in predicting the response ?

Here also we can use the F-statistic to answer this question.

The statistic

\[ F=\frac{SSreg/p}{RSS/(n-p-1)}\sim F_{p,n-p-1}, \]

is used to test

\[ \begin{aligned} H_0&: \beta_1 =\beta_2=\cdots= 0, \\ H_1&:\,\ otherwise \end{aligned} \]

In the next slide we see the resulting F-statistic for this model.

We will see also that in the full model, Newspaper is not significant, however, it is in a model with Newspaper the only predictor. A possible explanation will be based on analysing the collinearity between the covariates.

`summary(lm(Sales~TV+Radio+Newspaper))`

```
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
```

```
##
## Call:
## lm(formula = Sales ~ Newspaper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2272 -3.3873 -0.8392 3.5059 12.7751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
## Newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
```

We look at the correlation matrix

`cor(read.csv('Advertising.csv'))`

```
## TV Radio Newspaper Sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## Radio 0.05480866 1.00000000 0.35410375 0.5762226
## Newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## Sales 0.78222442 0.57622257 0.22829903 1.0000000
```

and conclude that newspaper sales are a surrogate for radio advertising; newspaper gets ‘credit’ for the effect of radio ad sales.

Note: it is possible for collinearity to exist between three (or more) covariates even if no pair of covariates has a particularly high correlation. This is a **multicollinearity** situation.

Checking for multicollinearity can be done using the variance inflation factor (VIF):

```
suppressWarnings(library(car))
vif (lm(Sales~TV+Radio+Newspaper) )
```

```
## TV Radio Newspaper
## 1.004611 1.144952 1.145187
```

For a given covariate, this indicator is based on a regression of this specific covariate onto all of the other predictors. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

Two simple solutions: drop one of the problematic variables or combine the collinear variables together into a single predictor.

The ideal scenario is when the predictors are uncorrelated, namely, a balanced design:

Each coefficient can be estimated and tested separately.

Interpretations such as “a unit change in \(X_j\) is associated with a \(\beta_j\) change in \(Y\) , while all the other variables stay fixed”, are possible.

Correlations amongst predictors cause problems:

The variance of all coefficients tends to increase, sometimes dramatically.

Interpretations become hazardous - when \(X_j\) changes, everything else changes.

**Claims of causality should be avoided for observational data**.

So far our model assumes that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media.

Now we also check for interactions (non-additive relations), for example:

\[ Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times (Radio\times TV) + \epsilon \]

Which is the same as: \[ Sales= \beta_0 + (\beta_1 + \beta_3 \times Radio) \times TV + \beta_2 \times Radio + \epsilon. \]

and as:

\[ Sales= \beta_0 + \beta_1 \times TV +(\beta_2 + \beta_3 \times TV) \times Radio + \epsilon. \]

\[ Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times (Radio\times TV) + \epsilon \]

`coef(lm(Sales~TV*Radio))`

```
## (Intercept) TV Radio TV:Radio
## 6.750220203 0.019101074 0.028860340 0.001086495
```

**One interpatation**: \(Sales= \beta_0 + (\beta_1 + \beta_3 \times Radio) \times TV + \beta_2 \times Radio + \epsilon.\)

In words: an increase in \(TV\) advertising of $1,000 is associated with increased \(Sales\) of \(19+1.1 \times Radio\) units.

**Another interpatation**: \(Sales= \beta_0 + \beta_1 \times TV +(\beta_2 + \beta_3 \times TV) \times Radio + \epsilon.\)

In words: an increase in \(Radio\) advertising of $1,000 will be associated with an increase in \(Sales\) of \(29 + 1.1\times TV\) units.

`summary(lm(Sales~TV*Radio))`

```
##
## Call:
## lm(formula = Sales ~ TV * Radio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
```

The results suggest that interactions are important.

The p-value for the interaction term \(TV\times Radio\) is extremely low, indicating that there is strong evidence for \(\beta_3\ne 0\).

The \(R^2\) for the interaction model is 96.8%, compared to only 89.7% for the model that predicts \(Sales\) using \(TV\) and \(Radio\) without an interaction term:

`1-sum(lm(Sales~TV+Radio)$residuals^2)/sum((Sales-mean(Sales))^2)`

`## [1] 0.8971943`

We used the formula \(R^2=1-RSS/TSS\), where \(RSS=\sum_{i=1}^n e_i^2\) and \(TSS=\sum_{i=1}^n(y_i-\bar y)^2\).

Note: the **hierarchical principle** states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

We will use **residuals analysis** in order to check for:

Errors are normally distributed.

The response \(Y\) is linear in the parameters.

Errors have zero expectation and constant variance.

Errors are independent random variables.

High leverage points and outliers.

```
## $lm1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0000909 1.1247468 2.667348 0.025734051
## x1 0.5000909 0.1179055 4.241455 0.002169629
##
## $lm2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000909 1.1253024 2.666758 0.025758941
## x2 0.500000 0.1179637 4.238590 0.002178816
##
## $lm3
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0024545 1.1244812 2.670080 0.025619109
## x3 0.4997273 0.1178777 4.239372 0.002176305
##
## $lm4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017273 1.1239211 2.670763 0.025590425
## x4 0.4999091 0.1178189 4.243028 0.002164602
```

```
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 27.510 27.5100 17.99 0.00217 **
## Residuals 9 13.763 1.5292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y2
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 27.500 27.5000 17.966 0.002179 **
## Residuals 9 13.776 1.5307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
## Analysis of Variance Table
##
## Response: y3
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 27.470 27.4700 17.972 0.002176 **
## Residuals 9 13.756 1.5285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y4
## Df Sum Sq Mean Sq F value Pr(>F)
## x4 1 27.490 27.4900 18.003 0.002165 **
## Residuals 9 13.742 1.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Errors are normally distributed**.

We will check this assumption using q-q plots.

Under the normality assumption, the residuals are normally distributed. We will plot the residuals versus the theoretical quantiles of the normal distribution.

We can also use **studentized residuals** which are based on the original residuals. Studentized residuals are sometimes called **jackknifed residuals**. If the residuals are normally distributed then the studentized residuals will have a t-distribution with \(n-p-1\) degrees of freedom. We will plot the studentized residuals versus the theoretical quantiles of the t-distribution.

If the errors are not normal, the least squares estimates may not be optimal. They will still be best linear unbiased estimates, but other robust estimators may be more effective. Tests and confidence intervals may not be exact. Long-tailed distributions in particular, cause large inaccuracies.

```
qqnorm(residuals(lm(Sales~TV+Radio)), ylab="Residuals")
qqline(residuals(lm(Sales~TV+Radio)))
```

```
suppressWarnings(library(car))
qqPlot(lm(Sales~TV+Radio))
```

Several distributions vs. theoretical standard normal quantiles.

Several distributions vs. theoretical standard normal quantiles.

**The response \(Y\) is linear in the parameters**.

**The errors have constant variance and are symmetric around zero.**.

By definition, \(\sum_{i=1}^ne_i=0\), and under the model assumptions, the residuals \(e_i=y_i-\hat y_i\) are normally distributed (but not independent).

Thus, if the model is correct, then we would expect to see the residuals scattered randomly around their mean, zero.

This can be checked by plotting the residuals as a function of the fitted values \(\hat y_i\).

The same plot enables us to inspect whether the variance of the residuals is not similar for all observations.

`plot(lm(Sales~TV+Radio), which=1)`

Transformation of variables can sometimes solve problems of nonlinearity. We discuss here one example. The problem of heterogeneous variance can be solved by using transformations or weighted least squares.

In order to assess the quality of manual printing, a company collected data of average cost for printing a document (in cents) over 45 days. Data taken from here.

Several transformations can be considered.

`summary(lm(Cost~Days))$r.squared`

`## [1] 0.7375462`

`summary(lm(log(Cost)~Days))$r.squared`

`## [1] 0.8174798`

`summary(lm(log(Cost)~sqrt(Days)))$r.squared`

`## [1] 0.8588231`

`summary(lm(log(Cost)~log(Days)))$r.squared`

`## [1] 0.8434075`

For simplicity, we choose the transformation with the highest \(R^2\) (we could also check model assumptions, etc.).

The theoretical model in that case is \[log(Cost)=\beta_0+\beta_1\sqrt{Days}+\epsilon.\]

If we want to predict the cost for day 40, then we should use \(\sqrt{40}=6.324555\) for the covariate.

```
logCost=log(Cost)
sqrDays=sqrt(Days)
predict(lm(logCost~sqrDays),data.frame(sqrDays=sqrt(40)))
```

```
## 1
## 2.173689
```

To obtain the predicted cost for day \(40\) we transform back and get \(\exp(2.17)=8.79\).

Note: we can also use polynomial regression in order to deal with non-linear effects of predictors.

**The errors are independent**.

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms.

If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be.

For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be.

This could cause us to erroneously conclude that a parameter is statistically significant.

**If the error terms are correlated, we may have an unwarranted sense of confidence in our model**.

In general, we use the **runs test** (if we deal with time series data, then we can also plot successive residuals to inspect for correlations). A “run” of a sequence is a maximal non-empty segment of the sequence consisting of adjacent equal elements. For example, the 16-element sequence “+ + - - - + + + - - - - - - + +”" consists of 5 runs, 3 of which consist of plus signs and 2 of minus signs. If the number of runs is significantly higher or lower than expected, the hypothesis of statistical independence of the elements may be rejected. Exact p-values can be calculated, however, for number of positive residuals larger than 10, and similarly for the negative ones, a normal approximation will be used:

```
suppressWarnings(library(lawstat))
runs.test(residuals(lm(Sales~TV+Radio)),alternative ="two.sided")
```

```
##
## Runs Test - Two sided
##
## data: residuals(lm(Sales ~ TV + Radio))
## Standardized Runs Statistic = -0.4253, p-value = 0.6706
```

Under the two-sided null hypothesis, the errors are not positively or negatively correlated. We will not reject the null hypothesis.

**High leverage points and outliers**.

Let \(\hat y\) and \(y\) denote the vectors of fitted values and observations, respectively. It can be shown that

\[
\hat y= Hy
\] where \(H\) is called the **hat** matrix. The leverage of observation \(i\) is the value of the \(i\)th diagonal term, \(h_{ii}\), of the hat matrix, \(H\). The diagonal terms satisfy \(0\leq h_{ii}\leq 1\), \(\sum_{i=1}^nh_{ii}=p+1\), where \(p\) is the number of covariates in the regression model, and \(n\) is the number of observations.

Leverage is a measure of the effect of a particular observation on the regression predictions. A large value of \(h_{ii}\) indicates that the \(i\)th case is distant from the center of all \(X\) values and has more leverage. Because the sum of the leverage values is \(p+1\), an observation \(i\) can be considered as an outlier if its leverage substantially exceeds the mean leverage value, \((p+1)/n\), for example, values of \(h_{ii} > 2(p+1)/n\).

In the advertising data, in a model with \(TV\) and \(Radio\) we have \(p=2\) and \(n=200\):

` which(hatvalues(lm(Sales~TV+Radio)) > 1.8*(2+1)/200)`

```
## 3 6 9 36 76 109 176 179
## 3 6 9 36 76 109 176 179
```

` which(hatvalues(lm(Sales~TV+Radio)) > 2*(2+1)/200)`

```
## 6 176
## 6 176
```

` which(hatvalues(lm(Sales~TV+Radio)) > 3*(2+1)/200)`

`## named integer(0)`

`plot(lm(Sales~TV+Radio), which=5)`

**Cook’s distance** measure is a combination of a residual effect and leverage. It shows the influence of each observation on the fitted response values (by calculating the distance, in some sense, between the coefficients of a regression with and without a particular observation):

\[ D_i=\frac{\sum_{j=1}^n(\hat y_j-\hat y_{j(-i)})}{(p+1)RSS}. \]

Several guidelines exist for an observation to be considered as an outlier, e.g., an observation with Cook’s distance larger than one, or “three times the mean Cook’s distance”.

Note: observation #131 was detected by Cook’s distance but not by leverage only.

`plot(lm(Sales~TV+Radio), which=4)`

Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables.

See for example the scatterplot matrix of the credit card data in the next slide.

In addition to the 7 quantitative variables shown, there are four qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American (AA) or Asian).

The R code used to generate the scatterplot matrix is:

```
Credit=read.csv('Credit.csv')
attach(Credit)
pairs (~Balance+Age+Cards+Education+Income+Limit+Rating, cex.labels = 1,gap=0,cex.axis=0.5)
```

Example: investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable:

\[ \begin{equation*} z_i= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is female} \\ 0,\,\text{if i}{\it th}\text{ person is male} \end{array} \end{equation*} \]

The model in that case is:

\[ \begin{equation*} y_i=\beta_0+\beta_1 z_i+\epsilon_i= \bigg\{ \begin{array}{l} \beta_0+\beta_1+\epsilon_i,\,\text{if i}{\it th}\text{ person is female} \\ \beta_0+\epsilon_i,\,\text{if i}{\it th}\text{ person is male} \end{array} \end{equation*} \]

`summary(lm(Balance~Gender))`

```
##
## Call:
## lm(formula = Balance ~ Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## GenderFemale 19.73 46.05 0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
```

With more than two levels, we create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. The first is:

\[ \begin{equation*} z_{i1}= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is Asian} \\ 0,\,\text{if i}{\it th}\text{ person is not Asian} \end{array} \end{equation*} \] \[ \begin{equation*} z_{i2}= \bigg\{ \begin{array}{l} 1,\,\text{if i}{\it th}\text{ person is Caucasian} \\ 0,\,\text{if i}{\it th}\text{ person is not Caucasian} \end{array} \end{equation*} \]

Then we obtain the model:

\[ \begin{equation*} y_i=\beta_0+\beta_1 z_{i1}+\beta_2 z_{i2}+\epsilon_i= \bigg\{ \begin{array}{l} \beta_0+\beta_1+\epsilon_i,\,\text{if i}{\it th}\text{ person is Asian} \\ \beta_0+\beta_2+\epsilon_i,\,\text{if i}{\it th}\text{ person is Caucasian} \\ \beta_0+\epsilon_i,\,\text{if i}{\it th}\text{ person is AA} \end{array} \end{equation*} \]

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable, African American in this example, is known as the **baseline**.

`summary(lm(Balance~Ethnicity))`

```
##
## Call:
## lm(formula = Balance ~ Ethnicity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## EthnicityAsian -18.69 65.02 -0.287 0.774
## EthnicityCaucasian -12.50 56.68 -0.221 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
```

Interactions between qualitative and quantitative variables.

Suppose that we wish to predict balance using income (quantitative) and student (qualitative).

Without an interaction term, the model takes the form

\[ \begin{equation*} \text{balance}_i\approx \beta_1 \text{income}_i+ \bigg\{ \begin{array}{l} \beta_0+\beta_2,\,\text{if i}{\it th}\text{ person is a student} \\ \beta_0,\,\text{if i}{\it th}\text{ person is not a student} \end{array} \end{equation*} \]

With interactions, the model takes the form

\[ \begin{equation*} \text{balance}_i\approx \bigg\{ \begin{array}{l} (\beta_0+\beta_2)+(\beta_1+\beta_3)\text{income}_i,\,\text{if i}{\it th}\text{ person is a student} \\ \beta_0+\beta_1 \text{income}_i,\,\text{if i}{\it th}\text{ person is not a student} \end{array} \end{equation*} \]

Estimates without interactions:

`lm(Balance~Income+Student)$coef`

```
## (Intercept) Income StudentYes
## 211.142964 5.984336 382.670539
```

Estimates with an interaction term between income and student.

`lm(Balance~Income*Student)$coef`

```
## (Intercept) Income StudentYes Income:StudentYes
## 200.623153 6.218169 476.675843 -1.999151
```