# Lesson 3 Predictive Regression Part I

## 3.1 Introduction

This is the first of two lessons that will teach you how to implement and interpret *predictive* linear regression in R. For the moment we won't worry about causality and we won't talk about heteroskedasticity or autocorrelation. In this first lesson, we'll introduce the basics using a simple dataset that you can download from my website and display as follows:

```
library(readr)
<- read_csv("http://ditraglia.com/data/child_test_data.csv")
kids kids
```

```
## # A tibble: 434 × 4
## kid.score mom.hs mom.iq mom.age
## <dbl> <dbl> <dbl> <dbl>
## 1 65 1 121. 27
## 2 98 1 89.4 25
## 3 85 1 115. 27
## 4 83 1 99.4 25
## 5 115 1 92.7 27
## 6 98 0 108. 18
## 7 69 1 139. 20
## 8 106 1 125. 23
## 9 102 1 81.6 24
## 10 95 1 95.1 19
## # … with 424 more rows
```

Each row of the tibble `kids`

contains information on a three-year old child. The first column gives the child's test score at age three, while the remaining columns provide information about each child's mother:

`kid.score`

child test score at age 3`mom.age`

age of mother at birth of child`mom.hs`

mother completed high school? (1 = yes)`mom.iq`

mother's IQ score

The columns `kid.score`

gives the child's test score at age three. The remaining columns describe the child's mother: `mom.age`

is mother's age at the birth of the child, `mom.hs`

is a dummy variable that equals one the mother completed high school, and `mom.iq`

is the mother's IQ score. Our main goal will be to predict a child's test score based on mother characteristics. But **stay alert**: in some of the exercises I may be a bit devious and ask you to predict *something else*!

### 3.1.1 Exercise

Using a dot `.`

to separate words in a variable name isn't great coding style: it's better to use an underscore `_`

. Search the `dplyr`

help files for the command `rename()`

and then use this command to replace each instance of a `.`

in the column names of `kids`

with an underscore `_`

.

```
library(dplyr)
<- kids %>%
kids rename(kid_score = kid.score,
mom_hs = mom.hs,
mom_iq = mom.iq,
mom_age = mom.age)
```

## 3.2 The Least Squares Problem

Suppose we observe a dataset with \(n\) observations \((Y_i, X_i)\) where \(Y_i\) is an **outcome** variable for person \(i\)--the thing we want to predict--and \(X_i\) is a vector of \(p\) **predictor** variables--the things we'll use to make our prediction. In the `kids`

dataset, our outcome is `kid_score`

and our predictors are `mom_hs`

, `mom_age`

, and `mom_iq`

. Our goal is to build a model of the form \(X'\beta = \sum_{j=1}^p \beta_j X_{j}\) that we can use to predict \(Y\) for a person who is *not* in our dataset. The constants \(\beta_j\) are called **coefficients** and a model of this form is called a **linear model** because the \(\beta_j\) enter linearly: they're not raised to any powers etc. Ordinary least squares (OLS) uses the observed data to find the coefficients \(\widehat{\beta}\) that solve the **least squares problem**

\[
\underset{\beta}{\text{minimize}} \sum_{i=1}^n (Y_i - X_i'\beta)^2.
\]
In case you were wondering "but wait, where's the intercept?" I should point out that some people prefer to write \((Y_i - \beta_0 - X_i' \beta)\) rather than \((Y_i - X_i'\beta)\). To allow an intercept using my notation, simply treat the first element of my \(X_i\) vector as a \(1\) and the first element of my \(\beta\) vector as the intercept.

### 3.2.1 Exercise

Suppose we want to regress the outcome \(Y_i\) on an *intercept only*, in other words we want to minimize \(\sum_{i=1}^n (Y_i - \beta)^2\) over \(\beta\). What is the solution? Does this make sense?

Differentiating with respect to \(\beta\), the first order condition is \(-2 \sum_{i=1}^n (Y_i - \widehat{\beta}) = 0\). Because the objective function is convex, this characterizes the global minimum. Re-arranging and solving for \(\widehat{\beta}\) gives \(\widehat{\beta} = \frac{1}{n}\sum_{i=1}^n Y_i\). In other words \(\widehat{\beta} = \bar{Y}\), the sample mean. This makes sense: the sample mean is a reasonable prediction of the next \(Y\)-observation if you have no other information to work with. Here we've shown that it is also the least squares solution.

## 3.3 Linear Regression with `lm()`

The R function `lm()`

, short for **linear model**, solves the least squares problem. Its basic syntax is `lm([formula], [dataframe])`

where `[formula]`

is an R *formula*--an object that describes the regression we want to run--and `[dataframe]`

is the name of a data frame containing our \(X\) and \(Y\) observations, e.g. `kids`

. R formulas can be a bit confusing when you first encounter them, so I'll explain the details in stages. For the moment, there are two symbols you need to learn: `~`

and `+`

The tilde symbol `~`

is used to separate the "left hand side" and "right hand side" of a formula: the *outcome* goes on the left of the `~`

and the predictors go on the right. For example, to regress `kid_score`

on `mom_iq`

we use the command

`lm(kid_score ~ mom_iq, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq, data = kids)
##
## Coefficients:
## (Intercept) mom_iq
## 25.80 0.61
```

This tells R: "please solve the least squares problem to predict `kid_score`

using `mom_iq`

based on the data contained in `kids`

." Notice that R includes an intercept in the regression automatically. This is a good default, because it seldom makes sense to run a regression without an intercept. When you want to run a regression with *multiple* right-hand side predictors, use the plus sign `+`

to separate them. For example, to regress `kid_score`

on `mom_iq`

and `mom_age`

use the command

`lm(kid_score ~ mom_iq + mom_age, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq + mom_age, data = kids)
##
## Coefficients:
## (Intercept) mom_iq mom_age
## 17.5962 0.6036 0.3881
```

### 3.3.1 Exercise

- Interpret the regression coefficients from
`lm(kid_score ~ mom_iq, kids)`

.

Consider two kids whose mothers differ by 1 point in `mom_iq`

. We would predict that the kid whose mom has the higher value of `mom_iq`

will score about 0.6 points higher in `kid_score`

.

`lm(kid_score ~ mom_iq, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq, data = kids)
##
## Coefficients:
## (Intercept) mom_iq
## 25.80 0.61
```

- Run a linear regression to predict
`mom_hs`

using`kid_score`

and`mom_iq`

.

`lm(mom_hs ~ kid_score + mom_iq, kids)`

```
##
## Call:
## lm(formula = mom_hs ~ kid_score + mom_iq, data = kids)
##
## Coefficients:
## (Intercept) kid_score mom_iq
## -0.060135 0.002775 0.006050
```

## 3.4 Plotting the Regression Line

The `ggplot2`

package makes it easy to produce an attractive and informative plot of the results of a simple linear regression. Using what we learned in our last lesson, we know how to make a scatter plot of `mom_iq`

and `kid_score`

:

```
library(ggplot2)
ggplot(kids) +
geom_point(aes(x = mom_iq, y = kid_score))
```

To add the regression line, we'll use `geom_smooth()`

, a function for plotting *smoothed conditional means*. We can use `geom_smooth()`

in exactly the same way as `geom_point()`

by specifying `aes()`

to set the `x`

and `y`

variables. By default, `geom_smooth()`

plots a *non-parametric* regression curve rather than a linear regression:

```
ggplot(kids) +
geom_point(aes(x = mom_iq, y = kid_score)) +
geom_smooth(aes(x = mom_iq, y = kid_score))
```

`## `geom_smooth()` using method = 'loess' and formula 'y ~ x'`

Notice that we had to type `aes(x = mom_iq, y = kid_score)`

*twice* in the preceding code chunk. This is tedious and error-prone. In more complicated plots that contain multiple `geom`

functions, life is much simpler if we specify our desired `aes()`

*once*. To do this, pass it as an argument to `ggplot`

rather than to the `geom`

, for example

```
ggplot(kids, aes(x = mom_iq, y = kid_score)) +
geom_point() +
geom_smooth()
```

To plot a regression *line* rather than this non-parametric regression function, we merely need to set `method = 'lm`

in `geom_smooth()`

, for example

```
ggplot(kids, aes(x = mom_iq, y = kid_score)) +
geom_point() +
geom_smooth(method = 'lm')
```

By default this plots the regression line from the regression `y ~ x`

.

### 3.4.1 Exercise

- Use
`ggplot()`

to make a scatter plot with`mom_age`

on the horizontal axis and`kid_score`

on the vertical axis.

```
ggplot(kids, aes(x = mom_age, y = kid_score)) +
geom_point()
```

- Use
`geom_smooth()`

to including the non-parametric regression function on the preceding plot.

```
ggplot(kids, aes(x = mom_age, y = kid_score)) +
geom_point() +
geom_smooth()
```

- Modify the preceding to include the regression line corresponding to
`kid_score ~ mom_age`

on the scatter plot rather than the non-parametric regression function.

```
ggplot(kids, aes(x = mom_age, y = kid_score)) +
geom_point() +
geom_smooth(method = 'lm')
```

## 3.5 Getting More from `lm()`

If we simply run `lm`

as above, R will display only the estimated regression coefficients and the command that we used to run the regression: `Call`

.
To get more information, we need to *store* the results of our regression using the assignment operator `<-`

for example:

`<- lm(kid_score ~ mom_iq, kids) reg1 `

If you run the preceding line of code in the R console, it won't produce any output. But if you check your R environment after running it, you'll see a new `List`

object: `reg1`

. To see what's inside this list, we can use the command `str`

:

`str(reg1)`

```
## List of 12
## $ coefficients : Named num [1:2] 25.8 0.61
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "mom_iq"
## $ residuals : Named num [1:434] -34.68 17.69 -11.22 -3.46 32.63 ...
## ..- attr(*, "names")= chr [1:434] "1" "2" "3" "4" ...
## $ effects : Named num [1:434] -1808.22 190.39 -8.77 -1.96 33.73 ...
## ..- attr(*, "names")= chr [1:434] "(Intercept)" "mom_iq" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:434] 99.7 80.3 96.2 86.5 82.4 ...
## ..- attr(*, "names")= chr [1:434] "1" "2" "3" "4" ...
....
```

Don't panic: you don't need to know what all of these list elements are.
The important thing to understand is that `lm`

returns a *list* from which we can extract important information about the regression we have run.
To extract the regression coefficient estimates, we use the function `coefficients()`

or `coef()`

for short

`coef(reg1)`

```
## (Intercept) mom_iq
## 25.7997778 0.6099746
```

To extract the regression residuals, we use the function `residuals()`

or `resid()`

for short

`resid(reg1)`

```
## 1 2 3 4 5 6
## -34.67839049 17.69174662 -11.21717291 -3.46152907 32.62769741 6.38284487
## 7 8 9 10 11 12
## -41.52104074 3.86488149 26.41438662 11.20806784 11.17050590 -25.66178773
## 13 14 15 16 17 18
## 3.93517580 -17.40659730 14.87699469 10.74760539 6.40273692 5.13172163
## 19 20 21 22 23 24
## -2.44440289 15.87116678 9.04396998 11.90180808 14.47664178 0.24807074
## 25 26 27 28 29 30
## 13.66974891 -4.06297022 -14.03359486 -7.52559318 2.18354609 2.47533381
....
```

To extract the *fitted values* i.e. \(\hat{Y}_i \equiv X_i'\hat{\beta}\), the predicted values of, we use `fitted.values`

`fitted.values(reg1)`

```
## 1 2 3 4 5 6 7 8
## 99.67839 80.30825 96.21717 86.46153 82.37230 91.61716 110.52104 102.13512
## 9 10 11 12 13 14 15 16
## 75.58561 83.79193 79.82949 83.66179 80.06482 95.40660 87.12301 99.25239
## 17 18 19 20 21 22 23 24
## 95.59726 93.86828 107.44440 85.12883 92.95603 103.09819 85.52336 86.75193
## 25 26 27 28 29 30 31 32
## 85.33025 100.06297 86.03359 85.52559 74.81645 95.52467 92.37138 87.90567
## 33 34 35 36 37 38 39 40
## 97.75549 92.06345 84.67978 82.44896 84.29520 91.07649 78.98774 80.30825
....
```

### 3.5.1 Exercise

- Plot a histogram of the residuals from
`reg1`

using`ggplot`

with a bin width of 5. Is there anything noteworthy about this plot?

There seems to be a bit of left skewness in the residuals.

```
library(ggplot2)
ggplot() +
geom_histogram(aes(x = resid(reg1)), binwidth = 5)
```

- Calculate the residuals "by hand" by subtracting the fitted values from
`reg1`

from the column`kid_score`

in`kids`

. Use the R function`all.equal`

to check that this gives the same result as`resid()`

.

They give exactly the same result:

`all.equal(resid(reg1), kids$kid_score - fitted.values(reg1))`

`## [1] TRUE`

- As long as you include an intercept in your regression, the residuals will sum to zero. Verify that this is true (up to machine precision!) of the residuals from
`reg1`

Close enough!

`sum(resid(reg1))`

`## [1] 1.056155e-12`

- By construction, the regression residuals are uncorrelated with any predictors included in the regression. Verify that this holds (up to machine precision!) for
`reg1`

.

Again, close enough!

`cor(resid(reg1), kids$mom_iq)`

`## [1] 3.121525e-16`

## 3.6 Summarizing The Ouput of `lm()`

To view the "usual" summary of regression output, we use the `summary()`

function:

`summary(reg1)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq, data = kids)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.753 -12.074 2.217 11.710 47.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.79978 5.91741 4.36 1.63e-05 ***
## mom_iq 0.60997 0.05852 10.42 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.27 on 432 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.1991
## F-statistic: 108.6 on 1 and 432 DF, p-value: < 2.2e-16
```

Among other things, `summary`

shows us the coefficient estimates and associated standard errors for each regressor. It also displays the t-value (Estimate / SE) and associated p-value for a test of the null hypothesis \(H_0\colon \beta = 0\) versus \(H_1\colon \beta \neq 0\). Farther down in the output, `summary`

provides the residual standard error, the R-squared, and the F-statistic and associated p-value for a test of the null hypothesis that all regression coefficients except for the intercept are zero.^{10}

**Health warning: by default, lm() computes standard errors and p-values under the classical regression assumptions.** In particular, unless you explicitly tell R to do otherwise, it will assume that the regression errors \(\varepsilon_i \equiv Y_i - X_i' \beta\) are homoskedastic, and iid. If you're not quite sure what this means, or if you're worried that I'm sweeping important details under the rug, fear not: we'll revisit this in a later lesson. For the moment, let me offer you the following mantra, paraphrasing the wisdom of my favorite professor from grad school:

You can

alwaysrun a [predictive] linear regression; it's inference that requires assumptions.

### 3.6.1 Exercise

Use the `kids`

tibble to run a regression that uses `kid_score`

and `mom_hs`

to predict `mom_iq`

. Store your results in an object called `reg_reverse`

and then display a summary of the regression results.

```
<- lm(mom_iq ~ mom_hs + kid_score, kids)
reg_reverse summary(reg_reverse)
```

```
##
## Call:
## lm(formula = mom_iq ~ mom_hs + kid_score, data = kids)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.046 -10.412 -1.762 8.839 42.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.86638 2.82454 24.381 < 2e-16 ***
## mom_hs 6.82821 1.58450 4.309 2.03e-05 ***
## kid_score 0.29688 0.03189 9.309 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 431 degrees of freedom
## Multiple R-squared: 0.234, Adjusted R-squared: 0.2304
## F-statistic: 65.82 on 2 and 431 DF, p-value: < 2.2e-16
```

## 3.7 Tidying up with `broom`

We saw above that `lm()`

returns a list. It turns out that `summary()`

, when applied to an `lm()`

object, *also* returns a list:

`str(summary(reg1))`

```
## List of 11
## $ call : language lm(formula = kid_score ~ mom_iq, data = kids)
## $ terms :Classes 'terms', 'formula' language kid_score ~ mom_iq
## .. ..- attr(*, "variables")= language list(kid_score, mom_iq)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "kid_score" "mom_iq"
## .. .. .. ..$ : chr "mom_iq"
## .. ..- attr(*, "term.labels")= chr "mom_iq"
## .. ..- attr(*, "order")= int 1
....
```

In principle, this gives us a way of extracting particular pieces of information from a table of regression output generated by `summary()`

. For example, if you carefully examine the output of `str(summary(reg1))`

you'll find a named list element called `r.squared`

. By accessing this element, you can pluck out the R-squared from `summary(reg1)`

as follows:

`summary(reg1)$r.squared`

`## [1] 0.2009512`

Similarly, you could extract F-statistics and associated degrees of freedom by accessing You could extract the information
That wasn't so bad! But now suppose you wanted to extract the estimates, standard errors, and p-values from `reg1`

. While it's *possible* to do this by poring over the output of `str(summary(reg1))`

, there's a much easier way.

The `broom`

package provides some extremely useful functions for extracting regression output. Best of all, the same tools apply to models that we'll meet in later lessons. Use `tidy()`

to create a tibble containing regression estimates, standard errors, t-statistics, and p-values e.g.

```
library(broom)
tidy(reg1)
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 25.8 5.92 4.36 1.63e- 5
## 2 mom_iq 0.610 0.0585 10.4 7.66e-23
```

Use `glance()`

to create a tibble that summarizes various measures of model fit:

`glance(reg1)`

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.201 0.199 18.3 109. 7.66e-23 1 -1876. 3757. 3769.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

Finally, use `augment()`

to create a tibble that *merges* the tibble you used to run your regression with the corresponding regression fitted values, residuals, etc.

`augment(reg1, kids)`

```
## # A tibble: 434 × 10
## kid_score mom_hs mom_iq mom_age .fitted .resid .hat .sigma .cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 65 1 121. 27 99.7 -34.7 0.00688 18.2 0.0126
## 2 98 1 89.4 25 80.3 17.7 0.00347 18.3 0.00164
## 3 85 1 115. 27 96.2 -11.2 0.00475 18.3 0.000905
## 4 83 1 99.4 25 86.5 -3.46 0.00231 18.3 0.0000416
## 5 115 1 92.7 27 82.4 32.6 0.00284 18.2 0.00456
## 6 98 0 108. 18 91.6 6.38 0.00295 18.3 0.000181
## 7 69 1 139. 20 111. -41.5 0.0178 18.2 0.0478
## 8 106 1 125. 23 102. 3.86 0.00879 18.3 0.000200
## 9 102 1 81.6 24 75.6 26.4 0.00577 18.2 0.00611
## 10 95 1 95.1 19 83.8 11.2 0.00255 18.3 0.000483
## # … with 424 more rows, and 1 more variable: .std.resid <dbl>
```

Notice that `augment()`

uses a dot "." to begin the name of any column that it merges. This avoids potential clashes with columns you already have in your dataset. After all, you'd never start a column name with a dot would you?

### 3.7.1 Exercise

To answer the following, you may need to consult the help files for `tidy.lm()`

, `glance.lm()`

, and `augment.lm()`

from the `broom`

package.

- Use
`dplyr`

and`tidy()`

to display the regression estimate, standard error, t-statistic, and p-value for the predictor`kid_score`

in`reg_reverse`

from above.

```
%>%
reg_reverse tidy() %>%
filter(term == 'kid_score')
```

```
## # A tibble: 1 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 kid_score 0.297 0.0319 9.31 6.61e-19
```

- Use
`ggplot()`

and`augment()`

to make a scatterplot with the fitted values from`reg_reverse`

on the horizontal axis and`mom_iq`

on the vertical axis. Use`geom_abline()`

to add a 45-degree line to your plot. (You may need to read the help file for this function.)

```
augment(reg_reverse, kids) %>%
ggplot(aes(x = .fitted, y = mom_iq)) +
geom_point() +
geom_abline(intercept = 0, slope = 1)
```

- Continuing from the preceding exercise, run a regression of
`mom_iq`

on the fitted values from`reg_reverse`

and display the estimated regression coefficients. Compare the R-squared of this regression to that of`reg_reverse`

. Explain your results.

When we regress \(Y_i\) on \(\widehat{Y}_i\), the fitted values from a regression of \(Y_i\) on \(X_i\), we get an intercept of zero and a slope of one:

```
<- augment(reg_reverse, kids)
kids_augmented <- lm(mom_iq ~ .fitted, kids_augmented)
reg_y_vs_fitted tidy(reg_y_vs_fitted)
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.96e-13 8.73 2.25e-14 1.00e+ 0
## 2 .fitted 1.00e+ 0 0.0871 1.15e+ 1 7.85e-27
```

This makes sense. Suppose we wanted to choose \(\alpha_0\) and \(\alpha_1\) to minimize \(\sum_{i=1}^n (Y_i - \alpha_0 - \alpha_1 \widehat{Y}_i)^2\) where \(\widehat{Y}_i = \widehat{\beta}_0 + X_i'\widehat{\beta}_1\). This is equivalent to minimizing \[ \sum_{i=1}^n \left[Y_i - (\alpha_0 + \widehat{\beta}_0) - X_i'(\alpha_1\widehat{\beta}_1)\right]^2. \] By construction \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) minimize \(\sum_{i=1}^n (Y_i - \beta_0 - X_i'\beta_1)^2\), so unless \(\widehat{\alpha_0} = 0\) and \(\widehat{\alpha_1} = 1\) we'd have a contradiction! Similar reasoning explains why the R-squared values for the two regressions are the same:

`c(glance(reg_reverse)$r.squared, glance(reg_y_vs_fitted)$r.squared)`

`## [1] 0.2339581 0.2339581`

The R-squared of a regression equals \(1 - \text{SS}_{\text{residual}} / \text{SS}_{\text{total}}\) \[ \text{SS}_{\text{total}} = \sum_{i=1}^n (Y_i - \bar{Y})^2,\quad \text{SS}_{\text{residual}} = \sum_{i=1}^n (Y_i - \widehat{Y}_i^2) \] The total sum of squares is the same for both regressions because they have the same outcome variable. The residual sum of squares is the same because \(\widehat{\alpha}_0 = 0\) and \(\widehat{\alpha}_1 = 1\) together imply that both regressions have the same fitted values.

## 3.8 Dummy Variables with `lm()`

The column `mom_hs`

in `kids`

is a *dummy variable*, also known as a binary variable. It equals `1`

if a child's mother graduated from college and `0`

otherwise. For this reason, the coefficient on `mom_hs`

in the following regression tells us the *difference of mean test scores* between kids whose mothers graduated from college and those whose mothers did not, while the intercept tells us the mean of `kid_score`

for children whose mothers didn't graduate from high school:

`lm(kid_score ~ mom_hs, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_hs, data = kids)
##
## Coefficients:
## (Intercept) mom_hs
## 77.55 11.77
```

Although it's represented using the numerical values `0`

and `1`

, `mom_hs`

doesn't actually encode quantitative information. The numerical values are just shorthand for two different categories: `mom_hs`

is a *categorical* variable. To keep from getting confused, it's good practice to make categorical variables *obvious* by storing them as character or factor data. Here I create a new column, `mom_education`

, that stores the same information as `mom_hs`

as a *factor*:

```
<- kids %>%
kids mutate(mom_education = if_else(mom_hs == 1, 'High School', 'No High School')) %>%
mutate(mom_education = factor(mom_education, levels = unique(mom_education)))
```

The column `mom_education`

is a *factor*, R's built-in representation of a categorical variable. So what happens if we include `mom_education`

in our regression in place of `mom_hs`

?

`lm(kid_score ~ mom_education, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_education, data = kids)
##
## Coefficients:
## (Intercept) mom_educationNo High School
## 89.32 -11.77
```

Wait a minute; now the estimate is *negative*! We can't run a regression that includes an intercept and a coefficient for each level of a dummy variable--this is the dummy variable trap!--so R has excluded one of them. Rather capriciously, `lm()`

has chosen to treat `High School`

as the omitted category.

We can override this behavior by using `fct_relevel()`

from the `forcats`

package. The following code tells R that we want 'No High School' to be the *first* ordered factor level, the level that `lm()`

treats as the omitted category by default:

```
library(forcats)
<- kids %>%
kids mutate(mom_education = fct_relevel(mom_education, 'No High School'))
lm(kid_score ~ mom_education, kids)
```

```
##
## Call:
## lm(formula = kid_score ~ mom_education, data = kids)
##
## Coefficients:
## (Intercept) mom_educationHigh School
## 77.55 11.77
```

In the exercise you'll explore how `lm()`

handles categorical variables that take on *more than two* values. In your econometrics class you probably learned that we can use two dummy variables to encode a three-valued categorical variable, four dummy variables to encode a four-valued categorical variable, and so on. You may wonder if we need to explicitly construct these dummy variables in R. The answer is *no*: `lm()`

handles things for us automatically, as you're about to discover. This is worth putting in bold: **you don't have to explicitly construct dummy variables in R**. The `lm()`

function will construct them for you.

### 3.8.1 Exercise

- Read the help file for the base R function
`cut()`

. Setting the argument`breaks`

to`c(16, 19, 23, 26, 29)`

, use this function in concert with`dplyr()`

to create a factor variable called`mom_age_bins`

to`kids`

. Use the base R function`levels()`

to display the factor levels of`mom_age_bins`

.

```
<- kids %>%
kids mutate(mom_age_bins = cut(mom_age, c(16, 19, 23, 26, 29)))
levels(kids$mom_age_bins)
```

`## [1] "(16,19]" "(19,23]" "(23,26]" "(26,29]"`

- Run a linear regression of
`kid_score`

on`mom_age_bins`

and display the coefficients. Explain the results.

R "expands" the factor variable `mom_age_bins`

into a collection of dummy variables before running the regression. It names these dummies in an obvious way based on the levels that `mom_age_bins`

takes on:

`lm(kid_score ~ mom_age_bins, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_age_bins, data = kids)
##
## Coefficients:
## (Intercept) mom_age_bins(19,23] mom_age_bins(23,26]
## 87.054 -2.308 2.293
## mom_age_bins(26,29]
## 2.232
```

In words: R has run a regression of `kid_score`

on a constant (the intercept), a dummy for mother's age in the range `(19,23]`

, a dummy for mother's age in the range `(23,26]`

and a dummy for mother's age in the range `(26,29]`

.

From running `levels(kids$mom_age_bins)`

above, we know this means the omitted category is `(16,19]`

. This corresponds to the intercept in the regression, so the average value of `kid_score`

for a child whose mother's age lies in the range `(16,19]`

is around 87 points. All the other coefficients are differences of means relative to this category of "teen moms." For example, kids whose mothers' age is in the range `(19,23]`

score about two points lower, on average, then kids whose mothers' age is in the range `(16,19]`

.

- Re-run the preceding regression, but use
`fct_relevel()`

from the`forcats`

package make`(26,29]`

the omitted category for`mom_age_bins`

.

```
<- kids %>%
kids mutate(mom_age_bins = fct_relevel(mom_age_bins, '(26,29]'))
lm(kid_score ~ mom_age_bins, kids)
```

```
##
## Call:
## lm(formula = kid_score ~ mom_age_bins, data = kids)
##
## Coefficients:
## (Intercept) mom_age_bins(16,19] mom_age_bins(19,23]
## 89.28571 -2.23214 -4.54043
## mom_age_bins(23,26]
## 0.06106
```

## 3.9 Fun with R Formulas

It's time to learn some more about R formulas. But before we do, you may ask "why bother?" It's true that you run just about any regression you need using nothing more complicated than `+`

and `~`

as introduced above. I know, because I did this for the better part of a decade! But a key goal of this book is showing you how to work *smarter* rather than harder, both to make your own life easier and help others replicate your work. If you ever plan to fit more than a handful of models with more than a handful of variables, it's worth your time to learn about formulas. You've already met the special symbols `~`

and `+`

explained in the following table. In the next few sub-sections, I'll walk you through the others: `.`

, `-`

, `1`

, `:`

, `*`

, `^`

, and `I()`

.

Symbol | Purpose | Example | In Words |
---|---|---|---|

`~` |
separate LHS and RHS of formula | `y ~ x` |
regress `y` on `x` |

`+` |
add variable to a formula | `y ~ x + z` |
regress `y` on `x` and `z` |

`.` |
denotes "everything else" | `y ~ .` |
regress `y` on all other variables in a data frame |

`-` |
remove variable from a formula | `y ~ . - x` |
regress `y` on all other variables except `z` |

`1` |
denotes intercept | `y ~ x - 1` |
regress `y` on `x` without an intercept |

`:` |
construct interaction term | `y ~ x + z + x:z` |
regress `y` on `x` , `z` , and the product `x` times `z` |

`*` |
shorthand for levels plus interaction | `y ~ x * z` |
regress `y` on `x` , `z` , and the product `x` times `z` |

`^` |
higher order interactions | `y ~ (x + z + w)^3` |
regress `y` on `x` , `z` , `w` , all two-way interactions, and the three-way interactions |

`I()` |
"as-is" - override special meanings of other symbols from this table | `y ~ x + I(x^2)` |
regress `y` on `x` and `x` squared |

### 3.9.1 "Everything Else" - The Dot `.`

Sometimes all you want to do is run a regression of one variable on *everything else*. If you have lots of predictors, typing out all of their names, each separated by a `+`

sign, is painful and error-prone. Fortunately there's a shortcut: the dot `.`

`lm(kid_score ~ ., kids)`

```
##
## Call:
## lm(formula = kid_score ~ ., data = kids)
##
## Coefficients:
## (Intercept) mom_hs mom_iq
## -8.571 5.638 0.571
## mom_age mom_educationHigh School mom_age_bins(16,19]
## 1.155 NA 14.048
## mom_age_bins(19,23] mom_age_bins(23,26]
## 7.205 7.631
```

This command tells R to regress `kid_score`

on *everything else* in `kids`

. Notice that the coefficient `mom_educationHigh School`

is `NA`

, in other words **missing**. This is because you can't run a regression on `mom_hs`

*and* `mom_education`

at the same time: they're two versions of exactly the same information and hence are perfectly co-linear, so R drops one of them.

We'll encounter the dot in many guises later in this lesson and elsewhere. Wherever you see it, replace it mentally with the word "everything" and you'll never be confused. The rest will be clear from context.

### 3.9.2 Removing Predictors with `-`

The regression we ran above using the dot `.`

was very silly: it included both `mom_hs`

and `mom_education`

, and it also included both `mom_age_bins`

and `mom_age`

. Suppose we wanted to run a regression of `kid_score`

on everything *except* `mom_age_bins`

and `mom_education`

. This is easy to achieve using the minus sign `-`

as follows:

`lm(kid_score ~ . - mom_age_bins - mom_education, kids)`

```
##
## Call:
## lm(formula = kid_score ~ . - mom_age_bins - mom_education, data = kids)
##
## Coefficients:
## (Intercept) mom_hs mom_iq mom_age
## 20.9847 5.6472 0.5625 0.2248
```

Think of `+`

as saying "add me to the regression" and `-`

as saying "remove me from the regression." This use of `-`

is very similar to what you've seen in the `select()`

function from `dplyr`

. And as in `dplyr`

, we can use it to remove one variable or more than one.

### 3.9.3 The Intercept: `1`

It almost always makes sense to include an intercept when you run a linear regression. Without one, we're forced to predict that \(Y\) will be zero when \(X\) is zero. Because this is usually a bad idea, `lm()`

includes an intercept by default:

`lm(kid_score ~ mom_iq, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq, data = kids)
##
## Coefficients:
## (Intercept) mom_iq
## 25.80 0.61
```

In some special cases, however, we may have a reason to run a regression without an intercept. R's formula syntax denotes the intercept by `1`

. Armed with this knowledge, we can remove it from our regression using `-`

as introduced above:

`lm(kid_score ~ mom_iq - 1, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_iq - 1, data = kids)
##
## Coefficients:
## mom_iq
## 0.8623
```

Another situation in which we may wish to remove the intercept is when running a regression with a categorical variable. We can't include an intercept *and* a coefficient for each value of a categorical variable in our regression: this is the dummy variable trap. We either have to drop one level of the categorical variable (the baseline or omitted category) or drop the intercept. Above we saw how to choose which category to omit. But another option is to drop the intercept. In the first regression, the intercept equals the mean of `kid_score`

for the omitted category `mom_education == "No High School"`

while the intercept gives the difference of means:

`lm(kid_score ~ mom_education, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_education, data = kids)
##
## Coefficients:
## (Intercept) mom_educationHigh School
## 77.55 11.77
```

In the second, we obtain the mean of `kid_score`

for each group:

`lm(kid_score ~ mom_education - 1, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_education - 1, data = kids)
##
## Coefficients:
## mom_educationNo High School mom_educationHigh School
## 77.55 89.32
```

### 3.9.4 Exercise

- Run a regression of
`mom_hs`

on everything in`kids`

*except*`mom_age`

and`mom_hs`

.

`lm(kid_score ~ . - mom_age - mom_hs, kids)`

```
##
## Call:
## lm(formula = kid_score ~ . - mom_age - mom_hs, data = kids)
##
## Coefficients:
## (Intercept) mom_iq mom_educationHigh School
## 23.2831 0.5688 6.0147
## mom_age_bins(16,19] mom_age_bins(19,23] mom_age_bins(23,26]
## 3.7181 0.3138 4.4481
```

- Write
`dplyr`

code to verify that`lm(kid_score ~ mom_education - 1, kids)`

does indeed calculate mean of`kid_score`

for each group, as asserted.

```
%>%
kids group_by(mom_education) %>%
summarize(mean(kid_score))
```

```
## # A tibble: 2 × 2
## mom_education `mean(kid_score)`
## <fct> <dbl>
## 1 No High School 77.5
## 2 High School 89.3
```

- What do you get if you run the regression
`lm(kid_score ~ 1, kids)`

? Explain.

This is a regression with *only an intercept*, so it calculates the sample mean of `kid_score`

`lm(kid_score ~ 1, kids)`

```
##
## Call:
## lm(formula = kid_score ~ 1, data = kids)
##
## Coefficients:
## (Intercept)
## 86.8
```

```
%>%
kids summarize(mean(kid_score))
```

```
## # A tibble: 1 × 1
## `mean(kid_score)`
## <dbl>
## 1 86.8
```

See the exercise earlier in this lesson for a mathematical proof that regression with only an intercept is equivalent to computing the sample mean of \(Y\).

- Run a regression of
`kid_score`

on`mom_age_bins`

and`mom_iq`

, but rather than including an intercept and an omitted category fit a separate coefficient for each level of`mom_age_bins`

.

`lm(kid_score ~ mom_age_bins + mom_iq - 1, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_age_bins + mom_iq - 1, data = kids)
##
## Coefficients:
## mom_age_bins(26,29] mom_age_bins(16,19] mom_age_bins(19,23]
## 23.9233 26.1812 23.8011
## mom_age_bins(23,26] mom_iq
## 28.2905 0.6139
```

### 3.9.5 Transforming Outcomes and Predictors

What if you wanted to regress the logarithm of `kid_score`

on `mom_age`

and `mom_age^2`

? One way to do this is by creating a new data frame:

```
<- kids %>%
new_kids mutate(log_kid_score = log(kid_score),
mom_age_sq = mom_age^2)
lm(log_kid_score ~ mom_age + mom_age_sq, new_kids)
```

```
##
## Call:
## lm(formula = log_kid_score ~ mom_age + mom_age_sq, data = new_kids)
##
## Coefficients:
## (Intercept) mom_age mom_age_sq
## 4.4586761 -0.0109575 0.0004214
```

It worked! But that required an awful lot of typing. What's more, I had to clutter up my R environment with *another* data frame: `new_kids`

. A more elegant approach uses R's formula syntax to do all the heavy lifting. First I'll show you the syntax and then I'll explain it:

`lm(log(kid_score) ~ mom_age + I(mom_age^2), kids)`

```
##
## Call:
## lm(formula = log(kid_score) ~ mom_age + I(mom_age^2), data = kids)
##
## Coefficients:
## (Intercept) mom_age I(mom_age^2)
## 4.4586761 -0.0109575 0.0004214
```

The key point here is that we can use functions *within an R formula*. When `lm()`

encounters `log(kid_score) ~ mom_age + I(mom_age^2)`

it looks at the data frame `kids`

, and then *parses* the formula to construct all the variables that it needs to run the regression. There's no need for us to construct and store these in advance: R does everything for us. Remember how `lm()`

automatically constructed all the dummy variables required for a regression with a categorical predictor in the exercise from above? Roughly the same thing is going on here.

The only awkward part is the function `I()`

. What on earth is this mean?! Formulas have their own special syntax: a `+`

inside a formula doesn't denote addition and a `.`

doesn't indicate a decimal point. The full set of characters that have a "special meaning" within an R formula is as follows: `~ + . - : * ^ |`

. We've already met the first four of these; we'll encounter the next three in the following section. If you want to use any of these characters in their *ordinary meaning* you need to "wrap them" in an `I()`

. In words, `I()`

means "as-is," in other words "don't treat the things inside these parentheses as formula syntax; treat them as you would a plain vanilla R expression. Since `^`

has a special meaning within a formula, we need to wrap `mom_age^2`

inside of `I()`

to include the square of `mom_age`

. We *don't* have to wrap `log(kid_score)`

inside of `I()`

because `log()`

isn't one of the special characters `~ + . - : * ^ |`

.

### 3.9.6 Adding Interactions With `:`

, `*`

, and `^`

An *interaction* is a predictor that is constructed by taking the *product* of two "basic" predictors: for example \(X_1 \times X_2\). In R formula syntax we use a colon `:`

to denote an interaction, for example:

`lm(kid_score ~ mom_age:mom_iq, kids)`

```
##
## Call:
## lm(formula = kid_score ~ mom_age:mom_iq, data = kids)
##
## Coefficients:
## (Intercept) mom_age:mom_iq
## 48.04678 0.01698
```

runs a regression of `kid_score`

on the product of `mom_age`

and `mom_iq`

. This is a rather strange regression to run. Unless you have a very good reason for doing so, it's strange to use the *product* of `mom_age`

and `mom_iq`

to predict `kid_score`

without using the *individual variables* as well. In statistical parlance, the coefficient on a predictor that is *not* an interaction, is often called a *main effect*. And as a rule of thumb, just as it rarely makes sense to run a regression without an intercept, it rarely makes sense to include an interaction term without including the associated main effects. Taking this to heart, we can run a regression that includes `mom_age`

and `mom_iq`

in addition to their interaction as follows:

`coef(lm(kid_score ~ mom_age + mom_iq + mom_age:mom_iq, kids))`

```
## (Intercept) mom_age mom_iq mom_age:mom_iq
## -32.69120826 2.55287614 1.09946961 -0.02130665
```

Because including two variables as main effects along with their interaction is such a common pattern in practice, R's formula syntax include a special shorthand symbol for this: `*`

. Somewhat confusingly, the expression `x1 * x2`

within an R formula *does not* refer to the product of `x1`

and `x2`

. Instead it denotes `x1 + x2 + x1:x2`

. For example, we could run exactly the same regression as above with less typing as follows:

`coef(lm(kid_score ~ mom_age * mom_iq, kids))`

```
## (Intercept) mom_age mom_iq mom_age:mom_iq
## -32.69120826 2.55287614 1.09946961 -0.02130665
```

Both `:`

and `*`

can be chained together to create higher-order interactions. For example, the following command regresses `kid_score`

on the three-way interaction of `mom_hs`

, `mom_iq`

, and `mom_age`

`coef(lm(kid_score ~ mom_hs:mom_iq:mom_age, kids))`

```
## (Intercept) mom_hs:mom_iq:mom_age
## 74.860699531 0.006427933
```

If we want to include the corresponding levels, and two-way interactions *alongside* the three-way interaction, we can achieve this by chaining the `*`

symbol as follows:

`coef(lm(kid_score ~ mom_hs * mom_iq * mom_age, kids))`

```
## (Intercept) mom_hs mom_iq
## -123.19702677 128.77032321 2.29825977
## mom_age mom_hs:mom_iq mom_hs:mom_age
## 5.23979255 -1.62184546 -3.72860045
## mom_iq:mom_age mom_hs:mom_iq:mom_age
## -0.06249009 0.05390860
```

The caret `^`

is another shorthand symbol for running regressions with interactions. I rarely use it myself, but you may come across it occasionally. The syntax `(x1 + x2)^2`

is equivalent to `x1 * x2`

, for example:

`coef(lm(kid_score ~ (mom_age + mom_iq)^2, kids))`

```
## (Intercept) mom_age mom_iq mom_age:mom_iq
## -32.69120826 2.55287614 1.09946961 -0.02130665
```

and `(x1 + x2 + x2)^3`

is equivalent to `x1 * x2 * x3`

`coef(lm(kid_score ~ (mom_hs + mom_iq + mom_age)^3, kids))`

```
## (Intercept) mom_hs mom_iq
## -123.19702677 128.77032321 2.29825977
## mom_age mom_hs:mom_iq mom_hs:mom_age
## 5.23979255 -1.62184546 -3.72860045
## mom_iq:mom_age mom_hs:mom_iq:mom_age
## -0.06249009 0.05390860
```

One place where `^`

can come in handy is if you want to include levels and all *two way* interactions of more than two variables. For example,

`coef(lm(kid_score ~ (mom_hs + mom_iq + mom_age)^2, kids))`

```
## (Intercept) mom_hs mom_iq mom_age mom_hs:mom_iq
## -29.68240313 15.83347387 1.30109421 0.94781914 -0.43693774
## mom_hs:mom_age mom_iq:mom_age
## 1.39044445 -0.01656256
```

is a more compact way of specifying

`coef(lm(kid_score ~ mom_hs * mom_iq * mom_age - mom_hs:mom_iq:mom_age, kids))`

```
## (Intercept) mom_hs mom_iq mom_age mom_hs:mom_iq
## -29.68240313 15.83347387 1.30109421 0.94781914 -0.43693774
## mom_hs:mom_age mom_iq:mom_age
## 1.39044445 -0.01656256
```

### 3.9.7 Exercise

- Run a regression of
`kid_score`

on`mom_iq`

and`mom_iq^2`

and display a tidy summary of the regression output, including standard errors.

`tidy(lm(kid_score ~ mom_iq + I(mom_iq^2), kids))`

```
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -99.0 37.3 -2.65 0.00823
## 2 mom_iq 3.08 0.730 4.21 0.0000307
## 3 I(mom_iq^2) -0.0119 0.00352 -3.39 0.000767
```

- Use
`ggplot2`

to make a scatter-plot with`mom_iq`

on the horizontal axis and`kid_score`

on the vertical axis. Add the quadratic regression function from the first part.

Add the argument `formula = y ~ x + I(x^2)`

to `geom_smooth()`

.

```
ggplot(kids, aes(x = mom_iq, y = kid_score)) +
geom_point() +
geom_smooth(method = 'lm', formula = y ~ x + I(x^2))
```

- Based on the results of the preceding two parts, is there any evidence that the slope of the predictive relationship between
`kid_score`

and`mom_iq`

varies with`mom_iq`

?

Yes: the estimated coefficient on `mom_iq^2`

is highly statistically significant, and from the plot we see that there is a "practically significant" amount of curvature in the predictive relationship.

- Suppose we wanted to run a regression of
`kid_score`

on`mom_iq`

,`mom_hs`

, and their interaction. Write down*three*different ways of specifying this regression using R's formula syntax.

`kid_score ~ mom_iq + mom_hs + mom_iq:mom_hs`

`kid_score ~ mom_iq * mom_hs`

`kid_score ~ (mom_iq + mom_hs)^2`

- Run a regression of
`kid_score`

on`mom_hs`

,`mom_iq`

and their interaction. Display a tidy summary of the estimates, standard errors, etc. Is there any evidence that the predictive relationship between`kid_score`

and`mom_iq`

varies depending on whether a given child's mother graduated from high school? Explain.

`tidy(lm(kid_score ~ mom_iq * mom_hs, kids))`

```
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -11.5 13.8 -0.835 4.04e- 1
## 2 mom_iq 0.969 0.148 6.53 1.84e-10
## 3 mom_hs 51.3 15.3 3.34 9.02e- 4
## 4 mom_iq:mom_hs -0.484 0.162 -2.99 2.99e- 3
```

Let's introduce a bit of notation to make things clearer. The regression specification is as follows:
\[
\texttt{kid_score} = \widehat{\alpha} + \widehat{\beta} \times \texttt{mom_hs} + \widehat{\gamma} \times \texttt{mom_iq} + \widehat{\delta} \times (\texttt{mom_hs} \times \texttt{mom_iq})
\]
Now, `mom_hs`

is a dummy variable that equals 1 if a child's mother graduated from high school For a child whose mother did not graduate from high school, our prediction becomes
\[
\texttt{kid_score} = \widehat{\alpha} + \widehat{\gamma} \times \texttt{mom_iq}
\]
compared to the following for a child whose mother *did* graduate from high school:
\[
\texttt{kid_score} = (\widehat{\alpha} + \widehat{\beta}) + (\widehat{\gamma} + \widehat{\delta})\times \texttt{mom_iq}
\]
Thus, the coefficient \(\widehat{\delta}\) on the interaction `mom_iq:mom_hs`

is the *difference of slopes*: high-school minus no high school. The point estimate is negative, large, and highly statistically significant. We have fairly strong evidence that the predictive relationship between `kid_score`

and `mom_iq`

is *less steep* for children whose mothers attended high school. For more discussion of this example, see this document.

If you're rusty on the F-test, this may help.↩︎