## 3.1 Derivation and foundations

### 3.1.1 Derivation

#### 3.1.1.3 Implementation

Suppose that we want to fit a linear model: $y_i=\beta_0+\beta_1x_i+e_i$ For instance, in the bwght data we might want to relate birthweight to cigarette smoking or $\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+e_i$ The fitted model would then be: $\text{bwght}_i=\hat\beta_0+\hat\beta_1\text{cigs}_i+\hat e_i$ $\text{bwght}_i=119.77-0.5138\text{cigs}_i+\hat e_i$ This means that for every 1 cigarette smoked by the mother per day during pregnancy is associated with a 0.514 oz decrease in the expected birthweight of the child.

modela <- lm(bwght~cigs,data=bwght)
modela
##
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
##
## Coefficients:
## (Intercept)         cigs
##    119.7719      -0.5138

If we want to get summary statistics for the model (hypothesis testing, R-squared, etc.), we just call:

summary(modela)
##
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -96.772 -11.772   0.297  13.228 151.228
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.77190    0.57234 209.267  < 2e-16 ***
## cigs         -0.51377    0.09049  -5.678 1.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.13 on 1386 degrees of freedom
## Multiple R-squared:  0.02273,    Adjusted R-squared:  0.02202
## F-statistic: 32.24 on 1 and 1386 DF,  p-value: 1.662e-08

To plot linear models, we are again going to use the ggplot2 package (Wickham et al. 2018). This plot is the same as the ones we created before just with a different data set:

plot4_basic <- ggplot(bwght,aes(x=cigs,y=bwght)) + geom_point()
plot4_basic <- plot4_basic + scale_x_continuous(name="Cigarettes smoked during pregnancy per day") + scale_y_continuous(name="Birthweight (oz)")
plot4_basic

Now we need to add the line from our linear model, modela. we are going to use the geom_line() function to add the line. We have to specify the y values of the line associated with the x points of our data. To find these y values, we just need to use the predict function from before:

plot4_ols <- plot4_basic + geom_line(aes(y=predict(modela)),color="red",size=2)
plot4_ols

The broom package (Robinson and Hayes 2018) has various functions which standardize output from models. After installing the broom package it can be read into memory with:

library(broom)

The first and most import broom command is tidy which takes the summary output from a linear model (or other models later) and formats the table of coefficients into a clean format:

tidy(modela)
Table 10: Tidy results from model3
term estimate std.error statistic p.value
(Intercept) 119.7719004 0.5723407 209.266802 0
cigs -0.5137721 0.0904909 -5.677609 0

The summary output also produces some various statistis about the model such as the $$R^2$$ value. If you want a simple format for those results, you can use the glance function on the model:

glance(modela)
Table 11: Glance results from modela
Variables Values
r.squared 2.272910e-02
adj.r.squared 2.202400e-02
sigma 2.012858e+01
statistic 3.223524e+01
p.value 0.000000e+00
df 2.000000e+00
logLik -6.135457e+03
AIC 1.227691e+04
BIC 1.229262e+04
deviance 5.615513e+05
df.residual 1.386000e+03

This also gives other statistics such as the AIC and BIC which are useful for model selection. Augment gives the original data with the residuals and fitted values:

summary(augment(modela))
##      bwght            cigs           .fitted          .se.fit
##  Min.   : 23.0   Min.   : 0.000   Min.   : 94.08   Min.   :0.5403
##  1st Qu.:107.0   1st Qu.: 0.000   1st Qu.:119.77   1st Qu.:0.5723
##  Median :120.0   Median : 0.000   Median :119.77   Median :0.5723
##  Mean   :118.7   Mean   : 2.087   Mean   :118.70   Mean   :0.6741
##  3rd Qu.:132.0   3rd Qu.: 0.000   3rd Qu.:119.77   3rd Qu.:0.5723
##  Max.   :271.0   Max.   :50.000   Max.   :119.77   Max.   :4.3692
##      .resid             .hat               .sigma         .cooksd
##  Min.   :-96.772   Min.   :0.0007206   Min.   :19.72   Min.   :5.000e-08
##  1st Qu.:-11.772   1st Qu.:0.0008085   1st Qu.:20.13   1st Qu.:3.877e-05
##  Median :  0.297   Median :0.0008085   Median :20.13   Median :1.771e-04
##  Mean   :  0.000   Mean   :0.0014409   Mean   :20.13   Mean   :7.080e-04
##  3rd Qu.: 13.228   3rd Qu.:0.0008085   3rd Qu.:20.14   3rd Qu.:5.403e-04
##  Max.   :151.228   Max.   :0.0471172   Max.   :20.14   Max.   :5.279e-02
##    .std.resid
##  Min.   :-4.809631
##  1st Qu.:-0.585072
##  Median : 0.014764
##  Mean   : 0.000038
##  3rd Qu.: 0.657446
##  Max.   : 7.516143

### 3.1.2 Conditional expectation and predictions

#### 3.1.2.3 Predictions and residuals

To make predictions based on a model, e.g., $\hat{bwght}=\hat\beta_0+\hat\beta_1cigs$, we can use the predict() function (e.g., predict(modela)). Because we only specify the model with the predictions, R doesn’t know what x values we want to use. So in this case it defaults to using the original x variables. This gives us the so-called fitted values. of the model:

summary(predict(modela))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   94.08  119.77  119.77  118.70  119.77  119.77

If we want to make new predictions for two data points $$cigs=20$$ and $$cigs=0$$, we can add these x-values to the predict function:

predict(modela,data.table(cigs=c(20,0)))
##        1        2
## 109.4965 119.7719

### 3.1.4 Special models

#### 3.1.4.1 Binary independent variables

Suppose we want to run a multiple regression: $\text{bwght}_i=\beta_0+\beta_1 \text{cigs}_i+\beta_2 \text{faminc}_i + \text{male}_i+e_i$ To run this model, we simply enter:

modelb <- lm(bwght~cigs+faminc+male,data=bwght)
summary(modelb)
##
## Call:
## lm(formula = bwght ~ cigs + faminc + male, data = bwght)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -97.521 -11.577   0.994  13.194 151.655
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.22771    1.20788  95.397  < 2e-16 ***
## cigs         -0.46105    0.09134  -5.048 5.07e-07 ***
## faminc        0.09688    0.02915   3.324  0.00091 ***
## male          3.11397    1.07640   2.893  0.00388 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.01 on 1384 degrees of freedom
## Multiple R-squared:  0.03564,    Adjusted R-squared:  0.03355
## F-statistic: 17.05 on 3 and 1384 DF,  p-value: 7.105e-11

This notation inside the lm command is called formula notation. A helpful guide on formula notation in R is available from the Booth School of Business.

Here, a $$1$$ cigarette increase is associated with a $$-0.461$$ decrease in $$oz$$ of birthweight, but now we are controlling for family income and whether or not the child was male. To understand why we’re saying this consider two predictions. One child is a boy who is born with a mother who smokes $$20$$ cigarettes per day has a family income of $$\30,000$$. The other child is also a boy whose mother smokes $$0$$ cigarettes per day and also has a family income of $$\30,000$$. Clearly we are keeping family income and child sex constant. In this case the difference between the two predicted birthweights would be: $\hat y_{20}-\hat y_{0}= 20\hat\beta_1+30\hat\beta_2+\hat\beta_3-0\hat\beta_1-30\hat\beta_2-\hat\beta_3$ $\hat y_{20}-\hat y_{0}= 20\hat\beta_1=-9.22oz$ which is $$20$$ times the estimated $$\beta_1$$ value because this is a difference of $$20$$ cigarettes per day.

In this linear model, we could also consider an interpretation for the $$\beta_3$$ coefficient. If a child is male, then $$\text{male}_i=1$$ otherwise $$\text{male}_i=0$$. This means that if the child is male, we expect the birthweight to increase by $$\hat\beta_3=3.11oz$$ controlling for cigarette smoking and family income. This is how we interpret models with binary or “dummy” variables.

#### 3.1.4.2 Quadratics and polynomial functions

To perform a quadratic regression such as: $\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+\beta_2\text{cigs}_i^2+e_i$ we have to use the I() function in R. I() tells R to ignore the special formula notation and do literally what commands we input. Here we are taking the square. So the I(cigs^2) says add $$\text{cigs}^2$$ to the linear model:

modelc <- lm(bwght~cigs+I(cigs^2),data=bwght)
summary(modelc)
##
## Call:
## lm(formula = bwght ~ cigs + I(cigs^2), data = bwght)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -96.949 -11.949   1.051  13.051 151.051
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.949119   0.579843 206.865  < 2e-16 ***
## cigs         -0.858407   0.207398  -4.139  3.7e-05 ***
## I(cigs^2)     0.013551   0.007339   1.846    0.065 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.11 on 1385 degrees of freedom
## Multiple R-squared:  0.02513,    Adjusted R-squared:  0.02372
## F-statistic: 17.85 on 2 and 1385 DF,  p-value: 2.218e-08

#### 3.1.4.3 Interactions

To include interactions between two variables, you can use the formula notaion for interactions which is a:b to interact (essentially multiply) a and b and include the result in the model:

bwght$smokes <- as.numeric(bwght$cigs>0)
modele <- lm(bwght~smokes+faminc+male+white+smokes:faminc+smokes:male+smokes:white,data=bwght)
summary(modele)
##
## Call:
## lm(formula = bwght ~ smokes + faminc + male + white + smokes:faminc +
##     smokes:male + smokes:white, data = bwght)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -93.340 -11.681   0.563  12.981 150.882
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   112.02091    1.54585  72.465  < 2e-16 ***
## smokes         -5.54028    3.83280  -1.445  0.14855
## faminc          0.04112    0.03225   1.275  0.20255
## male            3.39415    1.16286   2.919  0.00357 **
## white           6.34911    1.48836   4.266 2.13e-05 ***
## smokes:faminc   0.12468    0.10066   1.239  0.21569
## smokes:male    -2.85699    3.00414  -0.951  0.34176
## smokes:white   -5.13610    3.73596  -1.375  0.16942
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.88 on 1380 degrees of freedom
## Multiple R-squared:  0.05043,    Adjusted R-squared:  0.04561
## F-statistic: 10.47 on 7 and 1380 DF,  p-value: 7.211e-13

Here, I have interacted a dummy for smoking behavior with faminc, male, white. To include all the pairwise interactions we can enter:

bwght$smokes <- as.numeric(bwght$cigs>0)
modelf <- lm(bwght~(smokes+faminc+male+white)^2,data=bwght)
summary(modelf)
##
## Call:
## lm(formula = bwght ~ (smokes + faminc + male + white)^2, data = bwght)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -93.463 -11.706   0.626  13.163 150.943
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   113.08465    2.44118  46.324   <2e-16 ***
## smokes         -6.05461    3.88408  -1.559   0.1193
## faminc         -0.02513    0.08240  -0.305   0.7604
## male            2.92238    2.65627   1.100   0.2714
## white           5.87552    2.80561   2.094   0.0364 *
## smokes:faminc   0.12521    0.10109   1.239   0.2157
## smokes:male    -2.37318    3.06905  -0.773   0.4395
## smokes:white   -4.75201    3.79282  -1.253   0.2105
## faminc:male     0.04539    0.06128   0.741   0.4590
## faminc:white    0.05094    0.08315   0.613   0.5402
## male:white     -1.16949    2.74399  -0.426   0.6700
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.9 on 1377 degrees of freedom
## Multiple R-squared:  0.05112,    Adjusted R-squared:  0.04423
## F-statistic: 7.418 on 10 and 1377 DF,  p-value: 1.482e-11

Incidentally this is why we have to use the I() function to create quadratics. In formula notation a^b is used for adding interactions, not making things quadratic. Be careful about your notation here.

#### 3.1.4.4 Log-linear specifications

If we want to take the natural logarithm of one of the variables, we simply surround that variable with the log() function:

modeld <- lm(log(bwght)~cigs+faminc,data=bwght)
summary(modeld)
##
## Call:
## lm(formula = log(bwght) ~ cigs + faminc, data = bwght)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.62745 -0.08761  0.02148  0.12155  0.82230
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.7439566  0.0098431 481.958  < 2e-16 ***
## cigs        -0.0040326  0.0008593  -4.693 2.96e-06 ***
## faminc       0.0008437  0.0002739   3.081  0.00211 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1883 on 1385 degrees of freedom
## Multiple R-squared:  0.02646,    Adjusted R-squared:  0.02505
## F-statistic: 18.82 on 2 and 1385 DF,  p-value: 8.608e-09

### References

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Robinson, David, and Alex Hayes. 2018. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.