3.1 Derivation and foundations

3.1.1 Derivation

3.1.1.1 Problem specification

3.1.1.2 Solution

3.1.1.3 Implementation

Suppose that we want to fit a linear model: \[y_i=\beta_0+\beta_1x_i+e_i\] For instance, in the bwght data we might want to relate birthweight to cigarette smoking or \[\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+e_i\] The fitted model would then be: \[\text{bwght}_i=\hat\beta_0+\hat\beta_1\text{cigs}_i+\hat e_i\] \[\text{bwght}_i=119.77-0.5138\text{cigs}_i+\hat e_i\] This means that for every 1 cigarette smoked by the mother per day during pregnancy is associated with a 0.514 oz decrease in the expected birthweight of the child.

modela <- lm(bwght~cigs,data=bwght)
modela

## 
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
## 
## Coefficients:
## (Intercept)         cigs  
##    119.7719      -0.5138

If we want to get summary statistics for the model (hypothesis testing, R-squared, etc.), we just call:

summary(modela)

## 
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.772 -11.772   0.297  13.228 151.228 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.77190    0.57234 209.267  < 2e-16 ***
## cigs         -0.51377    0.09049  -5.678 1.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.13 on 1386 degrees of freedom
## Multiple R-squared:  0.02273,    Adjusted R-squared:  0.02202 
## F-statistic: 32.24 on 1 and 1386 DF,  p-value: 1.662e-08

To plot linear models, we are again going to use the ggplot2 package (Wickham et al. 2018). This plot is the same as the ones we created before just with a different data set:

plot4_basic <- ggplot(bwght,aes(x=cigs,y=bwght)) + geom_point() 
plot4_basic <- plot4_basic + scale_x_continuous(name="Cigarettes smoked during pregnancy per day") + scale_y_continuous(name="Birthweight (oz)")
plot4_basic

Now we need to add the line from our linear model, modela. we are going to use the geom_line() function to add the line. We have to specify the y values of the line associated with the x points of our data. To find these y values, we just need to use the predict function from before:

plot4_ols <- plot4_basic + geom_line(aes(y=predict(modela)),color="red",size=2)
plot4_ols

The broom package (Robinson and Hayes 2018) has various functions which standardize output from models. After installing the broom package it can be read into memory with:

library(broom)

The first and most import broom command is tidy which takes the summary output from a linear model (or other models later) and formats the table of coefficients into a clean format:

tidy(modela)

Table 10: Tidy results from model3
term	estimate	std.error	statistic	p.value
(Intercept)	119.7719004	0.5723407	209.266802	0
cigs	-0.5137721	0.0904909	-5.677609	0

The summary output also produces some various statistis about the model such as the $R^2$ value. If you want a simple format for those results, you can use the glance function on the model:

glance(modela)

Table 11: Glance results from modela
Variables	Values
r.squared	2.272910e-02
adj.r.squared	2.202400e-02
sigma	2.012858e+01
statistic	3.223524e+01
p.value	0.000000e+00
df	2.000000e+00
logLik	-6.135457e+03
AIC	1.227691e+04
BIC	1.229262e+04
deviance	5.615513e+05
df.residual	1.386000e+03

This also gives other statistics such as the AIC and BIC which are useful for model selection. Augment gives the original data with the residuals and fitted values:

summary(augment(modela))

##      bwght            cigs           .fitted          .se.fit      
##  Min.   : 23.0   Min.   : 0.000   Min.   : 94.08   Min.   :0.5403  
##  1st Qu.:107.0   1st Qu.: 0.000   1st Qu.:119.77   1st Qu.:0.5723  
##  Median :120.0   Median : 0.000   Median :119.77   Median :0.5723  
##  Mean   :118.7   Mean   : 2.087   Mean   :118.70   Mean   :0.6741  
##  3rd Qu.:132.0   3rd Qu.: 0.000   3rd Qu.:119.77   3rd Qu.:0.5723  
##  Max.   :271.0   Max.   :50.000   Max.   :119.77   Max.   :4.3692  
##      .resid             .hat               .sigma         .cooksd         
##  Min.   :-96.772   Min.   :0.0007206   Min.   :19.72   Min.   :5.000e-08  
##  1st Qu.:-11.772   1st Qu.:0.0008085   1st Qu.:20.13   1st Qu.:3.877e-05  
##  Median :  0.297   Median :0.0008085   Median :20.13   Median :1.771e-04  
##  Mean   :  0.000   Mean   :0.0014409   Mean   :20.13   Mean   :7.080e-04  
##  3rd Qu.: 13.228   3rd Qu.:0.0008085   3rd Qu.:20.14   3rd Qu.:5.403e-04  
##  Max.   :151.228   Max.   :0.0471172   Max.   :20.14   Max.   :5.279e-02  
##    .std.resid       
##  Min.   :-4.809631  
##  1st Qu.:-0.585072  
##  Median : 0.014764  
##  Mean   : 0.000038  
##  3rd Qu.: 0.657446  
##  Max.   : 7.516143

3.1.1.4 R-squared

3.1.2 Conditional expectation and predictions

3.1.2.1 Conditional expectation

3.1.2.2 Linear models

3.1.2.3 Predictions and residuals

To make predictions based on a model, e.g., \[\hat{bwght}=\hat\beta_0+\hat\beta_1cigs\], we can use the predict() function (e.g., predict(modela)). Because we only specify the model with the predictions, R doesn’t know what x values we want to use. So in this case it defaults to using the original x variables. This gives us the so-called fitted values. of the model:

summary(predict(modela))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   94.08  119.77  119.77  118.70  119.77  119.77

If we want to make new predictions for two data points $cigs=20$ and $cigs=0$, we can add these x-values to the predict function:

predict(modela,data.table(cigs=c(20,0)))

##        1        2 
## 109.4965 119.7719

3.1.3 Matrix algebra representation

3.1.3.1 Multiple regression

3.1.3.2 Derivation

3.1.3.3 Implementation

3.1.3.4 Multiple regression with LM

3.1.4 Special models

3.1.4.1 Binary independent variables

Suppose we want to run a multiple regression: \[\text{bwght}_i=\beta_0+\beta_1 \text{cigs}_i+\beta_2 \text{faminc}_i + \text{male}_i+e_i\] To run this model, we simply enter:

modelb <- lm(bwght~cigs+faminc+male,data=bwght)
summary(modelb)

## 
## Call:
## lm(formula = bwght ~ cigs + faminc + male, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -97.521 -11.577   0.994  13.194 151.655 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 115.22771    1.20788  95.397  < 2e-16 ***
## cigs         -0.46105    0.09134  -5.048 5.07e-07 ***
## faminc        0.09688    0.02915   3.324  0.00091 ***
## male          3.11397    1.07640   2.893  0.00388 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.01 on 1384 degrees of freedom
## Multiple R-squared:  0.03564,    Adjusted R-squared:  0.03355 
## F-statistic: 17.05 on 3 and 1384 DF,  p-value: 7.105e-11

This notation inside the lm command is called formula notation. A helpful guide on formula notation in R is available from the Booth School of Business.

Here, a $1$ cigarette increase is associated with a $-0.461$ decrease in $oz$ of birthweight, but now we are controlling for family income and whether or not the child was male. To understand why we’re saying this consider two predictions. One child is a boy who is born with a mother who smokes $20$ cigarettes per day has a family income of $\$30,000$. The other child is also a boy whose mother smokes $0$ cigarettes per day and also has a family income of $\$30,000$. Clearly we are keeping family income and child sex constant. In this case the difference between the two predicted birthweights would be: \[\hat y_{20}-\hat y_{0}= 20\hat\beta_1+30\hat\beta_2+\hat\beta_3-0\hat\beta_1-30\hat\beta_2-\hat\beta_3\] \[\hat y_{20}-\hat y_{0}= 20\hat\beta_1=-9.22oz\] which is $20$ times the estimated $\beta_1$ value because this is a difference of $20$ cigarettes per day.

In this linear model, we could also consider an interpretation for the $\beta_3$ coefficient. If a child is male, then $\text{male}_i=1$ otherwise $\text{male}_i=0$. This means that if the child is male, we expect the birthweight to increase by $\hat\beta_3=3.11oz$ controlling for cigarette smoking and family income. This is how we interpret models with binary or “dummy” variables.

3.1.4.2 Quadratics and polynomial functions

To perform a quadratic regression such as: \[\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+\beta_2\text{cigs}_i^2+e_i\] we have to use the I() function in R. I() tells R to ignore the special formula notation and do literally what commands we input. Here we are taking the square. So the I(cigs^2) says add $\text{cigs}^2$ to the linear model:

modelc <- lm(bwght~cigs+I(cigs^2),data=bwght)
summary(modelc)

## 
## Call:
## lm(formula = bwght ~ cigs + I(cigs^2), data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.949 -11.949   1.051  13.051 151.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.949119   0.579843 206.865  < 2e-16 ***
## cigs         -0.858407   0.207398  -4.139  3.7e-05 ***
## I(cigs^2)     0.013551   0.007339   1.846    0.065 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.11 on 1385 degrees of freedom
## Multiple R-squared:  0.02513,    Adjusted R-squared:  0.02372 
## F-statistic: 17.85 on 2 and 1385 DF,  p-value: 2.218e-08

3.1.4.3 Interactions

To include interactions between two variables, you can use the formula notaion for interactions which is a:b to interact (essentially multiply) a and b and include the result in the model:

bwght$smokes <- as.numeric(bwght$cigs>0)
modele <- lm(bwght~smokes+faminc+male+white+smokes:faminc+smokes:male+smokes:white,data=bwght)
summary(modele)

## 
## Call:
## lm(formula = bwght ~ smokes + faminc + male + white + smokes:faminc + 
##     smokes:male + smokes:white, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.340 -11.681   0.563  12.981 150.882 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   112.02091    1.54585  72.465  < 2e-16 ***
## smokes         -5.54028    3.83280  -1.445  0.14855    
## faminc          0.04112    0.03225   1.275  0.20255    
## male            3.39415    1.16286   2.919  0.00357 ** 
## white           6.34911    1.48836   4.266 2.13e-05 ***
## smokes:faminc   0.12468    0.10066   1.239  0.21569    
## smokes:male    -2.85699    3.00414  -0.951  0.34176    
## smokes:white   -5.13610    3.73596  -1.375  0.16942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.88 on 1380 degrees of freedom
## Multiple R-squared:  0.05043,    Adjusted R-squared:  0.04561 
## F-statistic: 10.47 on 7 and 1380 DF,  p-value: 7.211e-13

Here, I have interacted a dummy for smoking behavior with faminc, male, white. To include all the pairwise interactions we can enter:

bwght$smokes <- as.numeric(bwght$cigs>0)
modelf <- lm(bwght~(smokes+faminc+male+white)^2,data=bwght)
summary(modelf)

## 
## Call:
## lm(formula = bwght ~ (smokes + faminc + male + white)^2, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.463 -11.706   0.626  13.163 150.943 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   113.08465    2.44118  46.324   <2e-16 ***
## smokes         -6.05461    3.88408  -1.559   0.1193    
## faminc         -0.02513    0.08240  -0.305   0.7604    
## male            2.92238    2.65627   1.100   0.2714    
## white           5.87552    2.80561   2.094   0.0364 *  
## smokes:faminc   0.12521    0.10109   1.239   0.2157    
## smokes:male    -2.37318    3.06905  -0.773   0.4395    
## smokes:white   -4.75201    3.79282  -1.253   0.2105    
## faminc:male     0.04539    0.06128   0.741   0.4590    
## faminc:white    0.05094    0.08315   0.613   0.5402    
## male:white     -1.16949    2.74399  -0.426   0.6700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.9 on 1377 degrees of freedom
## Multiple R-squared:  0.05112,    Adjusted R-squared:  0.04423 
## F-statistic: 7.418 on 10 and 1377 DF,  p-value: 1.482e-11

Incidentally this is why we have to use the I() function to create quadratics. In formula notation a^b is used for adding interactions, not making things quadratic. Be careful about your notation here.

3.1.4.4 Log-linear specifications

If we want to take the natural logarithm of one of the variables, we simply surround that variable with the log() function:

modeld <- lm(log(bwght)~cigs+faminc,data=bwght)
summary(modeld)

## 
## Call:
## lm(formula = log(bwght) ~ cigs + faminc, data = bwght)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62745 -0.08761  0.02148  0.12155  0.82230 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.7439566  0.0098431 481.958  < 2e-16 ***
## cigs        -0.0040326  0.0008593  -4.693 2.96e-06 ***
## faminc       0.0008437  0.0002739   3.081  0.00211 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1883 on 1385 degrees of freedom
## Multiple R-squared:  0.02646,    Adjusted R-squared:  0.02505 
## F-statistic: 18.82 on 2 and 1385 DF,  p-value: 8.608e-09

References

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Robinson, David, and Alex Hayes. 2018. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.