3.1 Derivation and foundations
3.1.1 Derivation
3.1.1.1 Problem specification
3.1.1.2 Solution
3.1.1.3 Implementation
Suppose that we want to fit a linear model: \[y_i=\beta_0+\beta_1x_i+e_i\] For instance, in the bwght
data we might want to relate birthweight to cigarette smoking or \[\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+e_i\] The fitted model would then be: \[\text{bwght}_i=\hat\beta_0+\hat\beta_1\text{cigs}_i+\hat e_i\] \[\text{bwght}_i=119.77-0.5138\text{cigs}_i+\hat e_i\] This means that for every 1 cigarette smoked by the mother per day during pregnancy is associated with a 0.514 oz decrease in the expected birthweight of the child.
modela <- lm(bwght~cigs,data=bwght)
modela
##
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
##
## Coefficients:
## (Intercept) cigs
## 119.7719 -0.5138
If we want to get summary statistics for the model (hypothesis testing, R-squared, etc.), we just call:
summary(modela)
##
## Call:
## lm(formula = bwght ~ cigs, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.772 -11.772 0.297 13.228 151.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.77190 0.57234 209.267 < 2e-16 ***
## cigs -0.51377 0.09049 -5.678 1.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.13 on 1386 degrees of freedom
## Multiple R-squared: 0.02273, Adjusted R-squared: 0.02202
## F-statistic: 32.24 on 1 and 1386 DF, p-value: 1.662e-08
To plot linear models, we are again going to use the ggplot2
package (Wickham et al. 2018). This plot is the same as the ones we created before just with a different data set:
plot4_basic <- ggplot(bwght,aes(x=cigs,y=bwght)) + geom_point()
plot4_basic <- plot4_basic + scale_x_continuous(name="Cigarettes smoked during pregnancy per day") + scale_y_continuous(name="Birthweight (oz)")
plot4_basic
Now we need to add the line from our linear model, modela
. we are going to use the geom_line()
function to add the line. We have to specify the y values of the line associated with the x points of our data. To find these y values, we just need to use the predict
function from before:
plot4_ols <- plot4_basic + geom_line(aes(y=predict(modela)),color="red",size=2)
plot4_ols
The broom
package (Robinson and Hayes 2018) has various functions which standardize output from models. After installing the broom package it can be read into memory with:
library(broom)
The first and most import broom command is tidy
which takes the summary output from a linear model (or other models later) and formats the table of coefficients into a clean format:
tidy(modela)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 119.7719004 | 0.5723407 | 209.266802 | 0 |
cigs | -0.5137721 | 0.0904909 | -5.677609 | 0 |
The summary output also produces some various statistis about the model such as the \(R^2\) value. If you want a simple format for those results, you can use the glance function on the model:
glance(modela)
Variables | Values |
---|---|
r.squared | 2.272910e-02 |
adj.r.squared | 2.202400e-02 |
sigma | 2.012858e+01 |
statistic | 3.223524e+01 |
p.value | 0.000000e+00 |
df | 2.000000e+00 |
logLik | -6.135457e+03 |
AIC | 1.227691e+04 |
BIC | 1.229262e+04 |
deviance | 5.615513e+05 |
df.residual | 1.386000e+03 |
This also gives other statistics such as the AIC and BIC which are useful for model selection. Augment gives the original data with the residuals and fitted values:
summary(augment(modela))
## bwght cigs .fitted .se.fit
## Min. : 23.0 Min. : 0.000 Min. : 94.08 Min. :0.5403
## 1st Qu.:107.0 1st Qu.: 0.000 1st Qu.:119.77 1st Qu.:0.5723
## Median :120.0 Median : 0.000 Median :119.77 Median :0.5723
## Mean :118.7 Mean : 2.087 Mean :118.70 Mean :0.6741
## 3rd Qu.:132.0 3rd Qu.: 0.000 3rd Qu.:119.77 3rd Qu.:0.5723
## Max. :271.0 Max. :50.000 Max. :119.77 Max. :4.3692
## .resid .hat .sigma .cooksd
## Min. :-96.772 Min. :0.0007206 Min. :19.72 Min. :5.000e-08
## 1st Qu.:-11.772 1st Qu.:0.0008085 1st Qu.:20.13 1st Qu.:3.877e-05
## Median : 0.297 Median :0.0008085 Median :20.13 Median :1.771e-04
## Mean : 0.000 Mean :0.0014409 Mean :20.13 Mean :7.080e-04
## 3rd Qu.: 13.228 3rd Qu.:0.0008085 3rd Qu.:20.14 3rd Qu.:5.403e-04
## Max. :151.228 Max. :0.0471172 Max. :20.14 Max. :5.279e-02
## .std.resid
## Min. :-4.809631
## 1st Qu.:-0.585072
## Median : 0.014764
## Mean : 0.000038
## 3rd Qu.: 0.657446
## Max. : 7.516143
3.1.1.4 R-squared
3.1.2 Conditional expectation and predictions
3.1.2.1 Conditional expectation
3.1.2.2 Linear models
3.1.2.3 Predictions and residuals
To make predictions based on a model, e.g., \[\hat{bwght}=\hat\beta_0+\hat\beta_1cigs\], we can use the predict()
function (e.g., predict(modela)
). Because we only specify the model with the predictions, R doesn’t know what x values we want to use. So in this case it defaults to using the original x variables. This gives us the so-called fitted values. of the model:
summary(predict(modela))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.08 119.77 119.77 118.70 119.77 119.77
If we want to make new predictions for two data points \(cigs=20\) and \(cigs=0\), we can add these x-values to the predict function:
predict(modela,data.table(cigs=c(20,0)))
## 1 2
## 109.4965 119.7719
3.1.3 Matrix algebra representation
3.1.3.1 Multiple regression
3.1.3.2 Derivation
3.1.3.3 Implementation
3.1.3.4 Multiple regression with LM
3.1.4 Special models
3.1.4.1 Binary independent variables
Suppose we want to run a multiple regression: \[\text{bwght}_i=\beta_0+\beta_1 \text{cigs}_i+\beta_2 \text{faminc}_i + \text{male}_i+e_i\] To run this model, we simply enter:
modelb <- lm(bwght~cigs+faminc+male,data=bwght)
summary(modelb)
##
## Call:
## lm(formula = bwght ~ cigs + faminc + male, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.521 -11.577 0.994 13.194 151.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.22771 1.20788 95.397 < 2e-16 ***
## cigs -0.46105 0.09134 -5.048 5.07e-07 ***
## faminc 0.09688 0.02915 3.324 0.00091 ***
## male 3.11397 1.07640 2.893 0.00388 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.01 on 1384 degrees of freedom
## Multiple R-squared: 0.03564, Adjusted R-squared: 0.03355
## F-statistic: 17.05 on 3 and 1384 DF, p-value: 7.105e-11
This notation inside the lm
command is called formula notation. A helpful guide on formula notation in R is available from the Booth School of Business.
Here, a \(1\) cigarette increase is associated with a \(-0.461\) decrease in \(oz\) of birthweight, but now we are controlling for family income and whether or not the child was male. To understand why we’re saying this consider two predictions. One child is a boy who is born with a mother who smokes \(20\) cigarettes per day has a family income of \(\$30,000\). The other child is also a boy whose mother smokes \(0\) cigarettes per day and also has a family income of \(\$30,000\). Clearly we are keeping family income and child sex constant. In this case the difference between the two predicted birthweights would be: \[\hat y_{20}-\hat y_{0}= 20\hat\beta_1+30\hat\beta_2+\hat\beta_3-0\hat\beta_1-30\hat\beta_2-\hat\beta_3\] \[\hat y_{20}-\hat y_{0}= 20\hat\beta_1=-9.22oz\] which is \(20\) times the estimated \(\beta_1\) value because this is a difference of \(20\) cigarettes per day.
In this linear model, we could also consider an interpretation for the \(\beta_3\) coefficient. If a child is male, then \(\text{male}_i=1\) otherwise \(\text{male}_i=0\). This means that if the child is male, we expect the birthweight to increase by \(\hat\beta_3=3.11oz\) controlling for cigarette smoking and family income. This is how we interpret models with binary or “dummy” variables.
3.1.4.2 Quadratics and polynomial functions
To perform a quadratic regression such as: \[\text{bwght}_i=\beta_0+\beta_1\text{cigs}_i+\beta_2\text{cigs}_i^2+e_i\] we have to use the I()
function in R. I()
tells R to ignore the special formula notation and do literally what commands we input. Here we are taking the square. So the I(cigs^2)
says add \(\text{cigs}^2\) to the linear model:
modelc <- lm(bwght~cigs+I(cigs^2),data=bwght)
summary(modelc)
##
## Call:
## lm(formula = bwght ~ cigs + I(cigs^2), data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.949 -11.949 1.051 13.051 151.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.949119 0.579843 206.865 < 2e-16 ***
## cigs -0.858407 0.207398 -4.139 3.7e-05 ***
## I(cigs^2) 0.013551 0.007339 1.846 0.065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.11 on 1385 degrees of freedom
## Multiple R-squared: 0.02513, Adjusted R-squared: 0.02372
## F-statistic: 17.85 on 2 and 1385 DF, p-value: 2.218e-08
3.1.4.3 Interactions
To include interactions between two variables, you can use the formula notaion for interactions which is a:b
to interact (essentially multiply) a
and b
and include the result in the model:
bwght$smokes <- as.numeric(bwght$cigs>0)
modele <- lm(bwght~smokes+faminc+male+white+smokes:faminc+smokes:male+smokes:white,data=bwght)
summary(modele)
##
## Call:
## lm(formula = bwght ~ smokes + faminc + male + white + smokes:faminc +
## smokes:male + smokes:white, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.340 -11.681 0.563 12.981 150.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.02091 1.54585 72.465 < 2e-16 ***
## smokes -5.54028 3.83280 -1.445 0.14855
## faminc 0.04112 0.03225 1.275 0.20255
## male 3.39415 1.16286 2.919 0.00357 **
## white 6.34911 1.48836 4.266 2.13e-05 ***
## smokes:faminc 0.12468 0.10066 1.239 0.21569
## smokes:male -2.85699 3.00414 -0.951 0.34176
## smokes:white -5.13610 3.73596 -1.375 0.16942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.88 on 1380 degrees of freedom
## Multiple R-squared: 0.05043, Adjusted R-squared: 0.04561
## F-statistic: 10.47 on 7 and 1380 DF, p-value: 7.211e-13
Here, I have interacted a dummy for smoking behavior with faminc, male, white. To include all the pairwise interactions we can enter:
bwght$smokes <- as.numeric(bwght$cigs>0)
modelf <- lm(bwght~(smokes+faminc+male+white)^2,data=bwght)
summary(modelf)
##
## Call:
## lm(formula = bwght ~ (smokes + faminc + male + white)^2, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.463 -11.706 0.626 13.163 150.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.08465 2.44118 46.324 <2e-16 ***
## smokes -6.05461 3.88408 -1.559 0.1193
## faminc -0.02513 0.08240 -0.305 0.7604
## male 2.92238 2.65627 1.100 0.2714
## white 5.87552 2.80561 2.094 0.0364 *
## smokes:faminc 0.12521 0.10109 1.239 0.2157
## smokes:male -2.37318 3.06905 -0.773 0.4395
## smokes:white -4.75201 3.79282 -1.253 0.2105
## faminc:male 0.04539 0.06128 0.741 0.4590
## faminc:white 0.05094 0.08315 0.613 0.5402
## male:white -1.16949 2.74399 -0.426 0.6700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.9 on 1377 degrees of freedom
## Multiple R-squared: 0.05112, Adjusted R-squared: 0.04423
## F-statistic: 7.418 on 10 and 1377 DF, p-value: 1.482e-11
Incidentally this is why we have to use the I()
function to create quadratics. In formula notation a^b
is used for adding interactions, not making things quadratic. Be careful about your notation here.
3.1.4.4 Log-linear specifications
If we want to take the natural logarithm of one of the variables, we simply surround that variable with the log()
function:
modeld <- lm(log(bwght)~cigs+faminc,data=bwght)
summary(modeld)
##
## Call:
## lm(formula = log(bwght) ~ cigs + faminc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62745 -0.08761 0.02148 0.12155 0.82230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7439566 0.0098431 481.958 < 2e-16 ***
## cigs -0.0040326 0.0008593 -4.693 2.96e-06 ***
## faminc 0.0008437 0.0002739 3.081 0.00211 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1883 on 1385 degrees of freedom
## Multiple R-squared: 0.02646, Adjusted R-squared: 0.02505
## F-statistic: 18.82 on 2 and 1385 DF, p-value: 8.608e-09
References
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Robinson, David, and Alex Hayes. 2018. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.