3.2 Model selection and endogeneity

3.2.1 Maximum likelihood

3.2.2 Model selection criteria

Consider the bwght example from Chapter ??. Suppose we run the following model:

model1 <- lm(log(bwght)~cigs+faminc+male+white+motheduc+fatheduc,data=bwght)

I am using complete.cases here because the following methods are not robust to missing values (which complete.cases drops). To compute the Akaike information criterion (AIC) of the model, simply type:

AIC(model1)

## [1] -638.222

Similarly to get the BIC, use the BIC function.

3.2.3 Step-wise regression

To use step-wise regression to simplify the model by finding the lowest AIC value, just type:

model2 <- step(model1)

Alternatively, we can find the best model using the BIC by adding

model3 <- step(model1,k=log(nrow(bwght)))

Note that because we specified $k=\ln(n)$, we know this is using the BIC even though the output says AIC.

3.2.4 A note on hypothesis testing

3.2.5 Endogeneity definition

3.2.5.1 Expected value of the coefficients

3.2.5.2 Exogeneity assumption

3.2.5.3 Unbiased simulation

We covered the build.nsim function in chapter ??:

dt <- build.nsim(nsim,c(10,25,50,100,250))
bigN <- nrow(dt)

Once the collections of simulation environment is created, we need to generate some data. For this, we can set up a few basic variables, dt$x and dt$y:

dt$x <- rnorm(bigN,2,1)
dt$y <- 2*dt$x + rnorm(bigN,0,1)

Here, $x_i\sim iid\mathcal N(2,1)$ and $e_i\sim iid\mathcal N(0,1)$ and \[y_i=\beta_1 x_i + e_i = 2x_i+e_i\]. Next we can run the linear models. The issue is that in order to do this inside our group-by statement, we need to output the result in some tidy format. Sometimes, we might be interest in simulating the glance statistics:

dt[,glance(lm(y~x)),by=.(n,sim)]

##          n   sim r.squared adj.r.squared     sigma  statistic      p.value
##     1:  10     1 0.7905678     0.7643888 1.0791416   30.19852 5.769226e-04
##     2:  10     2 0.7803381     0.7528803 1.1099545   28.41960 7.016822e-04
##     3:  10     3 0.8025502     0.7778689 1.0611060   32.51662 4.531588e-04
##     4:  10     4 0.8521860     0.8337092 0.9665015   46.12207 1.390489e-04
##     5:  10     5 0.6918718     0.6533557 1.1480775   17.96322 2.844030e-03
##    ---                                                                    
## 49996: 250 49996 0.7926300     0.7917938 1.0153844  947.92974 1.073480e-86
## 49997: 250 49997 0.8283488     0.8276566 0.9684280 1196.79010 6.934725e-97
## 49998: 250 49998 0.8094907     0.8087225 1.0152319 1053.77373 2.879841e-91
## 49999: 250 49999 0.7974613     0.7966446 0.9822531  976.45741 5.754041e-88
## 50000: 250 50000 0.7894579     0.7886089 1.0259765  929.91148 7.066252e-86
##        df     logLik       AIC       BIC   deviance df.residual
##     1:  2  -13.83533  33.67065  34.57841   9.316372           8
##     2:  2  -14.11686  34.23371  35.14147   9.855991           8
##     3:  2  -13.66679  33.33357  34.24133   9.007568           8
##     4:  2  -12.73294  31.46589  32.37364   7.473002           8
##     5:  2  -14.45456  34.90911  35.81687  10.544655           8
##    ---                                                         
## 49996:  2 -357.54742 721.09484 731.65923 255.689348         248
## 49997:  2 -345.71032 697.42063 707.98501 232.587472         248
## 49998:  2 -357.50988 721.01977 731.58415 255.612577         248
## 49999:  2 -349.25406 704.50812 715.07250 239.275669         248
## 50000:  2 -360.14181 726.28363 736.84801 261.051672         248

Here, we are interested in obtaining the estimated $\beta_1$ coefficients:

result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)]
result

##          n   sim term estimate  std.error statistic      p.value
##     1:  10     1    x 1.522488 0.27705177  5.495318 5.769226e-04
##     2:  10     2    x 1.893609 0.35520678  5.331004 7.016822e-04
##     3:  10     3    x 2.094013 0.36722029  5.702335 4.531588e-04
##     4:  10     4    x 2.415660 0.35569795  6.791323 1.390489e-04
##     5:  10     5    x 2.200994 0.51931009  4.238303 2.844030e-03
##    ---                                                          
## 49996: 250 49996    x 2.049229 0.06655834 30.788468 1.073480e-86
## 49997: 250 49997    x 2.095238 0.06056537 34.594654 6.934725e-97
## 49998: 250 49998    x 2.005738 0.06178749 32.461881 2.879841e-91
## 49999: 250 49999    x 1.933262 0.06186771 31.248319 5.754041e-88
## 50000: 250 50000    x 2.070814 0.06790789 30.494450 7.066252e-86

This gives us 5000 different $\hat\beta_1$ values. We can then group-by the sample size used to find the mean and variance of $\hat\beta_1$:

result[,.(mean(estimate),var(estimate)),by=n]

##      n       V1          V2
## 1:  10 1.996826 0.143108233
## 2:  25 2.000226 0.045006980
## 3:  50 2.002831 0.021289006
## 4: 100 2.000351 0.010120972
## 5: 250 1.999203 0.003926791

You can clean up the results by creating the following function:

tab.sim <- function(dtinput,column=quote(estimate)){
  dt <- dtinput[,.(mean(eval(column)),var(eval(column))),by=n]
  names(dt) <- c("n","mean","variance")
  return(dt)
}

tab.sim(result)

Table 12: Beta hat distribution without endogeneity
n	mean	variance
10	1.996826	0.1431082
25	2.000226	0.0450070
50	2.002831	0.0212890
100	2.000351	0.0101210
250	1.999203	0.0039268

Notice that the $\hat\beta_1$ values from the model without endogeneity are centered around $\beta_1=2$ with a variance that approaches zero as $n\rightarrow\infty$.

3.2.6 Sources of endogeneity

3.2.6.1 Sample selection bias

3.2.6.2 Omitted variable bias

dt$w <- rnorm(bigN,0,0.5)
dt$x <- rnorm(bigN,2,0.5) + dt$w
dt$z <- rnorm(bigN,1,0.5) + dt$w
dt$y <- 2*dt$x + 3*dt$z + rnorm(bigN,0,1)

result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)]

result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)]
tab.sim(result)

Table 13: Beta hat distribution under omitted variable bias
n	mean	variance
10	3.510818	1.2778899
25	3.508936	0.3905536
50	3.502884	0.1853704
100	3.497456	0.0902560
250	3.498624	0.0357755

3.2.6.3 Outlier bias

3.2.6.4 Measurement error bias

dt$x <- rnorm(bigN,2,1)
dt$xstar <- dt$x + rnorm(bigN,0,1)
dt$y <- 2*dt$x + rnorm(bigN,0,1)
dt$ystar <- dt$y + rnorm(bigN,0,1)
tab.sim(dt[,tidy(lm(ystar~xstar))[2,],by=.(n,sim)])

##      n      mean    variance
## 1:  10 0.9996409 0.287528741
## 2:  25 1.0017050 0.089506290
## 3:  50 1.0000425 0.042499906
## 4: 100 0.9974228 0.020793890
## 5: 250 1.0007813 0.008082634

tab.sim(dt[,tidy(lm(y~xstar))[2,],by=.(n,sim)])

##      n      mean    variance
## 1:  10 1.0007801 0.215746414
## 2:  25 1.0017694 0.067357052
## 3:  50 1.0008632 0.031392542
## 4: 100 0.9981413 0.015638762
## 5: 250 1.0007376 0.006033224

tab.sim(dt[,tidy(lm(ystar~x))[2,],by=.(n,sim)])

##      n     mean    variance
## 1:  10 2.003924 0.286280370
## 2:  25 2.003591 0.090385638
## 3:  50 1.999744 0.042309487
## 4: 100 1.996857 0.020447698
## 5: 250 2.000420 0.008124784

3.2.6.5 Reverse regression bias

dt$x <- rnorm(bigN,2,1)
dt$y <- 2*dt$x + rnorm(bigN,0,1)
result <- dt[,tidy(lm(x~y))[2,],by=.(n,sim)]
result$estimate <- 1/result$estimate
tab.sim(result[estimate<5])

##      n     mean    variance
## 1:  10 2.577353 0.255151301
## 2:  25 2.527498 0.075751517
## 3:  50 2.515748 0.035413432
## 4: 100 2.505908 0.016630351
## 5: 250 2.502658 0.006385136