## 3.2 Model selection and endogeneity

### 3.2.2 Model selection criteria

Consider the bwght example from Chapter ??. Suppose we run the following model:

model1 <- lm(log(bwght)~cigs+faminc+male+white+motheduc+fatheduc,data=bwght)

I am using complete.cases here because the following methods are not robust to missing values (which complete.cases drops). To compute the Akaike information criterion (AIC) of the model, simply type:

AIC(model1)
## [1] -638.222

Similarly to get the BIC, use the BIC function.

### 3.2.3 Step-wise regression

To use step-wise regression to simplify the model by finding the lowest AIC value, just type:

model2 <- step(model1)

Alternatively, we can find the best model using the BIC by adding

model3 <- step(model1,k=log(nrow(bwght)))

Note that because we specified $$k=\ln(n)$$, we know this is using the BIC even though the output says AIC.

### 3.2.5 Endogeneity definition

#### 3.2.5.3 Unbiased simulation

We covered the build.nsim function in chapter ??:

dt <- build.nsim(nsim,c(10,25,50,100,250))
bigN <- nrow(dt)

Once the collections of simulation environment is created, we need to generate some data. For this, we can set up a few basic variables, dt$x and dt$y:

dt$x <- rnorm(bigN,2,1) dt$y <- 2*dt$x + rnorm(bigN,0,1) Here, $$x_i\sim iid\mathcal N(2,1)$$ and $$e_i\sim iid\mathcal N(0,1)$$ and $y_i=\beta_1 x_i + e_i = 2x_i+e_i$. Next we can run the linear models. The issue is that in order to do this inside our group-by statement, we need to output the result in some tidy format. Sometimes, we might be interest in simulating the glance statistics: dt[,glance(lm(y~x)),by=.(n,sim)] ## n sim r.squared adj.r.squared sigma statistic p.value ## 1: 10 1 0.7905678 0.7643888 1.0791416 30.19852 5.769226e-04 ## 2: 10 2 0.7803381 0.7528803 1.1099545 28.41960 7.016822e-04 ## 3: 10 3 0.8025502 0.7778689 1.0611060 32.51662 4.531588e-04 ## 4: 10 4 0.8521860 0.8337092 0.9665015 46.12207 1.390489e-04 ## 5: 10 5 0.6918718 0.6533557 1.1480775 17.96322 2.844030e-03 ## --- ## 49996: 250 49996 0.7926300 0.7917938 1.0153844 947.92974 1.073480e-86 ## 49997: 250 49997 0.8283488 0.8276566 0.9684280 1196.79010 6.934725e-97 ## 49998: 250 49998 0.8094907 0.8087225 1.0152319 1053.77373 2.879841e-91 ## 49999: 250 49999 0.7974613 0.7966446 0.9822531 976.45741 5.754041e-88 ## 50000: 250 50000 0.7894579 0.7886089 1.0259765 929.91148 7.066252e-86 ## df logLik AIC BIC deviance df.residual ## 1: 2 -13.83533 33.67065 34.57841 9.316372 8 ## 2: 2 -14.11686 34.23371 35.14147 9.855991 8 ## 3: 2 -13.66679 33.33357 34.24133 9.007568 8 ## 4: 2 -12.73294 31.46589 32.37364 7.473002 8 ## 5: 2 -14.45456 34.90911 35.81687 10.544655 8 ## --- ## 49996: 2 -357.54742 721.09484 731.65923 255.689348 248 ## 49997: 2 -345.71032 697.42063 707.98501 232.587472 248 ## 49998: 2 -357.50988 721.01977 731.58415 255.612577 248 ## 49999: 2 -349.25406 704.50812 715.07250 239.275669 248 ## 50000: 2 -360.14181 726.28363 736.84801 261.051672 248 Here, we are interested in obtaining the estimated $$\beta_1$$ coefficients: result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)] result ## n sim term estimate std.error statistic p.value ## 1: 10 1 x 1.522488 0.27705177 5.495318 5.769226e-04 ## 2: 10 2 x 1.893609 0.35520678 5.331004 7.016822e-04 ## 3: 10 3 x 2.094013 0.36722029 5.702335 4.531588e-04 ## 4: 10 4 x 2.415660 0.35569795 6.791323 1.390489e-04 ## 5: 10 5 x 2.200994 0.51931009 4.238303 2.844030e-03 ## --- ## 49996: 250 49996 x 2.049229 0.06655834 30.788468 1.073480e-86 ## 49997: 250 49997 x 2.095238 0.06056537 34.594654 6.934725e-97 ## 49998: 250 49998 x 2.005738 0.06178749 32.461881 2.879841e-91 ## 49999: 250 49999 x 1.933262 0.06186771 31.248319 5.754041e-88 ## 50000: 250 50000 x 2.070814 0.06790789 30.494450 7.066252e-86 This gives us 5000 different $$\hat\beta_1$$ values. We can then group-by the sample size used to find the mean and variance of $$\hat\beta_1$$: result[,.(mean(estimate),var(estimate)),by=n] ## n V1 V2 ## 1: 10 1.996826 0.143108233 ## 2: 25 2.000226 0.045006980 ## 3: 50 2.002831 0.021289006 ## 4: 100 2.000351 0.010120972 ## 5: 250 1.999203 0.003926791 You can clean up the results by creating the following function: tab.sim <- function(dtinput,column=quote(estimate)){ dt <- dtinput[,.(mean(eval(column)),var(eval(column))),by=n] names(dt) <- c("n","mean","variance") return(dt) } tab.sim(result) Table 12: Beta hat distribution without endogeneity n mean variance 10 1.996826 0.1431082 25 2.000226 0.0450070 50 2.002831 0.0212890 100 2.000351 0.0101210 250 1.999203 0.0039268 Notice that the $$\hat\beta_1$$ values from the model without endogeneity are centered around $$\beta_1=2$$ with a variance that approaches zero as $$n\rightarrow\infty$$. ### 3.2.6 Sources of endogeneity #### 3.2.6.1 Sample selection bias #### 3.2.6.2 Omitted variable bias dt$w <- rnorm(bigN,0,0.5)
dt$x <- rnorm(bigN,2,0.5) + dt$w
dt$z <- rnorm(bigN,1,0.5) + dt$w
dt$y <- 2*dt$x + 3*dt$z + rnorm(bigN,0,1) result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)] result <- dt[,tidy(lm(y~x))[2,],by=.(n,sim)] tab.sim(result) Table 13: Beta hat distribution under omitted variable bias n mean variance 10 3.510818 1.2778899 25 3.508936 0.3905536 50 3.502884 0.1853704 100 3.497456 0.0902560 250 3.498624 0.0357755 #### 3.2.6.3 Outlier bias #### 3.2.6.4 Measurement error bias dt$x <- rnorm(bigN,2,1)
dt$xstar <- dt$x + rnorm(bigN,0,1)
dt$y <- 2*dt$x + rnorm(bigN,0,1)
dt$ystar <- dt$y + rnorm(bigN,0,1)
tab.sim(dt[,tidy(lm(ystar~xstar))[2,],by=.(n,sim)])
##      n      mean    variance
## 1:  10 0.9996409 0.287528741
## 2:  25 1.0017050 0.089506290
## 3:  50 1.0000425 0.042499906
## 4: 100 0.9974228 0.020793890
## 5: 250 1.0007813 0.008082634
tab.sim(dt[,tidy(lm(y~xstar))[2,],by=.(n,sim)])
##      n      mean    variance
## 1:  10 1.0007801 0.215746414
## 2:  25 1.0017694 0.067357052
## 3:  50 1.0008632 0.031392542
## 4: 100 0.9981413 0.015638762
## 5: 250 1.0007376 0.006033224
tab.sim(dt[,tidy(lm(ystar~x))[2,],by=.(n,sim)])
##      n     mean    variance
## 1:  10 2.003924 0.286280370
## 2:  25 2.003591 0.090385638
## 3:  50 1.999744 0.042309487
## 4: 100 1.996857 0.020447698
## 5: 250 2.000420 0.008124784

#### 3.2.6.5 Reverse regression bias

dt$x <- rnorm(bigN,2,1) dt$y <- 2*dt$x + rnorm(bigN,0,1) result <- dt[,tidy(lm(x~y))[2,],by=.(n,sim)] result$estimate <- 1/result\$estimate
tab.sim(result[estimate<5])
##      n     mean    variance
## 1:  10 2.577353 0.255151301
## 2:  25 2.527498 0.075751517
## 3:  50 2.515748 0.035413432
## 4: 100 2.505908 0.016630351
## 5: 250 2.502658 0.006385136