EC295
Justin Smith
Wilfrid Laurier University
Fall 2022
We have discussed regression with one independent variable
Helps provide intuition
But, actual models rarely involve only one
In this section, we will cover multiple regressors
Regression models with more than one independent variable
Most real models have many such variables
Multiple regression allows us to
Explicitly hold constant other variables in a regression
More accurately estimate causal effects
Incorporate more general relationships between variables
The same intuition carries forward for multiple regression
A regression model with two regressors is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ u_{i}\]
\(X_{1i}\) is the first independent variable
\(X_{2i}\) is the second independent variable
\(\beta_{0}\) is the intercept
\(\beta_{1}\) is the all else equal effect of \(X_{1i}\) on \(Y_{i}\)
\(\beta_{2}\) is the all else equal effect of \(X_{2i}\) on \(Y_{i}\)
\(u_{i}\) is all factors other than \(X_{1i},X_{2i}\) that affect \(Y_{i}\)
The corresponding population regression function is
\[E[Y_{i}| X_{1i}, X_{2i}]= \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}\]
Key difference vs model with one regressor: when measuring slope, we explicitly hold other factors constant
\(\beta_{1}\) is effect of \(X_{1i}\) on \(Y_{i}\) holding \(X_{2i}\) fixed
\(\beta_{2}\) is effect of \(X_{2i}\) on \(Y_{i}\) holding \(X_{1i}\) fixed
To see this, take total change in \(E[Y_{i}| X_{1i}, X_{2i}]\) \[\Delta E[Y_{i}| X_{1i}, X_{2i}] = \beta_{1}\Delta X_{1i}+ \beta_{2}\Delta X_{2i}\]
Now hold \(X_{2i}\) fixed by setting \(\Delta X_{2i} = 0\) \[\Delta E[Y_{i}| X_{1i}, X_{2i}] = \beta_{1}\Delta X_{1i}\]
and as a result \[\beta_{1} = \frac{\Delta E[Y_{i}| X_{1i}, X_{2i}] }{\Delta X_{1i}}\]
Key is that \(\beta_{1}\) is calculated holding \(X_{2i}\) constant
Intercept also has slightly different interpretation
Example: Wages, schooling, and experience
\[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} + u_{i}\]
\(\beta_{1}\) is effect of education on wages, holding experience fixed
\(\beta_{2}\) is experience on wages, holding education fixed
\(u\) is other variables that affect wages
We will see that adding more variables to the model helps avoid bias in OLS
For OLS to be unbiased, need to assume unobserved variables have zero conditional mean
For variables included in the model, no longer need to make that assumption
Strengthens likelihood that OLS estimator is unbiased
Definition of homoskedasticity and heteroskedasticity are natural extensions
Homoskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i}) = \sigma^2_{u}\)
Heteroskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i}) = \sigma^2_{ui}\)
A regression model with \(k\) regressors is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ... + \beta_{k}X_{ki} + u_{i}\]
\(X_{1i}, X_{2i}, ..., X_{ki}\) are independent variables
\(\beta_{0}\) is the intercept
\(\beta_{1}, \beta_{2}, ..., \beta_{k}\) are slope parameters
\(u_{i}\) is all factors other than \(X_{1i}, X_{2i}, ..., X_{ki}\) that affect \(Y_{i}\)
Population regression function is
\[E[Y_{i}| X_{1i}, X_{2i}, ...,X_{ki}]= \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ... + \beta_{k}X_{ki}\]
The partial effect of \(X_{ji}\) on \(Y_{i}\) is
\[\beta_{j} = \frac{\Delta E[Y_{i}| X_{1i}, X_{2i},..., X_{ki}] }{\Delta X_{ji}}\]
Intercept is value of \(E[Y_{i}| X_{1i}, X_{2i},..., X_{ki}]\) when all independent variables equal zero
As before, assumptions for errors are natural extensions
Homoskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i},..., X_{ki}) = \sigma^2_{u}\)
Heteroskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i},..., X_{ki}) = \sigma^2_{ui}\)
Remember that OLS chooses estimates of \(\beta\) to minimize the sum of the squared residuals in the sample
The population model is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ...+ \beta_{k}X_{ki}+ u_{i}\]
When we replace parameters with estimates, we get \[Y_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}+ ...+ \hat{\beta}_{k}X_{ki}+ \hat{u}_{i}\]
The sum of the squared residuals are \[\sum_{i=1}^{n} \left ( Y_{i}- \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right )^2\]
\[\sum_{i=1}^{n} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) =0\] \[\sum_{i=1}^{n} X_{1i} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) = 0\] \[\vdots\] \[\sum_{i=1}^{n} X_{ki} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) = 0\]
\[\hat{Y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}+ ...+ \hat{\beta}_{k}X_{ki}\]
\[\hat{u}_{i} = Y_{i} - \hat{Y}_{i} = Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki}\]
\(\hat{\beta}_{0}\) is the estimated intercept
\(\hat{\beta}_{1},\hat{\beta}_{2},...,\hat{\beta}_{k}\) are the estimated partial effects
To see partial effect interpretation, take the total change in \(\hat{Y_{i}}\)
\[\Delta \hat{Y_{i}} = \hat{\beta}_{1}\Delta X_{1i} + \hat{\beta}_{2}\Delta X_{2i}+ ...+ \hat{\beta}_{k}\Delta X_{ki}\]
If we set \(\Delta X_{2i} = 0,...,\Delta X_{ki} = 0\) then
\[\hat{\beta}_{1} = \frac{\Delta \hat{Y}_{i}} {\Delta X_{1i}}\]
Similarly, if we set \(\Delta X_{1i} = 0,\Delta X_{3i} = 0, ... ,\Delta X_{ki} = 0\) then
\[\hat{\beta}_{2} = \frac{\Delta \hat{Y}_{i}}{\Delta X_{2i}}\]
We saw that \(\hat{\beta}_{j}\) is the partial effect of \(X_{ji}\) on \(\hat{Y}_{i}\)
We can express \(\hat{\beta}_{j}\) with a simple formula consistent with this interpretation
Imagine a regression with \(k=2\)
\[\hat{Y}_{i} = \hat{\beta}_{0}+ \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}\]
Now, perform this two-step procedure
Regress \(X_{1i}\) on \(X_{2i}\), and obtain the residuals \(\hat{r}_{1i}\)
Estimate a regression of \(Y_{i}\) on \(\hat{r}_{i1}\)
\[\hat{\beta}_{1} = \frac{\sum_{i=1}^{n} \hat{r}_{i1}Y_{i}}{\sum_{i=1}^{n} \hat{r}_{i1}^2}\]
Why?
In step 1, we purge \(X_{1i}\) of part that is correlated with \(X_{2i}\)
\(\hat{r}_{i1}\) is piece of \(X_{1i}\) unrelated to \(X_{2i}\)
By using \(\hat{r}_{i1}\) step 2 regression, it is like we already controlled for \(X_{2i}\)
Highlights intuition behind regression
When you regress \(Y_{i}\) on \(X_{1i}\), you separate \(Y_{i}\) into two pieces
Part that is correlated with \(X_{1i} \rightarrow \hat{Y}_{i}\)
Part that is uncorrelated with \(X_{1i} \rightarrow \hat{u}_{i}\)
\[\tilde{Y}_{i} = \tilde{\beta}_{0}+ \tilde{\beta}_{1}X_{1i}\] \[\hat{Y}_{i} = \hat{\beta}_{0}+ \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}\]
The slope coefficients have the following relationship
\[\tilde{\beta}_{1} =\hat{\beta}_{1} + \hat{\beta}_{2}\tilde{\delta}_{1}\]
Simple and multiple regression coefficients are equal when
\(\hat{\beta}_{2} = 0\), which means \(X_{2i}\) is unrelated to \(Y_{i}\)
\(\tilde{\delta}_{1} = 0\), which means \(X_{1i}\) is unrelated to \(X_{2i}\)
Suppose we are interested in \(\beta_{1}\), the effect of \(X_{i}\) on \(Y_{i}\)
In regression of \(Y_{i}\) on \(X_{i}\), \(\hat{\beta}_{1}\) is unbiased if unobserved factors are unrelated to \(X_{i}\)
Mathematically, this means \(E[u_{i}|X_{i}] = 0\)
When this is true, we estimate a causal effect
Recall this is an assumption
When unobserved factors are related to \(X_{i}\), a solution is to add them to the model
This is multiple regression
It explicitly holds them fixed when measuring effect of \(X_{i}\) on \(Y_{i}\)
For these variables, we do not have to assume they are unrelated to \(X_{i}\)
It can often be difficult to hold all relevant factors fixed
Ex: class size and test scores
Many factors besides class size determine test scores
Family background, school quality, teachers, etc
Many are likely related to \(X_{i}\)
What if we exclude a relevant factor that is related to \(X_{i}\)?
Cannot assume \(E[u_{i}|X_{i}] = 0\)
Estimator \(\hat{\beta}_{1}\) will suffer from Omitted Variables Bias
\(E[\hat{\beta}_{1}] \neq \beta_{1}\)
\(\hat{\beta}_{1}\) will not measure all else equal (causal) effect \(X_{i}\) on \(Y_{i}\)
Instead it measures partly effect \(X_{i}\) on \(Y_{i}\), partly effect of excluded variable on \(Y_{i}\)
Suppose you are measuring effect of class size on test scores
Below are some potential omitted factors
Percentage of english as a second language (ESL) students
School districts with big classes have lots of ESL students
ESL students tend to perform worse on standardized tests
Even if class size has no independent effect, bigger classes will perform worse
Parental background
Wealthier areas tend to have schools with smaller classes
Richer students may perform better on standardized tests, due to better resources at home
Even if class size has no independent effect, bigger classes will perform worse
Imagine that the correct model for outcome \(Y_{i}\) is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + u_{i}\]
Assume that this model satisfies the OLS assumptions
Most importantly, \(u_{i}\) is unrelated to both \(X_{1i}\) and \(X_{2i}\)
Mathematically \(E[u_{i}|X_{1i},X_{2i}] = 0\)
Suppose we omit \(X_{2i}\), pushing it to the error term so that \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + v_{i}\]
What happens if we try to estimate this second model?
We know from the simple regression model that
\[\tilde{\beta}_{1} =\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})v_{i}}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]
\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})E[v_{i} |X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]
\(\tilde{\beta}_{1}\) is unbiased if \(E[v_{i} |X_{1i}] = 0\)
Working out that expectation, we get
\[E[v_{i} |X_{1i}]=E[\beta_{2}X_{2i} + u_{i}|X_{1i}]= \beta_{2}E[X_{2i}|X_{1i}] + E[u_{i}|X_{1i}]\]
Based on assumption that \(u_{i}\) is unrelated to \(X_{1i}\), \(E[u_{i}|X_{1i}] = 0\)
Therefore
\[E[v_{i} |X_{1i}]= \beta_{2}E[X_{2i}|X_{1i}]\]
In general, this is not equal to zero, so \(\tilde{\beta}_{1}\) is biased
To quantify the bias, substitute back into the original formula to get
\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})\beta_{2}E[X_{2i}|X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]
Let \(\delta_{1} = \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})E[X_{2i}|X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\)
The formula above then simplifies to
\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} +\beta_{2}\delta_{1}\]
The bias in OLS estimators from an omitted variable is \(\beta_{2}\delta_{1}\)
Depends on two factors
\(\beta_{2}\), the relationship between \(Y_{i}\) and \(X_{2i}\)
\(\delta_{1}\), the relationship between \(X_{2i}\) and \(X_{1i}\)
We can use this to quantify the direction of the bias
\(corr(x_{i2}, x_{i1}) > 0\) | \(corr(x_{i2}, x_{i1}) < 0\) | |
---|---|---|
\(\beta_{2} >0\) | Positive Bias | Negative Bias |
\(\beta_{2} <0\) | Negative Bias | Positive Bias |
There are two situations where omitted variables do not cause bias
\(\beta_{2} = 0\), which means \(X_{2i}\) is unrelated to \(Y_{i}\)
\(\delta_{1} = 0\), which means \(X_{2i}\) is unrelated to \(X_{1i}\)
Example: Ability Bias
The classic example of omitted variables bias is leaving ability out of a regression of wages on schooling
Suppose the true model linking wages to schooling is
\[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}abil_{i}+ u_{i}\]
Often we do not have measures of ability
If you were to leave it out and estimate
\[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + v_{i}\]
The estimator would be biased
\(\beta_{2} > 0\), ability is positively related to wages
\(\delta_{1} > 0\), higher ability people get more schooling
Bias is positive: the return to education will appear higher than it really is
In more complicated models, omitting one variable can bias all estimators
Suppose the true model is
\[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} + \beta_{3}abil_{i} + u_{i}\]
If we leave ability out of the model and estimate
\[wage = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} +v_{i}\]
Then estimators for \(\beta_{1}\) and \(\beta_{2}\) are biased
This is true even if only one variable is correlated with \(abil_{i}\)
Imagine that \(educ_{i}\) and \(abil_{i}\) are correlated
But, \(exper_{i}\) and \(abil_{i}\) are uncorrelated
Estimator for \(\beta_{2}\) is still biased, unless \(exper_{i}\) and \(educ_{i}\) are unrelated
We will again illustrate Stata commands in the context of a research question
Research Question: Are test scores related to class size?
Previous model is extended with another independent variable
Add percent free/reduced price lunch in district
Acts as a proxy for parent income
Students only qualify when parent income is below a cutoff
We will see how the model changes with the additional variable
We will assume the model relating math scores to explanatory factors is
\[testscr_{i} = \beta_{0} + \beta_{1}str_{i} + \beta_{2}mealpct_{i} + u_{i}\]
\(\beta_{1}\) is effect of one extra student per teacher on test scores, all else equal
\(\beta_{2}\) is effect of a 1-percentage point increase in free meal status on test scores, all else equal
\(\beta_{0}\) is math scores when str and mealpct are zero
\(u\) are things other than the model variables that explain math scores
Meal percent variable is explicitly correlated with student teacher ratio
Intercept is 700
Slope on str is -1
Slope on mealpct is -0.5
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
testscr | 420 657.2409 16.86402 612.8703 704.8796
str | 420 20.13071 2.103167 14.2861 27.30753
mealpct | 420 46.91004 24.46755 0 100
Things to note
Test scores simulate data that are scaled to have mean 650 and SD 20
All data is simulated to look like real California data
In real data, you might need to deal with things like missing values
Linear regression Number of obs = 420
F(2, 417) = 383.40
Prob > F = 0.0000
R-squared = 0.6376
Root MSE = 10.177
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
str | -.8082448 .2293604 -3.52 0.000 -1.259091 -.3573982
mealpct | -.5341559 .0203418 -26.26 0.000 -.5741412 -.4941706
_cons | 698.5687 4.558356 153.25 0.000 689.6085 707.5289
------------------------------------------------------------------------------
The output indicates that
\(\hat{\beta}_{1} = -0.81\): 1-student increase in class size reduces scores by 0.81 points
\(\hat{\beta}_{2} = -0.53\): 1-percentage point increase in free meals reduces scores by 0.53 points
Linear regression Number of obs = 420
F(1, 418) = 3.89
Prob > F = 0.0494
R-squared = 0.0099
Root MSE = 16.801
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
res | -.8082448 .4100521 -1.97 0.049 -1.614266 -.0022236
_cons | 657.2409 .8197914 801.72 0.000 655.6295 658.8523
------------------------------------------------------------------------------
First step keeps part of class size that is unrelated to free meal status (res)
Second step regresses math scores on values from the first step
Linear regression Number of obs = 420
F(1, 418) = 22.09
Prob > F = 0.0000
R-squared = 0.0547
Root MSE = 16.416
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
str | -1.874765 .3988784 -4.70 0.000 -2.658823 -1.090708
_cons | 694.9813 7.995686 86.92 0.000 679.2645 710.698
------------------------------------------------------------------------------
Notice the coefficient on str is lower
Because str and mealpct are positively related and mealpct and testscr are negatively related
\[R^2 = \frac{ESS}{TSS} = 1-\frac{SSR}{TSS}\]
Recall that an \(R^2\) is between 0 and 1, with higher values meaning better fit
\(R^2\) is still the square of the correlation coefficient between \(Y_{i}\) and \(\hat{Y}_{i}\)
\[R^2 = \frac{\left ( \sum_{i=1}^{n}(y_{i} - \bar{y})(\hat{y}_{i} - \bar{\hat{y}}) \right )^2}{\left (\sum_{i=1}^{n}(y_{i} - \bar{y})^2 \right ) \left (\sum_{i=1}^{n}(\hat{y}_{i} - \bar{\hat{y}})^2 \right )}\]
\[TSS = \sum_{i=1}^{n}(Y_{i} - \bar{Y})^2\]
\[ESS = \sum_{i=1}^{n}(\hat{Y}_{i} - \bar{Y})^2\]
\[SSR = \sum_{i=1}^{n}\hat{u}_{i}^2\]
\[TSS = ESS + SSR\]
Recall that a low \(R^2\) does not mean regression is bad
Simply means we have not explained large proportion of variation in \(Y_{i}\)
Does not affect whether \(\hat{\beta}_{j}\) is good estimate for \(\beta_{j}\)
Important property of \(R^2\) is that it never decreases with additional variables
Adding variables cannot reduce explanatory power of regression
ESS cannot fall when variables added to regression
Makes \(R^2\) bad tool for deciding whether to add or drop variables
Ex: adding variable with random values slightly increases \(R^2\)
But clearly random values do not belong in regression
We discussed that \(R^2\) measures goodness of fit
The fraction of variation in \(Y_{i}\) explained by \(X_{i}\)
Correlation between actual \(Y_{i}\) and fitted values
How closely datapoints fall along a straight line
One issue with \(R^2\) is that it never falls with additional variables
With new variables, ESS stays the same or goes up
You cannot explain less of variation in \(Y_{i}\) with more variables
It is mostly not useful to use \(R^2\) to decide if a variable should be added to the model
Recall that the \(R^2\) is written as \[R^2 = 1- \frac{SSR}{TSS}\]
You can also write that as \[R^2 = 1- \frac{SSR/n}{TSS/n}\]
Think of \(SSR/n\) as estimate of error variance, and \(TSS/n\) as estimate of variance in \(y\)
But
\(SSR/n\) is biased estimator of error variance
\(TSS/n\) is biased estimator of variance in \(y\)
\[\bar{R}^2 = 1- \frac{SSR/(n-k-1)}{SST/(n-1)} = 1- \frac{n-1}{n-k-1} \frac{SSR}{TSS}\]
This measure does not always rise when a variable is added to the model
Imagine adding one new variable
\(SSR\) will fall
\(\frac{n-1}{n-k-1}\) rises (because \(k\) rises)
Effect on \(\bar{R}^2\) depends which is stronger
Thus, \(\bar{R}^2\) can fall with new variables
\(\bar{R}^2\) is always less than \({R}^2\)
\(\frac{n-1}{n-k-1}\) is always greater than 1
So we subtract a bigger number from 1 in the \(\bar{R}^2\) formula
\(\bar{R}^2\) can be negative if the model has a very poor fit
The SER is still the estimated standard deviation of the residuals
The formula changes due to degrees of freedom adjustment
\[SER = s_{\hat{u}} = \sqrt{\frac{1}{n-k-1} \sum_{i=1}^{n}\hat{u}_{i}^2}\] \[= \sqrt{\frac{SSR}{n-k-1}}\]
\(k\) represents the number variables in the model
Each parameter estimate uses information, which is the reason for the adjustment
If \(k=1\), then formula is same as we learned previously
Linear regression Number of obs = 420
F(2, 417) = 383.40
Prob > F = 0.0000
R-squared = 0.6376
Root MSE = 10.177
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
str | -.8082448 .2293604 -3.52 0.000 -1.259091 -.3573982
mealpct | -.5341559 .0203418 -26.26 0.000 -.5741412 -.4941706
_cons | 698.5687 4.558356 153.25 0.000 689.6085 707.5289
------------------------------------------------------------------------------
\(R^2\) is in the top right (\(\bar{R}^2\) only appears without robust option)
The SER is equal to the “Root MSE”
Like before, we need to make assumptions for causal inference
Maintain the same three assumptions we learned in last section
Add extra assumption about relationship between independent variables
The average error term \(u_{i}\) conditional on all X variables is zero \[E[u_{i}| X_{1i},X_{2i},...,X_{ki}]=0\]
This is the same condition as in the simple regression model
Difference is that error has average of zero for all combinations of \(X_{1i},X_{2i},...,X_{ki}\)
Also implies that \(u_{i}\) is not systematically related to any of \(X_{1i},X_{2i},...,X_{ki}\)
How can this assumption fail?
Model misspecification
Omitted variables
Simultaneous equations
These variables must be independent
Also must come from the same distribution
As discussed before, this holds true when we take a simple random sample
Assumption use to establish slopes are consistent and have Normal distribution
Outliers are values far from the usual range of data
They can have large effect on slope estimates
Collinearity refers to correlations among independent variables
Perfect collinearity is an exact linear relationship between one or more independent variables
Perfectly collinear variables cannot be included in the model
Collinear variables can be included
But not perfectly collinear ones
Exa,ple of perfect collinearity: Height on income
Suppose you want to estimate the effect of height on income
Your regresion model is
\[Y_{i} = \beta_{0} + \beta_{1}h\_inch_{i} + \beta_{1}h\_cm_{i} + u_{i}\]
Where \(h\_cm_{i} = 2.54*h\_inch_{i}\)
You cannot estimate a model with both \(h\_cm_{i}\) and \(h\_inch_{i}\)
\(h\_cm\) does not move independently from \(h\_inch_{i}\)
These variables are perfectly collinear
Ex of imperfect collinearity: Height and weight on income
Suppose you want effect of height and weight on income
Your regresion model is \(y = \beta_{0} + \beta_{1}h\_inch + \beta_{1}w\_lbs + u\)
Height and weight are very highly related
But there is no exact, linear relationship
This model can be estimated
Ex of imperfect collinearity: Height and height\(^2\) on income
Regresion model is \(y = \beta_{0} + \beta_{1}h\_inch + \beta_{1}h\_inch^2 + u\)
\(h\_inch^2\) is a perfect nonlinear function of \(h\_inch\)
This is allowed because there is no exact, linear relationship
This model can also be estimated
Problem of perfect collinearity is we cannot estimate the model
Intuitively, it is asking regression to answer illogical question
As we discuss later, the solution is to drop one of the collinear variables
We can show that if assumptions 1 - 4 are true, then
\[E[\hat{\beta}_{j}] = \beta_{j}, j=0,1,...,k\]
Which means that each \(\hat{\beta}_{j}\) is an unbiased estimator for \(\beta_{j}\)
The proof is long, but follows similar logic to the simple regression model
Implies that we are not making any systematic errors in estimating \(\beta_{j}\)
Remember that this is a statistical property
Remember that from one sample to the next, the value of OLS estimators will differ
Variance and standard deviation describe this variation
Variance of OLS estimators under heteroskedasticity is complicated
If we are willing to assume homoskedasticity, we can simplify
Imposing this assumption, the variance of the \(\hat{\beta}_{j}\) is \[Var[\hat{\beta}_{j}|X_{1i},X_{2i},...,X_{ki}] = \frac{\sigma_{u}^2 }{(\sum_{i=1}^{n}(X_{ji} - \bar{X}_{j})^2)(1-R^2_{j})}\]
Formula applies for all slope estimates from \(j = 1,...,k\)
The term \(R^2_{j}\) is the \(R^2\) from a regression of \({X}_{ji}\) on all other independent variables
The part of \({X}_{ji}\) that is explained by all other variables
Note: this is not the original model \(R^2\)
This is an auxiliary \(R^2\) from regressing \({X}_{ji}\) on other independent variables
The variance of \(\hat{\beta}_{j}\) consists of 3 components
The error variance \(\sigma_{u}^2\)
If the error is more variable, the estimate \(\hat{\beta}_{j}\) is more variable
Means that values of \(Y_{i}\) are more spread out around \(E[Y_{i}|X_{1i},X_{2i},...,X_{ki}]\)
More “noise” in the regression
Noise carries into the slope estimators
Sample variation in \(X_{j}\), \(\sum_{i=1}^{n}(X_{ji} - \bar{X}_{j})^2\)
More variation in \(X_{j}\) reduces variance in \(\hat{\beta}_{j}\)
More spread in \(X_{j}\) makes it easier to estimate slope parameters
Important to sample such that \(X_{j}\) is spread out widely
\(R^2_{j}\) is the \(R^2\) from a regression of \({X}_{ji}\) on all other independent variables
A larger \(R^2_{j}\) increases the variance of \(\hat{\beta}_{j}\)
Larger \(R^2_{j}\) means other variables explain more variation in \({X}_{j}\)
Means \({X}_{j}\) is more collinear with other independent variables
There is less independent variation in \({x}_{j}\)
Higher values will increase the variance of \(\hat{\beta}_{j}\)
Higher \(R^2_{j}\) illustrates problems created by collinearity
Larger \(R^2_{j}\) means \({X}_{j}\) more collinear with other independent variables
At the extreme, \(R^2_{j} = 1\) means perfect collinearity
As \(R^2_{j} \rightarrow 1\), \(var(\hat{\beta}_{j}) \rightarrow \infty\)
Thus, too much collinearity creates imprecise estimates of \(\hat{\beta}_{j}\)
As before, results from simple regression carry over to multiple regression
Given 4 assumptions, OLS estimators \(\hat{\beta}_{0}, \hat{\beta}_{1}, ...\hat{\beta}_{k}\) are Normally distributed in large samples
Mean and variance are as discussed previously
Recall that variance formula above only appropriate under homoskedasticity
More complicated formula for heteroskedastic errors
Knowing distribution of \(\hat{\beta}_{0}, \hat{\beta}_{1}, ...\hat{\beta}_{k}\) will help us for hypothesis testing
In our example we can simulate the sampling distribution of the slope
Below we simulate the distribution for \(\hat{\beta}_{1}\)
clear all
local sims = 9999
set obs `sims'
set seed 12345
set more off
gen beta1 = .
forvalues x = 1/`sims' {
preserve
clear
qui set obs 420
gen str = rnormal(20,2)
gen mealpct = 7 + 2*str + rnormal(0,25)
qui replace mealpct = 0 if mealpct < 0
qui replace mealpct = 100 if mealpct > 100
gen u = rnormal(0,10)
gen testscr = 700 -1 * str - 0.5*mealpct + u
qui regress testscr str mealpct
restore
qui replace beta1 = _b[str] in `x'
display "Iteration `x'"
}
No perfect multicollinearity is one of the assumptions of the model
Consequence of violating the assumption is the model cannot be estimated
Effectively like trying to estimate effect of a variable while holding it fixed
OLS does not work in this case
We discussed some examples, and here we expand on that list
Also add a few other details
In each example below, imagine we start with the model \[TestScore_{i} = \beta_{0} + \beta_{1}STR_{i} + PctESL_{i} + u_{i}\]
Then we create a multicollinearity problem by adding a third variable
Adding Fraction ESL
Fraction ESL varies between 0 and 1
It is perfectly functionally related to Percent ESL: \[PctESL_{i} = 100 \times FracESL_{i}\]
Impossible to estimate effect of \(PctESL_{i}\) holding \(FracESL_{i}\) constant
Because they measure exactly the same thing
“Not very small” classes
Suppose \(NVS_{i} = 1\{STR \ge 12\}\)
Also creates perfect collinearity for subtle reason
In the data no classes have \(STR_{i} < 12\), so \(NVS_{i} = 1\) for all observations
Perfectly collinear with the constant term in the regression
Percent English Speakers
Define English speakers as those who are not ESL students \[PctES_{i} = 100 - PctESL_{i}\]
\(PctES_{i}\) is an exact function of the constant and \(PctESL_{i}\) \[PctES_{i} = 100 \times (1) - PctESL_{i}\]
Dummy variables create special case of perfect collinearity
Consider estimating a gender wage gap
Create two dummy variables
\(male = 1\) if person is male, and 0 otherwise
\(female = 1\) if person is female, and 0 otherwise
Adding both \(male\) and \(female\) to regression creates perfect multicollinearity
\(male + female = 1\)
This is perfectly collinear with the constant
This is called the dummy variable trap
Generally, dummy variable trap is
There are G binary variables, and each observation falls into one category
Including all G variables creates perfect multicollinearity
To avoid dummy variable trap, you can only include \(G-1\) dummy variables
Male-female example
G = 2, since we divide people into male or female
Based on rule above, can only include one
Either include \(male\) or \(female\), but not both
Which one you include depends on preference
Typically, perfect multicollinearity arises from a mistake
Solution is to find mistake, and drop collinear variable
Sometimes it is easy to find mistake, sometimes not
In practice, Stata does this automatically
It looks for perfect linear relationships among \(X\) variables
If it finds one, it drops one or more variables until no more relationship exists
A perfect relationship between regressors makes us unable to produce OLS estimates
However, regressors are still allowed to be related
A relationship between variables that is not exact is called imperfect mutlicollinearity
Imperfect multicollinearity is allowed
Key issue for imperfect multicollinearity is that it causes variance of \(\hat{\beta}_{j}\) to rise
Intuition: OLS estimates independent effect of each \(X_{j}\) on \(Y_{i}\)
If \(X_{j}\) is highly related to another variable, there is little independent variation
There is no obvious solution to imperfect multicollinearity
You can still estimate all the \(\hat{\beta}_{j}\)
But they will not be very precise (i.e. their variance will be high)
Suppose we try to add a collinear variable to the model
mealdec is mealpct divided by 100
These two variables are perfectly linearly related
note: mealdec omitted because of collinearity.
Source | SS df MS Number of obs = 420
-------------+---------------------------------- F(2, 417) = 366.81
Model | 75975.9601 2 37987.98 Prob > F = 0.0000
Residual | 43185.6115 417 103.562617 R-squared = 0.6376
-------------+---------------------------------- Adj R-squared = 0.6358
Total | 119161.572 419 284.395159 Root MSE = 10.177
------------------------------------------------------------------------------
testscr | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
str | -.8082448 .2399457 -3.37 0.001 -1.279899 -.3365909
mealpct | -.5341559 .0206251 -25.90 0.000 -.5746981 -.4936138
mealdec | 0 (omitted)
_cons | 698.5687 4.786449 145.95 0.000 689.1601 707.9773
------------------------------------------------------------------------------
Linear regression Number of obs = 420
F(2, 417) = 376.52
Prob > F = 0.0000
R-squared = 0.6355
Root MSE = 10.206
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
smallclass | 3.018267 1.010161 2.99 0.003 1.032625 5.003909
mealpct | -.534675 .020274 -26.37 0.000 -.574527 -.494823
_cons | 680.8421 1.268558 536.71 0.000 678.3486 683.3357
------------------------------------------------------------------------------
Next try to add both dummies
Stata drops one of the dummies because \(smallclass + bigclass = 1\)
This is perfectly collinear with the constant
note: bigclass omitted because of collinearity.
Linear regression Number of obs = 420
F(2, 417) = 376.52
Prob > F = 0.0000
R-squared = 0.6355
Root MSE = 10.206
------------------------------------------------------------------------------
| Robust
testscr | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
smallclass | 3.018267 1.010161 2.99 0.003 1.032625 5.003909
bigclass | 0 (omitted)
mealpct | -.534675 .020274 -26.37 0.000 -.574527 -.494823
_cons | 680.8421 1.268558 536.71 0.000 678.3486 683.3357
------------------------------------------------------------------------------