Linear Regression with Multiple Regressors

EC295

Justin Smith

Wilfrid Laurier University

Fall 2022

Motivation

Motivation

  • We have discussed regression with one independent variable

    • Helps provide intuition

    • But, actual models rarely involve only one

  • In this section, we will cover multiple regressors

    • Regression models with more than one independent variable

    • Most real models have many such variables

  • Multiple regression allows us to

    • Explicitly hold constant other variables in a regression

    • More accurately estimate causal effects

    • Incorporate more general relationships between variables

  • The same intuition carries forward for multiple regression

    • Math is slightly more involved

The Multiple Regression Model

Model with Two Regressors

  • A regression model with two regressors is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ u_{i}\]

    • \(X_{1i}\) is the first independent variable

    • \(X_{2i}\) is the second independent variable

    • \(\beta_{0}\) is the intercept

    • \(\beta_{1}\) is the all else equal effect of \(X_{1i}\) on \(Y_{i}\)

    • \(\beta_{2}\) is the all else equal effect of \(X_{2i}\) on \(Y_{i}\)

    • \(u_{i}\) is all factors other than \(X_{1i},X_{2i}\) that affect \(Y_{i}\)

  • The corresponding population regression function is

\[E[Y_{i}| X_{1i}, X_{2i}]= \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}\]

Model with Two Regressors

  • Key difference vs model with one regressor: when measuring slope, we explicitly hold other factors constant

    • \(\beta_{1}\) is effect of \(X_{1i}\) on \(Y_{i}\) holding \(X_{2i}\) fixed

    • \(\beta_{2}\) is effect of \(X_{2i}\) on \(Y_{i}\) holding \(X_{1i}\) fixed

  • To see this, take total change in \(E[Y_{i}| X_{1i}, X_{2i}]\) \[\Delta E[Y_{i}| X_{1i}, X_{2i}] = \beta_{1}\Delta X_{1i}+ \beta_{2}\Delta X_{2i}\]

  • Now hold \(X_{2i}\) fixed by setting \(\Delta X_{2i} = 0\) \[\Delta E[Y_{i}| X_{1i}, X_{2i}] = \beta_{1}\Delta X_{1i}\]

  • and as a result \[\beta_{1} = \frac{\Delta E[Y_{i}| X_{1i}, X_{2i}] }{\Delta X_{1i}}\]

Model with Two Regressors

  • Key is that \(\beta_{1}\) is calculated holding \(X_{2i}\) constant

    • Called partial effect if \(X_{1i}\) on \(Y_{i}\)
  • Intercept also has slightly different interpretation

    • Average value of \(Y_{i}\) when \(X_{1i}\) and \(X_{2i}\) are zero
  • Example: Wages, schooling, and experience

    • If \(Y\) is wages, \(X_{1i}\) is education, and \(X_{2i}\) is experience, then

    \[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} + u_{i}\]

    • \(\beta_{1}\) is effect of education on wages, holding experience fixed

    • \(\beta_{2}\) is experience on wages, holding education fixed

    • \(u\) is other variables that affect wages

Model with Two Regressors

  • We will see that adding more variables to the model helps avoid bias in OLS

    • For OLS to be unbiased, need to assume unobserved variables have zero conditional mean

    • For variables included in the model, no longer need to make that assumption

      • Though we still do for variables that remain unobserved
    • Strengthens likelihood that OLS estimator is unbiased

  • Definition of homoskedasticity and heteroskedasticity are natural extensions

    • Homoskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i}) = \sigma^2_{u}\)

      • A constant number that does not vary across individuals
    • Heteroskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i}) = \sigma^2_{ui}\)

      • Does vary across individuals

Model with \(k\) Regressors

  • A regression model with \(k\) regressors is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ... + \beta_{k}X_{ki} + u_{i}\]

    • \(X_{1i}, X_{2i}, ..., X_{ki}\) are independent variables

    • \(\beta_{0}\) is the intercept

    • \(\beta_{1}, \beta_{2}, ..., \beta_{k}\) are slope parameters

      • Partial effects holding other factors in model constant
    • \(u_{i}\) is all factors other than \(X_{1i}, X_{2i}, ..., X_{ki}\) that affect \(Y_{i}\)

  • Population regression function is

\[E[Y_{i}| X_{1i}, X_{2i}, ...,X_{ki}]= \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ... + \beta_{k}X_{ki}\]

Model with \(k\) Regressors

  • The partial effect of \(X_{ji}\) on \(Y_{i}\) is

    \[\beta_{j} = \frac{\Delta E[Y_{i}| X_{1i}, X_{2i},..., X_{ki}] }{\Delta X_{ji}}\]

    • The effect of \(X_{ji}\) on \(Y_{i}\) holding all other variables fixed
  • Intercept is value of \(E[Y_{i}| X_{1i}, X_{2i},..., X_{ki}]\) when all independent variables equal zero

  • As before, assumptions for errors are natural extensions

    • Homoskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i},..., X_{ki}) = \sigma^2_{u}\)

    • Heteroskedastic errors are \(VAR(u_{i}|X_{1i}, X_{2i},..., X_{ki}) = \sigma^2_{ui}\)

Estimation by OLS

Ordinary Least Squares

  • Remember that OLS chooses estimates of \(\beta\) to minimize the sum of the squared residuals in the sample

  • The population model is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i}+ ...+ \beta_{k}X_{ki}+ u_{i}\]

  • When we replace parameters with estimates, we get \[Y_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}+ ...+ \hat{\beta}_{k}X_{ki}+ \hat{u}_{i}\]

  • The sum of the squared residuals are \[\sum_{i=1}^{n} \left ( Y_{i}- \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right )^2\]

Ordinary Least Squares

  • Minimizing this function, we get the following equations

\[\sum_{i=1}^{n} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) =0\] \[\sum_{i=1}^{n} X_{1i} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) = 0\] \[\vdots\] \[\sum_{i=1}^{n} X_{ki} \left ( Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki} \right ) = 0\]

  • The values \(\hat{\beta}_{0}, \hat{\beta}_{1} ,...,\hat{\beta}_{k}\) are the solutions to these equations

Ordinary Least Squares

  • The predicted values are computed as

\[\hat{Y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}+ ...+ \hat{\beta}_{k}X_{ki}\]

  • The residuals are then computed as

\[\hat{u}_{i} = Y_{i} - \hat{Y}_{i} = Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}X_{1i} - \hat{\beta}_{2}X_{2i}- ...- \hat{\beta}_{k}X_{ki}\]

  • \(\hat{\beta}_{0}\) is the estimated intercept

    • Predicted value \(\hat{Y_{i}}\) when all other variables equal zero
  • \(\hat{\beta}_{1},\hat{\beta}_{2},...,\hat{\beta}_{k}\) are the estimated partial effects

    • change in \(\hat{Y_{i}}\) when a variable changes,holding all other variables equal

Ordinary Least Squares

  • To see partial effect interpretation, take the total change in \(\hat{Y_{i}}\)

    \[\Delta \hat{Y_{i}} = \hat{\beta}_{1}\Delta X_{1i} + \hat{\beta}_{2}\Delta X_{2i}+ ...+ \hat{\beta}_{k}\Delta X_{ki}\]

    • If we set \(\Delta X_{2i} = 0,...,\Delta X_{ki} = 0\) then

      \[\hat{\beta}_{1} = \frac{\Delta \hat{Y}_{i}} {\Delta X_{1i}}\]

    • Similarly, if we set \(\Delta X_{1i} = 0,\Delta X_{3i} = 0, ... ,\Delta X_{ki} = 0\) then

      \[\hat{\beta}_{2} = \frac{\Delta \hat{Y}_{i}}{\Delta X_{2i}}\]

Partialling Out Interpretation

  • We saw that \(\hat{\beta}_{j}\) is the partial effect of \(X_{ji}\) on \(\hat{Y}_{i}\)

  • We can express \(\hat{\beta}_{j}\) with a simple formula consistent with this interpretation

  • Imagine a regression with \(k=2\)

\[\hat{Y}_{i} = \hat{\beta}_{0}+ \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}\]

  • Now, perform this two-step procedure

    1. Regress \(X_{1i}\) on \(X_{2i}\), and obtain the residuals \(\hat{r}_{1i}\)

    2. Estimate a regression of \(Y_{i}\) on \(\hat{r}_{i1}\)

Partialling Out Interpretation

  • \(\hat{\beta}_{1}\) equals the slope estimate on the second regression

\[\hat{\beta}_{1} = \frac{\sum_{i=1}^{n} \hat{r}_{i1}Y_{i}}{\sum_{i=1}^{n} \hat{r}_{i1}^2}\]

  • Why?

    • In step 1, we purge \(X_{1i}\) of part that is correlated with \(X_{2i}\)

    • \(\hat{r}_{i1}\) is piece of \(X_{1i}\) unrelated to \(X_{2i}\)

    • By using \(\hat{r}_{i1}\) step 2 regression, it is like we already controlled for \(X_{2i}\)

  • Highlights intuition behind regression

    • When you regress \(Y_{i}\) on \(X_{1i}\), you separate \(Y_{i}\) into two pieces

      • Part that is correlated with \(X_{1i} \rightarrow \hat{Y}_{i}\)

      • Part that is uncorrelated with \(X_{1i} \rightarrow \hat{u}_{i}\)

Comparing Simple and Multiple Regression

  • Imagine the following predicted values from two different regressions

\[\tilde{Y}_{i} = \tilde{\beta}_{0}+ \tilde{\beta}_{1}X_{1i}\] \[\hat{Y}_{i} = \hat{\beta}_{0}+ \hat{\beta}_{1}X_{1i} + \hat{\beta}_{2}X_{2i}\]

  • The slope coefficients have the following relationship

    \[\tilde{\beta}_{1} =\hat{\beta}_{1} + \hat{\beta}_{2}\tilde{\delta}_{1}\]

    • \(\tilde{\delta}_{1}\) is the slope in a regression of \(X_{1i}\) on \(X_{2i}\)
  • Simple and multiple regression coefficients are equal when

    1. \(\hat{\beta}_{2} = 0\), which means \(X_{2i}\) is unrelated to \(Y_{i}\)

    2. \(\tilde{\delta}_{1} = 0\), which means \(X_{1i}\) is unrelated to \(X_{2i}\)

Omitted Variables Bias

Introduction

  • Suppose we are interested in \(\beta_{1}\), the effect of \(X_{i}\) on \(Y_{i}\)

  • In regression of \(Y_{i}\) on \(X_{i}\), \(\hat{\beta}_{1}\) is unbiased if unobserved factors are unrelated to \(X_{i}\)

    • Mathematically, this means \(E[u_{i}|X_{i}] = 0\)

    • When this is true, we estimate a causal effect

    • Recall this is an assumption

  • When unobserved factors are related to \(X_{i}\), a solution is to add them to the model

    • This is multiple regression

    • It explicitly holds them fixed when measuring effect of \(X_{i}\) on \(Y_{i}\)

    • For these variables, we do not have to assume they are unrelated to \(X_{i}\)

Introduction

  • It can often be difficult to hold all relevant factors fixed

    • Ex: class size and test scores

      • Many factors besides class size determine test scores

      • Family background, school quality, teachers, etc

      • Many are likely related to \(X_{i}\)

  • What if we exclude a relevant factor that is related to \(X_{i}\)?

  • Cannot assume \(E[u_{i}|X_{i}] = 0\)

  • Estimator \(\hat{\beta}_{1}\) will suffer from Omitted Variables Bias

    • \(E[\hat{\beta}_{1}] \neq \beta_{1}\)

    • \(\hat{\beta}_{1}\) will not measure all else equal (causal) effect \(X_{i}\) on \(Y_{i}\)

    • Instead it measures partly effect \(X_{i}\) on \(Y_{i}\), partly effect of excluded variable on \(Y_{i}\)

Introduction

  • Suppose you are measuring effect of class size on test scores

  • Below are some potential omitted factors

    1. Percentage of english as a second language (ESL) students

      • School districts with big classes have lots of ESL students

      • ESL students tend to perform worse on standardized tests

      • Even if class size has no independent effect, bigger classes will perform worse

    2. Parental background

      • Wealthier areas tend to have schools with smaller classes

      • Richer students may perform better on standardized tests, due to better resources at home

      • Even if class size has no independent effect, bigger classes will perform worse

Formula for Omitted Variables Bias

  • Imagine that the correct model for outcome \(Y_{i}\) is \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + u_{i}\]

    • Assume that this model satisfies the OLS assumptions

      • Most importantly, \(u_{i}\) is unrelated to both \(X_{1i}\) and \(X_{2i}\)

      • Mathematically \(E[u_{i}|X_{1i},X_{2i}] = 0\)

  • Suppose we omit \(X_{2i}\), pushing it to the error term so that \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + v_{i}\]

    • where the new error is \(v_{i} =\beta_{2}X_{2i} + u_{i}\)

Formula for Omitted Variables Bias

  • What happens if we try to estimate this second model?

  • We know from the simple regression model that

\[\tilde{\beta}_{1} =\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})v_{i}}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]

  • The expected value of this estimator is

\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})E[v_{i} |X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]

  • \(\tilde{\beta}_{1}\) is unbiased if \(E[v_{i} |X_{1i}] = 0\)

  • Working out that expectation, we get

\[E[v_{i} |X_{1i}]=E[\beta_{2}X_{2i} + u_{i}|X_{1i}]= \beta_{2}E[X_{2i}|X_{1i}] + E[u_{i}|X_{1i}]\]

Formula for Omitted Variables Bias

  • Based on assumption that \(u_{i}\) is unrelated to \(X_{1i}\), \(E[u_{i}|X_{1i}] = 0\)

  • Therefore

\[E[v_{i} |X_{1i}]= \beta_{2}E[X_{2i}|X_{1i}]\]

  • In general, this is not equal to zero, so \(\tilde{\beta}_{1}\) is biased

    • We discuss two exceptions below
  • To quantify the bias, substitute back into the original formula to get

\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} + \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})\beta_{2}E[X_{2i}|X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\]

Formula for Omitted Variables Bias

  • Let \(\delta_{1} = \frac{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})E[X_{2i}|X_{1i}]}{\sum_{i=1}^{n}(X_{1i} - \bar{X}_{1})^2}\)

    • \(\delta_{1}\) represents the relationship between \(X_{2i}\) and \(X_{1i}\)
  • The formula above then simplifies to

\[E[\tilde{\beta}_{1}| X_{1i}]=\beta_{1} +\beta_{2}\delta_{1}\]

  • The bias in OLS estimators from an omitted variable is \(\beta_{2}\delta_{1}\)

  • Depends on two factors

    • \(\beta_{2}\), the relationship between \(Y_{i}\) and \(X_{2i}\)

    • \(\delta_{1}\), the relationship between \(X_{2i}\) and \(X_{1i}\)

Formula for Omitted Variables Bias

  • We can use this to quantify the direction of the bias

    Direction of Omitted Variables Bias
    \(corr(x_{i2}, x_{i1}) > 0\) \(corr(x_{i2}, x_{i1}) < 0\)
    \(\beta_{2} >0\) Positive Bias Negative Bias
    \(\beta_{2} <0\) Negative Bias Positive Bias
  • There are two situations where omitted variables do not cause bias

    1. \(\beta_{2} = 0\), which means \(X_{2i}\) is unrelated to \(Y_{i}\)

      • If \(X_{2i}\) is irrelevant, then it will not cause bias
    2. \(\delta_{1} = 0\), which means \(X_{2i}\) is unrelated to \(X_{1i}\)

      • Even if \(X_{2i}\) is relevant, it will not cause bias if it is unrelated to \(X_{1i}\)

Formula for Omitted Variables Bias

  • Example: Ability Bias

    • The classic example of omitted variables bias is leaving ability out of a regression of wages on schooling

    • Suppose the true model linking wages to schooling is

      \[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}abil_{i}+ u_{i}\]

    • Often we do not have measures of ability

    • If you were to leave it out and estimate

      \[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + v_{i}\]

    • The estimator would be biased

      • \(\beta_{2} > 0\), ability is positively related to wages

      • \(\delta_{1} > 0\), higher ability people get more schooling

      • Bias is positive: the return to education will appear higher than it really is

Formula for Omitted Variables Bias

  • In more complicated models, omitting one variable can bias all estimators

    • Suppose the true model is

      \[wage_{i} = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} + \beta_{3}abil_{i} + u_{i}\]

    • If we leave ability out of the model and estimate

      \[wage = \beta_{0} + \beta_{1}educ_{i} + \beta_{2}exper_{i} +v_{i}\]

    • Then estimators for \(\beta_{1}\) and \(\beta_{2}\) are biased

    • This is true even if only one variable is correlated with \(abil_{i}\)

      • Imagine that \(educ_{i}\) and \(abil_{i}\) are correlated

      • But, \(exper_{i}\) and \(abil_{i}\) are uncorrelated

      • Estimator for \(\beta_{2}\) is still biased, unless \(exper_{i}\) and \(educ_{i}\) are unrelated

Example with Stata

  • We will again illustrate Stata commands in the context of a research question

  • Research Question: Are test scores related to class size?

  • Previous model is extended with another independent variable

    • Add percent free/reduced price lunch in district

    • Acts as a proxy for parent income

    • Students only qualify when parent income is below a cutoff

  • We will see how the model changes with the additional variable

Example with Stata

  • We will assume the model relating math scores to explanatory factors is

    \[testscr_{i} = \beta_{0} + \beta_{1}str_{i} + \beta_{2}mealpct_{i} + u_{i}\]

    • \(\beta_{1}\) is effect of one extra student per teacher on test scores, all else equal

    • \(\beta_{2}\) is effect of a 1-percentage point increase in free meal status on test scores, all else equal

    • \(\beta_{0}\) is math scores when str and mealpct are zero

    • \(u\) are things other than the model variables that explain math scores

Example with Stata

  • First create the simulated data
clear
set obs 420
set seed 12345

gen str = rnormal(20,2)
gen mealpct = 7 + 2*str + rnormal(0,25)
    replace mealpct = 0 if mealpct < 0
    replace mealpct = 100 if mealpct > 100
gen u = rnormal(0,10)

gen testscr = 700 -1 * str - 0.5*mealpct + u
  • Meal percent variable is explicitly correlated with student teacher ratio

  • Intercept is 700

  • Slope on str is -1

  • Slope on mealpct is -0.5

Example with Stata

  • Summarize the relevant data
sum testscr str mealpct
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     testscr |        420    657.2409    16.86402   612.8703   704.8796
         str |        420    20.13071    2.103167    14.2861   27.30753
     mealpct |        420    46.91004    24.46755          0        100
  • Things to note

    • Test scores simulate data that are scaled to have mean 650 and SD 20

    • All data is simulated to look like real California data

    • In real data, you might need to deal with things like missing values

      • But we have avoided those complications here

Example with Stata

  • Estimate model by OLS
regress testscr str mealpct, robust
Linear regression                               Number of obs     =        420
                                                F(2, 417)         =     383.40
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6376
                                                Root MSE          =     10.177

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         str |  -.8082448   .2293604    -3.52   0.000    -1.259091   -.3573982
     mealpct |  -.5341559   .0203418   -26.26   0.000    -.5741412   -.4941706
       _cons |   698.5687   4.558356   153.25   0.000     689.6085    707.5289
------------------------------------------------------------------------------
  • The output indicates that

    • \(\hat{\beta}_{1} = -0.81\): 1-student increase in class size reduces scores by 0.81 points

    • \(\hat{\beta}_{2} = -0.53\): 1-percentage point increase in free meals reduces scores by 0.53 points

Example with Stata

  • We can also estimate \(\hat{\beta}_{1}\) by “partialling out” mealpct from str
regress str mealpct
predict res, resid
regress testscr res, robust
Linear regression                               Number of obs     =        420
                                                F(1, 418)         =       3.89
                                                Prob > F          =     0.0494
                                                R-squared         =     0.0099
                                                Root MSE          =     16.801

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         res |  -.8082448   .4100521    -1.97   0.049    -1.614266   -.0022236
       _cons |   657.2409   .8197914   801.72   0.000     655.6295    658.8523
------------------------------------------------------------------------------
  • First step keeps part of class size that is unrelated to free meal status (res)

  • Second step regresses math scores on values from the first step

Example with Stata

  • We can illustrate omitted variables bias by leaving mealpct out of the regression
regress testscr str, robust
Linear regression                               Number of obs     =        420
                                                F(1, 418)         =      22.09
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0547
                                                Root MSE          =     16.416

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         str |  -1.874765   .3988784    -4.70   0.000    -2.658823   -1.090708
       _cons |   694.9813   7.995686    86.92   0.000     679.2645     710.698
------------------------------------------------------------------------------
  • Notice the coefficient on str is lower

  • Because str and mealpct are positively related and mealpct and testscr are negatively related

Measures of Fit in Multiple Regression

\(R^2\)

  • We continue to use \(R^2\) as a measure of goodness of fit

\[R^2 = \frac{ESS}{TSS} = 1-\frac{SSR}{TSS}\]

  • Recall that an \(R^2\) is between 0 and 1, with higher values meaning better fit

  • \(R^2\) is still the square of the correlation coefficient between \(Y_{i}\) and \(\hat{Y}_{i}\)

\[R^2 = \frac{\left ( \sum_{i=1}^{n}(y_{i} - \bar{y})(\hat{y}_{i} - \bar{\hat{y}}) \right )^2}{\left (\sum_{i=1}^{n}(y_{i} - \bar{y})^2 \right ) \left (\sum_{i=1}^{n}(\hat{y}_{i} - \bar{\hat{y}})^2 \right )}\]

\(R^2\)

  • The definitions of the sums of squares are the same

\[TSS = \sum_{i=1}^{n}(Y_{i} - \bar{Y})^2\]

\[ESS = \sum_{i=1}^{n}(\hat{Y}_{i} - \bar{Y})^2\]

\[SSR = \sum_{i=1}^{n}\hat{u}_{i}^2\]

  • It is still true that you can decompose total sum of squares into the explained and residual component

\[TSS = ESS + SSR\]

\(R^2\)

  • Recall that a low \(R^2\) does not mean regression is bad

    • Simply means we have not explained large proportion of variation in \(Y_{i}\)

    • Does not affect whether \(\hat{\beta}_{j}\) is good estimate for \(\beta_{j}\)

  • Important property of \(R^2\) is that it never decreases with additional variables

    • Adding variables cannot reduce explanatory power of regression

    • ESS cannot fall when variables added to regression

      • Equivalently, SSR cannot rise with more variables in regression
    • Makes \(R^2\) bad tool for deciding whether to add or drop variables

      • Ex: adding variable with random values slightly increases \(R^2\)

      • But clearly random values do not belong in regression

Adjusted \(R^2\)

  • We discussed that \(R^2\) measures goodness of fit

    • The fraction of variation in \(Y_{i}\) explained by \(X_{i}\)

    • Correlation between actual \(Y_{i}\) and fitted values

    • How closely datapoints fall along a straight line

  • One issue with \(R^2\) is that it never falls with additional variables

    • With new variables, ESS stays the same or goes up

    • You cannot explain less of variation in \(Y_{i}\) with more variables

  • It is mostly not useful to use \(R^2\) to decide if a variable should be added to the model

Adjusted \(R^2\)

  • Recall that the \(R^2\) is written as \[R^2 = 1- \frac{SSR}{TSS}\]

  • You can also write that as \[R^2 = 1- \frac{SSR/n}{TSS/n}\]

  • Think of \(SSR/n\) as estimate of error variance, and \(TSS/n\) as estimate of variance in \(y\)

  • But

    • \(SSR/n\) is biased estimator of error variance

    • \(TSS/n\) is biased estimator of variance in \(y\)

Adjusted \(R^2\)

  • The adjusted \(R^2\) replaces these biased estimators with unbiased ones

\[\bar{R}^2 = 1- \frac{SSR/(n-k-1)}{SST/(n-1)} = 1- \frac{n-1}{n-k-1} \frac{SSR}{TSS}\]

  • There are three useful things about the adjusted \(R^2\)
  1. This measure does not always rise when a variable is added to the model

    • Imagine adding one new variable

      • \(SSR\) will fall

      • \(\frac{n-1}{n-k-1}\) rises (because \(k\) rises)

      • Effect on \(\bar{R}^2\) depends which is stronger

      • Thus, \(\bar{R}^2\) can fall with new variables

Adjusted \(R^2\)

  1. \(\bar{R}^2\) is always less than \({R}^2\)

    • \(\frac{n-1}{n-k-1}\) is always greater than 1

    • So we subtract a bigger number from 1 in the \(\bar{R}^2\) formula

  2. \(\bar{R}^2\) can be negative if the model has a very poor fit

    • Happens mainly when \(SSR\) is large, \(n\) is small, and \(k\) is large

Standard Error of Regression (SER)

  • The SER is still the estimated standard deviation of the residuals

    • Measures how far the \(Y_{i}\) values are from the line, on average
  • The formula changes due to degrees of freedom adjustment

\[SER = s_{\hat{u}} = \sqrt{\frac{1}{n-k-1} \sum_{i=1}^{n}\hat{u}_{i}^2}\] \[= \sqrt{\frac{SSR}{n-k-1}}\]

  • \(k\) represents the number variables in the model

    • Each parameter estimate uses information, which is the reason for the adjustment

    • If \(k=1\), then formula is same as we learned previously

Example with Stata

  • Below is the regression output from earlier
regress testscr str mealpct, robust
Linear regression                               Number of obs     =        420
                                                F(2, 417)         =     383.40
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6376
                                                Root MSE          =     10.177

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         str |  -.8082448   .2293604    -3.52   0.000    -1.259091   -.3573982
     mealpct |  -.5341559   .0203418   -26.26   0.000    -.5741412   -.4941706
       _cons |   698.5687   4.558356   153.25   0.000     689.6085    707.5289
------------------------------------------------------------------------------
  • \(R^2\) is in the top right (\(\bar{R}^2\) only appears without robust option)

  • The SER is equal to the “Root MSE”

Least Squares Assumptions for Multiple Regression

Introduction

  • Like before, we need to make assumptions for causal inference

    • To estimate only direct effect of an independent variable on outcome
  • Maintain the same three assumptions we learned in last section

  • Add extra assumption about relationship between independent variables

    • They cannot be perfectly related to each other

Assumption 1: Zero Conditional Mean of the Error

  • The average error term \(u_{i}\) conditional on all X variables is zero \[E[u_{i}| X_{1i},X_{2i},...,X_{ki}]=0\]

  • This is the same condition as in the simple regression model

  • Difference is that error has average of zero for all combinations of \(X_{1i},X_{2i},...,X_{ki}\)

  • Also implies that \(u_{i}\) is not systematically related to any of \(X_{1i},X_{2i},...,X_{ki}\)

  • How can this assumption fail?

    • Model misspecification

    • Omitted variables

    • Simultaneous equations

Assumption 2: (\(Y_{i},X_{1i},X_{2i},...,X_{ki})\) are iid

  • These variables must be independent

  • Also must come from the same distribution

  • As discussed before, this holds true when we take a simple random sample

  • Assumption use to establish slopes are consistent and have Normal distribution

Assumption 3: Large Outliers Unlikely

  • Outliers are values far from the usual range of data

  • They can have large effect on slope estimates

Assumption 4: No Perfect Multicollinearity

  • Collinearity refers to correlations among independent variables

  • Perfect collinearity is an exact linear relationship between one or more independent variables

  • Perfectly collinear variables cannot be included in the model

    • Collinear variables can be included

    • But not perfectly collinear ones

Assumption 4: No Perfect Multicollinearity

  • Exa,ple of perfect collinearity: Height on income

    • Suppose you want to estimate the effect of height on income

    • Your regresion model is

    \[Y_{i} = \beta_{0} + \beta_{1}h\_inch_{i} + \beta_{1}h\_cm_{i} + u_{i}\]

    • Where \(h\_cm_{i} = 2.54*h\_inch_{i}\)

    • You cannot estimate a model with both \(h\_cm_{i}\) and \(h\_inch_{i}\)

    • \(h\_cm\) does not move independently from \(h\_inch_{i}\)

    • These variables are perfectly collinear

Assumption 4: No Perfect Multicollinearity

  • Ex of imperfect collinearity: Height and weight on income

    • Suppose you want effect of height and weight on income

    • Your regresion model is \(y = \beta_{0} + \beta_{1}h\_inch + \beta_{1}w\_lbs + u\)

    • Height and weight are very highly related

    • But there is no exact, linear relationship

    • This model can be estimated

Assumption 4: No Perfect Multicollinearity

  • Ex of imperfect collinearity: Height and height\(^2\) on income

    • Regresion model is \(y = \beta_{0} + \beta_{1}h\_inch + \beta_{1}h\_inch^2 + u\)

    • \(h\_inch^2\) is a perfect nonlinear function of \(h\_inch\)

    • This is allowed because there is no exact, linear relationship

    • This model can also be estimated

  • Problem of perfect collinearity is we cannot estimate the model

  • Intuitively, it is asking regression to answer illogical question

    • What is effect of \(X_{i}\) on \(Y_{i}\) holding \(X_{i}\) fixed?
  • As we discuss later, the solution is to drop one of the collinear variables

Distribution of the OLS Estimators in Multiple Regression

Expected Value of OLS Estimator

  • We can show that if assumptions 1 - 4 are true, then

    \[E[\hat{\beta}_{j}] = \beta_{j}, j=0,1,...,k\]

  • Which means that each \(\hat{\beta}_{j}\) is an unbiased estimator for \(\beta_{j}\)

  • The proof is long, but follows similar logic to the simple regression model

  • Implies that we are not making any systematic errors in estimating \(\beta_{j}\)

    • Only reason \(\hat{\beta}_{j}\) differs from \(\beta_{j}\) is sampling error
  • Remember that this is a statistical property

    • In theory, over many repeated samples, the average of all \(\hat{\beta}_{j}\) equals the true value

Variance of OLS Estimator

  • Remember that from one sample to the next, the value of OLS estimators will differ

    • This is sampling variation
  • Variance and standard deviation describe this variation

  • Variance of OLS estimators under heteroskedasticity is complicated

  • If we are willing to assume homoskedasticity, we can simplify

  • Imposing this assumption, the variance of the \(\hat{\beta}_{j}\) is \[Var[\hat{\beta}_{j}|X_{1i},X_{2i},...,X_{ki}] = \frac{\sigma_{u}^2 }{(\sum_{i=1}^{n}(X_{ji} - \bar{X}_{j})^2)(1-R^2_{j})}\]

Variance of OLS Estimator

  • Formula applies for all slope estimates from \(j = 1,...,k\)

    • Intercept has different variance formula
  • The term \(R^2_{j}\) is the \(R^2\) from a regression of \({X}_{ji}\) on all other independent variables

    • The part of \({X}_{ji}\) that is explained by all other variables

    • Note: this is not the original model \(R^2\)

    • This is an auxiliary \(R^2\) from regressing \({X}_{ji}\) on other independent variables

  • The variance of \(\hat{\beta}_{j}\) consists of 3 components

Variance of OLS Estimator

  1. The error variance \(\sigma_{u}^2\)

    • If the error is more variable, the estimate \(\hat{\beta}_{j}\) is more variable

      • Means that values of \(Y_{i}\) are more spread out around \(E[Y_{i}|X_{1i},X_{2i},...,X_{ki}]\)

      • More “noise” in the regression

      • Noise carries into the slope estimators

  2. Sample variation in \(X_{j}\), \(\sum_{i=1}^{n}(X_{ji} - \bar{X}_{j})^2\)

    • More variation in \(X_{j}\) reduces variance in \(\hat{\beta}_{j}\)

      • More spread in \(X_{j}\) makes it easier to estimate slope parameters

      • Important to sample such that \(X_{j}\) is spread out widely

Variance of OLS Estimator

  1. \(R^2_{j}\) is the \(R^2\) from a regression of \({X}_{ji}\) on all other independent variables

    • A larger \(R^2_{j}\) increases the variance of \(\hat{\beta}_{j}\)

      • Larger \(R^2_{j}\) means other variables explain more variation in \({X}_{j}\)

      • Means \({X}_{j}\) is more collinear with other independent variables

      • There is less independent variation in \({x}_{j}\)

      • Higher values will increase the variance of \(\hat{\beta}_{j}\)

  • Higher \(R^2_{j}\) illustrates problems created by collinearity

    • Larger \(R^2_{j}\) means \({X}_{j}\) more collinear with other independent variables

    • At the extreme, \(R^2_{j} = 1\) means perfect collinearity

    • As \(R^2_{j} \rightarrow 1\), \(var(\hat{\beta}_{j}) \rightarrow \infty\)

  • Thus, too much collinearity creates imprecise estimates of \(\hat{\beta}_{j}\)

Distribution of OLS Estimators

  • As before, results from simple regression carry over to multiple regression

  • Given 4 assumptions, OLS estimators \(\hat{\beta}_{0}, \hat{\beta}_{1}, ...\hat{\beta}_{k}\) are Normally distributed in large samples

    • A result of the Central Limit Theorem
  • Mean and variance are as discussed previously

    • Recall that variance formula above only appropriate under homoskedasticity

    • More complicated formula for heteroskedastic errors

  • Knowing distribution of \(\hat{\beta}_{0}, \hat{\beta}_{1}, ...\hat{\beta}_{k}\) will help us for hypothesis testing

Example with Stata

  • In our example we can simulate the sampling distribution of the slope

  • Below we simulate the distribution for \(\hat{\beta}_{1}\)

clear all
local sims = 9999
set obs `sims'
set seed 12345
set more off

gen beta1 = .

forvalues x = 1/`sims' {

    preserve
    clear
    
    qui set obs 420

    gen str = rnormal(20,2)
    gen mealpct = 7 + 2*str + rnormal(0,25)
        qui replace mealpct = 0 if mealpct < 0
        qui replace mealpct = 100 if mealpct > 100
    gen u = rnormal(0,10)

    gen testscr = 700 -1 * str - 0.5*mealpct + u
    
    qui regress testscr str mealpct
    restore
    
    qui replace beta1 = _b[str] in `x'
    
    display "Iteration `x'"
}

Example with Stata

sum beta1
twoway hist beta1, title(Sampling Distribution of Beta1) scheme(s2mono)
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       beta1 |      9,999   -1.000459    .2472745  -2.063423  -.0158269

Multicollinearity

Introduction

  • No perfect multicollinearity is one of the assumptions of the model

  • Consequence of violating the assumption is the model cannot be estimated

    • Effectively like trying to estimate effect of a variable while holding it fixed

    • OLS does not work in this case

  • We discussed some examples, and here we expand on that list

  • Also add a few other details

Examples of Perfect Multicollinearity

  • In each example below, imagine we start with the model \[TestScore_{i} = \beta_{0} + \beta_{1}STR_{i} + PctESL_{i} + u_{i}\]

  • Then we create a multicollinearity problem by adding a third variable

  1. Adding Fraction ESL

    • Fraction ESL varies between 0 and 1

    • It is perfectly functionally related to Percent ESL: \[PctESL_{i} = 100 \times FracESL_{i}\]

    • Impossible to estimate effect of \(PctESL_{i}\) holding \(FracESL_{i}\) constant

    • Because they measure exactly the same thing

Examples of Perfect Multicollinearity

  1. “Not very small” classes

    • Suppose \(NVS_{i} = 1\{STR \ge 12\}\)

    • Also creates perfect collinearity for subtle reason

    • In the data no classes have \(STR_{i} < 12\), so \(NVS_{i} = 1\) for all observations

    • Perfectly collinear with the constant term in the regression

      • In linear regression models, intercept can be thought of as having variable equal to 1 for all observations
  2. Percent English Speakers

    • Define English speakers as those who are not ESL students \[PctES_{i} = 100 - PctESL_{i}\]

    • \(PctES_{i}\) is an exact function of the constant and \(PctESL_{i}\) \[PctES_{i} = 100 \times (1) - PctESL_{i}\]

Dummy Variable Trap

  • Dummy variables create special case of perfect collinearity

  • Consider estimating a gender wage gap

  • Create two dummy variables

    • \(male = 1\) if person is male, and 0 otherwise

    • \(female = 1\) if person is female, and 0 otherwise

  • Adding both \(male\) and \(female\) to regression creates perfect multicollinearity

    • \(male + female = 1\)

    • This is perfectly collinear with the constant

  • This is called the dummy variable trap

Dummy Variable Trap

  • Generally, dummy variable trap is

    • There are G binary variables, and each observation falls into one category

    • Including all G variables creates perfect multicollinearity

  • To avoid dummy variable trap, you can only include \(G-1\) dummy variables

  • Male-female example

    • G = 2, since we divide people into male or female

    • Based on rule above, can only include one

    • Either include \(male\) or \(female\), but not both

    • Which one you include depends on preference

Solutions to Multicollinearity

  • Typically, perfect multicollinearity arises from a mistake

  • Solution is to find mistake, and drop collinear variable

    • In dummy variable example, drop either \(male\) or \(female\) dummy
  • Sometimes it is easy to find mistake, sometimes not

    • Especially when collinearity is subtle
  • In practice, Stata does this automatically

    • It looks for perfect linear relationships among \(X\) variables

    • If it finds one, it drops one or more variables until no more relationship exists

Imperfect Multicollinearity

  • A perfect relationship between regressors makes us unable to produce OLS estimates

  • However, regressors are still allowed to be related

    • Just not perfectly related
  • A relationship between variables that is not exact is called imperfect mutlicollinearity

  • Imperfect multicollinearity is allowed

  • Key issue for imperfect multicollinearity is that it causes variance of \(\hat{\beta}_{j}\) to rise

    • Depending on how closely variables are related

Solutions to Multicollinearity

  • Intuition: OLS estimates independent effect of each \(X_{j}\) on \(Y_{i}\)

  • If \(X_{j}\) is highly related to another variable, there is little independent variation

    • With little independent movement in \(X_{j}\), \(\hat{\beta}_{j}\) will fluctuate more from sample to sample
  • There is no obvious solution to imperfect multicollinearity

    • You can still estimate all the \(\hat{\beta}_{j}\)

    • But they will not be very precise (i.e. their variance will be high)

Example with Stata

  • Suppose we try to add a collinear variable to the model

    • mealdec is mealpct divided by 100

    • These two variables are perfectly linearly related

clear
set obs 420
set seed 12345


gen str = rnormal(20,2)
gen mealpct = 7 + 2*str + rnormal(0,25)
    replace mealpct = 0 if mealpct < 0
    replace mealpct = 100 if mealpct > 100
gen mealdec = mealpct/100   
gen u = rnormal(0,10)

gen testscr = 700 -1 * str - 0.5*mealpct + u

Example with Stata

regress testscr str mealpct mealdec
note: mealdec omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =       420
-------------+----------------------------------   F(2, 417)       =    366.81
       Model |  75975.9601         2    37987.98   Prob > F        =    0.0000
    Residual |  43185.6115       417  103.562617   R-squared       =    0.6376
-------------+----------------------------------   Adj R-squared   =    0.6358
       Total |  119161.572       419  284.395159   Root MSE        =    10.177

------------------------------------------------------------------------------
     testscr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         str |  -.8082448   .2399457    -3.37   0.001    -1.279899   -.3365909
     mealpct |  -.5341559   .0206251   -25.90   0.000    -.5746981   -.4936138
     mealdec |          0  (omitted)
       _cons |   698.5687   4.786449   145.95   0.000     689.1601    707.9773
------------------------------------------------------------------------------

Example with Stata

  • Finally, now we create two dummy variables
gen smallclass = str < 20
gen bigclass = str >= 20
  • First use the small class dummy only in the regression
regress testscr smallclass mealpct, robust
Linear regression                               Number of obs     =        420
                                                F(2, 417)         =     376.52
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6355
                                                Root MSE          =     10.206

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  smallclass |   3.018267   1.010161     2.99   0.003     1.032625    5.003909
     mealpct |   -.534675    .020274   -26.37   0.000     -.574527    -.494823
       _cons |   680.8421   1.268558   536.71   0.000     678.3486    683.3357
------------------------------------------------------------------------------

Example with Stata

  • Next try to add both dummies

    • Stata drops one of the dummies because \(smallclass + bigclass = 1\)

    • This is perfectly collinear with the constant

regress testscr smallclass bigclass mealpct, robust
note: bigclass omitted because of collinearity.

Linear regression                               Number of obs     =        420
                                                F(2, 417)         =     376.52
                                                Prob > F          =     0.0000
                                                R-squared         =     0.6355
                                                Root MSE          =     10.206

------------------------------------------------------------------------------
             |               Robust
     testscr | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  smallclass |   3.018267   1.010161     2.99   0.003     1.032625    5.003909
    bigclass |          0  (omitted)
     mealpct |   -.534675    .020274   -26.37   0.000     -.574527    -.494823
       _cons |   680.8421   1.268558   536.71   0.000     678.3486    683.3357
------------------------------------------------------------------------------