The Linear Regression Model


Justin Smith

Wilfrid Laurier University

Fall 2022



  • Questions in economics often involve explaining a variable in terms of others

    • How does age of school entry affect test scores?

    • Does childhood health insurance affect adult health?

    • Does foreign competition affect domestic innovation?

  • Often we are interested in the causal relationship

    • The independent effect of one variable on another

    • Causal relationships are important for policy

  • Econometrics provides a framework for examining these relationships

    • Strong focus on causality

    • We discussed this and will revisit it

What Are We Trying To Model?

Conditional Expectation Function

  • As noted above, we want to relate dependent variable \(y\) to independent variables \(\mathbf{x}\)

  • Specifically want to know systematically what happens to \(y\) when \(\mathbf{x}\) changes

  • Difficult because \(y\) and \(\mathbf{x}\) are random variables

    • \(y\) can take many different values for each \(\mathbf{x}\)

    • This randomness makes it difficult to see relationships

  • One way to see a systematic pattern is to focus on average \(y\) at each \(\mathbf{x}\)

    • Does \(y\) change on average as we increase \(\mathbf{x}\)?

    • Ex: Does academic achievement fall on average as we increase class size?

  • Mathematically, this is the conditional expectation \(\mathbf{E}[y|\mathbf{x}]\)

Conditional Expectation Function

  • Idea is illustrated in figure

  • Log earnings on vertical axis, years of schooling on horizontal

  • Grey shaded areas are distribution of log earnings at each level of schooling

    • Big spread incomes for each level of schooling

    • Hard to see relationship

  • Black line is the conditional mean of earnings at each level of schooling

    • Increasing pattern between school and earnings is much easier to see

    • Note how it is not linear

Conditional Expectation Function

  • The Conditional Expectation Function (CEF) highlights the pattern through the randomness

  • It is therefore appealing as a way to measure systematic relationships

  • It is also the optimal predictor of \(y\) given \(\mathbf{x}\)

    • It minimizes the mean squared error in predicting \(y\)
  • We would therefore like to use the CEF to measure relationships between \(\mathbf{E}[y|\mathbf{x}]\)

  • Problem: as a population value, it is not known

    • It is not observable because we do not see the population

    • Therefore cannot say anything about its value or functional form

Conditional Expectation Function

  • We can use linear regression to approximate CEF

  • This approximation is justified in several ways

    • If the CEF is linear, it is equivalent to population regression function

    • The population regression function is the best linear predictor of \(y\) given \(\mathbf{x}\)

    • The population regression function is the best linear approximation to the CEF

  • This is partly why linear regression is so popular in economics

  • Next section examines the population regression function

    • Derive the population slope before thinking about samples

    • This derivation will probably be new to you

The Population Regression Model


  • A linear model relating \(y\) to one or more explanatory variables \(\mathbf{x}\) is

\[y = \mathbf{x}\boldsymbol{\beta} + u\]

  • Where

    • \(y\) is a scalar observable random outcome variable

    • \(\mathbf{x}\) is a \(1\times (k + 1)\) vector of random explanatory factors

    • \(\boldsymbol{\beta}\) is a \((k + 1) \times 1\) vector of slope parameters (non-random)

    • \(u\) is a scalar population residual term

  • This is our model for the (unobserved) population

    • Sometimes called the Data Generating Process (DGP)
  • \(\mathbf{x}\boldsymbol{\beta}\) is called the Population Regression Function (PRF)

    • The part of \(y\) that is predictable by \(\mathbf{x}\)


  • Recall we are using it to approximate the CEF

    • Goal is not necessarily to get approximation exactly right

    • But to capture essential features of relationship

  • In undergrad courses it is typical to just assume the CEF is linear

    • This is not necessarily true

    • But avoids complications of non-linear CEF

  • In some cases the CEF is inherently linear

    • In last section of the course, we saw the CEF for a binary treatment

    • This type of CEF is linear, so it equals the PRF

Population Regression Slope Vector

  • In undergrad it is typical to next estimate \(\boldsymbol{\beta}\) with a sample

  • You can also derive a population least squares vector

  • It is the slope that minimizes the mean squared prediction error (MSPE)

\[\min_\beta \textbf{E}[(y-\mathbf{x}\boldsymbol{\beta})^2]\]

  • If you take the derivative with respect to \(\boldsymbol{\beta}\), you get

\[\textbf{E}[\mathbf{x}'(y-\mathbf{x}\boldsymbol{\beta})]= \textbf{E}[\mathbf{x}'u]=\mathbf{0}\]

Population Regression Slope Vector

  • Solving for \(\boldsymbol{\beta}\), we get \[\textbf{E}[\mathbf{x}'(y-\mathbf{x}\boldsymbol{\beta})]= \mathbf{0}\] \[\textbf{E}[\mathbf{x}'y]= \textbf{E}[\mathbf{x'x}\boldsymbol{\beta}]\] \[\textbf{E}[\mathbf{x}'y]= \textbf{E}[\mathbf{x'x}]\boldsymbol{\beta}\] \[(\textbf{E}[\mathbf{x'x}])^{-1} \textbf{E}[\mathbf{x}'y]= \boldsymbol{\beta}\]

  • This is the formula for the slope in the PRF

    • With a single \(x\), it is the population covariance between \(y\) and \(x\) divided by the population variance of \(x\)
  • This is the same least squares process you use to get OLS estimator

    • Except we are doing it with the population instead of a sample

Population Regression Slope Vector

  • Minimizing the MSPE implies that


  • Expanding that equation, we get

\[\begin{bmatrix} \textbf{E}(u)\\ \textbf{E}(x_{1}u)\\ \vdots\\ \textbf{E}(x_{k}u) \end{bmatrix} =\mathbf{0}\]

  • This says the following important things

    • The average value of the population residual \(u\) is zero

    • The covariance between each \(x\) and \(u\) is zero

Population Regression Slope Vector

  • Note that

\[\text{cov}(x_{1},u) = \mathbf{E}[(x_{1} - \mathbf{E}(x_{1}))(u - \mathbf{E}(u))]\]

  • From above, we know that \(\mathbf{E}(u) =0\), so

\[\text{cov}(x_{1},u) = \mathbf{E}[x_{1}u - \mathbf{E}(x_{1})u]\]

  • Bringing the expectation through the brackets

\[\text{cov}(x_{1},u) = \mathbf{E}(x_{1}u) - \mathbf{E}(x_{1})\mathbf{E}(u) = \mathbf{E}(x_{1}u)\]

  • Says that \(u\) is mean zero and uncorrelated with \(\mathbf{x}\)

Population Regression Slope Vector

  • If we observed the population we could compute \(\boldsymbol{\beta}\)

  • Problem again is we do not observe the population

  • So we cannot compute \(\textbf{E}[\mathbf{x}'y]\) or \((\textbf{E}[\mathbf{x'x}])^{-1}\)

  • Instead, we collect a sample of data and estimate \(\boldsymbol{\beta}\)

  • Before we do that, we briefly discuss causality in regression models

Regression and Causality

Why is Causality Important?

  • Empirical economists are often interested in a causal effect

  • For policy, it is often key to have estimate causal effect

    • E.g. a school district looking to implement pre-kindergarten program

    • This is generally funded with public money

    • Need to know if pre-k has independent effects on current and future outcomes

      • Do not want this estimate confounded with parent background
  • When can we interpret a regression slope as causal?

  • Answer: when the model is structural

    • Structural model is one where the coefficients have a causal interpretation

Model with One Binary Regressor

  • In the last section we defined the underlying potential outcomes as

\[y_{0} = \alpha + \eta\] \[y_{1} = y_{0} + \rho\]

  • With the observed outcome

\[y = \alpha + \rho w + \eta\]

  • This regression model is structural because \(\rho\) is the causal effect
  • We derived that the difference in conditional expectations is

\[E(y|w=1) - E(y|w=0) = \rho + E(\eta |w=1) - E(\eta |w=0)\]

Model with One Binary Regressor

  • The population regression function with a binary regressor is

\[y = \beta_{0} + \beta_{1}w + u\]

  • The population least squares slope \(\beta_{1}\) from minimizing the MSPE is

\[\beta_{1} = E(y|w=1) - E(y|w=0)\]

  • Combining this equation with the structural model

\[\beta_{1} = \rho + E(\eta |w=1) - E(\eta |w=0)\]

Model with One Binary Regressor

  • The regression slope \(\beta_{1}\) equals the treatment effect \(\rho\) when

    \[E(\eta |w=1) - E(\eta |w=0)\]

  • We saw cases when this is true

    • Randomization

    • Mean independence of \(\eta\)

  • If none of these are true, then \(\beta_{1} \neq \rho\) and \(\beta_{1}\) is not a causal effect

Model with Continuous Regressor

  • With a continuous independent variable \(s\), suppose the structural model is

\[y = \alpha + \rho s + \eta\]

  • Where the definition of \(\rho\) is

    \[\rho = E(y_{s_{0}}|s=s_{0}) - E(y_{s_{0}-1}|s = s_{0} - 1)\]

  • Where \(y_{s_{0}}\) and \(y_{s_{0}-1}\) are potential outcomes with two different levels of \(s\)

    • \(\rho\) is the causal effect of a one-unit increase in \(s\)
  • If we set the population regression function as

\[y = \beta_{0} + \beta_{1}s + u\]

Model with Continuous Regressor

  • The regression slope is

\[\beta_{1} = \frac{cov(y,s)}{var(s)}\]

  • To relate \(\beta_{1}\) to \(\rho\), sub in the structural model for \(y\)

\[\beta_{1} = \frac{cov(\alpha + \rho s + \eta ,s)}{var(s)}\]

  • Simplifying we get

\[\beta_{1} = \rho + \frac{cov(\eta ,s)}{var(s)}\]

Model with Continuous Regressor

  • \(\beta_{1}\) equals \(\rho\) when \(\eta\) and \(s\) are uncorrelated

    • Randomization, mean independence both mean this is true
  • So if we assume

\[E(\eta | s) = 0\]

  • Then the second term in equation above is zero and the population slope is the causal effect

Model with Continuous Regressor

  • Now imagine that the structural model is

\[y = \alpha + \rho s + \gamma x + \eta\]

  • The definition of \(\rho\) is

\[\rho = E(y_{s_{0}}|x, s=s_{0}) - E(y_{s_{0}-1}|x, s = s_{0} - 1)\]

  • If we set the population regression function as

\[y = \beta_{0} + \beta_{1}s + \beta_{2} x+ u\]

Model with Continuous Regressor

  • Then \(\beta_{1}\) equals \(\rho\) if we assume

    • Conditional independence of \(\eta\)

    • Conditional mean independence of \(\eta\)

  • Conditional mean independence means

\[E(\eta | s, x) = E(\eta | x)\]

  • In words, this means \(s\) is related to potential outcomes only through \(x\)

    • So holding \(x\) constant breaks this relationship
  • Even though \(\beta_{1}\) equals \(\rho\), it is important to note that \(\beta_{0} \neq \alpha\) and \(\beta_{2} \neq \gamma\)

    • With regression we do not measure the structural intercept or effect of \(x\)

Model with Continuous Regressor

  • To see this, take expectation of \(y\) in structural model

\[E[y|s,x] = \alpha + \rho s + \gamma x + E[\eta|s,x]\]

  • If we impose conditional mean independence, then \(E(\eta | s, x) = E(\eta | x)\)

\[E[y|s,x] = \alpha + \rho s + \gamma x + E[\eta|x]\]

  • The error is not a function of \(s\) anymore, but it is a function of \(x\)

  • For example, suppose

\[\eta = \theta_{0} + \theta_{1} x + \epsilon\]

Model with Continuous Regressor

  • Assume that \(\epsilon\) is just a random error unrelated to \(x\) and \(s\)

  • Sub into structural model

\[y = \alpha + \rho s + \gamma x + \theta_{0} + \theta_{1} x + \epsilon\] \[y = (\alpha +\theta_{0})+ \rho s + (\gamma + \theta_{1})x \epsilon\] \[y = \lambda + \rho s + \pi x + \epsilon\]

  • The intercept and slope on \(x\) are now redefined

    • The are no longer causal effects
  • Slope on \(s\) is still the causal effect \(\rho\)

Model with Continuous Regressor

  • If the regression function is

\[y = \beta_{0} + \beta_{1}s + \beta_{2} x+ u\]

  • Then if \(E[\epsilon|s,x] = 0\)

\[\beta_{0} = \lambda\] \[\beta_{1} = \rho\] \[\beta_{2} = \pi\]

Omitted Variables Bias

  • In the regression model above, what happens if we leave out \(x\)?

  • Continue to assume conditional mean independence

\[y = \beta_{0} + \beta_{1}s + u\]

  • Remember the regression slope is

\[\beta_{1} = \frac{cov(y ,s)}{var(s)}\]

  • Sub in the structural model

\[\beta_{1} = \frac{cov(\lambda + \rho s + \pi x + \epsilon ,s)}{var(s)}\]

Omitted Variables Bias

\[\beta_{1} = \rho + \pi* \frac{cov( x ,s)}{var(s)} + \frac{cov( \epsilon ,s)}{var(s)}\]

  • The last term is zero because we assume \(\epsilon\) is unrelated to \(x\) and \(s\)

    \[\beta_{1} = \rho + \pi* \frac{cov( x ,s)}{var(s)}\]

  • The regression slope does not measure the causal effect in this case

  • The bias is

\[\pi* \frac{cov( x ,s)}{var(s)}\]

Omitted Variables Bias

  • Bias has two parts

    • \(\pi \rightarrow\) the effect of \(x\) on \(y\)

    • \(\frac{cov( x ,s)}{var(s)} \rightarrow\) the effect of \(s\) on \(x\)

  • If \(x\) is related to \(y\) and \(x\) is related to \(s\), we have bias

  • Direction of bias depends on signs of each term

    • If both positive or both negative \(\rightarrow\) positive bias

    • If one positive and one negative \(\rightarrow\) negative bias

  • If either \(y\) or \(s\) is unrelated to \(x\), there is no bias

  • In vector notation, restate the structural model as

\[y = \mathbf{x_{1}}\boldsymbol{\alpha_{1}} + \mathbf{x_{2}}\boldsymbol{\alpha_{2}} + \eta\]

Omitted Variables Bias

  • If we try to approximate it with the population regression function

\[y = \mathbf{x_{1}}\boldsymbol{\beta_{1}} + u\]

  • We get the population regression slope as

\[\boldsymbol{\beta_{1}}=\left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1} E[\mathbf{x_{1}'}y]\]

  • Sub the structural model into the population slope function

\[\boldsymbol{\beta_{1}}=\left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1} E[\mathbf{x_{1}'}( \mathbf{x_{1}}\boldsymbol{\alpha_{1}} + \mathbf{x_{2}}\boldsymbol{\alpha_{2}} + \eta )]\] \[=\left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1} E[\mathbf{x_{1}'} \mathbf{x_{1}}\boldsymbol{\alpha_{1}} + \mathbf{x_{1}'x_{2}}\boldsymbol{\alpha_{2}} + \mathbf{x_{1}'}\eta ]\] \[=\left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1} E[\mathbf{x_{1}'} \mathbf{x_{1}}]\boldsymbol{\alpha_{1}} + \left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1}E[\mathbf{x_{1}'x_{2}}]\boldsymbol{\alpha_{2}} + \left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1}E[\mathbf{x_{1}'}\eta ]\]

Omitted Variables Bias

\[=\boldsymbol{\alpha_{1}} + \left ( E[\mathbf{x_{1}'x_{1}}\right] )^{-1}E[\mathbf{x_{1}'x_{2}}]\boldsymbol{\alpha_{2}}\]

  • The population slope vector on \(\mathbf{x_{1}}\) equals the sum of

    • The causal slope vector \(\boldsymbol{\alpha_{1}}\)

    • A bias term containing

      • the regression of \(\mathbf{x_{2}}\) on \(\mathbf{x_{1}}\)

      • the slope on \(\mathbf{x_{2}}\) in the structural for \(y\)

  • A key lesson here is that a single omitted variable will bias all population slopes \(\boldsymbol{\beta_{1}}\)

    • Unless it is unrelated to y

    • Or it is uncorrelated with all but one included regressor, and that regressor is uncorrelated with the others