Prediction

EC295

Justin Smith

Wilfrid Laurier University

Fall 2022

Introduction

Introduction

  • Our focus in this class has been on estimating the causal effect of \(X\) on \(Y\)

  • You can also use regressions to predict unknown outcomes

  • Examples

    • The performance of a school that has not yet been built

    • A stock price in the future

    • Consumer spending under various levels of the interest rate

  • We can use our model with values of independent variables to estimate outcomes

  • Predictions are estimates that are subject to error

  • It is important to account for this error

    • Do this through prediction intervals

Introduction

  • Like before, we want our predictions from OLS to be unbiased

    • Accurate on average
  • Assumptions for this are different than before

    • Key difference: causal effects are not necessary for prediction
  • Finally, we will examine how to compare different predictions

    • Based on the average error in prediction

Model and Estimation

Model

  • Imagine you are interested in an unknown school test score \(TestScore\)

    • You are attending a new school being built
  • Also pretend you know how big the class size will be

  • An intuitive (and mathematically optimal) way to predict \(TestScore\) is the conditional average

\[E[TestScore| STR]\]

  • The actual test score will not exactly equal this average

    • Because schools differ in many ways (e.g. student body, location, etc)
  • The difference between the real test score and the average is

\[ u = TestScore - E[TestScore| STR]\]

Model

  • You can rearrange this to put \(Y\) on the left side

\[TestScore = E[TestScore| STR] + u\]

  • For our purposes, suppose \(E[TestScore| STR]\) is a straight line

\[E[TestScore| STR] = \beta_{0} + \beta_{1}STR\]

  • Says average scores change linearly with extra students

    • The true relationship is probably not linear, but pretend it is
  • If we knew \(\beta_{0}\) and \(\beta_{1}\), you could get average test scores for any \(STR\)

  • You could use this as your prediction, and you would be done

Estimation

  • Problem: you do not know \(\beta_{0}\) and \(\beta_{1}\)

  • But we can estimate them using data

  • An estimation technique we already know is OLS

  • OLS estimates of the slope and intercept are

\[\hat{\beta}_{0} = \overline{Y} - \hat{\beta}_{1}\overline{X}\] \[\hat{\beta}_{1} = \frac{\sum_{i=1}^{n}(X_{i} - \overline{X})(Y_{i} - \overline{Y})}{\sum_{i=1}^{n}(X_{i} - \overline{X})^2}\]

Estimation

  • You can then predict \(TestScore\) using an estimate of the conditional mean

\[\widehat{TestScore} = \hat{\beta}_{0} + \hat{\beta}_{1}STR\]

  • Plug in any value of \(STR\) and this function will produce an estimated \(TestScore\)

  • Notice that this is a two-step process

    1. Use \(E[TestScore| STR]\) to predict \(TestScore\)

    2. Use \(\widehat{TestScore}\) to estimate \(E[TestScore| STR]\)

Example with Stata

  • Pretend a dad has hired you to predict his child’s test score

  • The only thing you know is that there will be 25 students in the class

  • Your strategy is to provide an estimate of the mean test score with 25 students

  • You assume average test scores and class size are linearly related

  • First step is to estimate the slope and intercept of the mean test score

  • The generate the prediction using these estimates

Example with Stata

  • Below we simulate 420 observations on test scores and student/teacher ratio
clear  
set obs 420  
set seed 12345  
      
gen str = rnormal(20,2)  
gen u = rnormal(0,20)  
      
gen testscr = 700 -2 * str + u 

Example with Stata

  • Estimate the slope and intercept
regress testscr str
      Source |       SS           df       MS      Number of obs   =       420
-------------+----------------------------------   F(1, 418)       =     15.45
       Model |  6383.10498         1  6383.10498   Prob > F        =    0.0001
    Residual |  172661.265       418  413.065226   R-squared       =    0.0357
-------------+----------------------------------   Adj R-squared   =    0.0333
       Total |  179044.369       419  427.313531   Root MSE        =    20.324

------------------------------------------------------------------------------
     testscr | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         str |  -1.855817    .472094    -3.93   0.000    -2.783791   -.9278429
       _cons |   696.4934    9.55519    72.89   0.000     677.7112    715.2756
------------------------------------------------------------------------------
  • Compute predicted test score with 25 students
display _b[_cons] + _b[str]*25
650.098

Example with Stata

  • Visualize predictions for any class size
twoway (function y = _b[_cons] + _b[str]*x, range(5 25)), title(Predicted Test Scores and Student Teacher Ratio) 

Assumptions

  • Predictions are subject to two kinds of errors

    • Error in using the mean to predict the value of \(TestScore\)

    • Sampling error from estimating the mean of \(TestScore\)

  • We would like our predictions to be good on average

  • In addition to \(E[TestScore| STR]\) being linear, the following needs to be true

  1. Out of sample observations are drawn from the same population as the data

  2. Sample data are iid

  3. Large outliers are unlikely

Assumptions

  • First assumption is key

    • Intuition: if our data used to estimate model come from a specific population, they apply best to predicting out of sample in the same population

    • Means estimates from Canadian data might not be good for predicting USA test scores

    • Also implies that \(E[TestScore| STR]\) is the best way to predict both in and out of sample

  • Second and third assumptions ensure estimates of intercept and slope parameters are consistent

    • They are close to the real values in big samples

Errors in Prediction

Introduction

  • No prediction is perfect

  • In addition to a prediction, we need to estimate its accuracy

    • We would prefer predictions that are more accurate
  • In this section we review methods for assessing accuracy

  • Depends on the objective of your prediction

  • There are two possible objectives

    • Predicting the unknown mean \(E[TestScore| STR]\)

    • Predicting some unknown specific value of \(TestScore\)

  • The way you assess accuracy is different

Errors Predicting the Population Mean

  • For some applications, you only want to predict the mean \(E[TestScore| STR]\)

    • You are not interested in a school outcome, but the average school
  • To assess accuracy, you can calculate

    • The standard error of your prediction

    • A confidence interval based on that standard error and the prediction

  • The prediction of the mean is the predicted value from the regression at a value \(STR = STR^{oos}\)

\[\widehat{TestScore} = \hat{\beta}_{0} + \hat{\beta}_{1}STR\]

  • The value \(STR^{oos}\) is some value of \(STR\) out of sample (oos)

  • The standard error of this prediction is the estimated square root of the variance of \(\widehat{TestScore}\)

\[se(\widehat{TestScore}) = \sqrt{\hat{Var}(\hat{\beta}_{0} + \hat{\beta}_{1}STR^{oos} )} \]

Errors Predicting the Population Mean

  • This is a complicated function of the variances and covariances of \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\)

    • So we will not write it down explicitly
  • But we know that it depends on:

    1. The number of observations

      • The standard error is smaller with more observations
    2. The variation of the error

      • The higher the variation, the higher the standard error
    3. The distance between \(STR^{oos}\) and the sample mean of \(STR\)

      • The closer \(STR^{oos}\) is to the sample mean, the smaller the standard error

Errors Predicting the Population Mean

  • You can use this to construct a confidence interval for the prediction of \(E[TestScore| STR]\)

  • Use same method as before

\[CI = estimate \pm \text{margin of error}\] - In this context, a 95% confidence interval is is

\[CI = \widehat{TestScore} \pm t^c \times se(\widehat{TestScore}) \]

  • As before \(t^c\) is the critical value from a two-sided t-test

Example with Stata

  • Let us reestimate the regression of test scores on class size
regress testscr str
  • Find the predicted values of test score for each observation
predict pred_testscr, xb
  • Generate the standard error of the prediction of the population mean

    • Note that this computes a separate value for each observation in the sample
predict se_ymean, stdp
  • Compute the upper and lower bounds of the confidence interval
gen ci_upper = pred_testscr + 1.96*se_ymean
gen ci_lower = pred_testscr - 1.96*se_ymean

Example with Stata

  • Visualize predictions and confidence intervals
twoway (line pred_testscr str) (line ci_upper str, lpattern(dash) sort) (line ci_lower str,lpattern(dash) sort), title(Predicted Test Scores and Confidence Intervals) 
> ntervals) 

Errors Predicting Specific Test Scores

  • You might also be interested in predicting a specific test score

    • Example: you want to predict the scores of one school

    • Perhaps because it did not report data on its scores

  • To predict a specific test score, you also use \(\widehat{TestScore}\)

    • The prediction of the mean and the prediction of one test score is the same value
  • Difference is that there is an additional source of error

    • Error from predicting \(TestScore\) using \(E[TestScore| STR]\)

    • And from estimating \(E[TestScore| STR]\) using data

  • Thus the prediction is the same, but the standard error of the prediction is different

Errors Predicting Specific Test Scores

  • Suppose the specific test score you are trying to predict is \(Y^{oos}\)

  • Given our assumptions it is defined as

\[TestScore^{oos} = \beta_{0} + \beta_{1}STR^{oos} + u^{oos}\]

  • The prediction error is

\[\hat{e}^{oos} = TestScore^{oos} - \widehat{TestScore^{oos}} \]

Errors Predicting Specific Test Scores

  • To emphasize the source of the errors, you can reexpress this as

\[\hat{e}^{oos} = E[TestScore| STR = STR^{oos}] + u^{oos} - \widehat{TestScore^{oos}})\] \[\hat{e}^{oos} = u^{oos} + (E[TestScore| STR = STR^{oos}] - \widehat{TestScore^{oos}})\]

  • The first part is the error from using the mean to predict \(TestScore\)

  • The second part is the error in predicting the mean using \(\widehat{TestScore^{oos}}\)

  • The variance of this prediction error is

\[var(\hat{e}^{oos}) = var(u^{oos}) + var(\widehat{TestScore^{oos}})\]

  • The prediction is less accurate when

    • Test scores are spread more widely we around their mean in the population

    • There is lots of sampling variation in the prediction of test scores

Errors Predicting Specific Test Scores

  • You can construct an interval around the prediction for \(TestScore^{oos}\)

  • This is called a prediction interval

    • Different from a confidence interval

    • Prediction intervals are for predictions of a specific value of the outcome

    • Interpretation is also slightly different (we will not get into it)

  • The prediction interval is

\[PI = \widehat{TestScore^{oos}} \pm t^c \times se(\widehat{TestScore^{oos}})\]

  • The value \(t^{c}\) is chosen depending on the confidence level

  • Note that this is wider than the confidence interval from earlier because of two sources of error

Example with Stata

  • We again reestimate the regression
regress testscr str
  • We already have the predicted values from previous slides

  • Generate the standard error of the prediction for test score

predict se_y, stdf
  • Compute the upper and lower bounds of the confidence interval
gen pi_upper = pred_testscr + 1.96*se_y
gen pi_lower = pred_testscr - 1.96*se_y

Example with Stata

  • Visualize prediction interval
twoway (line pred_testscr str) (line ci_upper str, lpattern(dash) sort) (line ci_lower str,lpattern(dash) sort) (line pi_upper str, lpattern(dash_dot) sort) (line pi_lower str,lpattern(dash_dot) sort), title(Predicted Test Scores and Prediction Intervals) 
> (line pi_lower str,lpattern(dash_dot) sort), title(Predicted Test Scores and Prediction Intervals) 

Evaluating Predictions

Mean Squared Prediction Error

  • Suppose you want to evaluate how good your model is at prediction

    • You can use to compare against other models
  • Normally you would predict several values at the same time

  • One way to evaluate a group of predictions is with the Mean Squared Prediction Error (MSPE)

  • Typically you want to evaluate the predictions on data that has not been used to estimate the prediction

    • Models are always better at predicting values in the sample used for estimation
  • One way people do this is to split the sample randomly

    • Estimate model on half the sample, called the training data

    • Evaluate model using the other half, called the test data

  • The MSPE in the test data is

\[MSPE = \frac{1}{n_{test}} \sum(TestScore_i - \widehat{TestScore}_{i})^2\]

Example with Stata

  • Split the sample randomly in half
gen half = runiform() >=0.5
  • Estimate regression on half the sample
regress testscr str if half == 1
  • Predict values of \(TestScore\) on other half
predict pred_test if half == 0, xb
  • Compute the prediction error, square it, and find the mean
gen pred_error2 = (testscr - pred_test)^2
summarize pred_error2
(212 missing values generated)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
 pred_error2 |        208    437.0696    599.9934   1.49e-06   4363.157