EC295
Justin Smith
Wilfrid Laurier University
Fall 2022
Our focus in this class has been on estimating the causal effect of \(X\) on \(Y\)
You can also use regressions to predict unknown outcomes
Examples
The performance of a school that has not yet been built
A stock price in the future
Consumer spending under various levels of the interest rate
We can use our model with values of independent variables to estimate outcomes
Predictions are estimates that are subject to error
It is important to account for this error
Like before, we want our predictions from OLS to be unbiased
Assumptions for this are different than before
Finally, we will examine how to compare different predictions
Imagine you are interested in an unknown school test score \(TestScore\)
Also pretend you know how big the class size will be
An intuitive (and mathematically optimal) way to predict \(TestScore\) is the conditional average
\[E[TestScore| STR]\]
The actual test score will not exactly equal this average
The difference between the real test score and the average is
\[ u = TestScore - E[TestScore| STR]\]
\[TestScore = E[TestScore| STR] + u\]
\[E[TestScore| STR] = \beta_{0} + \beta_{1}STR\]
Says average scores change linearly with extra students
If we knew \(\beta_{0}\) and \(\beta_{1}\), you could get average test scores for any \(STR\)
You could use this as your prediction, and you would be done
Problem: you do not know \(\beta_{0}\) and \(\beta_{1}\)
But we can estimate them using data
An estimation technique we already know is OLS
OLS estimates of the slope and intercept are
\[\hat{\beta}_{0} = \overline{Y} - \hat{\beta}_{1}\overline{X}\] \[\hat{\beta}_{1} = \frac{\sum_{i=1}^{n}(X_{i} - \overline{X})(Y_{i} - \overline{Y})}{\sum_{i=1}^{n}(X_{i} - \overline{X})^2}\]
\[\widehat{TestScore} = \hat{\beta}_{0} + \hat{\beta}_{1}STR\]
Plug in any value of \(STR\) and this function will produce an estimated \(TestScore\)
Notice that this is a two-step process
Use \(E[TestScore| STR]\) to predict \(TestScore\)
Use \(\widehat{TestScore}\) to estimate \(E[TestScore| STR]\)
Pretend a dad has hired you to predict his child’s test score
The only thing you know is that there will be 25 students in the class
Your strategy is to provide an estimate of the mean test score with 25 students
You assume average test scores and class size are linearly related
First step is to estimate the slope and intercept of the mean test score
The generate the prediction using these estimates
Source | SS df MS Number of obs = 420
-------------+---------------------------------- F(1, 418) = 15.45
Model | 6383.10498 1 6383.10498 Prob > F = 0.0001
Residual | 172661.265 418 413.065226 R-squared = 0.0357
-------------+---------------------------------- Adj R-squared = 0.0333
Total | 179044.369 419 427.313531 Root MSE = 20.324
------------------------------------------------------------------------------
testscr | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
str | -1.855817 .472094 -3.93 0.000 -2.783791 -.9278429
_cons | 696.4934 9.55519 72.89 0.000 677.7112 715.2756
------------------------------------------------------------------------------
Predictions are subject to two kinds of errors
Error in using the mean to predict the value of \(TestScore\)
Sampling error from estimating the mean of \(TestScore\)
We would like our predictions to be good on average
In addition to \(E[TestScore| STR]\) being linear, the following needs to be true
Out of sample observations are drawn from the same population as the data
Sample data are iid
Large outliers are unlikely
First assumption is key
Intuition: if our data used to estimate model come from a specific population, they apply best to predicting out of sample in the same population
Means estimates from Canadian data might not be good for predicting USA test scores
Also implies that \(E[TestScore| STR]\) is the best way to predict both in and out of sample
Second and third assumptions ensure estimates of intercept and slope parameters are consistent
No prediction is perfect
In addition to a prediction, we need to estimate its accuracy
In this section we review methods for assessing accuracy
Depends on the objective of your prediction
There are two possible objectives
Predicting the unknown mean \(E[TestScore| STR]\)
Predicting some unknown specific value of \(TestScore\)
The way you assess accuracy is different
For some applications, you only want to predict the mean \(E[TestScore| STR]\)
To assess accuracy, you can calculate
The standard error of your prediction
A confidence interval based on that standard error and the prediction
The prediction of the mean is the predicted value from the regression at a value \(STR = STR^{oos}\)
\[\widehat{TestScore} = \hat{\beta}_{0} + \hat{\beta}_{1}STR\]
The value \(STR^{oos}\) is some value of \(STR\) out of sample (oos)
The standard error of this prediction is the estimated square root of the variance of \(\widehat{TestScore}\)
\[se(\widehat{TestScore}) = \sqrt{\hat{Var}(\hat{\beta}_{0} + \hat{\beta}_{1}STR^{oos} )} \]
This is a complicated function of the variances and covariances of \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\)
But we know that it depends on:
The number of observations
The variation of the error
The distance between \(STR^{oos}\) and the sample mean of \(STR\)
You can use this to construct a confidence interval for the prediction of \(E[TestScore| STR]\)
Use same method as before
\[CI = estimate \pm \text{margin of error}\] - In this context, a 95% confidence interval is is
\[CI = \widehat{TestScore} \pm t^c \times se(\widehat{TestScore}) \]
Generate the standard error of the prediction of the population mean
You might also be interested in predicting a specific test score
Example: you want to predict the scores of one school
Perhaps because it did not report data on its scores
To predict a specific test score, you also use \(\widehat{TestScore}\)
Difference is that there is an additional source of error
Error from predicting \(TestScore\) using \(E[TestScore| STR]\)
And from estimating \(E[TestScore| STR]\) using data
Thus the prediction is the same, but the standard error of the prediction is different
Suppose the specific test score you are trying to predict is \(Y^{oos}\)
Given our assumptions it is defined as
\[TestScore^{oos} = \beta_{0} + \beta_{1}STR^{oos} + u^{oos}\]
\[\hat{e}^{oos} = TestScore^{oos} - \widehat{TestScore^{oos}} \]
\[\hat{e}^{oos} = E[TestScore| STR = STR^{oos}] + u^{oos} - \widehat{TestScore^{oos}})\] \[\hat{e}^{oos} = u^{oos} + (E[TestScore| STR = STR^{oos}] - \widehat{TestScore^{oos}})\]
The first part is the error from using the mean to predict \(TestScore\)
The second part is the error in predicting the mean using \(\widehat{TestScore^{oos}}\)
The variance of this prediction error is
\[var(\hat{e}^{oos}) = var(u^{oos}) + var(\widehat{TestScore^{oos}})\]
The prediction is less accurate when
Test scores are spread more widely we around their mean in the population
There is lots of sampling variation in the prediction of test scores
You can construct an interval around the prediction for \(TestScore^{oos}\)
This is called a prediction interval
Different from a confidence interval
Prediction intervals are for predictions of a specific value of the outcome
Interpretation is also slightly different (we will not get into it)
The prediction interval is
\[PI = \widehat{TestScore^{oos}} \pm t^c \times se(\widehat{TestScore^{oos}})\]
The value \(t^{c}\) is chosen depending on the confidence level
Note that this is wider than the confidence interval from earlier because of two sources of error
We already have the predicted values from previous slides
Generate the standard error of the prediction for test score
twoway (line pred_testscr str) (line ci_upper str, lpattern(dash) sort) (line ci_lower str,lpattern(dash) sort) (line pi_upper str, lpattern(dash_dot) sort) (line pi_lower str,lpattern(dash_dot) sort), title(Predicted Test Scores and Prediction Intervals)
> (line pi_lower str,lpattern(dash_dot) sort), title(Predicted Test Scores and Prediction Intervals)
Suppose you want to evaluate how good your model is at prediction
Normally you would predict several values at the same time
One way to evaluate a group of predictions is with the Mean Squared Prediction Error (MSPE)
Typically you want to evaluate the predictions on data that has not been used to estimate the prediction
One way people do this is to split the sample randomly
Estimate model on half the sample, called the training data
Evaluate model using the other half, called the test data
The MSPE in the test data is
\[MSPE = \frac{1}{n_{test}} \sum(TestScore_i - \widehat{TestScore}_{i})^2\]
(212 missing values generated)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
pred_error2 | 208 437.0696 599.9934 1.49e-06 4363.157