EC655
Justin Smith
Wilfrid Laurier University
Fall 2022
Recall that the population linear regression model is
\[y = \mathbf{x}\boldsymbol{\beta} + u\]
We wish to estimate the slope parameters of this model
The first step is to collect a sample \(\{(\mathbf{x}_{i},y_{i}): i=1,2,...,n)\}\)
Samples are drawn independently from the same distribution
This means each is independent and identically distributed (iid)
The usual method we use for estimation is Ordinary Least Squares (OLS)
You get the same result with the Method of Moments
Above we found that the population slopes are
\[\boldsymbol{\beta}=(\textbf{E}[\mathbf{x'x}])^{-1} \textbf{E}[\mathbf{x}'y]\]
Replacing the population expectations with sample means across \(n\) observations
\[\boldsymbol{\hat{\beta}}=\left ( \frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right )^{-1} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right ) = \left ( \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right )^{-1} \left ( \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right )\]
Here \(i\) indexes the sample observation
\(\mathbf{x}_{i}\) is a \(1 \times (k+1)\) vector, so \(\mathbf{x_{i}'x_{i}}\) is a \((k+1) \times (k+1)\) matrix
Recall from the matrix review the concept of partitioned matrix multiplication
Based on that concept, we can rewrite
\[\left ( \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right ) = \mathbf{X'X}\] \[\left ( \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right ) = \mathbf{X'y}\]
Substituting in, we get an equivalent expression for the estimated slope
\[\boldsymbol{\hat{\beta}}=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'y}\]
You get the same the \(\boldsymbol{\hat{\beta}}\) if you use OLS
\[\min_{\boldsymbol{\hat{\beta}}} \sum_{i=1}^{n} (y_{i} - \mathbf{x}_{i}\boldsymbol{\hat{\beta}})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})'(\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})\]
Because of this, we will call \(\boldsymbol{\hat{\beta}}\) the OLS Estimator
You can use \(\boldsymbol{\hat{\beta}}\) to obtain predicted values of \(y\)
\[\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}}\]
By definition, the residual is the difference between \(\mathbf{y}\) and its predicted value
\[\mathbf{y} = \mathbf{X}\boldsymbol{\hat{\beta}} + \mathbf{\hat{u}}\]
OLS is one way to estimate a linear regression model
It is important to know how well the method works
One way is to examine the “fit” of our regression line
How close to the line are the datapoints?
Does \(\mathbf{x}\) explain a large fraction of variation in \(y\)?
These are the algebraic properties of our estimator
Different from the statistical properties
The first properties relate to the OLS residuals
\[\mathbf{X}'\hat{\mathbf{u}} = \begin{bmatrix} \sum_{i=1}^{n}\hat{u}_{i}\\ \sum_{i=1}^{n}x_{1i}\hat{u}_{i}\\ \vdots\\ \sum_{i=1}^{n}x_{ki}\hat{u}_{i}\\ \end{bmatrix} = \mathbf{0}\]
The sum (and the mean) of the residuals is zero
The sample covariance between \(x\) and the residuals is zero
These are the sample versions of \(\textbf{E}[\mathbf{x}'\mathbf{u}]=\mathbf{0}\)
The Coefficient of Determination (\(R^2\)) measures the fraction of the variation in \(y\) that is explained by the independent variables \[R^2 = \frac{ESS}{TSS}\]
TSS is the Total Sum of Squares \[TSS = \sum_{i=1}^{N} (y_{i} - \bar{y})^2 = \mathbf{(Ny)'Ny} = \mathbf{y'Ny}\]
where \(\mathbf{N} = (\mathbf{I} - \frac{1}{n}\mathbf{11'})\) is a symmetric, idempotent matrix
\(\mathbf{N}\) turns a vector into deviations from means
ESS is the Explained Sum of Squares \[ESS = \sum_{i=1}^{N} (\hat{y}_{i} - \bar{y})^2 =\mathbf{(N\hat{y})'N\hat{y}} = \mathbf{\hat{y}'N\hat{y}}\]
And the Residual Sum of Squares (SSR) is \[SSR= \sum_{i=1}^{N} (\hat{u}_{i})^2 = \hat{\mathbf{u}}'\hat{\mathbf{u}}\]
\(R^2\) ranges between 0 and 1
\(R^2 = 0\) means that \(\mathbf{x}\) explains none of the variation in \(y\)
\(R^2 = 1\) means that \(\mathbf{x}\) explains all of the variation in \(y\)
\(R^2\) is also equal to the square of correlation coefficient between \(y_{i}\) and \(\hat{y}_{i}\)
An important relationship between sums of squares is \[TSS= ESS + SSR\]
Using matrix notation,
\[\mathbf{\hat{y}'\mathbf{N}\hat{y}} = (\mathbf{X}\boldsymbol{\hat{\beta}} + \hat{\mathbf{u}})'\mathbf{N}(\mathbf{X}\boldsymbol{\hat{\beta} + \hat{\mathbf{u}}} )\] \[= (\boldsymbol{\hat{\beta}}' \mathbf{X}'+ \hat{\mathbf{u}}')(\mathbf{N}\mathbf{X}\boldsymbol{\hat{\beta} + \mathbf{N}\hat{\mathbf{u}}} )\] \[= \boldsymbol{\hat{\beta}}' \mathbf{X}' N\mathbf{X}\boldsymbol{\hat{\beta}} + \boldsymbol{\hat{\beta}}' \mathbf{X}' \mathbf{N}\hat{\mathbf{u}} + \hat{\mathbf{u}}'N\mathbf{X}\boldsymbol{\hat{\beta} +\hat{\mathbf{u}}' \mathbf{N}\hat{\mathbf{u}}}\]
Because \(\mathbf{N}\hat{\mathbf{u}} =\hat{\mathbf{u}}\) and \(\mathbf{X}'\hat{\mathbf{u}} = \mathbf{0}\) we can simplify to
\[= \boldsymbol{\hat{\beta}}' N\mathbf{X}' \mathbf{X}\boldsymbol{\hat{\beta}} +\hat{\mathbf{u}}' \hat{\mathbf{u}} = \mathbf{\hat{y}'N\hat{y}}+\hat{\mathbf{u}}' \hat{\mathbf{u}}\]
We noted above that \(ESS = \mathbf{\hat{y}'N\hat{y}}\) and \(SSR=\hat{\mathbf{u}}' \hat{\mathbf{u}}\)
As a result, you can reexpress \[R^2 = \frac{ESS}{TSS} = 1- \frac{SSR}{TSS}\]
Important to be cautious when using \(R^2\)
In real applications, \(R^2\) is often very low
Does not mean regression is bad
Just means we have not captured all factors that explain \(Y\)
A low \(R^2\) does not imply a poor estimate of \(\mathbf{\beta}\)
\(\mathbf{\beta}\) measures effect on \(y\) from changing \(\mathbf{x}\), all else equal
\(R^2\) measures fraction of total variation in \(y\) is explained by \(\mathbf{x}\)
Concepts are independent of each other
Whether estimate of \(\mathbf{\beta}\) is good or bad depends on statistical properties
Consistency and unbiasedness
We cover these in detail below
In practice, researchers do not pay much attention to \(R^2\)
Can also measure fit with spread of data around regression line
The residual \(\hat{\mathbf{u}}\) is deviation of \(\mathbf{y}\) from prediction \[\mathbf{\hat{u}} = \mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}}\]
The standard error of regression (SER) is the standard deviation of \(\hat{u}_{i}\)
\[SER = s_{\hat{u}} = \sqrt{\frac{\hat{\mathbf{u}}' \hat{\mathbf{u}}}{n-k-1} }= \sqrt{\frac{SSR}{n-k-1} }\]
\(\boldsymbol{\hat{\beta}}\) is an estimator for \(\boldsymbol{\beta}\)
Just like sample mean is an estimator for population mean
\(\boldsymbol{\beta}\) is a population parameter, \(\boldsymbol{\hat{\beta}}\) is a function of the sample
Like all estimators, \(\boldsymbol{\hat{\beta}}\) are random variables with a sampling distribution
We want estimators to have certain statistical properties
Consistency
Efficiency (lowest variance)
Unbiasedness
There are large sample properties and finite sample properties
Large sample (asymptotic) properties hold as sample size grows large
Finite sample ones hold in any sample size
We will focus most on large sample properties
Consistency and large sample distribution
These are often what matter most in applied work
This will be different from your undergraduate work
Occasionally we will also touch on the finite sample properties
To derive the statistical properties, we need some assumptions
We will focus on the assumptions necessary for the large sample properties
\(\textbf{E}[\mathbf{x}'u]=\mathbf{0}\)
Says that the population residual is mean zero and uncorrelated with \(\mathbf{x}\)
This is true in our model by definition, so not really an assumption
\(\text{rank } \textbf{E}[\mathbf{x'x}] = k+1\)
Says that there is no linear dependence in \(\mathbf{x}\)
Same as saying there is no perfect multicollinearity among regressors
\(\{(\mathbf{x}_{i},y_{i}): i=1,2,...,n)\}\) are a random sample
Implies that the observations are independent and identically distributed (iid)
This is necessary to establish consistency
Under these three assumptions you can show that \(\boldsymbol{\hat{\beta}}\)
is a consistent estimator for \(\boldsymbol{\beta}\)
has a large sample Normal distribution with
mean \(\boldsymbol{\beta}\)
variance \(n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\)
Consistency:an estimator approaches the parameter when the sample gets big
Econometricians like estimators to be consistent at minimum
To establish consistency, we need the three assumptions noted above
Population residual is mean zero and uncorrelated with \(\mathbf{x}\)
No linear dependence among \(\mathbf{x}\)
Random sampling
Mathematically, write the OLS estimator as
\[\boldsymbol{\hat{\beta}}=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'y}\] \[=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'(X'\boldsymbol{\beta} + u)}\] \[=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'X}\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}u\] \[=\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}u\]
If we partition the matrices, we can write this equivalently as
\[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]
Consistency means showing \(\boldsymbol{\hat{\beta}}\) approaches \(\boldsymbol{\beta}\) as \(n\) gets large
Consistency is established with the probability limit
Probability that an estimator is in small range around a parameter as \(n \rightarrow \infty\)
The probability limit of OLS slope estimator \(\hat{\beta}_{k}\) is \(\beta_{k}\) if \[\lim_{n \rightarrow \infty} Pr(\beta_{k} - \epsilon < \hat{\beta}_{k} < \beta_{k} + \epsilon) \rightarrow 1\]
The short form of this is \[\text{plim }(\hat{\beta}_{k}) = \beta_{k}\]
\[\text{plim}(x+y) = \text{plim}(x) + \text{plim}(y)\] \[\text{plim}(xy) = \text{plim}(x)\text{plim}(y)\] \[\text{plim}(\frac{x}{y}) = \frac{\text{plim}(x)}{ \text{plim}(y)}\]
\[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta} + \text{plim} \left( \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\right )\] \[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta} + \text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \text{plim} \left( \frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]
You can show that
\[\text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} = \left( \mathbf{E}[\mathbf{x'x}]\right )^{-1}\] \[\text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x'}u\right ) = \mathbf{E}\left( \mathbf{x'}u\right )\]
A property of our model is \(\mathbf{E}\left( \mathbf{x'}u\right ) =\mathbf{0}\), so
\[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta}\]
Given our assumptions, the OLS estimator is consistent for \(\boldsymbol{\beta}\)
As the sample size gets large, the OLS estimator approaches the true parameter
Many economic researchers work with big samples, so asymptotic approach works
Economists generally think at minimum an estimator should be consistent
Another desirable property is unbiasedness
This finite sample property holds for any sample size
The OLS estimator is unbiased under stricter assumptions than those made above
We need to replace assumption \(\textbf{E}[\mathbf{x}'u]=0\) with \(\textbf{E}[u|\mathbf{x}]=0\)
This is stronger than assuming \(\textbf{E}[\mathbf{x}'u]=\mathbf{0}\)
Zero correlation means no linear relationship between \(u\) and \(\mathbf{x}\)
\(\textbf{E}[u|\mathbf{x}]=0\) means no linear or nonlinear relationship
With this assumption we can show \(\boldsymbol{\hat{\beta}}\) is unbiased
Recall the OLS estimator is \[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}\]
We assume \(u\) and \(\mathbf{x}\) are unrelated in the population, so \(\textbf{E}[\mathbf{u}|\mathbf{X}]=\mathbf{0}\)
Taking expectations
\[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]=\mathbf{E}[\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}|\mathbf{X}]\]
Bringing the expectation operator through the bracket we get
\[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]=\boldsymbol{\beta} + \mathbf{E}[\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}|\mathbf{X}]\] \[=\boldsymbol{\beta} + \mathbf{E}[\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{E}[\mathbf{u}|\mathbf{X}]\] \[=\boldsymbol{\beta}\]
Says \(\boldsymbol{\hat{\beta}}\) at each \(\mathbf{X}\)
If you take the average across all values of \(\mathbf{X}\), you get \[\mathbf{E}[\boldsymbol{\hat{\beta}}] =\mathbf{E}[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]] = \boldsymbol{\beta}\]
Important to emphasize unbiasedness requires stronger assumptions
Some estimators are consistent but biased in finite samples
Instrumental variables (IV) estimator is one of them
For this reason only use IV with large samples
This is why researchers focus more on consistency as basis for good estimator
We now discuss the distribution and variance of \(\boldsymbol{\hat{\beta}}\)
This will allow us to move on to talking about inference
We will emphasize that inference is as important as estimation
For doing hypothesis tests on \(\boldsymbol{\beta}\), we need to know the distribution of \(\boldsymbol{\hat{\beta}}\)
Need to know where estimate falls in distribution of \(\boldsymbol{\hat{\beta}}\) under an assumption about \(\boldsymbol{\beta}\)
If our estimate is in the tails of the distribution, we reject
If it is in the center, we fail to reject
We will focus on the large-sample distribution
Distribution as sample size gets large
If you make assumptions about the errors, you can get the finite distribution
To get large sample distribution, we rely on the Central Limit Theorem (CLT)
The CLT says
Random vectors \(w_{i}, i=1,2,...\) are iid with mean \(\mathbf{E}(\mathbf{w_{i}})\) and variance \(\text{var}(\mathbf{w_{i}})\)
Then \(\bar{\mathbf{w}} = \frac{1}{n} \sum_{1=1}^{n}\mathbf{w_{i}}\) converges to \(N(\mathbf{E}(\mathbf{w_{i}}), n^{-1}\text{var}(\mathbf{w_{i}}))\)
If you write the slope estimator as
\[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]
You can see that taking mean of \(\boldsymbol{\hat{\beta}}-\boldsymbol{\beta}\) is equivalent to a mean of \(\mathbf{x_{i}'}u_{i}\)
Applying the CLT, \(\boldsymbol{\hat{\beta}}\) has a Normal distribution with mean \(\boldsymbol{\beta}\) and variance
\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\]
Note that this assumes nothing about the distribution of the error terms
Sometimes it makes sense to assume homoskedasticity of the errors
Usually this is stated as
\[\text{var}(u|\mathbf{x}) = \sigma^2\]
Note you can write the variance of \(u\) given \(\mathbf{x}\) as \[\text{var}(u|\mathbf{x}) = \mathbf{E}(u^2|\mathbf{x}) - \left [ \mathbf{E}(u|\mathbf{x}) \right ]^2\]
Turns out we only need to assume the first component is constant
\[\mathbf{E}(u^2|\mathbf{x}) = \sigma^2\]
We can then simplify the expression for \(\text{var}(\boldsymbol{\hat{\beta}})\)
First, note that because of the law of iterated expectations
\[\mathbf{E}(u^2\mathbf{x'x}) = \mathbf{E}(\mathbf{E}(u^2|\mathbf{x})\mathbf{x'x})\]
Substituting in \(\mathbf{E}(u^2|\mathbf{x}) = \sigma^2\), we get
\[\mathbf{E}(u^2\mathbf{x'x}) = \sigma^2\mathbf{E}(\mathbf{x'x})\]
Finally, substituting in to the expression for \(\text{var}(\boldsymbol{\hat{\beta}})\)
\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\sigma^2\mathbf{E}(\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[= \sigma^2 n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[= \sigma^2 n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\]
This is the version of the variance of \(\boldsymbol{\hat{\beta}}\) that is given by default in Stata
It is only valid under homoskedasticity
In practice this is rarely the case
Academic researchers almost never assume homoskedastic errors
Note that these are the population variances
Just like we estimated \(\beta\), we must also estimate \(\text{var}(\boldsymbol{\hat{\beta}})\)
Follow the same procedure by replacing population moments with sample ones
\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = n^{-1}\left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1}\left ( \frac{1}{n} \sum_{i=1}^{n}\hat{u}_{i}^2\mathbf{x_{i}'x_{i}}\right )\left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1}\]
\[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = \left (\mathbf{X'X}\right )^{-1}\left ( \sum_{i=1}^{n}\hat{u}_{i}^2\mathbf{x_{i}'x_{i}}\right )\left ( \mathbf{X'X}\right )^{-1}\]
This is the Heteroskedasticity-Robust estimator of the variance
It is what Stata produces when you use the “robust” option in regression
The square root of the diagonal elements are the Robust Standard Errors
If we assume homoskedastic errors, we get
\[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = s_{\hat{u}}^2 \left (\mathbf{X'X}\right )^{-1}\]
where \(s_{\hat{u}}^2\) is a consistent estimator of \(\sigma^2\) and equals
\[s_{\hat{u}}^2 = \frac{\hat{\mathbf{u}}'\hat{\mathbf{u}}}{n-k-1}\]
The square root of the diagonals are Stata’s default standard errors
Recall that we will never know the value of the true slope parameters
But, we can use our estimator and estimate to test hypotheses about them
Procedure is as follows
Make tentative assumption about true slope
Choose a test statistic, with known distribution under assumption
Formulate a decision rule
Check whether estimate is likely to occur under that rule
If no, then reject
If yes, fail to reject
Often in applied econometrics we test hypotheses about one parameter
Usually we are interested in effect of one of regressors on outcome
We test whether that slope parameter is zero
The standard method is to use a \(t\)-test
In the linear regression model, the null and alternative hypotheses are
\(H_{0}: \beta_{j} = \beta_{j,0}\)
\(H_{1}: \beta_{j} \neq \beta_{j,0}\)
The test statistic is the \(t\) statistic
\[t=\frac{\hat{\beta}_{j} - \beta_{j,0}}{se(\hat{\beta}_{j} )}\]
The \(t\)-statistic is a random variable that varies across samples
It has a Standard Normal distribution in large samples
To compute the \(t\)-statistic, we need \(se(\hat{\beta}_{j} )\)
This is the square root of the \(j\)th diagonal element of \(\hat{\text{var}}(\boldsymbol{\hat{\beta}})\)
The formula we use for \(\hat{\text{var}}(\boldsymbol{\hat{\beta}})\) depends on whether we assume homoskedasticity
We then make a decision rule for rejection
Usually this is defined in terms of a Significance Level
Typically, this is \(\alpha = 0.05\)
Choice of significance level divides sampling distribution into two regions
Rejection Region: values for \(t\) where we reject \(H_{0}\)
One-sided test: the upper or lower \(\alpha \%\) of values
Two-sided test: the upper and lower \(\frac{\alpha}{2} \%\) of values
Acceptance Region: values for \(t\) where we accept \(H_{0}\)
One-sided test: the upper or lower \((1-\alpha) \%\) of values
Two-sided test: the middle \((1-\alpha) \%\) of values
The Critical Value separates the acceptance and rejection regions
On graphs below it is value between green and white regions
For two-tailed tests, there are two critical values
Value depends on sampling distribution and \(\alpha\)
At \(5\%\) significance with a t-distribution and a large sample
Upper tailed test: \(t^{c}\) = 1.64
Lower tailed test: \(t^{c}\) = -1.64
Two tailed test: \(|t^{c}|\) = 1.96
Finally, we compare our realized test statistic to the critical values
In two-tailed test reject if \(|t| > |t^{c}|\)
In lower-tailed test reject if \(t < t^{c}\)
In upper-tailed test reject if \(t > t^{c}\)
If you do not reject, you fail to reject
Sometimes you need to test multiple hypotheses at the same time
Testing significance of the model
Testing whether two parameters are equal
We follow a similar procedure
Make tentative assumption about parameters
Choose a test statistic with known distribution
Make decision rule
Compute test statistic and apply decision rule
Key difference is in the test statistic we use
We can write multiple hypotheses with matrix notation
\[H_{0}: \mathbf{R}\boldsymbol{\beta} =\mathbf{r}\] \[H_{1}: \mathbf{R}\boldsymbol{\beta} \neq \mathbf{r}\]
\(\mathbf{R}\) is a \(q \times k+1\) matrix of constants
\(\boldsymbol{\beta}\) is a \(k+1 \times 1\) vector of slope parameters
\(r\) is a \(q \times 1\) vector of constants
For example, suppose you want to test \(H_{0}: \beta_{2} = 0\). In this case
\[\mathbf{R} = \begin{bmatrix} 0 & 0 & 1 & 0 & \cdots & 0 \end{bmatrix}\] \[r = 0\]
If you multiply this out you get
\[\mathbf{R}\boldsymbol{\beta} = \begin{bmatrix} 0 & 0 & 1 & 0 & \cdots & 0 \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2} \\ \vdots\\ \beta_{k} \end{bmatrix} =\beta_{2}\]
So that \(\mathbf{R}\boldsymbol{\beta} =r\) is equivalent to \(\beta_{2} = 0\)
Now imagine you want to test \(H_{0}: \beta_{2} = 0, \beta_{4} = 2\)
\[\mathbf{R} = \begin{bmatrix} 0 & 0 & 1 & 0 &0&\cdots & 0\\ 0 & 0 & 0 & 0 &1&\cdots & 0 \end{bmatrix}\] \[\mathbf{r} = \begin{bmatrix} 0 \\ 2 \end{bmatrix}\]
If you multiply out in this case
\[\mathbf{R}\boldsymbol{\beta} = \begin{bmatrix} 0 & 0 & 1 & 0 &0&\cdots & 0\\ 0 & 0 & 0 & 0 &1&\cdots & 0 \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2} \\ \vdots\\ \beta_{k} \end{bmatrix} = \begin{bmatrix} \beta_{2}\\ \beta_{4} \end{bmatrix}\]
So that \(\mathbf{R}\boldsymbol{\beta} =r\) is equivalent to \[\begin{bmatrix} \beta_{2}\\ \beta_{4} \end{bmatrix} = \begin{bmatrix} 0\\ 2 \end{bmatrix}\]
When testing multiple linear restrictions, it is common to use the Wald Statistic
\[W = (\mathbf{R}\boldsymbol{\hat{\beta}} -\mathbf{r} )'(\mathbf{R}\hat{\text{var}}(\boldsymbol{\hat{\beta}}) \mathbf{R}')^{-1} (\mathbf{R}\boldsymbol{\hat{\beta}} -\mathbf{r} )\]
Intuition is \(W\) computes squared distance between estimate and null hypothesis
Distance is scaled by the variance
If that distance is too far, we reject the null
Very similar to intuition of \(t\)-test
\(\text{var}(\boldsymbol{\hat{\beta}})\) is variance matrix of OLS estimator computed above
Note that you can test single restrictions this way
Suppose you are testing \(\beta_{2} = 0\). The \(W\) statistic reduces to
\[W = \frac{(\hat{\beta}_{2} - 0)^2}{\text{var}(\hat{\beta}_{2})}\]
The \(W\) statistic has a \(\chi^2_{q}\) distribution in large samples
In Stata, this is sometimes implemented as an \(F\)-test, where
\[F = \frac{W}{q}\]
\(F\) has an \(F\) distribution with \((1,n-k-1)\) degrees of freedom