Ordinary Least Squares

EC655

Justin Smith

Wilfrid Laurier University

Fall 2022

Estimating Parameters of the Linear Regression Model

Estimation by Method of Moments

Recall that the population linear regression model is

\[y = \mathbf{x}\boldsymbol{\beta} + u\]
We wish to estimate the slope parameters of this model
The first step is to collect a sample \(\{(\mathbf{x}_{i},y_{i}): i=1,2,...,n)\}\)
- Samples are drawn independently from the same distribution
- This means each is independent and identically distributed (iid)
The usual method we use for estimation is Ordinary Least Squares (OLS)
- Minimize the sum of the squared residuals

Estimation by Method of Moments

You get the same result with the Method of Moments
- Replace the unknown population moments with their sample equivalents
Above we found that the population slopes are

\[\boldsymbol{\beta}=(\textbf{E}[\mathbf{x'x}])^{-1} \textbf{E}[\mathbf{x}'y]\]
Replacing the population expectations with sample means across \(n\) observations

\[\boldsymbol{\hat{\beta}}=\left ( \frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right )^{-1} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right ) = \left ( \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right )^{-1} \left ( \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right )\]
Here \(i\) indexes the sample observation
\(\mathbf{x}_{i}\) is a \(1 \times (k+1)\) vector, so \(\mathbf{x_{i}'x_{i}}\) is a \((k+1) \times (k+1)\) matrix

Estimation by Method of Moments

Recall from the matrix review the concept of partitioned matrix multiplication
- You can break matrices into conformable vectors before multiplication
Based on that concept, we can rewrite

\[\left ( \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}} \right ) = \mathbf{X'X}\] \[\left ( \sum_{i=1}^{n}\mathbf{x}_{i}'y_{i}\right ) = \mathbf{X'y}\]
Substituting in, we get an equivalent expression for the estimated slope

\[\boldsymbol{\hat{\beta}}=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'y}\]

Ordinary Least Squares

You get the same the \(\boldsymbol{\hat{\beta}}\) if you use OLS

\[\min_{\boldsymbol{\hat{\beta}}} \sum_{i=1}^{n} (y_{i} - \mathbf{x}_{i}\boldsymbol{\hat{\beta}})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})'(\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})\]
Because of this, we will call \(\boldsymbol{\hat{\beta}}\) the OLS Estimator
You can use \(\boldsymbol{\hat{\beta}}\) to obtain predicted values of \(y\)

\[\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}}\]
By definition, the residual is the difference between \(\mathbf{y}\) and its predicted value

\[\mathbf{y} = \mathbf{X}\boldsymbol{\hat{\beta}} + \mathbf{\hat{u}}\]

Ordinary Least Squares

Algebraic Properties of OLS

Introduction

OLS is one way to estimate a linear regression model
It is important to know how well the method works
One way is to examine the “fit” of our regression line
- How close to the line are the datapoints?
- Does \(\mathbf{x}\) explain a large fraction of variation in \(y\)?
These are the algebraic properties of our estimator
- Mathematical relationships hold true in each sample
Different from the statistical properties
- The behaviour of estimators across (hypothetical) repeated samples

Properties of the Residuals

The first properties relate to the OLS residuals

\[\mathbf{X}'\hat{\mathbf{u}} = \begin{bmatrix} \sum_{i=1}^{n}\hat{u}_{i}\\ \sum_{i=1}^{n}x_{1i}\hat{u}_{i}\\ \vdots\\ \sum_{i=1}^{n}x_{ki}\hat{u}_{i}\\ \end{bmatrix} = \mathbf{0}\]
- The sum (and the mean) of the residuals is zero
- The sample covariance between \(x\) and the residuals is zero
These are the sample versions of \(\textbf{E}[\mathbf{x}'\mathbf{u}]=\mathbf{0}\)
- Also come from the first order conditions of OLS

\(R^2\)

The Coefficient of Determination (\(R^2\)) measures the fraction of the variation in \(y\) that is explained by the independent variables \[R^2 = \frac{ESS}{TSS}\]
TSS is the Total Sum of Squares \[TSS = \sum_{i=1}^{N} (y_{i} - \bar{y})^2 = \mathbf{(Ny)'Ny} = \mathbf{y'Ny}\]
- where \(\mathbf{N} = (\mathbf{I} - \frac{1}{n}\mathbf{11'})\) is a symmetric, idempotent matrix
  - \(\mathbf{1}\) is a vector of all ones
- \(\mathbf{N}\) turns a vector into deviations from means

\(R^2\)

ESS is the Explained Sum of Squares \[ESS = \sum_{i=1}^{N} (\hat{y}_{i} - \bar{y})^2 =\mathbf{(N\hat{y})'N\hat{y}} = \mathbf{\hat{y}'N\hat{y}}\]
And the Residual Sum of Squares (SSR) is \[SSR= \sum_{i=1}^{N} (\hat{u}_{i})^2 = \hat{\mathbf{u}}'\hat{\mathbf{u}}\]
\(R^2\) ranges between 0 and 1
- \(R^2 = 0\) means that \(\mathbf{x}\) explains none of the variation in \(y\)
- \(R^2 = 1\) means that \(\mathbf{x}\) explains all of the variation in \(y\)

\(R^2\)

\(R^2\) is also equal to the square of correlation coefficient between \(y_{i}\) and \(\hat{y}_{i}\)
- \(R^2 = 1\) is perfect correlation between prediction and actual values
An important relationship between sums of squares is \[TSS= ESS + SSR\]
- Movement of \(y_{i}\) away from its average is explained by regression and other factors
Using matrix notation,

\[\mathbf{\hat{y}'\mathbf{N}\hat{y}} = (\mathbf{X}\boldsymbol{\hat{\beta}} + \hat{\mathbf{u}})'\mathbf{N}(\mathbf{X}\boldsymbol{\hat{\beta} + \hat{\mathbf{u}}} )\] \[= (\boldsymbol{\hat{\beta}}' \mathbf{X}'+ \hat{\mathbf{u}}')(\mathbf{N}\mathbf{X}\boldsymbol{\hat{\beta} + \mathbf{N}\hat{\mathbf{u}}} )\] \[= \boldsymbol{\hat{\beta}}' \mathbf{X}' N\mathbf{X}\boldsymbol{\hat{\beta}} + \boldsymbol{\hat{\beta}}' \mathbf{X}' \mathbf{N}\hat{\mathbf{u}} + \hat{\mathbf{u}}'N\mathbf{X}\boldsymbol{\hat{\beta} +\hat{\mathbf{u}}' \mathbf{N}\hat{\mathbf{u}}}\]

\(R^2\)

Because \(\mathbf{N}\hat{\mathbf{u}} =\hat{\mathbf{u}}\) and \(\mathbf{X}'\hat{\mathbf{u}} = \mathbf{0}\) we can simplify to

\[= \boldsymbol{\hat{\beta}}' N\mathbf{X}' \mathbf{X}\boldsymbol{\hat{\beta}} +\hat{\mathbf{u}}' \hat{\mathbf{u}} = \mathbf{\hat{y}'N\hat{y}}+\hat{\mathbf{u}}' \hat{\mathbf{u}}\]
We noted above that \(ESS = \mathbf{\hat{y}'N\hat{y}}\) and \(SSR=\hat{\mathbf{u}}' \hat{\mathbf{u}}\)
As a result, you can reexpress \[R^2 = \frac{ESS}{TSS} = 1- \frac{SSR}{TSS}\]
Important to be cautious when using \(R^2\)
In real applications, \(R^2\) is often very low
- Does not mean regression is bad
- Just means we have not captured all factors that explain \(Y\)

\(R^2\)

A low \(R^2\) does not imply a poor estimate of \(\mathbf{\beta}\)
- \(\mathbf{\beta}\) measures effect on \(y\) from changing \(\mathbf{x}\), all else equal
- \(R^2\) measures fraction of total variation in \(y\) is explained by \(\mathbf{x}\)
- Concepts are independent of each other
Whether estimate of \(\mathbf{\beta}\) is good or bad depends on statistical properties
- Consistency and unbiasedness
- We cover these in detail below
In practice, researchers do not pay much attention to \(R^2\)

Standard Error of Regression (SER)

Can also measure fit with spread of data around regression line
The residual \(\hat{\mathbf{u}}\) is deviation of \(\mathbf{y}\) from prediction \[\mathbf{\hat{u}} = \mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}}\]
The standard error of regression (SER) is the standard deviation of \(\hat{u}_{i}\)
- The average distance of \(y_{i}\) from its prediction \(\hat{y}_{i}\)
\[SER = s_{\hat{u}} = \sqrt{\frac{\hat{\mathbf{u}}' \hat{\mathbf{u}}}{n-k-1} }= \sqrt{\frac{SSR}{n-k-1} }\]

Statistical Properties of OLS Estimator

Introduction

\(\boldsymbol{\hat{\beta}}\) is an estimator for \(\boldsymbol{\beta}\)
- Just like sample mean is an estimator for population mean
- \(\boldsymbol{\beta}\) is a population parameter, \(\boldsymbol{\hat{\beta}}\) is a function of the sample
Like all estimators, \(\boldsymbol{\hat{\beta}}\) are random variables with a sampling distribution
- Realized values change from sample to sample
We want estimators to have certain statistical properties
- Consistency
- Efficiency (lowest variance)
- Unbiasedness

Introduction

There are large sample properties and finite sample properties
- Large sample (asymptotic) properties hold as sample size grows large
- Finite sample ones hold in any sample size
We will focus most on large sample properties
- Consistency and large sample distribution
- These are often what matter most in applied work
- This will be different from your undergraduate work
Occasionally we will also touch on the finite sample properties
- For example, unbiasedness

Assumptions

To derive the statistical properties, we need some assumptions
We will focus on the assumptions necessary for the large sample properties

\(\textbf{E}[\mathbf{x}'u]=\mathbf{0}\)
- Says that the population residual is mean zero and uncorrelated with \(\mathbf{x}\)
- This is true in our model by definition, so not really an assumption
\(\text{rank } \textbf{E}[\mathbf{x'x}] = k+1\)
- Says that there is no linear dependence in \(\mathbf{x}\)
- Same as saying there is no perfect multicollinearity among regressors
  - None of the \(x\) variables are linear functions of another

Assumptions

\(\{(\mathbf{x}_{i},y_{i}): i=1,2,...,n)\}\) are a random sample
- Implies that the observations are independent and identically distributed (iid)
- This is necessary to establish consistency

Under these three assumptions you can show that \(\boldsymbol{\hat{\beta}}\)
- is a consistent estimator for \(\boldsymbol{\beta}\)
- has a large sample Normal distribution with
  - mean \(\boldsymbol{\beta}\)
  - variance \(n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\)

Consistency

Consistency:an estimator approaches the parameter when the sample gets big
- Removing sampling variation leaves us with the true parameter
Econometricians like estimators to be consistent at minimum
To establish consistency, we need the three assumptions noted above
- Population residual is mean zero and uncorrelated with \(\mathbf{x}\)
- No linear dependence among \(\mathbf{x}\)
- Random sampling

Consistency

Mathematically, write the OLS estimator as

\[\boldsymbol{\hat{\beta}}=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'y}\] \[=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'(X'\boldsymbol{\beta} + u)}\] \[=\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'X}\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}u\] \[=\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}u\]
If we partition the matrices, we can write this equivalently as

\[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]
Consistency means showing \(\boldsymbol{\hat{\beta}}\) approaches \(\boldsymbol{\beta}\) as \(n\) gets large

Consistency

Consistency is established with the probability limit
- Probability that an estimator is in small range around a parameter as \(n \rightarrow \infty\)
- The probability limit of OLS slope estimator \(\hat{\beta}_{k}\) is \(\beta_{k}\) if \[\lim_{n \rightarrow \infty} Pr(\beta_{k} - \epsilon < \hat{\beta}_{k} < \beta_{k} + \epsilon) \rightarrow 1\]
- The short form of this is \[\text{plim }(\hat{\beta}_{k}) = \beta_{k}\]

Consistency

You can use “plim” as an operator with the following rules

\[\text{plim}(x+y) = \text{plim}(x) + \text{plim}(y)\] \[\text{plim}(xy) = \text{plim}(x)\text{plim}(y)\] \[\text{plim}(\frac{x}{y}) = \frac{\text{plim}(x)}{ \text{plim}(y)}\]

Consistency

Applying this to our slope estimator

\[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta} + \text{plim} \left( \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\right )\] \[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta} + \text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \text{plim} \left( \frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]

Consistency

You can show that

\[\text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} = \left( \mathbf{E}[\mathbf{x'x}]\right )^{-1}\] \[\text{plim} \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x'}u\right ) = \mathbf{E}\left( \mathbf{x'}u\right )\]
A property of our model is \(\mathbf{E}\left( \mathbf{x'}u\right ) =\mathbf{0}\), so

\[\text{plim}(\boldsymbol{\hat{\beta}})=\boldsymbol{\beta}\]
Given our assumptions, the OLS estimator is consistent for \(\boldsymbol{\beta}\)
As the sample size gets large, the OLS estimator approaches the true parameter
Many economic researchers work with big samples, so asymptotic approach works

Unbiasedness

Economists generally think at minimum an estimator should be consistent
- Estimator should be close to the real value with big samples
Another desirable property is unbiasedness
- On average, the estimator should equal the parameter
This finite sample property holds for any sample size
The OLS estimator is unbiased under stricter assumptions than those made above
We need to replace assumption \(\textbf{E}[\mathbf{x}'u]=0\) with \(\textbf{E}[u|\mathbf{x}]=0\)
- Assumption says that \(u\) is unrelated to any function of \(\mathbf{x}\)

Unbiasedness

This is stronger than assuming \(\textbf{E}[\mathbf{x}'u]=\mathbf{0}\)
- Zero correlation means no linear relationship between \(u\) and \(\mathbf{x}\)
- \(\textbf{E}[u|\mathbf{x}]=0\) means no linear or nonlinear relationship
With this assumption we can show \(\boldsymbol{\hat{\beta}}\) is unbiased
Recall the OLS estimator is \[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}\]
We assume \(u\) and \(\mathbf{x}\) are unrelated in the population, so \(\textbf{E}[\mathbf{u}|\mathbf{X}]=\mathbf{0}\)
- Says each element of \(\mathbf{u}\) is unrelated to all parts of \(\mathbf{X}\)

Unbiasedness

Taking expectations

\[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]=\mathbf{E}[\boldsymbol{\beta} + \left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}|\mathbf{X}]\]
Bringing the expectation operator through the bracket we get

\[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]=\boldsymbol{\beta} + \mathbf{E}[\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{u}|\mathbf{X}]\] \[=\boldsymbol{\beta} + \mathbf{E}[\left ( \mathbf{X'X}\right )^{-1} \mathbf{X'}\mathbf{E}[\mathbf{u}|\mathbf{X}]\] \[=\boldsymbol{\beta}\]
Says \(\boldsymbol{\hat{\beta}}\) at each \(\mathbf{X}\)
If you take the average across all values of \(\mathbf{X}\), you get \[\mathbf{E}[\boldsymbol{\hat{\beta}}] =\mathbf{E}[\mathbf{E}[\boldsymbol{\hat{\beta}}|\mathbf{X}]] = \boldsymbol{\beta}\]

Unbiasedness

Important to emphasize unbiasedness requires stronger assumptions
Some estimators are consistent but biased in finite samples
- Instrumental variables (IV) estimator is one of them
- For this reason only use IV with large samples
This is why researchers focus more on consistency as basis for good estimator
We now discuss the distribution and variance of \(\boldsymbol{\hat{\beta}}\)
- This will allow us to move on to talking about inference
- We will emphasize that inference is as important as estimation

Large Sample Distribution of \(\boldsymbol{\hat{\beta}}\)

For doing hypothesis tests on \(\boldsymbol{\beta}\), we need to know the distribution of \(\boldsymbol{\hat{\beta}}\)
- Need to know where estimate falls in distribution of \(\boldsymbol{\hat{\beta}}\) under an assumption about \(\boldsymbol{\beta}\)
- If our estimate is in the tails of the distribution, we reject
- If it is in the center, we fail to reject
We will focus on the large-sample distribution
- Distribution as sample size gets large
- If you make assumptions about the errors, you can get the finite distribution
To get large sample distribution, we rely on the Central Limit Theorem (CLT)

Large Sample Distribution of \(\boldsymbol{\hat{\beta}}\)

The CLT says
- Random vectors \(w_{i}, i=1,2,...\) are iid with mean \(\mathbf{E}(\mathbf{w_{i}})\) and variance \(\text{var}(\mathbf{w_{i}})\)
- Then \(\bar{\mathbf{w}} = \frac{1}{n} \sum_{1=1}^{n}\mathbf{w_{i}}\) converges to \(N(\mathbf{E}(\mathbf{w_{i}}), n^{-1}\text{var}(\mathbf{w_{i}}))\)
If you write the slope estimator as

\[\boldsymbol{\hat{\beta}}=\boldsymbol{\beta} + \left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1} \left (\frac{1}{n}\sum_{i=1}^{n}\mathbf{x_{i}'}u_{i} \right )\]
You can see that taking mean of \(\boldsymbol{\hat{\beta}}-\boldsymbol{\beta}\) is equivalent to a mean of \(\mathbf{x_{i}'}u_{i}\)
Applying the CLT, \(\boldsymbol{\hat{\beta}}\) has a Normal distribution with mean \(\boldsymbol{\beta}\) and variance

\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\]

Large Sample Distribution of \(\boldsymbol{\hat{\beta}}\)

Note that this assumes nothing about the distribution of the error terms
Sometimes it makes sense to assume homoskedasticity of the errors
- Variation in the errors is not a function of \(\mathbf{x}\)
Usually this is stated as

\[\text{var}(u|\mathbf{x}) = \sigma^2\]
Note you can write the variance of \(u\) given \(\mathbf{x}\) as \[\text{var}(u|\mathbf{x}) = \mathbf{E}(u^2|\mathbf{x}) - \left [ \mathbf{E}(u|\mathbf{x}) \right ]^2\]
- The sum of two conditional expectations

Large Sample Distribution of \(\boldsymbol{\hat{\beta}}\)

Turns out we only need to assume the first component is constant

\[\mathbf{E}(u^2|\mathbf{x}) = \sigma^2\]
- Means zero covariance between \(u^2\) and squares and cross-products of regressors
We can then simplify the expression for \(\text{var}(\boldsymbol{\hat{\beta}})\)
First, note that because of the law of iterated expectations

\[\mathbf{E}(u^2\mathbf{x'x}) = \mathbf{E}(\mathbf{E}(u^2|\mathbf{x})\mathbf{x'x})\]
Substituting in \(\mathbf{E}(u^2|\mathbf{x}) = \sigma^2\), we get

\[\mathbf{E}(u^2\mathbf{x'x}) = \sigma^2\mathbf{E}(\mathbf{x'x})\]

Large Sample Distribution of \(\boldsymbol{\hat{\beta}}\)

Finally, substituting in to the expression for \(\text{var}(\boldsymbol{\hat{\beta}})\)

\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\sigma^2\mathbf{E}(\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[= \sigma^2 n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[= \sigma^2 n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\]
This is the version of the variance of \(\boldsymbol{\hat{\beta}}\) that is given by default in Stata
It is only valid under homoskedasticity
- In practice this is rarely the case
- Academic researchers almost never assume homoskedastic errors
Note that these are the population variances
- Based on population moments that we do not know

Variance Estimator for \(\boldsymbol{\hat{\beta}}\)

Just like we estimated \(\beta\), we must also estimate \(\text{var}(\boldsymbol{\hat{\beta}})\)
Follow the same procedure by replacing population moments with sample ones

\[\text{var}(\boldsymbol{\hat{\beta}}) = n^{-1}[\mathbf{E}(\mathbf{x'x})^{-1}]\mathbf{E}(u^2\mathbf{x'x})[\mathbf{E}(\mathbf{x'x})^{-1}]\] \[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = n^{-1}\left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1}\left ( \frac{1}{n} \sum_{i=1}^{n}\hat{u}_{i}^2\mathbf{x_{i}'x_{i}}\right )\left ( \frac{1}{n} \sum_{i=1}^{n}\mathbf{x_{i}'x_{i}}\right )^{-1}\]

\[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = \left (\mathbf{X'X}\right )^{-1}\left ( \sum_{i=1}^{n}\hat{u}_{i}^2\mathbf{x_{i}'x_{i}}\right )\left ( \mathbf{X'X}\right )^{-1}\]
This is the Heteroskedasticity-Robust estimator of the variance
- Sometimes just called the robust estimator

Variance Estimator for \(\boldsymbol{\hat{\beta}}\)

It is what Stata produces when you use the “robust” option in regression
The square root of the diagonal elements are the Robust Standard Errors
If we assume homoskedastic errors, we get

\[\hat{\text{var}}(\boldsymbol{\hat{\beta}}) = s_{\hat{u}}^2 \left (\mathbf{X'X}\right )^{-1}\]
where \(s_{\hat{u}}^2\) is a consistent estimator of \(\sigma^2\) and equals

\[s_{\hat{u}}^2 = \frac{\hat{\mathbf{u}}'\hat{\mathbf{u}}}{n-k-1}\]
The square root of the diagonals are Stata’s default standard errors

Hypothesis Testing

Introduction

Recall that we will never know the value of the true slope parameters
- This is why we are estimating it
But, we can use our estimator and estimate to test hypotheses about them
Procedure is as follows
- Make tentative assumption about true slope
- Choose a test statistic, with known distribution under assumption
- Formulate a decision rule
- Check whether estimate is likely to occur under that rule
  - If no, then reject
  - If yes, fail to reject

Testing Single Linear Hypotheses

Often in applied econometrics we test hypotheses about one parameter
- Usually we are interested in effect of one of regressors on outcome
- We test whether that slope parameter is zero
The standard method is to use a \(t\)-test
In the linear regression model, the null and alternative hypotheses are
- \(H_{0}: \beta_{j} = \beta_{j,0}\)
- \(H_{1}: \beta_{j} \neq \beta_{j,0}\)
The test statistic is the \(t\) statistic

\[t=\frac{\hat{\beta}_{j} - \beta_{j,0}}{se(\hat{\beta}_{j} )}\]

Testing Single Linear Hypotheses

The \(t\)-statistic is a random variable that varies across samples
It has a Standard Normal distribution in large samples
- It is a standardized version of a Normal random variable
To compute the \(t\)-statistic, we need \(se(\hat{\beta}_{j} )\)
- This is the square root of the \(j\)th diagonal element of \(\hat{\text{var}}(\boldsymbol{\hat{\beta}})\)
- The formula we use for \(\hat{\text{var}}(\boldsymbol{\hat{\beta}})\) depends on whether we assume homoskedasticity
We then make a decision rule for rejection
- Usually this is defined in terms of a Significance Level
  - The significance level \(\alpha\) is the maximum proportion of all possible \(t\) values unusual enough to reject \(H_{0}\)
- Typically, this is \(\alpha = 0.05\)

Testing Single Linear Hypotheses

Choice of significance level divides sampling distribution into two regions
- Rejection Region: values for \(t\) where we reject \(H_{0}\)
  - One-sided test: the upper or lower \(\alpha \%\) of values
  - Two-sided test: the upper and lower \(\frac{\alpha}{2} \%\) of values
- Acceptance Region: values for \(t\) where we accept \(H_{0}\)
  - One-sided test: the upper or lower \((1-\alpha) \%\) of values
  - Two-sided test: the middle \((1-\alpha) \%\) of values
The Critical Value separates the acceptance and rejection regions
- On graphs below it is value between green and white regions
- For two-tailed tests, there are two critical values

Testing Single Linear Hypotheses

Value depends on sampling distribution and \(\alpha\)
- At \(5\%\) significance with a t-distribution and a large sample
  - Upper tailed test: \(t^{c}\) = 1.64
  - Lower tailed test: \(t^{c}\) = -1.64
  - Two tailed test: \(|t^{c}|\) = 1.96
Finally, we compare our realized test statistic to the critical values
- In two-tailed test reject if \(|t| > |t^{c}|\)
- In lower-tailed test reject if \(t < t^{c}\)
- In upper-tailed test reject if \(t > t^{c}\)
If you do not reject, you fail to reject
- Not the same as accepting the null hypothesis

Testing Multiple Linear Hypotheses

Sometimes you need to test multiple hypotheses at the same time
- Testing significance of the model
- Testing whether two parameters are equal
We follow a similar procedure
- Make tentative assumption about parameters
- Choose a test statistic with known distribution
- Make decision rule
- Compute test statistic and apply decision rule
Key difference is in the test statistic we use

Testing Multiple Linear Hypotheses

We can write multiple hypotheses with matrix notation

\[H_{0}: \mathbf{R}\boldsymbol{\beta} =\mathbf{r}\] \[H_{1}: \mathbf{R}\boldsymbol{\beta} \neq \mathbf{r}\]
- \(\mathbf{R}\) is a \(q \times k+1\) matrix of constants
- \(\boldsymbol{\beta}\) is a \(k+1 \times 1\) vector of slope parameters
- \(r\) is a \(q \times 1\) vector of constants
For example, suppose you want to test \(H_{0}: \beta_{2} = 0\). In this case

\[\mathbf{R} = \begin{bmatrix} 0 & 0 & 1 & 0 & \cdots & 0 \end{bmatrix}\] \[r = 0\]

Testing Multiple Linear Hypotheses

If you multiply this out you get

\[\mathbf{R}\boldsymbol{\beta} = \begin{bmatrix} 0 & 0 & 1 & 0 & \cdots & 0 \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2} \\ \vdots\\ \beta_{k} \end{bmatrix} =\beta_{2}\]
So that \(\mathbf{R}\boldsymbol{\beta} =r\) is equivalent to \(\beta_{2} = 0\)
Now imagine you want to test \(H_{0}: \beta_{2} = 0, \beta_{4} = 2\)

\[\mathbf{R} = \begin{bmatrix} 0 & 0 & 1 & 0 &0&\cdots & 0\\ 0 & 0 & 0 & 0 &1&\cdots & 0 \end{bmatrix}\] \[\mathbf{r} = \begin{bmatrix} 0 \\ 2 \end{bmatrix}\]

Testing Multiple Linear Hypotheses

If you multiply out in this case

\[\mathbf{R}\boldsymbol{\beta} = \begin{bmatrix} 0 & 0 & 1 & 0 &0&\cdots & 0\\ 0 & 0 & 0 & 0 &1&\cdots & 0 \end{bmatrix} \begin{bmatrix} \beta_{0}\\ \beta_{1}\\ \beta_{2} \\ \vdots\\ \beta_{k} \end{bmatrix} = \begin{bmatrix} \beta_{2}\\ \beta_{4} \end{bmatrix}\]
So that \(\mathbf{R}\boldsymbol{\beta} =r\) is equivalent to \[\begin{bmatrix} \beta_{2}\\ \beta_{4} \end{bmatrix} = \begin{bmatrix} 0\\ 2 \end{bmatrix}\]

Testing Multiple Linear Hypotheses

When testing multiple linear restrictions, it is common to use the Wald Statistic

\[W = (\mathbf{R}\boldsymbol{\hat{\beta}} -\mathbf{r} )'(\mathbf{R}\hat{\text{var}}(\boldsymbol{\hat{\beta}}) \mathbf{R}')^{-1} (\mathbf{R}\boldsymbol{\hat{\beta}} -\mathbf{r} )\]
- Intuition is \(W\) computes squared distance between estimate and null hypothesis
- Distance is scaled by the variance
- If that distance is too far, we reject the null
- Very similar to intuition of \(t\)-test
\(\text{var}(\boldsymbol{\hat{\beta}})\) is variance matrix of OLS estimator computed above
- Use Robust or Homoskedastic version as appropriate

Testing Multiple Linear Hypotheses

Note that you can test single restrictions this way
Suppose you are testing \(\beta_{2} = 0\). The \(W\) statistic reduces to

\[W = \frac{(\hat{\beta}_{2} - 0)^2}{\text{var}(\hat{\beta}_{2})}\]

The \(W\) statistic has a \(\chi^2_{q}\) distribution in large samples
In Stata, this is sometimes implemented as an \(F\)-test, where

\[F = \frac{W}{q}\]
\(F\) has an \(F\) distribution with \((1,n-k-1)\) degrees of freedom