Appendix A — OLS: Technical Details
This section provides technical details about the linear model and the OLS estimator.
A.1 Probability toolbox
We will use some results from probability theory. Let V and W be random variables vectors (or random vectors with compatible dimensions).
1.) Law of the iterated expectation (LIE):
E[V] = E[E[V|W]]. See Stock and Watson Section 2.3.
2.) Conditioning theorem (CT):
E[W V|W] = W E[V|W] Moreover, E[g(W)V|W] = g(W)E[V|W] for any function g(\cdot).
3.) Independence rule (IR):
If V and W are independent, then E[V] = E[V|W]. Moreover, if V and W_2 are independent, then E[V|W_1] = E[V|W_1, W_2].
4.) Functions of independent random variables:
If V and W are independent and g(\cdot) and h(\cdot) are functions, then g(V) and h(W) are independent.
5.) Cauchy-Schwarz inequality:
|E[VW]| \leq \sqrt{E[V^2]} \sqrt{E[W^2]} See Stock and Watson Appendix 18.2.
6.) Convergence in probability
The sequence W_n convergence in probability to the constant C, written W_n \overset{p}{\to} C, if, for and \delta > 0, P(|W_n - C| > \delta) \to 0 \quad (\text{as} \ n \to \infty). If W_n is a random vector or matrix, and C is a deterministic vector or matrix, then W_n \overset{p}{\to} C if each entry of W_n - C converges in probability to zero. W_n is called consistent for C if W_n \overset{p}{\to} C. A sufficient conditions for consistency is that both E[W_n] \to C and Var[W_n] \to 0 as n \to \infty. See Stock and Watson Section 2.6 and 18.2.
7.) Law of large numbers
If W_n is an i.i.d. sequence with E[W_n^2] < \infty (or bounded second moment in each entry for vectors/matrices), then \frac{1}{n} \sum_{i=1}^n W_n \overset{p}{\to} E[W_n] See Stock and Watson Section 2.6 and 18.2.
8.) Convergence in distribution:
Let F_n be the cumulative distribution function (CDF) of W_n and let G be the CDF of V. W_n converges in distribution to V, written W_n \overset{d}{\to} V, if F_n(a) \to G(a) for all a at which G is continuous. If V is \mathcal N(\mu, \Sigma) distributed, we also write W_n \overset{d}{\to} \mathcal N(\mu, \Sigma).
9.) Multivariate central limit theorem:
If W_n is an i.i.d. sequence of vectors with bounded second moments in each entry, then \sqrt n \bigg( \frac{1}{n} \sum_{i=1}^n W_n - E[W_n] \bigg) \overset{d}{\to} \mathcal N(\boldsymbol 0, Var[W_n]).
See Stock and Watson Section 19.2.
10. Continuous Mapping Theorem
Let g(\cdot) be a continuous function. If W_n \overset{p}{\to} C, then g(W_n) \overset{p}{\to} g(C). Also, if W_n \overset{d}{\to} C, then g(W_n) \overset{d}{\to} g(C). See Stock and Watson Section 18.2.
11. Slutsky’s Theorem
If V_n \overset{p}{\to} C and W_n \overset{p}{\to} D, then V_n W_n \overset{p}{\to} C D. If V_n \overset{p}{\to} C and W_n \overset{d}{\to} W, then V_n W_n \overset{d}{\to} C W See Stock and Watson Section 18.2.
A.2 Conditional Expectation
Inserting the model equation Y_i = \boldsymbol X_i' \boldsymbol \beta + u_i gives E[Y_i|\boldsymbol X_i] = E[\boldsymbol X_i' \boldsymbol \beta + u_i|\boldsymbol X_i] = \underbrace{E[\boldsymbol X_i' \boldsymbol \beta|\boldsymbol X_i]}_{\overset{\tiny (CT)}{=} \boldsymbol X_i' \boldsymbol \beta} + \underbrace{E[u_i|\boldsymbol X_i]}_{\overset{\tiny (A1)}{=}0}
A.3 Weak exogeneity
(A1) and the law of iterated expectations (LIE) imply E[u_i] \overset{\tiny (LIE)}{=} E[\underbrace{E[u_i | \boldsymbol X_i]}_{=0}] = E[0] = 0, and the conditioning theorem (CT) yields \begin{align*} Cov(u_i, X_{il}) &= E[u_i X_{il}] - \underbrace{E[u_i]}_{=0} E[X_{il}] \\ &\overset{\tiny (LIE)}{=} E[E[u_i X_{il} | \boldsymbol X_{i}]] \overset{\tiny (CT)}{=} E[ \underbrace{E[u_i | \boldsymbol X_{i}]}_{=0}X_{il}] = 0. \end{align*}
A.4 Strict exogeneity
The i.i.d. assumption (A2) implies that \{(Y_i, \boldsymbol X_i',u_i), i=1,\ldots,n\} is an i.i.d. collection since u_i = Y_i - \boldsymbol X_i'\boldsymbol \beta is a function of a random sample, and functions of independent variables are independent as well. Therefore, u_i and \boldsymbol X_j are independent for i \neq j, and (IR) implies E[u_i | \boldsymbol X_1, \ldots, \boldsymbol X_n] = E[u_i | \boldsymbol X_i]. Then, E[u_i | \boldsymbol X] = E[u_i | \boldsymbol X_1, \ldots, \boldsymbol X_n] \overset{\tiny (A2)}{=} E[u_i | \boldsymbol X_i] \overset{\tiny (A1)}{=} 0. and Cov(u_i, X_{jl}) = \underbrace{E[u_i X_{jl}]}_{=0} - \underbrace{E[u_i]}_{=0} E[X_{jl}] = 0.
A.5 Heteroskedasticity
Var[u_i | \boldsymbol X] = E[u_i^2 | \boldsymbol X] \overset{\tiny (A2)}{=} E[u_i^2 | \boldsymbol X_i] =: \sigma_i^2 = \sigma^2(\boldsymbol X_i).
A.6 No autocorrelation
(A2) implies that u_i is independent of u_j for i \neq j, and E[u_i | u_j, \boldsymbol X] = E[u_i | \boldsymbol X] = 0, which implies
E[u_i u_j | \boldsymbol X] \overset{\tiny (LIE)}{=} E\big[E[u_i u_j | u_j, \boldsymbol X] | \boldsymbol X\big] \overset{\tiny (CT)}{=} E\big[u_j \underbrace{E[u_i | u_j, \boldsymbol X]}_{=0} | \boldsymbol X\big] = 0,
and Cov(u_i, u_j) = E[u_i u_j] \overset{\tiny (LIE)}{=} E[E[u_i u_j | \boldsymbol X]] = 0. The conditional covariance matrix is \boldsymbol D := Var[\boldsymbol u | \boldsymbol X] = E[\boldsymbol u \boldsymbol u' | \boldsymbol X] = \begin{pmatrix} \sigma_1^2 & 0 & \ldots & 0 \\ 0 & \sigma_2^2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sigma_n^2 \end{pmatrix}.
A.7 Existence
rank(\boldsymbol X) = k \quad \Leftrightarrow \quad rank(\boldsymbol X' \boldsymbol X) = k \quad \Leftrightarrow \quad \boldsymbol X' \boldsymbol X \ \text{is invertible}.
A.8 Unbiasedness
(A4) ensures that \widehat{\boldsymbol \beta} is well defined. The following decomposition is useful: \begin{align*} \widehat{\boldsymbol \beta} &= (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol Y \\ &= (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' (\boldsymbol X \boldsymbol \beta + \boldsymbol u) \\ &= (\boldsymbol X' \boldsymbol X)^{-1} (\boldsymbol X' \boldsymbol X) \boldsymbol \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u \\ &= \boldsymbol \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u. \end{align*} The strict exogeneity implies E[\boldsymbol u | \boldsymbol X] = \boldsymbol 0, and E[\widehat{ \boldsymbol \beta} - \boldsymbol \beta | \boldsymbol X] = E[(\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \overset{\tiny (CT)}{=} (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \underbrace{E[\boldsymbol u | \boldsymbol X]}_{= \boldsymbol 0} = \boldsymbol 0. By the (LIE), E[\widehat{ \boldsymbol \beta}] = E[E[\widehat{ \boldsymbol \beta} | \boldsymbol X]] = E[\boldsymbol \beta] = \boldsymbol \beta.
A.9 Conditional variance
Recall the matrix rule Var[\boldsymbol A \boldsymbol z] = \boldsymbol A Var[\boldsymbol z] \boldsymbol A' if \boldsymbol z is a random vector and \boldsymbol A is a matrix. Then, \begin{align*} Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] &= Var[\boldsymbol \beta + (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \\ &= Var[(\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u | \boldsymbol X] \\ &= (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' Var[\boldsymbol u | \boldsymbol X] ((\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X')' \\ &= (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol D \boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1}. \end{align*} (A5) implies \boldsymbol D = \sigma^2 \boldsymbol I_n and Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] = \sigma^2 (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X'\boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1} = \sigma^2 (\boldsymbol X'\boldsymbol X)^{-1}.
A.10 Consistency
Let \boldsymbol Q := E[\boldsymbol X_i \boldsymbol X_i'] and \boldsymbol \Omega := E[(\boldsymbol X_i u_i)(\boldsymbol X_i u_i)'] = E[E[u_i^2 | \boldsymbol X_i] X_i X_i'] = E[\sigma^2_i \boldsymbol X_i \boldsymbol X_i']. By (A3) and the Cauchy-Schwarz inequality, the entries of \boldsymbol X_i \boldsymbol X_i' and (\boldsymbol X_i u_i)(\boldsymbol X_i u_i)' have bounded second moments, and by (A2) these entries form i.i.d. sequences. Hence, the conditions for the Law of Large Numbers are satisfied, and we have \begin{align*} \frac{1}{n} \boldsymbol X' \boldsymbol X = \frac{1}{n}\sum_{i=1}^n \boldsymbol X_i \boldsymbol X_i' &\overset{p}{\rightarrow} \boldsymbol Q, \\ \frac{1}{n} \boldsymbol X' \boldsymbol D \boldsymbol X = \frac{1}{n}\sum_{i=1}^n \sigma^2_i \boldsymbol X_i \boldsymbol X_i' &\overset{p}{\rightarrow} \boldsymbol \Omega. \end{align*} Consequently, Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] = \underbrace{\frac{1}{n}}_{\to 0} \Big( \underbrace{\frac{1}{n} \boldsymbol X' \boldsymbol X}_{\overset{p}{\rightarrow} \boldsymbol Q} \Big)^{-1} \Big(\underbrace{\frac{1}{n} \boldsymbol X' \boldsymbol D \boldsymbol X}_{\overset{p}{\rightarrow} \boldsymbol \Omega} \Big) \Big( \underbrace{\frac{1}{n} \boldsymbol X' \boldsymbol X}_{\overset{p}{\rightarrow} \boldsymbol Q} \Big)^{-1}, which implies that Var[\widehat{ \boldsymbol \beta} | \boldsymbol X] \overset{p}{\rightarrow} \boldsymbol 0 by the Continuous Mapping Theorem and Slutsky’s Theorem. Since \widehat{\boldsymbol \beta} is unbiased, the LIE implies Var[\widehat{ \boldsymbol \beta}] \to \boldsymbol 0, and Chebyshev’s inequality implies that \widehat{\boldsymbol \beta} \overset{p}{\to} \boldsymbol \beta.
A.11 Asymptotic normality
Since \boldsymbol X_i u_i is i.i.d. and has bounded second moments by (A2) and (A3), the Multivariate Central Limit Theorem implies \frac{1}{\sqrt n} \boldsymbol X' \boldsymbol u = \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i u_i \overset{D}{\rightarrow} \mathcal N(\boldsymbol 0, \boldsymbol \Omega) because E[\boldsymbol X_i u_i] = 0 by (A1) and the LIE, and Var[\boldsymbol X_i u_i] = \boldsymbol \Omega. Then, by Slutsky’s Theorem and the Continuous Mapping Theorem, \sqrt n (\widehat{\boldsymbol \beta} - \boldsymbol \beta) = \Big( \frac{1}{n} \boldsymbol X' \boldsymbol X\Big)^{-1} \Big( \frac{1}{\sqrt n} \boldsymbol X' \boldsymbol u \Big) \overset{D}{\rightarrow} \boldsymbol Q^{-1} \mathcal N(\boldsymbol 0, \boldsymbol \Omega), and the result follows from the fact that Var[\boldsymbol Q^{-1} \mathcal N(\boldsymbol 0, \boldsymbol \Omega)] = \boldsymbol Q^{-1} \boldsymbol \Omega \boldsymbol Q^{-1}.
Note that by the decomposition above, \sqrt n (\widehat{\boldsymbol \beta} - \boldsymbol \beta) = \sqrt n (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u, which is a linear combination of \boldsymbol u. Since linear combinations of normal variables are normal, \sqrt n (\widehat{\boldsymbol \beta} - \boldsymbol \beta) is normal under (A6) condditional on \boldsymbol X with E[\sqrt n (\widehat{\boldsymbol \beta} - \boldsymbol \beta)] = \boldsymbol 0 by the unbiasedness, and Var[\sqrt n (\widehat{\boldsymbol \beta} - \boldsymbol \beta)] = n E[(\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' \boldsymbol D \boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1}] \overset{\tiny (A5)}{=} \sigma^2 \boldsymbol Q^{-1}.