Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

2.7k Views Asked by At

I have a basic question that I can't seem to find an answer for -- perhaps I'm not wording it correctly. Suppose that we have an $n$-by-$d$ matrix, $X$ that represents input features, and we have a $n$-by-$1$ vector of output labels, $y$. Furthermore, let these labels be a noisy linear transformation of the input features:

$$ y = X \cdot w + \epsilon$$

Where $w$ represents a $d$-dimensional set of true weights and $\epsilon$ is i.i.d. zero-mean Gaussian noise, and I am interested in inferring $w$ using ordinary least squares linear regression (OLS).

I would like to prove that as the number of data points, $n$, increases, the weights predicted by OLS converges in probability to the true weights $w$ -- say, in $l_2$-norm.

Can anyone help me to go about proving this, or point me to references? Thanks!

As pointed out in the comments, it is important to keep in mind how $X$ is constructed. Let $X$ be a random matrix.

2

There are 2 best solutions below

11
On BEST ANSWER

Here is the probabilistic approach to the proof. Assuming that $X$ is a random matrix and $E\| x \| < \infty$ (and $Ey^2 < \infty)$.

Your minimization problem is $$\min_{w\in \mathbb{R}^d} \| Xw - y \|_2^2. $$ It is straightforward to show that the OLS estimator ("weights" using your terminology) are given by $$ \hat{w}=(X'X)^{-1}X'y, $$ or equivalently $$ \hat{w}= \left( \frac 1 n \sum_{i=1}^n x_i x_i^T\right)^{-1} \left( \frac 1 n \sum_{i=1}^n x_i^T y_i\right). $$ Note that convergence in $L^2$ norm occur iff $\hat{w}$ converge in probability to $w$, thus by using WLLN you can observe that $$ \frac 1 n \sum_{i=1}^nx_ix_i^T \xrightarrow{p}Ex_ix_i^T=E(X'X) $$ as $n\to \infty$, and $$ \frac 1 n \sum_{i=1}^nx_i^Ty_i \xrightarrow{p}E(X'y). $$ As such, by the continuous mapping theorem for $f(X'X, X'y)=(X'X)^{-1}X'y$, you get that $$ \hat{w}\xrightarrow p (E(X'X))^{-1}E(X'y), $$ which are the population ("real") weights of $C(X)$.

0
On

Here I'm going to weaken the assumptions: rather than "Gaussian" we'll say they errors have expected value $0$ and finite variance; rather than "identically distributed" we'll say they all have the same variance; and rather than "independent" we'll say "uncorrelated", i.e. all the covariances $\text{are } 0.$

The least-squares estimator of $w$ is $$ \hat w = (X^T X)^{-1} X^T y. $$ (BTW, one must have $\operatorname{rank}(X) = d$; otherwise the matrix inverse above makes no sense and the least-square estimator is not unique.) Then we have $$ \operatorname{var} (\hat w) = (\text{a }d\times d\text{ matrix }) = \operatorname{E}\big( (\hat w - w)(\hat w - w)^T\big) = \sigma^2(X^T X)^{-1} \tag 1 $$ where $\sigma^2$ is the noise variance, and we get this via the identity $$ \operatorname{var}(A\hat w) = A\Big(\operatorname{var}(\hat w) \Big) A^T, \text{ for any } k\times d \text{ matrix } A. $$

Increasing the sample size $n$ means increasing the number of rows in the matrix $X$. This leaves a big question: In what way does the new row of $X$ vary within the $d$-dimensional row space as more and more new rows keep getting added? Without that, you don't have a well-defined question. So here I'll go out on a limb and say that the way we'll do it is we will increase the sample size $n$ by stacking copies of $X$ on top of each other: $$ \begin{bmatrix} X \\ \vdots \\ X \end{bmatrix} = \text{an } (nk)\times d \text{ matrix} \tag 2 $$ and we'll let $k$ keep growing and the $nk$ errors will all be uncorrelated with each other and otherwise satisfy the assumptions. This just means we're doing replications with the same set of $X$ values. In some contexts, this would be considered repetition rather than replication: we get new $y$ values for each row of $X$, rather than getting new rows of $X$ different from those we already had.

  • Show that $(1) \to 0$ as $k\to\infty$ (except that instead of $X$ in $(1)$ we'll have the matrix $(2)$.)
  • Observe that the variances of the components of $\hat w$ are the diagonal elements of $(1).$
  • Finally, apply Chebyshev's inequality.