Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

Question

Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

2.7k Views Asked by Bumbble Comm At 17 May 2026 - 6:53

I have a basic question that I can't seem to find an answer for -- perhaps I'm not wording it correctly. Suppose that we have an $n$-by-$d$ matrix, $X$ that represents input features, and we have a $n$-by-$1$ vector of output labels, $y$. Furthermore, let these labels be a noisy linear transformation of the input features:

$$ y = X \cdot w + \epsilon$$

Where $w$ represents a $d$-dimensional set of true weights and $\epsilon$ is i.i.d. zero-mean Gaussian noise, and I am interested in inferring $w$ using ordinary least squares linear regression (OLS).

I would like to prove that as the number of data points, $n$, increases, the weights predicted by OLS converges in probability to the true weights $w$ -- say, in $l_2$-norm.

Can anyone help me to go about proving this, or point me to references? Thanks!

As pointed out in the comments, it is important to keep in mind how $X$ is constructed. Let $X$ be a random matrix.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 07 Jan 2017 - 11:31

Here I'm going to weaken the assumptions: rather than "Gaussian" we'll say they errors have expected value $0$ and finite variance; rather than "identically distributed" we'll say they all have the same variance; and rather than "independent" we'll say "uncorrelated", i.e. all the covariances $\text{are } 0.$

The least-squares estimator of $w$ is $$ \hat w = (X^T X)^{-1} X^T y. $$ (BTW, one must have $\operatorname{rank}(X) = d$; otherwise the matrix inverse above makes no sense and the least-square estimator is not unique.) Then we have $$ \operatorname{var} (\hat w) = (\text{a }d\times d\text{ matrix }) = \operatorname{E}\big( (\hat w - w)(\hat w - w)^T\big) = \sigma^2(X^T X)^{-1} \tag 1 $$ where $\sigma^2$ is the noise variance, and we get this via the identity $$ \operatorname{var}(A\hat w) = A\Big(\operatorname{var}(\hat w) \Big) A^T, \text{ for any } k\times d \text{ matrix } A. $$

Increasing the sample size $n$ means increasing the number of rows in the matrix $X$. This leaves a big question: In what way does the new row of $X$ vary within the $d$-dimensional row space as more and more new rows keep getting added? Without that, you don't have a well-defined question. So here I'll go out on a limb and say that the way we'll do it is we will increase the sample size $n$ by stacking copies of $X$ on top of each other: $$ \begin{bmatrix} X \\ \vdots \\ X \end{bmatrix} = \text{an } (nk)\times d \text{ matrix} \tag 2 $$ and we'll let $k$ keep growing and the $nk$ errors will all be uncorrelated with each other and otherwise satisfy the assumptions. This just means we're doing replications with the same set of $X$ values. In some contexts, this would be considered repetition rather than replication: we get new $y$ values for each row of $X$, rather than getting new rows of $X$ different from those we already had.

Show that $(1) \to 0$ as $k\to\infty$ (except that instead of $X$ in $(1)$ we'll have the matrix $(2)$.)
Observe that the variances of the components of $\hat w$ are the diagonal elements of $(1).$
Finally, apply Chebyshev's inequality.

**Bumbble Comm** · Accepted Answer

Here is the probabilistic approach to the proof. Assuming that $X$ is a random matrix and $E\| x \| < \infty$ (and $Ey^2 < \infty)$.

Your minimization problem is $$\min_{w\in \mathbb{R}^d} \| Xw - y \|_2^2. $$ It is straightforward to show that the OLS estimator ("weights" using your terminology) are given by $$ \hat{w}=(X'X)^{-1}X'y, $$ or equivalently $$ \hat{w}= \left( \frac 1 n \sum_{i=1}^n x_i x_i^T\right)^{-1} \left( \frac 1 n \sum_{i=1}^n x_i^T y_i\right). $$ Note that convergence in $L^2$ norm occur iff $\hat{w}$ converge in probability to $w$, thus by using WLLN you can observe that $$ \frac 1 n \sum_{i=1}^nx_ix_i^T \xrightarrow{p}Ex_ix_i^T=E(X'X) $$ as $n\to \infty$, and $$ \frac 1 n \sum_{i=1}^nx_i^Ty_i \xrightarrow{p}E(X'y). $$ As such, by the continuous mapping theorem for $f(X'X, X'y)=(X'X)^{-1}X'y$, you get that $$ \hat{w}\xrightarrow p (E(X'X))^{-1}E(X'y), $$ which are the population ("real") weights of $C(X)$.

Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

There are 2 best solutions below

Related Questions in REGRESSION

Related Questions in ESTIMATION

Related Questions in LEAST-SQUARES

Related Questions in LINEAR-REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions