Proof of least squares approximation formulas?

173 Views Asked by At

I'm trying to study from this pdf: http://math.mit.edu/~gs/linearalgebra/ila0403.pdf

I was just wondering if there were any established proofs for the following formulas:


$\hat{y} = Zw$, where $\hat{y}, w$ are column vectors and $Z$ is a data matrix that is $N * M + 1$


$l(w) = ||Zw - t||^2$, $t$ being a column vector and where $||v||^2 = \sum_{n}{v_{n}^2}$


While searching for answers, I came across this post: Proof of least squares approximation lemma

Would it be fair to say that this proves the second formula? How is the first formula proven?

Edit: Here is a screenshot of the question

1

There are 1 best solutions below

4
On BEST ANSWER

These two results are actually definitions. Imagine you have a dataset of the form $\{y_i,(x_1,x_2,\cdots,x_M)_i\}_{i=1}^N$. That is, you have $N$ observations $y_i$ of a variable that depends on $M$ independent features ${\bf x}_i = (x_1,x_2\cdots,x_M)_i$. Imagine there exists $M+1$ numbers $(w_0,w_1,w_2,\cdots w_M)$ such that each observation can be explained as the linear combination of the features with these weights, that is

\begin{eqnarray} y_1 &=& w_0 + w_1 x_{1,1} + \cdots w_Mx_{M,1} +\epsilon_1\\ y_2 &=& w_0 + w_1 x_{1,2} + \cdots w_Mx_{M,2} +\epsilon_2\\ &\vdots& \\ y_N &=& w_0 + w_1 x_{1,N} + \cdots w_Mx_{M,N} +\epsilon_N\\ \end{eqnarray}

which can be represented in matrix form as

$$ \left(\begin{array}{c} y_1 \\ \vdots \\ y_N \end{array}\right) = \left(\begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{M,1} \\ 1 & x_{1,2} & \cdots & x_{M,2} \\ & & \vdots & \\ 1 & x_{1,N} & \cdots & x_{M,N} \\ \end{array}\right) \left(\begin{array}{c} w_0 \\ w_1 \\ \vdots \\ w_M \end{array}\right) + \left(\begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_N \end{array}\right) $$

or equivalently

$$ {\bf y} = {\bf Z}{\bf w} + {\bf \epsilon} = \hat{\bf y} + {\bf \epsilon} $$

with $\hat{\bf y}={\bf Z}{\bf w}$, with ${\bf W}\in \mathbb{R}^{N\times(M+1)}$, ${\bf y}, {\bf \epsilon}\in \mathbb{R}^N$ and ${\bf w} \in \mathbb{R}^{M+1}$.

As for the second part, the idea is to find find the weights ${\bf w}$ such that the prediction $\hat{\bf y}$ is as close as possible to the real observation ${\bf y}$. That is why you define the distance between these two vectors as

$$ l({\bf w}) = ||{\bf Z}{\bf w} - {\bf y}||^2 $$

and try to minimize it