After struggling for a few days I feel a bit more comfortable solving the cost function
$J = \sum{(y_n - \hat{y}_n)^2} + \lambda\lVert w \rVert^2 $
for $w$ using vector calculus. However, when I learned to solve the simpler ordinary least squares cost function
$J = \sum{(y_n - \hat{y}_n)^2}$
I learned with both vector calculus and linear algebra, and the linear algebra derivation was MUCH simpler.
So representing $J = \sum{(y_n - \hat{y}_n)^2}$
as
$A\hat{x}=b$
and solving for $\hat{x}$ yielded
$\hat{x} = (A^TA)^{-1}A^Tb$.
How can I represent the new cost function $J = \sum{(y_n - \hat{y}_n)^2} + \lambda\lVert w \rVert^2 $ similar to the original cost function $A\hat{x}=b$ (using only linear algebra)? I believe I can solve it easily but I'm a bit unsure how to set it up without using calculus.
Edit: I know the solution will be $\hat{x} = (\lambda I + X^TX)^{-1}X^TY$ from vector calculus solution.
Edit 2: Working backward from my solution I get the following
$(\lambda I + X^TX)\hat{x} = X^TY$
distributing $\hat{x}$ yields
$\lambda\hat{x} + X^TX\hat{x} = X^TY$
How do I get rid of the $X^T$ on both sides to figure out the starting expression? $X^T$ is probably not invertible, since it's likely a data matrix.
Your target function is $$ J(w) = \|y-Xw \|_2^2 + \lambda\|w\|_2^2, $$ taking derivative w.r.t. $w$ and equating ti zero you have $$ J'(w) = 2 X ^ T(y-Xw)+2\lambda w = 0, $$ or $$ X ^T Xw + \lambda w = X^Ty , $$ taking $w$ common factor $$ (X ^TX+\lambda I)w=X^Ty, $$ hence, $$ \hat{w}=(X^T X + \lambda I ) ^{-1}X ^T y = H(\lambda)y. $$ Note that now the hat matrix $H(\lambda)$ is not an orthogonal projection anymore, hence you cannot use $X(X^TX)^{-1}X^T$ as a projection matrix on the column space of the data matrix $X$ without taking into account the "penalty" $\lambda$.