I am given a full-rank feature matrix $\mathbf{X}$ to which I am supposed to provide a closed form solution for the weights $\hat{\mathbf{w}}_{ridge}$ of a ridge-regression optimization problem. The classic optimization problem is stated as follows:
\begin{align} \hat{\mathbf{w}}_{ridge} = \underset{\hat{\mathbf{w}}}{\arg\min} \| \mathbf{y} - \mathbf{X}\hat{\mathbf{w}} \|_2^2 + \lambda \| \hat{\mathbf{w}} \|_2^2 \end{align}
The above the feature matrix does not include the bias of the model, which means the first column is not a one-vector $\mathbf{1}$ but the raw data-points itself.
This gets changed in the following by the introduction of an augmented feature matrix $\tilde{\mathbf{X}} = [ \mathbf{1} \ \ \mathbf{X} ]$ which is indended to take said bias into consideration. My question is how I can apply above formula by introducing a bias-weight $w_0$ into $\mathbf{w}$ without including it into the penalty $\lambda \| \hat{\mathbf{w}} \|_2^2$ to find a closed-form solution like the one to the classic optimization problem:
\begin{align} \hat{\mathbf{w}} = (\mathbf{X}^\intercal\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\intercal\mathbf{y} \end{align}
Or differently put: How does the bias-weight get computed during ridge-regression in general if the introduction of the penalty term during ridge-regression itself alters the bias?
Let $e$ be the all-one vector.
Consider $$\arg \min_{\hat{w}, \hat{b}} \|y-\hat{b}e-X\hat{w}\|_2^2+\lambda \|w\|^2$$
Differentiate with respect to $\hat{b}$ and set it to $0$,
$$e^T(y-\hat{b}e - X\hat{w})=0 $$
$$\hat{b} = \frac1n(e^Ty -e^TX\hat{w} )\tag{1}$$
Differentiate with respect to $\hat{w}$ and set it to $0$,
$$-2X^T(y-\hat{b}e-X\hat{w}) + 2\lambda \hat{w} = 0$$
$$(X^TX+\lambda I)\hat{w}=X^T(y-\hat{b}e) \tag{2}$$
Replacing $(1)$ to $(2)$, we have
$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1n(e^Ty -e^TX\hat{w} )e\right)$$
$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1n(e^Tye -ee^TX\hat{w} )\right)$$
$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1ne^Tye \right) +\frac1nX^Tee^TX\hat{w} $$
$$(X^TX+\lambda I -\frac1nX^Tee^TX) \hat{w} = X^T\left(y- \frac1ne^Tye \right) $$
$$\hat{w} = (X^TX+\lambda I -\frac1nX^Tee^TX) ^{-1}X^T\left(y- \frac1ne^Tye \right) \tag{3}$$
Equation $(3)$ and then equation $(1)$ should give you your desired parameter.