How does the bias weight $w_0$ get computed during ridge regression?

328 Views Asked by At

I am given a full-rank feature matrix $\mathbf{X}$ to which I am supposed to provide a closed form solution for the weights $\hat{\mathbf{w}}_{ridge}$ of a ridge-regression optimization problem. The classic optimization problem is stated as follows:

\begin{align} \hat{\mathbf{w}}_{ridge} = \underset{\hat{\mathbf{w}}}{\arg\min} \| \mathbf{y} - \mathbf{X}\hat{\mathbf{w}} \|_2^2 + \lambda \| \hat{\mathbf{w}} \|_2^2 \end{align}

The above the feature matrix does not include the bias of the model, which means the first column is not a one-vector $\mathbf{1}$ but the raw data-points itself.

This gets changed in the following by the introduction of an augmented feature matrix $\tilde{\mathbf{X}} = [ \mathbf{1} \ \ \mathbf{X} ]$ which is indended to take said bias into consideration. My question is how I can apply above formula by introducing a bias-weight $w_0$ into $\mathbf{w}$ without including it into the penalty $\lambda \| \hat{\mathbf{w}} \|_2^2$ to find a closed-form solution like the one to the classic optimization problem:

\begin{align} \hat{\mathbf{w}} = (\mathbf{X}^\intercal\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\intercal\mathbf{y} \end{align}

Or differently put: How does the bias-weight get computed during ridge-regression in general if the introduction of the penalty term during ridge-regression itself alters the bias?

1

There are 1 best solutions below

0
On

Let $e$ be the all-one vector.

Consider $$\arg \min_{\hat{w}, \hat{b}} \|y-\hat{b}e-X\hat{w}\|_2^2+\lambda \|w\|^2$$

Differentiate with respect to $\hat{b}$ and set it to $0$,

$$e^T(y-\hat{b}e - X\hat{w})=0 $$

$$\hat{b} = \frac1n(e^Ty -e^TX\hat{w} )\tag{1}$$

Differentiate with respect to $\hat{w}$ and set it to $0$,

$$-2X^T(y-\hat{b}e-X\hat{w}) + 2\lambda \hat{w} = 0$$

$$(X^TX+\lambda I)\hat{w}=X^T(y-\hat{b}e) \tag{2}$$

Replacing $(1)$ to $(2)$, we have

$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1n(e^Ty -e^TX\hat{w} )e\right)$$

$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1n(e^Tye -ee^TX\hat{w} )\right)$$

$$(X^TX+\lambda I) \hat{w} = X^T\left(y- \frac1ne^Tye \right) +\frac1nX^Tee^TX\hat{w} $$

$$(X^TX+\lambda I -\frac1nX^Tee^TX) \hat{w} = X^T\left(y- \frac1ne^Tye \right) $$

$$\hat{w} = (X^TX+\lambda I -\frac1nX^Tee^TX) ^{-1}X^T\left(y- \frac1ne^Tye \right) \tag{3}$$

Equation $(3)$ and then equation $(1)$ should give you your desired parameter.