Find solution for optimal regression coefficients

84 Views Asked by At

Consider the cost function $E(\mathbf{w}) = \displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2$ where $(\mathbf{x}_1, d_1), ..., (\mathbf{x}_n, d_n)$ are observations such that $\mathbf{x}$ is a matrix, $\mathbf{d}$ is a scalar, and $\mathbf{w}$ is a matrix of regression coefficients.

I am interested in finding the optimal regression coefficients $\mathbf{\hat{w}}$ that minimize the cost function $E(\mathbf{w})$. I understand that I need to find $\displaystyle \frac{\partial E(\mathbf{w})}{\partial \mathbf{w}}$, but how should I approach this.

I understand $\displaystyle\frac{1}{2} \sum_{i=1}^{N}{(\mathbf{d}_i - \mathbf{x}_i^T\mathbf{w})^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2 = \frac{1}{2} {\left\lVert(\mathbf{d} - \mathbf{x}^T\mathbf{w})\right\rVert^2} + \frac{\lambda}{2} \left\lVert \mathbf{w}\right\rVert^2 = \frac{1}{2}(\mathbf{d} - \mathbf{x}^T\mathbf{w})^T(\mathbf{d} - \mathbf{x}^T\mathbf{w}) + \frac{\lambda}{2}(\mathbf{w}^T\mathbf{w})$, but I am a little confused how to actually carry out the derivation and present the solution for the optimal regression coefficients.

I found in my textbook that $\displaystyle \frac{\partial\mathbf{AX}}{\mathbf{X}} = \mathbf{A}^T$ and $\displaystyle \frac{\partial\mathbf{X}^T\mathbf{A}}{\mathbf{X}} = \mathbf{A}$ and $\displaystyle \frac{\partial\mathbf{X}^T\mathbf{X}}{\mathbf{X}} = 2\mathbf{X}$ and $\displaystyle \frac{\partial\mathbf{X}^T\mathbf{AX}}{\mathbf{X}} = \mathbf{AX} + \mathbf{A}^T\mathbf{X}$, but these still aren't helping me.

1

There are 1 best solutions below

0
On BEST ANSWER

We can compactly write $E(w) = (1/2)\|Xw-y\|^2 + (1/2)\lambda\|w\|^2$, where $y=(d_1,\ldots,d_n)$ and $X$ is the design matrix, i.e the $n \times p$ matrix with $i$th row $x_i$. To avoid, unnecessary issues, assume $\lambda>0$ (we can talk about the ridgeless case if needed). It is clear that $E$ is strongly-convex, and thus has a unique minimizer. Also, since $E$ is differentiable, this minimizer $\hat{w}_\lambda$ must verify $\nabla_w E(w)|_{w=\hat{w}_\lambda} = 0$.

Consider the change of variable $z := Xw-y \in \mathbb R^n$ and $f(z) := (1/2)\|z\|^2$. Note that $E(w) = f(z) + (1/2)\lambda\|w\|^2$. By the chain rule, one computes $$ \begin{split} \nabla_w E(w) &= (\nabla_w (Xw))(\nabla_z \frac{\|z\|^2}{2}) + \nabla_w(\frac{\lambda}{2}\|w\|^2)\\ &= X^\top z + \lambda I_p = X^\top(Xw-y) + \lambda I_p. \end{split} $$ Setting this to zero and solving for $w$ gives

$\hat{w}_\lambda = (X^\top X + \lambda I_p)^{-1}X^\top y$.