Matrix differentiation: $\frac{\partial{w}}{\alpha}$ for $w=(X^\top X + \alpha \textbf{I})^{-1}X^\top y$

120 Views Asked by At

What is $\frac{\partial{w}}{\partial{\alpha}}$ for $w=(X^\top X + \alpha \boldsymbol{I})^{-1} X^\top y$ where X is an $N \times D$ matrix, y is an N dimensional vector, $\boldsymbol{I}$ is an identity matrix of size $D \times D$ and $\alpha$ is a scalar?

edit: Actually I was trying to differentiate $\mathcal{F} (\hat{w}(\alpha)) = (\boldsymbol{y}-X\hat{\boldsymbol{w}})^\top (\boldsymbol{y}-X\hat{\boldsymbol{w}})+ \alpha (||{\hat{\boldsymbol{w}}||}^2 - c^2)$ w.r.t $\alpha$. Can I do $\dfrac{\partial{\mathcal{F} (\hat{w}(\alpha))}}{\partial{\alpha}}= \dfrac{\partial{\mathcal{F} (\hat{w}(\alpha))}}{\partial{\hat{w}}}\times \dfrac{\partial{\hat{w}}}{\partial{\alpha}}$?

if I do so, then $\dfrac{\partial{\mathcal{F} (\hat{w}(\alpha))}}{\partial{\alpha}}= (-2X^\top y + 2X^\top X \hat{w} )\dfrac{\partial{\hat{w}}}{\partial{\alpha}} + (||{\hat{\boldsymbol{w}}||}^2 - c^2)+ (2\alpha\hat{w}) \dfrac{\partial{\hat{w}}}{\partial{\alpha}} $

But the dimension of $\dfrac{\partial{\hat{w}}}{\partial{\alpha}}$ is $D \times 1$ since its expression is $-(X^\top X +\alpha \boldsymbol{I})^{-1}\boldsymbol{\hat{w}}$ and that of $\boldsymbol{\hat{w}}$ is also $D \times 1$ so they can't be multiplied in the order seen in the third term.

Is there a problem with the chain rule?

2

There are 2 best solutions below

0
On

Define the matrix variable $$\eqalign{A &= X^TX+\alpha I \cr dA &= I\,d\alpha}$$ Write the function in terms of this new variable, then find its differential and gradient. $$\eqalign{ \def\c#1{\color{red}{#1}} \def\a{\alpha} \def\A{A^{-1}} \def\F{\cal F} w &= \A X^Ty \\ dw &= d\A X^Ty \\ &= -\big(\A\,dA\,\A\big)X^Ty \\ &= -\A\,dA\,w \\ &= -\A w\,\,d\a \\ \frac{\partial w}{\partial\a} &= -\A w \\ }$$ For typing convenience, define the vector variable $$\eqalign{ z = (Xw-y) \quad\implies\quad dz = X\:dw }$$ Now we are ready to differentiate the main function $$\eqalign{ \F &= z^Tz + \a w^Tw - \a c^2 \\ d\F &= 2z^T\c{dz} + 2\a w^Tdw \\ &= 2\,z^T\c{X\:dw} + 2\,\a w^Tdw \\ &= 2\,(z^TX + \a w^T)\:\c{dw} \\ &= 2\,(z^TX + \a w^T)\:(\c{-\A w\:d\a}) \\ &= -2\,(X^Tz + \a w)^T\A w\:d\a \\ \frac{d\F}{d\a} &= -2\,(X^Tz + \a w)^T\A w \\ }$$

0
On

So you tried to do $\frac{∂(w(α))}{∂α}= \frac{∂(w(α))}{∂w}×\frac{∂w}{∂α}$, but the chain rule actually states $\frac{∂(w(α))}{∂α}= \frac{∂(w(α))}{∂w}∘\frac{∂w}{∂α}$.

This is because, by definition, the derivative of a function $f:X→Y$ at a point is $x∈X$ is a linear function $f(x):X→Y, {∆x} ↦ f(x)({∆x})$. I like to refer to this as the functional form of the derivative.

The only reason "×" makes any sense is because linear functions can be expressed in a tensorial form, and the chaining of linear functions can be expressed as tensor contractions, of which matrix multiplication is a special case.

With this is mind, given $(w) = ‖y-Xw‖^2 + α‖w‖^2$ and $w(α) = (X^⊤ X + α )^{-1} X^⊤ y = Z^{-1}X^⊤y$, we can apply basic backpropagation:

$$\begin{aligned} \dfrac{∂(w(α))}{∂α} &= \dfrac{∂(w(α))}{∂w}∘\dfrac{∂w}{∂α} \\ &= [{∆w} ↦ 2⟨y-Xw∣X{∆w}⟩ + 2α⟨w∣{∆w}⟩] ∘ \dfrac{∂ Z^{-1} X^⊤ y}{∂α} \\ &= [{∆w} ↦ ⟨\underbrace{2X^⊤(y-Xw) + 2αw}_{≕g_1}∣{∆w}⟩] ∘ \dfrac{∂ Z^{-1} X^⊤ y}{∂Z^{-1}} ∘ \dfrac{∂ Z^{-1}}{∂α} \\ &= [{∆w} ↦ ⟨g_1∣{∆w}⟩] ∘ [{∆Z}^{-1} ↦ {∆Z}^{-1}X^⊤y] ∘ \dfrac{∂ Z^{-1}}{∂Z}∘ \dfrac{∂ Z}{∂α} \\&= [{∆Z}^{-1} ↦ ⟨g_1∣{∆Z}^{-1}X^⊤y⟩] ∘ \dfrac{∂ Z^{-1}}{∂Z} ∘ \dfrac{∂ Z}{∂α} \\&= [{∆Z}^{-1} ↦ ⟨\underbrace{g_1 y^⊤X }_{≕g_2}∣{∆Z}^{-1}⟩] ∘ [{∆Z} ↦ -Z^{-1}{∆Z}Z^{-1}] ∘ \dfrac{∂ Z}{∂α} \\&= [{∆Z} ↦ ⟨g_2∣-Z^{-1}{∆Z}Z^{-1}⟩] ∘ \dfrac{∂ Z}{∂α} \\&= [{∆Z} ↦ ⟨\underbrace{-Z^{-1} g_2 Z^{-1}}_{≕g_3}∣{∆Z}⟩] ∘ [{∆α} ↦ {∆α}] \\&= [{∆α} ↦ ⟨g_3∣{∆α}⟩] = [{∆α} ↦ \underbrace{⟨g_3∣⟩}_{≕g_4}{∆α}] \end{aligned}$$

Plugging everything back together, and simplifying, we can also express this derivative in tensorial form as

$$\boxed{\dfrac{∂(w(α))}{∂α} = -2\operatorname{tr}(Z^{-1} (X^⊤(y-Xw) + αw)y^⊤ X Z^{-1})}$$