Chain rule with total and partial derivatives.

107 Views Asked by At

Let $g: \mathbb{R}^{N_\ell \times N_{\ell-1}} \to \mathbb{R}^{N_\ell} \;\;\;$ $g(W) = Wa$
a function that takes a matrix as an argument, and multiplies it by a vector $a \in \mathbb{R}^{N_\ell}$

Let $h: \mathbb{R}^{N_\ell} \to \mathbb{R} \;$ a differentiable function

I want to differentiate the composition $h \circ g$ with respect to the matrix $W$, so I differentiate with respect to each of its components. I want to use the total derivative of h, and my intuition says that

$\frac{\partial}{\partial W_{j,i}}(h \circ g) = Dh \dfrac{\partial g}{\partial W^\ell_{j,i}}$ where Dh is the total derivative of h, and $\dfrac{\partial g}{\partial W^\ell_{j,i}}$ is the partial derivative of g with respect to the j-row i-column component of the matrix $W$

My questions are: is my intuition correct? If so, why is it? (I'm familiar with the chain rule of total derivatives, but I've never seen it mixed with partial derivatives)

1

There are 1 best solutions below

0
On BEST ANSWER

Yeah it's ok. You can do the same calculation different ways.

In index notation.

Writing $n=N_\ell$ and $m=N_{\ell-1}$, $$\begin{align} \partial_{ij}(h\circ g) &= \sum_k (\partial_k h\circ g)\partial_{ij} g_k \\ &= \begin{bmatrix} \partial_1h\circ g & \cdots & \partial_nh\circ g \end{bmatrix} \begin{bmatrix} \partial_{ij}g_1 \\ \vdots \\ \partial_{ij}g_n \\ \end{bmatrix} \end{align}$$ As you wanted.

Now, for your specific problem since $g_k(W)=\sum_r W_{kr}a_r$, you have $\partial_{ij}g_k=\delta_{ik}a_j$, so your derivative is $$\begin{align} \partial_{ij}(h\circ g) &= \sum_k (\partial_k h\circ g)\partial_{ij} g_k \\ &= \sum_k (\partial_k h\circ g)\delta_{ik}a_j \\ &= a_j(\partial_i h\circ g). \end{align}$$

In differential notation

This is also nice. Taking $\nabla h$ as a row vector and $a$ as a column vector $$\begin{align} d(h(Wa)) &= \nabla h(Wa):d(Wa) \\ &= \nabla h(Wa):dWa \\ &= dWa:\nabla h(Wa) \\ &= dW:a\nabla h(Wa) \\ \end{align}$$ so that $$ \frac{dh(Wa)}{dW} = a\nabla h(Wa). $$

By definition of the derivative

The derivative of a function $f$ at $W$ is a linear operator $Df(W)$ such that $$ f(W+H) - f(W) = Df(W)(H) $$ for any infinitesimal matrix $H$. You can also say it in terms of limits or using the big $O$ notation, but this way is easier in notation. Now take $f(W)=h(Wa)$, so that Ignoring terms of higher order in $H$, we have $$ \begin{align} f(W+H) - f(W) &= -h(Wa) + h(Wa+Ha) \\ &= -h(Wa) + h(Wa) + Dh(Wa)(Ha) \\ &= Dh(Wa)(Ha) \\ &= \nabla h(Wa)Ha \\ &= \nabla h(Wa):Ha \\ &= Ha:\nabla h(Wa) \\ &= H:a\nabla h(Wa) \\ &= a\nabla h(Wa):H. \end{align} $$ The term $a\nabla h(Wa):H$ is linear in $H$, so this should be $Df(W)(H)$. $$ Df(W)(H) = a\nabla h(Wa):H. $$