Derivative of simple matrix product

153 Views Asked by At

I have been trying for the past few days to find a solution to this very simple problem, yet every time I seem to hit a wall. (PS: please excuse the poor notations, it's my first time using this website).

Let's say we have $C = AB$, $A$ and $B$ matrices so that $C$ is defined.

What is $\frac{\partial C}{\partial A}$ and $\frac{\partial C}{\partial B}$, i.e., the derivatives of $C$ w.r.t $A$ and $B$ respectively?

Let us assume size $A = m \times n$, size $B = n \times p$, size $C = m \times p$

By doing $C_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$ for $i = 1, \dots, m$ and $j = 1, \dots, p$

Then $\frac{\partial C_{ij}}{\partial a_{ls}} = \frac{\partial (\sum_{k=1}^{n} a_{ik} b_{kj})}{\partial a_{ls}} = b_{sj}$ for $i = l$

Then shouldn't $\frac{\partial C}{\partial A}$ be the rate of change for each $C_{ij}$ w.r.t each element of $A$?

So that $\frac{\partial C}{\partial A}$ is a matrix $D$ of size $C$ where each $D_{ij}$ is a matrix of size $A$.

For instance $C = AB$, given $A$ of size $1 \times 3$ and $B$ of size $3 \times 2$.

$A = $$\begin{pmatrix}a_1 & a_2 & a_3 \end{pmatrix}$

$B = $$\begin{pmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32}\end{pmatrix}$

$C = \begin{pmatrix} a_1 b_{11} + a_2 b_{21} + a_3 b_{31} & a_1 b_{12} + a_2 b_{22} + a_3 b_{32} \end{pmatrix}$

Then $\frac{\partial C_1}{\partial B} = \frac{\partial (a_1 b_{11} + a_2 b_{21} + a_3 b_{31})}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \end{pmatrix}$ and $\frac{\partial C_2}{\partial B} = \frac{\partial (a_1 b_{12} + a_2 b_{22} + a_3 b_{32})}{\partial B} = \begin{pmatrix} 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$

So that $\frac{\partial C}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \\ 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$, is of size $2 \times 6$.

$\frac{\partial C}{\partial A} = B^T$.

However, in the machine learning equations, it is given that $\frac{\partial C}{\partial B} = A^T$.

To be more precise, the context of this query is for machine learning and backpropagation.

We have $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$, where $L$ is a scalar function (the loss), $\hat{Y}$ is a $1 \times 2$ matrix (the activated output), $\text{out}$ is a $1 \times 2$ vector (pre-activation), and $W$ is a $3 \times 2$ matrix (the weights).

If you could also explain why we actually write:

$\frac{\partial L}{\partial W} = \frac{\partial \text{out}}{\partial W} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial L}{\partial \hat{Y}}$, instead of the forward notation used earlier $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$

Is there an explanation beyond "cause it fits for the matrix product so we rearrange the terms in the chain rule"?

Asking Chat-GPT these questions yields different answers every time.

Thank you for your time.

Edit : the formatting was done using Chat-GPT. I also checked the link : Not understanding derivative of a matrix-matrix product. and it seems that the result AX is a Kronecker product, which also seems to not be the right size.

The part where I usually get stuck at is computing every $D_{ij}$ for $D = \frac{\partial C}{\partial B}$.

Other EDIT : Here is more context for the chain derivation above.

$L(Y, \hat{Y}) = \frac{1}{2m} \sum_{i=1}^{m}(Y^{(i)} - \hat{Y}^{(i)})^2$ where $\hat{Y}$ is the predicted out

$ \hat{Y} = \sigma_{o}(out) $

$ out = h_{2} W $

where $\sigma_{o}$ is the activation function for the out layer,

$h_{2}$ is of size $1 \times 3$ , $W$ of size $3 \times 2$, $out$ of size $1 \times 2$ and $\hat{y}$ of size $1 \times 2$

Here the activation function $\sigma_{o}(X)$ is applied element-wise to $X$, a matrix.

2

There are 2 best solutions below

2
On

Let $\hat{\mathbf{y}}= \sigma(\mathbf{a})$ and $\mathbf{a} = \mathbf{h}_2 \mathbf{W}$ with your notations.

The sensitivities backprop as follows $$ \frac{\partial L}{\partial \mathbf{a}} = \frac{\partial L}{\partial \hat{\mathbf{y}}} \odot \sigma'(\mathbf{a}) $$ where $\odot$ denotes the Hadamard product.

$$ \frac{\partial L}{\partial \mathbf{W}} = \mathbf{h}_2^T \frac{\partial L}{\partial \mathbf{a}} $$

0
On

Let $V≔ℝ^{m×n}⊕ℝ^{n×k}$ and $W≔ℝ^{m×k}$. By definition, the derivative of the function

$$f:V ⟶ W,(A, B) ⟼ A⋅B$$

At point $(A^*, B^*)∈U$ is the unique linear function

$$ f(A^*, B^*):V ⟶ W,(∆A, ∆B) ⟼ f(A^*, B^*)(∆A, ∆B)$$

with the property

$$ \lim_{‖(∆A, ∆B)‖_{V}→0} \frac{‖f(A^* +{∆A}, B^*+{∆B})) - f(A^*, B^*) - f(A^*, B^*)(∆A, ∆B) ‖_{W}}{‖(∆A, ∆B)‖_{V}} = 0$$

We can find $f(A^*, B^*)$ simply by dropping higher order terms:

$$\begin{aligned} f(A^* +{∆A}, B^*+{∆B})) - f(A^*, B^*) &= (A^* +{∆A})⋅(B^* +{∆B}) - A^*⋅B^* \\&= A^*⋅B^* + A^*⋅{∆B} + {∆A}⋅B^* + {∆A}⋅{∆B} - A^*⋅B^* \\&= A^*⋅{∆B} + {∆A}⋅B^* + o(‖(∆A, ∆B)‖_V) \end{aligned}$$

Hence, the derivative is the linear function

$$ f(A^*, B^*):V ⟶ W,(∆A, ∆B) ⟼ A^*⋅{∆B} + {∆A}⋅B^*$$


Now, a linear function $V→W$ can be naturally encoded as an element of the tensor product $W⊗V^*$, where $V^*$ denotes the dual space.

Thus, we can encode $f(A^*, B^*)$ as an element $_f(A^*, B^*)$ of the tensor product space $$ℝ^{m×k}⊗(ℝ^{m×n}⊕ℝ^{n×k})^* ≅ (ℝ^{m×k}⊗(ℝ^{m×n})^*)⊕(ℝ^{m×k}⊗(ℝ^{n×k})^*)$$

I.e. as a tuple of two 4-dimensional tensors. The components of these tensors are simply given by:

$$ _f(A^*, B^*) = \Big(\frac{f(A^*, B^*)_{}}{(A^*, B^*)_{}}\Big)_{, } $$

In the specific case at hand, we have

$$\begin{aligned} \frac{ (A⋅B)_{k, l}}{ A_{m, n}} &= \frac{ ∑_{i}A_{ki}⋅B_{il}}{ A_{mn}} = δ_{km}B_{nl} &&⟹ \frac{f}{} = _m ⊗ B^T \\ \frac{ (A⋅B)_{k, l}}{ B_{r, s}} &= \frac{ ∑_{i}A_{ki}⋅B_{il}}{ B_{rs}} = δ_{ls}A_{kr} &&⟹ \frac{f}{B} = A ⊗ _n \end{aligned}$$

Depending on convention/definition, also the result $\frac{f}{} = _m ⊗ B$ and $\frac{f}{B} = A^T ⊗ _n$ is possible.