Derivative where the variable is a matrix

1.3k Views Asked by At

In the context of Neural Nets, I'm trying to derivate the objective function with respect to the weight-matrices. I'm stuck at the following point:

Let $N$ and $D$ two distinct integers, and $A\in\mathcal M_{N, D}(\mathbb R)$, $B\in\mathcal M_{D, N}(\mathbb R)$ two matrices. We'll also consider $\omega_A$ and $\omega_B$ two column-vectors of size $N$. At some point of the process, I must derivate the following quantity $q$ with respect to $B$:

$$ q = \omega_A^T\cdot A\cdot B\cdot\omega_B $$

I know that $q$ has the dimension of a scalar and that $\frac{\partial q}{\partial B}$ should be some sort of matrix. But I just can't apply the classic formulas of scalar derivation and get $\frac{\partial q}{\partial B}=\omega_A^T\cdot A\cdot\omega_B$ because the dimensions don't match. How should I proceed?

I've thought of rearranging the terms to match dimensions, such as:

$$ \frac{\partial q}{\partial B}=\left(\omega_B\cdot\omega_A^T\cdot A\right)^T $$

but I don't "see" why this should be correct.

Besides, I don't know if this is of relevance here, but the two $\omega$ are "One-Hot" vectors, meaning they constist of one $1$ and $0$s everywhere else. This means $q$ can also be understood as the scalar product between a line of $A$ (given by $\omega_A$) and a column of $B$ (given by $\omega_B$). In this case, only one column of the matrix $\frac{\partial q}{\partial B}$ would be non-zero because $\omega_B\cdot\omega_A^T$ is a "One-Hot" matrix."

3

There are 3 best solutions below

0
On BEST ANSWER

If you reshape the matrix $B$ as a large vector of dimension $D\cdot N$, you can consider $q$ as a scalar function of $D\cdot N$ variables, so the gradient of such function will be a vector of dimension $D\cdot N$. In turn, this vector might hopefully be reshaped again as a $D \times N$ matrix that has in the position $j,k$ the partial derivative $\frac{\partial q}{\partial B_{jk}}$.

So let us work by coordinates, always bearing in mind the formula of the matrix product. To avoid cluttering subindices, I am renaming the vectors so the scalar function is: $$q=v^\top \, A\, B\, w$$ or, component-wise: $$q_{11}=\sum_i\sum_j\sum_k v_{1i}^\top \, A_{ij}\, B_{jk}\, w_{k1}$$ Note that I am writing even the trivial subindices, to highlight the matching components in the matrix product formula. Now, take the partial derivative: $$\frac{\partial q}{\partial B_{jk}}=\sum_i v_{1i}^\top \, A_{ij}\, w_{k1}$$ And that's it... but I guess you want a "nice" matricial formula, so let us do some arrangements, looking for the $jk$ component of some matrix: $$\frac{\partial q}{\partial B_{jk}}=\sum_i v_{1i}^\top \, A_{ij}\, w_{k1}=\sum_i A_{ji}^\top \, v_{i1} \, w_{1k}^\top = \left(A^\top\, v\, w^\top \right)_{jk}$$ Therefore, with some abuse of notation, you can write the gradient as: $$\frac{\partial q}{\partial B}=A^\top\, v\, w^\top $$

EDIT: So your intuition is correct and everything that works by matching dimensions is correct... well, not everything :)

0
On

$q$ is basically a real function of multiple variables. Therefore, taking the derivative, with respect to $B$, is similar to calculating the gradient, when $B$ is thought as a vector. So, you need to calculate partial derivatives, denoted as below.

$$\frac{\partial q}{\partial B_{xy}}$$

$B_{xy}$ indicates the element of $B$, located at row $x$ and column $y$.

Note that $q$ is a sum of terms. In order to find $\frac{\partial q}{\partial B_{xy}}$, you need to find terms in $q$, in which we have $B_{xy}$ as a variable. Otherwise, the derivative is zero for the term, without $B_{xy}$. Let's focus on $B. \omega_{b}$, which is a vector. Only its $x$-th element has $B_{xy}$. $$(B.\omega_{b})_{x1}=\sum_{i}B_{xi} (\omega_{b})_{i}$$

Then, notice that $\omega_{a}^T.A$ is a row vector. So, the row vector $\omega_{a}^T.A$ is multiplied to the column vector $B.\omega_{b}$. As $(B.\omega_{b})_{x1}$ is the only element of $B.\omega_{b}$ that has $B_{xy}$, we need to consider the $x$-th element of $\omega_{a}^T.A$, which is $(\omega_{a}^T.A)_{1x}$. Therefore

$$\frac{\partial q}{\partial B_{xy}}=\frac{\partial \big( (\omega_{a}^T.A)_{1x}\times (B.\omega_{b})_{x1} \big)}{\partial B_{xy}}$$

$$\frac{\partial q}{\partial B_{xy}}=\frac{\partial \big( (\omega_{a}^T.A)_{1x}\times\sum_{i}B_{xi}(\omega_{b})_{i} \big)}{\partial B_{xy}}=(\omega_{a}^T.A)_{1x}\times (\omega_{b})_{y}$$

You can check that the matrix form is

$$A^T \omega_{a} \omega_{b}^T$$

2
On

The inner/Frobenius product is a convenient infix notation for the trace, i.e. $$X:Y={\rm tr}(X^TY)$$ Using this and the cyclic property of the trace, the function can be written in a form which simplifies differentiation
$$\eqalign{ q &= {\rm tr}\Big(w_a^TABw_b\Big) \cr &= {\rm tr}\Big(w_bw_a^TAB\Big) \cr &= {\rm tr}\Big((A^Tw_aw_b^T)^TB\Big) \cr &= A^Tw_aw_b^T:B \cr\cr dq &= A^Tw_aw_b^T:dB \cr\cr \frac{\partial q}{\partial B} &= A^Tw_aw_b^T \cr }$$