Matrix Calculus and the Backpropagation Algorithm

Question

Matrix Calculus and the Backpropagation Algorithm

224 Views Asked by Bumbble Comm At 11 Apr 2026 - 2:50

I am trying to understand the chain rule applied to a series of transformations in the context of the back propagation algorithm for deep learning. Let $x \in \mathbb{R^k}$ and $A,B$ be real-value matrices of size $K \times K$. Then consider a network defined as $$y = Ax$$ $$u = \sigma (y)$$ $$v = Bx$$ $$z = A (u * v)$$ $$w = Az$$ $$ L = ||w||^2$$

where $L$ is considered as a function of $x, A, B$, and $u*v$ is the element-wise product, and $\sigma(y)$ is the element-wise application of the sigmoid function to $y$. Now I want to be able to calculate $\frac{\partial L }{\partial A}$ and $\frac{\partial L }{\partial B}$.

From what I understand $\frac{\partial L }{\partial A} = \frac {\partial {L}}{\partial w} \frac {\partial w} {\partial A}$

I'm not sure how to express $\frac{\partial w} {\partial A}$ since $z$ is a function of $A$. My guess would be something like $\frac {\partial w}{\partial A} = \frac{d}{dA} (Az) + A \frac{d}{dA} (z)$ but I am not sure if this step should be an application of the product rule or the chain rule.

I'm also not sure how to express $\frac {\partial z} {\partial A}$. Any insights appreciated

Original Q&A

There are 3 best solutions below

Bumbble Comm On 29 Sep 2021 - 7:53

$$\frac{dL}{dA} = \frac{dL}{dw}\frac{\partial w}{\partial A} + \frac{dL}{dz}\frac{\partial z}{\partial A} + \frac{dL}{dy}\frac{\partial y}{\partial A}.$$

And $$\frac{dL}{dz} = \frac{dL}{dw} \frac{\partial w}{ \partial z}$$ $$ \frac{dL}{dy} = \frac{dL}{du} \frac{\partial u}{ \partial y} $$ $$ \frac{dL}{du} = \frac{dL}{dz} \frac{\partial z}{ \partial u} $$

Further, $$\frac{dL}{dB} = \frac{dL}{dv} \frac {\partial v} {\partial B}$$ and $$\frac{dL} {dv} = \frac{dL}{dz} \frac {\partial z} {\partial v}$$

Bumbble Comm On 29 Jan 2023 - 2:03

$ \def\s{\sigma} \def\qiq{\quad\implies\quad} \def\LR#1{\left(#1\right)} \def\c#1{\color{red}{#1}} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\cgrad#1#2{\c{\grad{#1}{#2}}} $Calculate the differential of each of the variables in your list $$\eqalign{ y &= Ax &\qiq dy = dA\:x + A\:dx \\ u &= \s(y) \\ U &= {\rm Diag}(u) &\qiq du = (U - U^2)\:dy \\ v &= Bx &\qiq dv = dB\:x + B\:dx \\ z &= A (u\odot v) \\ V &= {\rm Diag}(v) &\qiq dz = dA\:Vu + AV\:du + AU\:dv \\ w &= Az &\qiq dw = dA\:z + A\:dz \\ L &= \|w\|^2 &\qiq dL = 2w^Tdw \\ }$$ Then start at the last differential and back-substitute to the first $$\eqalign{ dL &= 2w : dw \\ \tfrac 12\:dL &= w : dw \\ &= w : \LR{dA\:z + A\:dz} \\ &= wz^T:dA + \LR{A^Tw}:dz \\ &= wz^T:dA + \LR{A^Tw}:\LR{dA\:Vu + AV\:du + AU\:dv} \\ &= wz^T:dA + {A^Twu^TV}:dA + VA^TA^Tw:du + UA^TA^Tw:dv \\ \\ }$$ This is getting absurdly long, so define a few variables before continuing $$\eqalign{ P &= wz^T+A^Twu^TV, \qquad q = VA^TA^Tw, \qquad r = UA^TA^Tw \\ \\ \tfrac 12\:dL &= P:dA + r:dv + q:du \\ &= P:dA + r:\LR{dB\:x + B\:dx} + q:(U - U^2)\:dy \\ &= P:dA + \LR{rx^T:dB + B^Tr:dx} + (U-U^2)q:dy \\ &= P:dA + rx^T:dB + B^Tr:dx + (U-U^2)q:\LR{dA\:x + A\:dx} \\ &= \LR{P+(U - U^2)qx^T}\c{:dA} + rx^T\c{:dB} + \LR{B^Tr+A^T(U-U^2)q}\c{:dx} \\ }$$ Now the desired gradients can be easily identified $$\eqalign{ \cgrad{L}{A} &= {2P+2(U - U^2)qx^T}, \quad \cgrad{L}{B} &= 2rx^T, \quad \cgrad{L}{x} &= {2B^Tr + 2A^T(U-U^2)q} \\ \\ }$$

The Frobenius product $(:)$ is extraordinarily useful in Matrix Calculus $$\eqalign{ \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ When applied to vectors $(n=\tt1)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many fruitful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$

**Bumbble Comm** · Accepted Answer

The first thing to do is to draw correctly the underlying computation graph, and then apply the chain rule according to that graph.

The following is the chain rule that you should remember:

The derivative of the output with respect to a node can be computed from the derivatives of all its children as follows: $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

Therefore, the chain rule applied to node $A$ gives $$\frac{\dv{L}}{\dv{A}} = \frac{\dv{L}}{\dv{w}}\frac{\partial w}{\partial A} + \frac{\dv{L}}{\dv{z}}\frac{\partial z}{\partial A} + \frac{\dv{L}}{\dv{y}}\frac{\partial y}{\partial A}.$$

The only unknown quantities in the above are $\frac{\dv{L}}{\dv{z}}$ and $\frac{\dv{L}}{\dv{y}}$, which can be computed using the above chain rule again applied to the nodes $z$ and $y$, respectively. This is precisely how backpropagation works.

Check my answer here for a more detailed explanation: https://math.stackexchange.com/a/3865685/31498. You should be able to fully understand backpropagation after reading that.

Matrix Calculus and the Backpropagation Algorithm

There are 3 best solutions below

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in MATRIX-CALCULUS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions