Why does the transpose of Jacobian appears during backpropagation?

Question

Why does the transpose of Jacobian appears during backpropagation?

317 Views Asked by Bumbble Comm At 22 Apr 2026 - 6:30

I am working on the backpropation through fully-connected layers, suppose this architecture:

My ultimate goal is to find the gradient of $\overrightarrow{a}$ respect to the loss function $C$, given the gradient of $\overrightarrow{z}$. That is:

$$\frac{\partial C}{\partial \overrightarrow{a}} = \frac{\partial \overrightarrow{z}}{\partial \overrightarrow{a}} \frac{\partial C}{\partial \overrightarrow{z}}$$

where latter part is just the gradient: $$\frac{\partial C}{\partial \overrightarrow{z}}=\begin{bmatrix} \frac{\partial C}{\partial z_1} \\ \frac{\partial C}{\partial z_2} \end{bmatrix}$$

I calculated the Jacobian matrix for the first part:

$$\begin{align} \textbf{w} &= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3}\\ w_{2,1} & w_{2,2} & w_{2,3} \end{bmatrix}\\ \overrightarrow{z} &= \mathbf{w}\cdot \overrightarrow{a} \\ &= \begin{bmatrix} w_{1,1}\,a_1 + w_{1,2}\,a_2+w_{1,3}\,a_3 \\ w_{2,1}\,a_1 + w_{2,2}\,a_2+w_{2,3}\,a_3 \end{bmatrix}\\\\ \textbf{J}(\overrightarrow{z}) &= \begin{bmatrix} \frac{\partial z_1}{\partial a_1} & \frac{\partial z_1}{\partial a_2} & \frac{\partial z_1}{\partial a_3} \\ \frac{\partial z_2}{\partial a_1} & \frac{\partial z_2}{\partial a_2} & \frac{\partial z_2}{\partial a_3} \end{bmatrix}\\ &= \begin{bmatrix} w_{1, 1} & w_{1,2} & w_{1,3} \\ w_{2, 1} & w_{2,2} & w_{2,3} \end{bmatrix} \end{align}$$

But when I try to dot them together, their dimensions are not compatible: $$\begin{bmatrix} w_{1, 1} & w_{1,2} & w_{1,3} \\ w_{2, 1} & w_{2,2} & w_{2,3} \end{bmatrix}\cdot \begin{bmatrix} \frac{\partial C}{\partial z_1} \\ \frac{\partial C}{\partial z_2} \end{bmatrix}$$

However, if I take the transpose of the Jacobian Matrix, they are compatible: $$\begin{align} \mathbf{J}^\rm T \cdot \begin{bmatrix} \frac{\partial C}{\partial z_1} \\ \frac{\partial C}{\partial z_2} \end{bmatrix} &= \begin{bmatrix} w_{1,1} & w_{2,1} \\ w_{1,2} & w_{2,2} \\ w_{1,3} & w_{2,3} \end{bmatrix} \cdot \begin{bmatrix} \frac{\partial C}{\partial z_1} \\ \frac{\partial C}{\partial z_2} \end{bmatrix}\\ &=\begin{bmatrix} \frac{\partial C}{\partial z_1} \cdot w_{1,1} + \frac{\partial C}{\partial z_2} \cdot w_{2,1} \\ \frac{\partial C}{\partial z_1} \cdot w_{1,2} + \frac{\partial C}{\partial z_2} \cdot w_{2,2} \\ \frac{\partial C}{\partial z_1} \cdot w_{1,3} + \frac{\partial C}{\partial z_2} \cdot w_{2,3} \end{bmatrix} \end{align}$$

And it is quite intuitive - each neuron $a$ influnces both output neuron $z$, so we should sum up the total impacts dealt to both neurons.

My question is, why does the transpose of Jacobian matrix shows up in this scenario? Did I do the calculations correctly? Is the fact of Jacobian stores transposes of gradients related to this? Thank you!

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

The root of your confusion is actually a problem with multivariable calculus conventions at large: there are two competing systems for matrix derivatives and no agreement on which to use.

The denominator layout: if $z = f(a)$ with $z \in \mathbb{R}^2$ and $a \in \mathbb{R}^3$ then the denominator layout Jacobian is:

$J= \begin{bmatrix} \frac{\partial z_1}{\partial a_1} & \frac{\partial z_2}{\partial a_1} \\ \frac{\partial z_1}{\partial a_2} & \frac{\partial z_2}{\partial a_2} \\ \frac{\partial z_1}{\partial a_3} & \frac{\partial z_2}{\partial a_3} \end{bmatrix} $

Notice that this produces the correct weight matrix that you had written at the second half of your question:

$ J = \begin{bmatrix} w_{1,1} & w_{2,1} \\ w_{1,2} & w_{2,2} \\ w_{1,3} & w_{2,3} \end{bmatrix} $

Note that if we define partial derivatives this way, the multivariable chain rule follows a left matrix multiplication pattern, as you had written: $ \frac{\partial C}{\partial{a}} = \frac{\partial{z}} {\partial{a}} \frac{\partial C}{\partial{z}} $

The alternate numerator layout has essentially all things transposed: the jacobian is transposed from what I wrote above, and the chain rule has right multiplication of partial derivatives (ie flipped from what I wrote above) with transposed Jacobians for each component.

The choice of numerator vs denominator layout doesn't make a material difference, but it does help to stay consistent within one layout. ML papers tend to be written with data as column vectors with matrix calculus in the denominator layout, whereas all major ML implementations (e.g. everything based off of numpy) are written with data as row vectors + numerator layout Jacobians.

Why does the transpose of Jacobian appears during backpropagation?

There are 1 best solutions below

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in BACKPROPAGATION

Trending Questions

Popular # Hahtags

Popular Questions