What is the gradient of a reduction operation on a Matrix?

155 Views Asked by At

Let's say that at some point we obtain a matrix A: [ [2, 4], [6, 8] ] And we do a sum on the axis 1, we will obtain a vector B: [ [6], [14] ] If the gradient w.r.t B is: [ [7], [2] ] Then the gradient w.r.t A is: [ [7, 7], [2, 2] ] I don't understand why we replicate the gradient on the reduced dimension and why for instance we don't split the gradient like: [ [3.5, 3.5], [1, 1] ]. If someone as a mathematical explanation or a link to some resources it would be helpful.

My question comes from the following comment in the TensorFlow source code:

// The partial derivative for any input along a "reduced" dimension

// is just 1, so we only need replicate the output gradient on such a

// dimension to its "expanded" shape.

// Running example:

// input is

// [[a, b, c],

// [d, e, f]]

// reduction_indices = [1]

// Sum = [a + b + c, d + e + f]

// if the gradient is [g1, g2]

// We want the propagated gradient to be

// [[g1, g1, g1],

// [g2, g2, g2]]

I initially asked the question on reddit but the answers don't help me.

1

There are 1 best solutions below

1
On BEST ANSWER

Assume we have been given a scalar function of the vector $b$ and its gradient, i.e. $$\eqalign{ \phi &= \phi(b),\,\,&g= \frac{\partial \phi}{\partial b} \cr }$$ Let's assume we are subsequently told that $b$ is actually a function of a matrix $A$, specifically $$b=A\,1\,\implies\,\,db=d\!A\,1$$ where $1$ is a vector with all elements equal to one.

We can find the gradient wrt $A$ by way of the differential $$\eqalign{ d\phi &= g:db \cr &= g:d\!A\,1 \cr &= g1^T:d\!A \cr \frac{\partial \phi}{\partial A} &= g1^T \cr }$$ where a colon is used to represent the trace/Frobenius product, i.e. $$A:B = {\rm tr}(A^TB)$$ So the replication, as you call it, is due to the presence of the $1^T$ term.