Let's say that at some point we obtain a matrix A: [ [2, 4], [6, 8] ] And we do a sum on the axis 1, we will obtain a vector B: [ [6], [14] ] If the gradient w.r.t B is: [ [7], [2] ] Then the gradient w.r.t A is: [ [7, 7], [2, 2] ] I don't understand why we replicate the gradient on the reduced dimension and why for instance we don't split the gradient like: [ [3.5, 3.5], [1, 1] ]. If someone as a mathematical explanation or a link to some resources it would be helpful.
My question comes from the following comment in the TensorFlow source code:
// The partial derivative for any input along a "reduced" dimension
// is just 1, so we only need replicate the output gradient on such a
// dimension to its "expanded" shape.
// Running example:
// input is
// [[a, b, c],
// [d, e, f]]
// reduction_indices = [1]
// Sum = [a + b + c, d + e + f]
// if the gradient is [g1, g2]
// We want the propagated gradient to be
// [[g1, g1, g1],
// [g2, g2, g2]]
I initially asked the question on reddit but the answers don't help me.
Assume we have been given a scalar function of the vector $b$ and its gradient, i.e. $$\eqalign{ \phi &= \phi(b),\,\,&g= \frac{\partial \phi}{\partial b} \cr }$$ Let's assume we are subsequently told that $b$ is actually a function of a matrix $A$, specifically $$b=A\,1\,\implies\,\,db=d\!A\,1$$ where $1$ is a vector with all elements equal to one.
We can find the gradient wrt $A$ by way of the differential $$\eqalign{ d\phi &= g:db \cr &= g:d\!A\,1 \cr &= g1^T:d\!A \cr \frac{\partial \phi}{\partial A} &= g1^T \cr }$$ where a colon is used to represent the trace/Frobenius product, i.e. $$A:B = {\rm tr}(A^TB)$$ So the replication, as you call it, is due to the presence of the $1^T$ term.