I was thinking about convolutions in neural networks and backpropagating them, and then I came across something I would like touched on.
Suppose I have a very high-resolution image that I'm using in a convolution in my neural network:
That means the output image from this convolution would also be big. Now, we want to train the filter to minimize some cost $C$, and to do that we need the derivative of $C$ w.r.t. the filter, $F$, which would be:
$$
\frac{\partial C}{\partial F} = \frac{\partial C}{\partial z} \cdot \frac{\partial z}{\partial F} = \frac{\partial C}{\partial z} \cdot \frac{\partial}{\partial F}[X * F]
$$
$$
\frac{\partial C}{\partial F} = X * \frac{\partial C}{\partial z}
$$
Where $z = X * F$ (i.e. our convolution operation).
Considering that $\frac{\partial C}{\partial z}$ is the same size as $z$, we're basically convolving a big image with a (still) big image to find $F$'s derivative, and here is where my problem lies; if we're convolving an image with another image of almost the same size, it's more like we're just getting the weighted sum of all the values in the image.
$$
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{bmatrix} * \begin{bmatrix}
1 & 4 & 7 \\
2 & 5 & 8 \\
3 & 6 & 9
\end{bmatrix} = \begin{bmatrix}
1+8+21+8+25+48+21+48+81
\end{bmatrix}
$$
While, yes, $\frac{\partial C}{\partial z}$ is going to be smaller than $X$, the difference here would be negligible, the difference between just a few values compared to the size of these high-resolution images would mean practically nothing. This means, the values in the filter $F$ would all have (almost) the same gradients, and the same changes would be made all across the board. I doubt a filter updating like this could truly train, but maybe I'm wrong.
Is this an actual problem? If so, solution/s? If not, why?
Thanks
2026-03-28 14:37:18.1774708638
Backpropagating convolution with big image to train (comparably) small filter
97 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
The difference between a few pixels compared to the size of these high-resolution images doesn't have to be "negligible". Let's look at a simpler problem first:
If we're finding the gradient of the filter $F$ using $X * \frac{\partial C}{\partial z}$, where $X$ is a very big image (and subsequently so is $\frac{\partial C}{\partial z}$), then why shouldn't these gradients' values blow up? If, for example, $X$ was a grayscale image with each pixel's brightness being from $0.0$ - $1.0$, and $\frac{\partial C}{\partial z}$ being arbritrary, you could imagine how, depending on the distribution of $\frac{\partial C}{\partial z}$, big convolutions with lots of values like this could blow up.
$$ \begin{bmatrix} 0.70 & 0.17 & 0.28 \\ 0.06 & 0.77 & 0.27 \\ 0.85 & 0.23 & 0.27 \end{bmatrix} * \begin{bmatrix} 0.23 & 0.59 \\ 0.46 & 0.64 \\ \end{bmatrix} = \begin{bmatrix} 0.7817 & 0.7313 \\ 1.0063 & 0.615 \end{bmatrix} $$
So, how come when convolving $X$ and $\frac{\partial C}{\partial z}$, where each value in the resulting output could be made of thousands of values, they don't add up and become very big? Well, if the distribution of $\frac{\partial C}{\partial z}$ has a mean of $0$, then the output of this convolution should also have a mean of $0$ (because there would be an equal number of positive and negative values, and so they cancel each other out), and thus wouldn't blow up.
This is also then the case for the original question; if $\frac{\partial C}{\partial z}$ has mean $0$, then these thousands of values just converge to $0$, and the difference of a few pixels isn't "negligible" any more.
Of course, this resolution is based on the assumption that the upstream gradient would have a mean of $0$, which is not a given fact. This problem is actually well known, called "exploding/vanishing gradients", and there are measures against it, such as batch normilization (scales and shifts input to mean $0$ and standard deviation $1$), better initialization (such as Xavier initialization) and so on.
So, this can be a problem given the upstream gradient isn't normilized, in which case the gradients can get unstable and explode (as well as update the same, observed in the question). This is a well known problem and can be fixed with multiple things (a few of which mentioned above).