I am stuck trying to derive the backpropagation rule in Convolutional Neural Networks (CNN). There are other forms of derivation, but let us concern with this particular approach. One thing to note is that most machine learning frameworks actually implement cross-corrrelation, which is what I'm trying to understand. Nevertheless, we will stick to the convention of calling "convolution" to cross-correlation.
Suppose we're convolving an image $I$ with a kernel $K \in \mathbb{R}^{m \times n}$ in layer $l$. From the point of view of the CNN, the input to the convolution are elements in the preceding layer $x^{l - 1}$ to be convolved with the kernel elements $w^{l}$ in the current layer $l$. The result of the convolution is $z^l_{i,j}$ in layer $l$ (we will ignore the bias). Note that the kernel is always positioned at the top (not at the center) and that the notation is one-index based (that why $ - 1$).
First, we define the convolutional operation (forward propagation):
\begin{align} {\big( (\boldsymbol{I} * \boldsymbol{K})(i,j) \big)}^l &= \sum_{m} \sum_{n} x^{l - 1}_{i + m - 1,j + n - 1} \cdot w^l_{m,n} \label{eq:cross_correlation_2d_1}\\ z^l_{i,j} &= {\big( (\boldsymbol{I} * \boldsymbol{k})(i,j) \big)}^l\label{eq:cross_correlation_3d_3} \end{align}
Now, for backpropagating the error, among other things (which are not to be treated here) and following the approach of Nielsen's book, we need to compute the partial derivative of the cost function $J$ w.r.t. $z^l$ in layer $l$ in terms of $z^{l + 1}$ in layer $l + 1$.
Here, it begins to be tricky.
More concretely, the goal is to compute the error $\delta^l_{m,n}$ in layer $l$, defined as the partial derivative of the cost function $J$ w.r.t. $z^l_{m,n}$.
\begin{equation} \delta^l_{m,n} = \frac{\partial J}{\partial z^l_{m,n}} \end{equation}
We need to 1) collect all partial derivative of all units that $z^l_{i,j}$ affects and 2) apply the chain rule of calculus.
Sub-question 1: Is this correct:? (Note the indices. Reiterating, we are summing over all elements of the output affected by this particular element in the input, kind of inverting the convolution)
\begin{equation} \delta^l_{m,n} = \sum_i \sum_j \frac{\partial J}{\partial z^{l + 1}_{m - i + 1,n - j + 1}} \cdot \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} \end{equation}
Now we can substitute the first term in the right by its definition (it does not actually have any relevance for this discussion because it is considered a known quantity, but let us keep the term for completion) $\delta^l_{m - i + 1,n - j + 1}$
\begin{equation} \delta^l_{m,n} = \sum_i \sum_j \delta^{l + 1}_{m - i + 1,n - j + 1} \cdot \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} \end{equation}
Sub-question 2: Is this correct:? Let us focus on the second term on the right and expand it:
\begin{align} \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} &= \frac{\partial \big( \sum_i \sum_j z^{l}_{(m - i + 1) + i - 1,(n - j + 1) + j - 1} \cdot w^{l + 1}_{i,j} \big)}{\partial z^l_{m,n}} \end{align}
Sub-question 3: Is this correct:? Simplifying the indices,e.g., $z^{l}_{(m - i + 1) + i - 1,(n - j + 1) + j - 1} = z^{l}_{m,n}$.
\begin{align} \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} &= \frac{\partial \big( \sum_i \sum_j z^{l}_{m,n} \cdot w^{l + 1}_{i,j} \big)}{\partial z^l_{m,n}} \end{align}
Sub-question 4: Is this correct:? Because $z^l_{m,n}$ does not have indices $i,j$ we can take it out of the double summation:
\begin{align} \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} &= \frac{\partial z^{l}_{m,n} \cdot \big( \sum_i \sum_j w^{l + 1}_{i,j} \big)}{\partial z^l_{m,n}}, \end{align}
Sub-question 5: Is this correct:? where it is clear that
\begin{align} \frac{\partial z^{l + 1}_{m - i + 1,n - j + 1}}{\partial z^l_{m,n}} &= \sum_i \sum_j w^{l + 1}_{i,j}. \end{align}
We now plug this into the second equation of Sub-question 1:
\begin{equation} \delta^l_{m,n} = \sum_i \sum_j \delta^{l + 1}_{m - i + 1,n - j + 1} \cdot \big( \sum_i \sum_j w^{l + 1}_{i,j} \big). \end{equation}
This looks very wild!
Sub-question 6: Is there such a thing like a double summation of a double summation with same indices? If yes, how does it may be reduced?
The expected result is a kind of convolution(strided, valid, ... conceptually not that important) of the error $\sum_\cdot \sum_\cdot \delta^{l + 1}_{\cdot,\cdot} \cdot w^l_{\cdot,\cdot}$ in layer $l + 1$ with a rotated kernel $rotx(K)$.
Help with the sub-questions and the result is appreciated it.