I'm currently studying deep learning with the book Deep Learning (Goodfellow et al., 2015) and had a question regarding the convolution operation of convolutional neural networks (CNN's).
More specifically, on page 337 (in section 9.5: Variants of the Basic Convolution Function) the authors give a mathematical description of what the convolution operation looks like in the context of neural networks as follows:
Let $\mathsf{K}$ be a 4-D tensor with element $K_{i, j, k, l}$ giving the connection strength between a unit in channel $i$ of the output and a unit in channel $j$ of the input, with an offset of $k$ rows and $l$ columns between the output unit and input unit.
Assume our input consists of observed data $\mathsf{V}$ with element $V_{i, j, k}$ giving the value of the input unit within channel $i$ at row $j$ and column $k$.
Assume our output consist of $\mathsf{Z}$ with the same format as $\mathsf{V}$. If $\mathsf{Z}$ is produced by convolving $\mathsf{K}$ across $\mathsf{V}$ without flipping $\mathsf{K}$, then:
$$ Z_{i,\ j,\ k} = \sum_{l,\ m,\ n} V_{l,\ j + m - 1,\ k + n - 1} \times K_{i,\ l,\ m,\ n}$$
where the summation over $l$, $m$, and $n$ is over all values for which the tensor indexing operations inside the summation are valid.
I suppose that it's not hard to see that the equation is depicting the dot product between a filter and a "patch" in the input data. But I don't particularly understand how the authors came up with this equation. My original knowledge of the convolution operation for neural networks is like this demo provided by Stanford University's CS231n course.
I suppose that if I were to summarize what's particularly puzzling me, it would be as follows:
My understanding regarding convolution is that the number of channels of an input image and the number of channels in a filter (or kernel) need to match, and that typically each channel only convolves with its corresponding channel. However, the definition given by the authors here (i.e. "between a unit in channel $i$ of the output and a unit in channel $j$ of the input") seems to imply that the number of channels doesn't have to match and that there may be convolution between different channels in the input and filter. Is cross-convolution across different channels possible?
What is meant by "an offset of $k$ rows and $l$ columns between the output unit and input unit" in the first paragraph? Is this simply referring to the filter's size (i.e. $k \times l$)?
It seems that the authors are also assuming that the output $\mathsf{Z}$ (i.e. the result of the dot product between a filter and the input patch) and input $\mathsf{V}$ are the same size. However, my knowledge tells me that after performing a convolution between an input and filter, the output feature map is always smaller than the input. Why are the authors assuming that the size is the same? Is this even possible?
Your formula means
$$Z_i = \sum_l V_l \ast K_{i,l}$$
(where $\ast$ is 2d convolution, over your $j,k$ variables, $Z_i , V_l,K_{i,l}$ are 2d arrays)
A convolution is a weighted sum of shifts of the input, the "offset" is the shift.
So it is the same formula as for fully connected (linear) layers except that the activation of each neuron is a 2d array, same for the weights, and instead of multiplying the activation and weight we convolve.