Convolution Layer CS321n

94 Views Asked by At

I am watching the cs231n stanford convolution series and have a question about the convolution part.

The image below shows a slide for the convolution part. The filter in this case is 3 dimensional.

Image

Question 1: Why is it 3 dimensional? Shouldn't a filter just be 1 dimensional and then we do the dot product using that filter on the 3 layers of the input image?

eg Output of that single neuron = filter dot with Red Channel + filter dot with Blue Channel + filter dot with Green Channel.

Question 2: If the filter is 3 dimensional, is the math to do the convolution as follow:

Output of that single neuron = 1st dimension filter dot with Red Channel + 2nd dimension filter dot with Blue Channel + 3rd dimension filter dot with Green Channel.

Am i doing the convolution math correctly? Thank you

1

There are 1 best solutions below

2
On

Let the input image be denoted $X$ of dimension $w \times h \times d$, and we have $n$ filters, denoted $F_i$, each of dimension $f \times f \times d$. The filter is always taken to have the same depth as the input image.

Applying this filter results in an output response map, $S$, which is a stack $S = \{ S_1, S_2, \dots, S_n\}$ where

$$ S_i = X * F_i ~~~ (+b_i) $$

where $*$ is the convolution operator, and $b_i$ is an optional bias vector. So, the dimension of each $S_i$ is $(w-f+1) \times (h-f+1)$.

The filter is three dimensional so that the resulting image is of depth $1$. You could potentially use a filter of depth $1$ but then you have to convolve each of the $3$ layers of $X$ separately, I'm not sure how this would work out.

To figure out how to do the convolution, Let's focus on the easy case in which $d=1$, and let's take $f=3$ and $w = h = 6$. In this case, the dimension of $S$ is $4 \times 4 \times 1$.

Then our convolution would look something like:

$$ S = \begin{bmatrix} X_{11} & X_{12} & X_{13} & \dots & X_{16} \\ X_{21} & X_{22} & X_{23} & \dots & X_{26} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ X_{61} & X_{62} & X_{63} & \dots & X_{66} \\ \end{bmatrix} * \begin{bmatrix} F_{11} & F_{12} & F_{13} \\ F_{21} & F_{22} & F_{23} \\ F_{31} & F_{32} & F_{33} \\ \end{bmatrix} $$

So we have to pass the $F$ matrix over $3 \times 3$ sub matrices in S, the first such pass over would give us the elementwise multiplication:

$$ S_{11}= \begin{bmatrix} X_{11} & X_{12} & X_{13} \\ X_{21} & X_{22} & X_{23} \\ X_{31} & X_{32} & X_{33} \\ \end{bmatrix} \begin{bmatrix} F_{11} & F_{12} & F_{13} \\ F_{21} & F_{22} & F_{23} \\ F_{31} & F_{32} & F_{33} \\ \end{bmatrix} = X_{11}F_{11} + X_{12}F_{12} + \dots X_{33} F_{33} $$

Now, we can write this as the matrix multiplication :

$$ \begin{bmatrix} S_{11} \\ S_{12}\\ S_{13}\\ S_{14}\\ S_{21}\\ \vdots\\ S_{44} \end{bmatrix} = \begin{bmatrix} X_{11} & X_{12} & X_{13} & X_{21}& X_{22}& X_{23} & X_{31}& X_{32}& X_{33}\\ X_{12} & X_{13} & X_{14} &X_{22} & X_{23} & X_{24} & X_{32} & X_{33} & X_{34} \\ X_{13} &&&&\dots \\ X_{14} &&&&\dots \\ X_{21} &&&&\dots \\ \vdots &&&&\dots\\ X_{44} &&&&\dots \end{bmatrix} \begin{bmatrix} F_{11} \\ F_{12}\\ F_{13}\\ F_{21}\\ F_{22}\\ F_{23}\\ F_{31}\\ F_{32}\\ F_{33}\\ \end{bmatrix} $$

or more succinctly:

$$ \text{vec}(S) = M_X \text{vec}(F) $$

where $M_X$ is the $16 \times 9$ matrix. Note that the first row of $M_X$ corresponds to the first $f \times f$ sub matrix of $X$ flattened out. The second row is the second submatrix, achieved by sliding the filter one step to the right, flattened out, and so on.. In general, when $d=1$, the dimension of $M_X$ is $(w-f+1)(h-f+1) \times f^2$.

Next, in the $d$ dimensional case, say $d=3$ for example, each pass over of the filter has three times as many multiplications as the 1 dimensional case. So each row of $M_X$ is now of length $d f^2$, i.e. $M_X$ is of dimension: $(w-f+1)(h-f+1) \times df^2$

Note that I have left out the bias vector in this discussion to avoid clogging up the presentation, but incorporating it should be straight forward.