How do I propagate errors back to the previous layers for convolutional neural network?

95 Views Asked by At

I'm trying to figure out the backward propagation for convolutional neural network.

I drew the following figure to illustrate the forward propagation of a 3-layer model. Someone may argue the flatten should be counted as another layer, well, that would make me write a few more lines of derivation without giving any good.

enter image description here

To simplify the whole process, I decided to consider binary classification and just one training example which is also an input image, so the output of the 1st layer would be

$a^{[1]} = ReLU (x * w^{[1]} + b^{[1]}) \tag{1}$

where the $*$ denotes the convolution operation, $w^{[1]}$ denotes the kernel/filter used in this convolution operation, $b^{[1]}$ denotes the bias/intercept for layer[1].

Given the input image $x$ is size of (8, 8), the filter size of (3, 3), so the output of layer[1] $a^{[1]}$ is size of (6, 6).

$m$ denotes the number of training examples, so $m=1$ here, and the last dimension of (m, 8, 8, 1) means the color channel which is also 1 here.

There are 10 learnable params in layer[1], 9 for the kernel, 1 for the bias and the activation function is ReLU in this model.

There is no learnable params or activation function in layer[2] though, I still use $a$ to denote the output of this layer. So $a^{[2]}$ is size of (3, 3), as the kernel size = (2, 2).

$a^{[3]}$ denotes the output of the last layer of the model and could be computed using this formula

$a^{[3]} = \sigma{(z^{[3]})} \tag{2}$

where $\sigma{(\cdot)}$ denotes the sigmoid function and $z^{[3]} = a^{[2]} \cdot w^{[3]} + b^{[3]} \tag{3}$

so, there are another 10 params to learn, 9 for $w^{[3]}$ and 1 for $b^{[3]}$.

the loss function could be

$$ \mathcal{L}(a^{[3]}, y) = - [y \log a^{[3]} + (1-y) \log(1-a^{[3]})] \tag{4}$$

I clearly understand part of the backward propagation for this model, which is

$$ \dfrac{d\mathcal{L}(a^{[3]}, y)}{da^{[3]}} = -\dfrac{y}{a^{[3]}} + \dfrac{1-y}{1-a^{[3]}} \tag{5} $$

$$ \dfrac{da^{[3]}}{dz^{[3]}} = a^{[3]}\,(1-a^{[3]}) \tag{6} $$

\begin{align} \frac{d\mathcal{L}(a^{[3]}, y)}{dz^{[3]}} = a^{[3]} - y \tag{7} \end{align}

\begin{align} \frac{d\mathcal{L}(a^{[3]}, y)}{da^{[2]}} = \frac{d\mathcal{L}(a^{[3]}, y)}{dz^{[3]}} \cdot w^{[3]T} \tag{8} \end{align}

the part of equation (3) could be rewritten as

$ a^{[2]} \cdot w^{[3]} = a^{[2]}_{1,1} \ w^{[3]}_{1} + a^{[2]}_{1,2} \ w^{[3]}_{2}+ a^{[2]}_{1,3} \ w^{[3]}_{3} \\ + a^{[2]}_{2,1} \ w^{[3]}_{4} + a^{[2]}_{2,2} \ w^{[3]}_5+ a^{[2]}_{2,3} \ w^{[3]}_{6} \\ + a^{[2]}_{3,1} \ w^{[3]}_{7} + a^{[2]}_{3,2} \ w^{[3]}_{8}+ a^{[2]}_{3,3} \ w^{[3]}_{9} \tag{9} $

where $ a^{[2]}_{1,1} = \max(a^{[1]}_{1,1}, a^{[1]}_{1,2}, a^{[1]}_{2,1}, a^{[1]}_{2,2}) \tag{10} $

$ a^{[2]}_{p,r} = \max(a^{[1]}_{2p-1,2r-1}, a^{[1]}_{2p-1,2r}, a^{[1]}_{2p,2r-1}, a^{[1]}_{2p,2r}) \tag{11} $

How do I propagate errors back to layer[2] and [1]?

I'm now blocked at the step of propagating errors from $a^{[2]}$ to $a^{[1]}$.

If $a^{[1]}_{1,1}$ is the output of equation (10), is it correct to use the formula below to compute $ \dfrac{d\mathcal{L}(a^{[3]}, y)}{da^{[1]}} $?

$ \dfrac{\partial \mathcal{L}(a^{[3]}, y)}{\partial a^{[1]}_{1,1}} = \dfrac{\partial \mathcal{L}(a^{[3]}, y)}{\partial a^{[2]}_{1,1}} = \dfrac{\partial \mathcal{L}(a^{[3]}, y)}{\partial z^{[3]}} \ w^{[3]}_1 = (a^{[3]} - y)\ w^{[3]}_1 $

If yes, what about the other 3 guys?

I went through a post for this though, I still don't know what should I do to complete the formulas after (8). Could someone give me a hint?

Note: I know what the convolution operation and cross-correlation are in deep learning. I'd just like to know the backward propagation part.

1

There are 1 best solutions below

2
On BEST ANSWER

Maybe these will be helpful:

Convolutional layers are pretty simple as they are basically a set of weights that are dragged across the image plane. This means you first need to figure out the gradient for a single patch of the image. And once you've done that, the gradient (with respect to the convolutional weights) will be a sum of the gradients over all patches of the image.