Derivation of back-propagation equation $\frac{\partial E(\theta)}{\partial W^k}=x*\delta h^k+\tilde{h}^k*\delta y$ for convolutional autoencoders

288 Views Asked by At

I was reading the following paper on convolution stacked auto-encoders and they had the following convolution neural network (for auto-encoders, notice I didn't write the offset term [to avoid cluterness] as part of the network but its present):

enter image description here

where the first layer convolution is a valid convolution and the second is a full convolution (with matlab terminology). Also, notice that there usually is not flipping/reversing in the context convolutional neural networks. Recall, that the only difference between the two convolutions is that the valid one only convolves (in a deep neural sense) with valid elements in the matrix (i.e. it does not go out of boundaries where the image is assumed to be zero) while full goes outside the boundary and assume its zero. Hence they result in different dimensions of matrices. Valid results in $(m-n+1) \times (m-n+1)$ and full results in $(m+n-1) \times (m+n-1)$. The reason the first one is valid is so to get a compressed representation of x. The reason the second one is full, is so to get a proper reconstruction (at least in the sense that its the correct dimensions/size).

The above network results in the following output:

$$ y = \sigma(\sum_{k} h^k * \tilde{W}^k + c) $$

$$ y = \sigma(\sum_{k} \sigma(x * W^k + b^k)* \tilde{W}^k + c ) $$

What I was interested is in finding the gradient of the squared error loss $E(\theta) = \frac{1}{2n} \sum^n_{i=1}(x_i -y_i )^2$ equations with respect to the matrix weights $W^k$ (and the offsets $b^k$). In the paper they claim that the equation is:

$$\frac{\partial E(\theta)}{\partial W^k}=x*\delta h^k+\tilde{h}^k*\delta y$$

However, I was not sure how they arrived at that equation. What is the derivation for that equation? Since the gradient uses partial derivatives with respect to a whole filter/kernel/weights I would assume that its doing derivatives with respect to a matrix. However, I am not sure and there are a few details that bother me. For example, earlier in the paper they mention how that the convolution $*$ that they denote on their paper depends on the context on the convolution (since it could be a full convolution or a valid convolution, the first one returns a $(m+x-1) \times (m+n-1)$ matrix and the other a $(m-n+1) \times (m-n+1)$). Therefore, its not clear to me whether that equation of the gradient is using full or valid convolution.

Furthermore, its not clear to me what the equations for the deltas should be. Usually they are recursive but its not clear to me what they should look like for this case. Usually for normal back-propagation they look as follows (obtained from Andrew Ng's notes):

$$ \delta^{(l)}_{i} = ( (W^{(l)} )^T \delta^{(l+1)}) \bullet f'(z^{(l)})$$

however I have never seen a delta with respect to a hidden unit $h$. Usually, the delta is defined:

$$ \delta^{(l)}_i = \frac{\partial Error(w)}{\partial s^{(l)}_i}$$

where $s^{(l)}_i$ denotes the input to an activation function.

Anyway, its not explained well on that paper and was hoping someone could help me figure out what the notation meant and also some of the mathematical details of the derivation.