Backpropagation of position-wise feedforward neural network

36 Views Asked by At

I have read a paper entitled "Attention is all you need" by Vaswani et al. (2017). This paper use the so-called position-wise feedforward neural network, where the input of this network is a matrix $\mathbf{X} \in \mathbb{R}^{n \times d_\mathrm{model}}$ (not a vector $\mathbf{X} \in \mathbb{R}^{d_\mathrm{model}}$). If I am not mistaken, the meaning of position-wise is that the (same) feed-forward layer applies to every vector $\mathbf{X}_{i*}$ ($i$th row of $\mathbf{X}$) for $i = 1, \dots, n$. Thus, the weights are shared.

I want to do backpropagation for a position-wise network consisting only a linear layer with no activation. Let the output dimensionality is $d_\mathrm{model}$. Applying this network yields $\mathbf{Z} \in \mathbb{R}^{n \times d_\mathrm{model}}$ where each row $\mathbf{Z}_{i*},\ i=1, \dots, n$ is given by $\mathbf{Z}_{i*} = \mathbf{X}_{i*} \mathbf{W} + \mathbf{b}^\intercal$. Here, $\mathbf{W} \in \mathbb{R}^{d\mathrm{model} \times d\mathrm{model}}$ and $\mathbf{b} \in \mathbb{R}^{d_\mathrm{model}}$ are weight and bias, respectively.

Let $L$ be the loss function. For sequence $i$ I get: $\dfrac{\partial L}{\partial \mathbf{W}_{pq}} = \dfrac{\partial L}{\partial \mathbf{Z}_{i1}} \cdot \dfrac{\partial \mathbf{Z}_{i1}}{\partial \mathbf{W}_{pq}} + \dfrac{\partial L}{\partial \mathbf{Z}_{i2}} \dfrac{\partial \mathbf{Z}_{i2}}{\partial \mathbf{W}_{pq}} + \dots + \dfrac{\partial L}{\partial \mathbf{Z}_{id_\mathrm{model}}} \dfrac{\partial \mathbf{Z}_{id_\mathrm{model}}}{\partial \mathbf{W}{pq}} = \dfrac{\partial L}{\partial \mathbf{Z}_{ip}} \mathbf{X}_{iq}$, for $p, q = 1, \dots, d_\mathrm{model}$.

Thus, I end up with $\dfrac{\partial L}{\partial \mathbf{W}} = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{i*}}\right)^\intercal \mathbf{X}_{i*}$.

My question: is $\left(\dfrac{\partial L}{\partial \mathbf{Z}_{1*}}\right)^\intercal \mathbf{X}_{1*} = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{2*}}\right)^\intercal \mathbf{X}_{2*} = \dots = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{d_\mathrm{model}*}}\right)^\intercal \mathbf{X}_{d_\mathrm{model}*}$ holds?