Backward propagation math: where does -1 come from?

106 Views Asked by At

I am trying to understand the math of CS231n as discussed here and implemented here.

To start, I dont get where the -1 operation come from as seen in Softmax.softmax_loss_vectorized() line 87:

86      dS = softmax_output.copy()
87      dS[range(num_train), list(y)] += -1
88      dW = (X.T).dot(dS)
89      dW = dW/num_train + reg* W 

And in TwoLayerNet.softmax_loss_vectorized() line 125:

124     dscores = softmax_output.copy()
125     dscores[range(N), list(y)] -= 1
126     dscores /= N
127     grads['W2'] = h_output.T.dot(dscores) + reg * W2
128     grads['b2'] = np.sum(dscores, axis = 0)
129     
130     dh = dscores.dot(W2.T)
131     dh_ReLu = (h_output > 0) * dh
132     grads['W1'] = X.T.dot(dh_ReLu) + reg * W1
133     grads['b1'] = np.sum(dh_ReLu, axis = 0)

If explanation for subsequent operations could be given, that would be wonderful. I could not make a one-to-one match with aforementioned discussion. Thank you so much in advance.

PS: the code is in Python.

========================================================================

More info:

I think the answer to -1 is in section Computing the Analytic Gradient with Backpropagation of this link.

This link provides the following graph for the two-layer case:

# W1
# ----+                      W2         b2
# dW1  \                     ----+      ----+
#      (*)----+              dW2  \     db2  \
# X    /       \                   \          \  
# ----+         \               h   \          \   scores           softmaxes
#               (+)------(ReLU)-----(*)--------(+)--------(softmax)----------
# b1            /  drelu        dh      dscores    dscores 
# -------------+
# db1

This link and this link offer much better visualization, like the one below.

img

I still dont get how that graph come about, and what are the equations leading to the code implemented above.

1

There are 1 best solutions below

1
On BEST ANSWER

where does -1 come from?

From the first link, we're given $\mathcal{L}_i = - \log\left(p_{y_i}\right)$ and $p_{y_i} = \frac{e^{f_{y_i}}}{\sum_{j=1}^K e^{f_j}}.$ It follows (by the chain rule$^\dagger$) that $\frac{\partial\mathcal{L}_i}{\partial f_k} = p_k - \mathbb{1}_{y_i=k}.$ The lines of NumPy code that you're asking about simply appear to be implementing the latter equation.


$^\dagger$ The chain rule gives

$$\begin{align}\frac{\partial\mathcal{L}_i}{\partial f_k} &=\sum_l\left( \frac{\partial\mathcal{L}_i}{\partial p_l}\frac{\partial p_l}{\partial f_k}\right)\\[5mm] &=\sum_l\left( \left(-p_l^{-1}\mathbb{1}_{l=y_i}\right)\left({e^{f_l}\mathbb{1}_{k=l}\over\sum_{j=1}^K e^{f_j}}-{e^{f_l}e^{f_k}\over\left(\sum_{j=1}^K e^{f_j}\right)^2}\right)\right)\\[5mm] &=\sum_l\left( \left(-p_l^{-1}\mathbb{1}_{l=y_i}\right)\left({p_l\mathbb{1}_{k=l}}-{p_l p_k}\right)\right)\\[5mm] &=\sum_l\left( \left(-\mathbb{1}_{l=y_i}\right)\left({\mathbb{1}_{k=l}}-{ p_k}\right)\right)\\[5mm] &=p_k-\mathbb{1}_{y_i=k} \end{align}$$