I am trying to understand the math of CS231n as discussed here and implemented here.
To start, I dont get where the -1 operation come from as seen in Softmax.softmax_loss_vectorized() line 87:
86 dS = softmax_output.copy()
87 dS[range(num_train), list(y)] += -1
88 dW = (X.T).dot(dS)
89 dW = dW/num_train + reg* W
And in TwoLayerNet.softmax_loss_vectorized() line 125:
124 dscores = softmax_output.copy()
125 dscores[range(N), list(y)] -= 1
126 dscores /= N
127 grads['W2'] = h_output.T.dot(dscores) + reg * W2
128 grads['b2'] = np.sum(dscores, axis = 0)
129
130 dh = dscores.dot(W2.T)
131 dh_ReLu = (h_output > 0) * dh
132 grads['W1'] = X.T.dot(dh_ReLu) + reg * W1
133 grads['b1'] = np.sum(dh_ReLu, axis = 0)
If explanation for subsequent operations could be given, that would be wonderful. I could not make a one-to-one match with aforementioned discussion. Thank you so much in advance.
PS: the code is in Python.
========================================================================
More info:
I think the answer to -1 is in section Computing the Analytic Gradient with Backpropagation of this link.
This link provides the following graph for the two-layer case:
# W1
# ----+ W2 b2
# dW1 \ ----+ ----+
# (*)----+ dW2 \ db2 \
# X / \ \ \
# ----+ \ h \ \ scores softmaxes
# (+)------(ReLU)-----(*)--------(+)--------(softmax)----------
# b1 / drelu dh dscores dscores
# -------------+
# db1
This link and this link offer much better visualization, like the one below.
I still dont get how that graph come about, and what are the equations leading to the code implemented above.

From the first link, we're given $\mathcal{L}_i = - \log\left(p_{y_i}\right)$ and $p_{y_i} = \frac{e^{f_{y_i}}}{\sum_{j=1}^K e^{f_j}}.$ It follows (by the chain rule$^\dagger$) that $\frac{\partial\mathcal{L}_i}{\partial f_k} = p_k - \mathbb{1}_{y_i=k}.$ The lines of NumPy code that you're asking about simply appear to be implementing the latter equation.
$^\dagger$ The chain rule gives
$$\begin{align}\frac{\partial\mathcal{L}_i}{\partial f_k} &=\sum_l\left( \frac{\partial\mathcal{L}_i}{\partial p_l}\frac{\partial p_l}{\partial f_k}\right)\\[5mm] &=\sum_l\left( \left(-p_l^{-1}\mathbb{1}_{l=y_i}\right)\left({e^{f_l}\mathbb{1}_{k=l}\over\sum_{j=1}^K e^{f_j}}-{e^{f_l}e^{f_k}\over\left(\sum_{j=1}^K e^{f_j}\right)^2}\right)\right)\\[5mm] &=\sum_l\left( \left(-p_l^{-1}\mathbb{1}_{l=y_i}\right)\left({p_l\mathbb{1}_{k=l}}-{p_l p_k}\right)\right)\\[5mm] &=\sum_l\left( \left(-\mathbb{1}_{l=y_i}\right)\left({\mathbb{1}_{k=l}}-{ p_k}\right)\right)\\[5mm] &=p_k-\mathbb{1}_{y_i=k} \end{align}$$