Crossentropy of softmax function derivative explanation

19 Views Asked by At

Following these calculations: https://sebastianraschka.com/faq/docs/softmax_regression.html , I am a bit confused about the last equation1 . Imagine I have a X which is of the shape (300, 7), y (which is one hot encoded) is of the shape (300, 3) and the weights matrix W is of the shape (7, 3). How would I compute the W', that is, the gradients of all w_j? Also, I don't get what is xi in the equation, if it is ith example.