Gradient of softmax composed with cross-entropy

306 Views Asked by At

I am unsure if this is more appropriate for here or CV, but since it is mostly a question about calculus, I figured posting it here would be a reasonable idea.

More specifically, I am interested in obtaining the gradient of $$CE(softmax(\vec \beta), \vec x)$$ with $\beta = A^T \vec y$, such that $\beta_i = \vec a_i^T \vec y$ with respect to $\vec y$.

Also, $softmax$ is defined as $softmax(x)_i = \exp(x_i) / \sum_j \exp(x_j)$

By the chain rule, I get $$\frac{dCE}{dy} = \frac{dCE}{d softmax(\beta)} \frac{dsoftmax(\beta)}{d\beta} \odot \frac{d\beta}{d y}$$

Here, the elementwise product comes from the derivative of an elementwise function (chain rule applied to elementwise functions).

I calculate that $$\frac{dCE}{d \beta} = softmax(\beta) - \vec x$$

So far this makes sense (and I have managed to verify the above using various sources online). However, when I try to calculate $d\beta/dy$, I get $A^T$, which doesn't make sense, as the elementwise product isn't defined for matrices of differing dimensions (AKA my vector $softmax(\beta) - \vec x$ and matrix $A^T$)

Can anyone shed some light onto where I am going wrong?