I am unsure if this is more appropriate for here or CV, but since it is mostly a question about calculus, I figured posting it here would be a reasonable idea.
More specifically, I am interested in obtaining the gradient of $$CE(softmax(\vec \beta), \vec x)$$ with $\beta = A^T \vec y$, such that $\beta_i = \vec a_i^T \vec y$ with respect to $\vec y$.
Also, $softmax$ is defined as $softmax(x)_i = \exp(x_i) / \sum_j \exp(x_j)$
By the chain rule, I get $$\frac{dCE}{dy} = \frac{dCE}{d softmax(\beta)} \frac{dsoftmax(\beta)}{d\beta} \odot \frac{d\beta}{d y}$$
Here, the elementwise product comes from the derivative of an elementwise function (chain rule applied to elementwise functions).
I calculate that $$\frac{dCE}{d \beta} = softmax(\beta) - \vec x$$
So far this makes sense (and I have managed to verify the above using various sources online). However, when I try to calculate $d\beta/dy$, I get $A^T$, which doesn't make sense, as the elementwise product isn't defined for matrices of differing dimensions (AKA my vector $softmax(\beta) - \vec x$ and matrix $A^T$)
Can anyone shed some light onto where I am going wrong?