Can you please help me to check whether this is correct?
$\hat{y} = \operatorname{softmax}(hU + b_2)$, $J = \operatorname{CrossEntropy}(y, \hat{y})$.
where $\hat{y} \in \Bbb R^{1x5}, y \in\Bbb R^{1x5}, h\in \Bbb R^{1x 30}, U \in \Bbb R^{30 x5}, b_2 \in \Bbb R^{1x5}$. And $y$ is a one hot vector (meaning only 1 entry has probability of 1, while the other entries are 0.
I want to compute the gradient with respect of L w.r.t $U$ and $b_2$.
Attempt:
$\frac{dJ}{dU} = (\operatorname{softmax}(hU+b_2) - y)h$
$\frac{dJ}{db_2} =(\operatorname{softmax}(hU+b_2) - y) $
But it looks like $dJ/dU$ should have a dimension of $30 \times 5$, this doesn't work out. Am I missing anything?
For the derivative of cross-entropy with softmax, I follow the derivation here: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/