Given a function $CE=-\sum_i y_i log(\hat{y}_i)$ and $\hat{y}_i=\frac{e^{\theta_i}}{\sum_j e^{\theta_j}}$, where $y$ and $\theta$ are vectors. The question asks to compute $\frac{\partial CE}{\partial \theta}$.
(Hint from the original question: $y$ is the one-hot label vector, you might want to consider the fact many elements of $y$ are zeros, and assume that only the $k$-th dimension of $y$ is one.)
My solution: $$\frac{\partial CE(y,\hat{y})}{\partial \theta_i}= -\sum_i y_i \frac{1}{\hat{y}_i} \frac{e^{\theta_i}\sum_j e^{\theta_i} -e^{2\theta_i}}{(\sum_j e^{\theta_j})^2}= -\sum_i y_i \frac{\sum_j e^{\theta_j}}{e^{\theta_i}} \frac{e^{\theta_i}\sum_j e^{\theta_j} -e^{2\theta_i}}{(\sum_j e^{\theta_j})^2}=\sum_i y_i (\hat{y}_i-y_i)$$
thus, $\frac{\partial CE(y,\hat{y})}{\partial \theta_i}$=\begin{cases} \hat{y}_i-1, \quad i=k\\ 0, \quad \text{otherwise} \end{cases}
Solution given:
$\frac{\partial CE(y,\hat{y})}{\partial \theta}=\hat{y}-y$
or equivalently, $\frac{\partial CE(y,\hat{y})}{\partial \theta_i}$=\begin{cases} \hat{y}_i-1, \quad i=k\\ \hat{y}_i, \quad \text{otherwise} \end{cases}
The difference shows on calculations of otherwise, did I miss anything in my solution?
Update (Solved):
\begin{align*} \frac{\partial CE(y,\hat{y})}{\partial \theta_k} &= \sum_i \frac{\partial CE(y,\hat{y})}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial \theta_k}\\ &= -y_k \frac{1}{\hat{y}_k} \frac{\partial \hat{y}_k}{\partial \theta_k}- \sum_{i,i \neq k} y_i \frac{1}{\hat{y}_i} \frac{\partial \hat{y}_i}{\partial \theta_k}\\ &= - y_k \frac{\sum_k e^{\theta_k}}{e^{\theta_k}} \frac{e^{\theta_k}\sum_j e^{\theta_j} -e^{2\theta_k}}{(\sum_j e^{\theta_j})^2}+\sum_{i,i\neq k}y_i \frac{\sum_j e^{\theta_j}}{e^{\theta_i}} \frac{e^{\theta_i} e^{\theta_k}}{(\sum_j e^{\theta_j})^2}\\ &=-y_k(1-\hat{y}_k)+ \sum_{i,i\neq k}y_i \hat{y}_k\\ &= -y_k + \hat{y}_k\sum_i y_i\\ &= \hat{y_k}-y_k \end{align*}