Let us consider the following functions
\begin{equation} y = \operatorname{softmax}(z) \end{equation} \begin{equation} z = h\cdot W + b \end{equation}
where $y, h, W$ and $b$ are $1 \times n$, $1 \times m$, $m \times n$ and $1 \times n$ matrices. Compute $\frac{\partial{y_i}}{\partial{W}}$.
My efforts:
\begin{equation} \frac{\partial{y_i}}{\partial{W}} = \frac{\partial{y_i}}{\partial{z}} \times \frac{\partial{z}}{\partial{W}} \end{equation}
Here $z$ is a vector and $W$ is a matrix so $\frac{\partial{z}}{\partial{W}}$ will be a 3D tensor.
But $y_i$ is a scalar and $W$ is $m \times n$ matrix so $\frac{\partial{y_i}}{\partial{W}}$ should be of size $m \times n$.
Please tell me where I am wrong?
Given $$\eqalign{ z &= hW+b \cr y &= \operatorname{softmax}(z) \cr Y &= \operatorname{Diag}(y) \cr }$$ Find the differential and gradient of $y$ $$\eqalign{ dy &= dz\,(Y-y^Ty) \cr &= h\,dW\,(Y-y^Ty) \cr &= h\,{\mathbb E}\,(Y-y^Ty):dW \cr\cr \frac{\partial y}{\partial W} &= h\,{\mathbb E}\,(Y-y^Ty) \cr }$$ where colon denotes the double-dot (aka Frobenius) product, and ${\mathbb E}$ is a $4^{th}$ order isotropic tensor with components $${\mathbb E}_{ijkl} = \delta_{ik}\,\delta_{jl}$$
Also recall that we are working with row vectors, so $(y^Ty)$ is a matrix, not a scalar product.
$$\eqalign{}$$
$$\eqalign{}$$