I have some scalar function $S: \mathbb R^{m \times n} \to \mathbb R$
I also know some other function $h:\mathbb R^{m \times n} \to \mathbb R^{m \times n}$.
I know the derivative $\frac{\partial S(h(z))}{\partial h(z)}$. It is of size $m$ by $n$, as expected.
What I want to find is $\frac{\partial S(h(z))}{\partial z}$. It should also be of size $m$ by $n$ since $z \in \mathbb R^{m \times n}$.
If we use chain-rule, we will get that:
$\frac{\partial S(h(z))}{\partial z} = \frac{\partial S(h(z))}{\partial h(z)}\frac{\partial h(z)}{\partial z}$
Since we expect that $\frac{\partial S(h(z))}{\partial z}$ is $m$ by $n$, and we know that $\frac{\partial S(h(z))}{\partial h(z)}$ is $m$ by $n$, it follows that we must have $\frac{\partial h(z)}{\partial z}$ is $n$ by $n$.
The problem is that it isn't. According to my calculations $\frac{\partial h(z)}{\partial z}$ is $m^2$ by $n^2$. So we have a dimension mismatch and multiplication is not possible.
I asked earlier today a question that resembles this, and was told that chain-rule does not always work in matrix calculus. Fine, this must be one of those examples, but I still need to find that derivative, how do i do it?
More in-depth details
In reality, I have a matrix $z \in \mathbb R^{m \times n}$. I then compute the $h$ matrix as follows: $h_{ij} = g(z_{ij}) = \frac{1}{1+e^{-z_{ij}}}$. This is called a sigmoid function. It has the special property that $g'(x) = g(x)(1-g(x))$.
My original function $S$ is a function of this new matrix $h$. But I need to find the derivative with respect to $z$.
More clearly:
If $z = \begin{pmatrix} z_{11} & z_{12} & \dots & z_{1n} \\ \vdots & \dots & \dots & \vdots \\ z_{m1} & z_{m2} & \dots & z_{mn}\end{pmatrix}$ then $h = \begin{pmatrix} g(z_{11}) & g(z_{12}) & \dots & g(z_{1n}) \\ \vdots & \dots & \dots & \vdots \\ g(z_{m1}) & g(z_{m2}) & \dots & g(z_{mn})\end{pmatrix}$
Now if for example we look at the derivative with respect to $z_{11}$, we will simply get the matrix $\begin{pmatrix} g(z_{11})(1-g(z_{11})) & 0& \dots & 0 \\ \vdots & \dots & \dots & \vdots \\ 0 & 0 & \dots & 0\end{pmatrix}$ which is $m$ by $n$.
So overall if we look at the derivative of $h$ wrt entire $z$, we will have an $m^2$ by $n^2$ matrix. Which again does not make much sense and does not help us at all to find the derivative of $S$ wrt $z$, which is what we originally wanted.
$\frac{\partial h(z)}{ \partial z} = \nabla h(z)$ will be a linear operator from $\mathbb{R}^{m \times n}$ to $\mathbb{R}^{m \times n}$, so if we vectorize $z$ we will need a matrix of size $mn \times mn$ to represent it. This matrix will have entries defined by partial derivatives $(\nabla h(z))_{(i,j),(k,l)} = \frac{\partial h(z)_{i,j}}{ \partial z_{k,l}}$. However, in your case this matrix will be diagonal with diagonal entries $(\nabla h(z))_{(i,j),(i,j)} = g(z_{i,j})(1 - g(z_{i,j}))$.
How $\nabla h(z)$ interacts with $\nabla S $ will be more sort of composition than multiplication if you think about differentials as linear operators i.e
$$ \nabla (S \circ h) (z) = \mathrm{vec}^{-1}(\nabla h(z)(\mathrm{vec} \nabla S(h(z)))) \in \mathbb{R}^{m \times n} $$ where $\mathrm{vec} : \mathbb{R}^{m \times n} \to \mathbb{R}^{mn} $ associates a matrix with a vector of its entries.
The main idea here is to think about differentials as linear transformations instead of plain matrices. In this case you can forget about $\mathrm{vec}$ and remember it only for computations.
(Sorry for change of notation)