In matrix calculus, we can calculate scalar by matrix derivative or matrix by scalar derivatives, according to here.
Where do these formulae come from? What I'm confused about is how this is related to ordinary multi-variable differentiation? For example, given a matrix to scalar function $f: \mathbb{R}^{n \times m} \to \mathbb{R}$, in what sense is the result a $m \times n$ matrix?
Is this related to treating matrices as vectors of dimension $n \times m$, using matrix norm to measure distance on the vector space of matrices? However, shouldn't the resulting derivative linear map of type $\mathbb{R}^{n \times m} \to \mathbb{R}$? Why is it $\mathbb{R}^m \to \mathbb{R}^n$?
A detailed derivation of the formulae would be helpful!
Ok, lets look at the general definition of a differential $df$ on normed spaces, for $f:\mathbb{S}\to\mathbb{P}$ $$f(x+h)=f(x)+df(x,h)+o(||h||_S)$$ where the differential $df(x,h)$ is linear and continuous in $h$.
Lets look at $\mathbb{S}=\mathbb{R}^m,\mathbb{P}=\mathbb{R}^n$. Then any linear operator $\mathbb{R}^m\to\mathbb{R}^n$ can be represented by some matrix $\in \mathbb{R^{n\times m}}$ (i.e. by matrix-vector multiplication). Lets denote this matrix $A$. So out differential in this case becomes $$df(x,h)=A(x)h$$ We call $A(x)$ a derivative (remeber $df$ is a differential).
Now we take $\mathbb{S}=\mathbb{R}^{n\times m},\mathbb{P}=\mathbb{R}$. But we also equip this $\mathbb{S}$ with the scalar product $B\cdot C=\sum_{i,j}B_{i,j}C_{i,j}$. This is basically the same as taking $\mathbb{S}=\mathbb{R}^{nm}$ with the standard dot product. Then any linear operator $L: \mathbb{S}\to\mathbb{P}$ can be represented as $L(h)=D\cdot h$ where $D$ is some matrix $\in \mathbb{S}$.
So in this case the differential $df$ can be represented as $$df(x,h)=A(x)\cdot h$$ We call $A(x)$ a derivative. So $A(x)$ in both cases can even be the same (tables of numbers), but they are used differently.
Small example: $$f(x)=Ax,\ A\in\mathbb{R}^{n\times m},\ x\in\mathbb{R}^m,\ f:\mathbb{R}^m\to\mathbb{R}^n$$ $$df(x,dx)=Adx$$ $$\frac{df(x)}{dx}=A$$
$$g(x)=A\cdot x,\ A\in\mathbb{R}^{n\times m},\ x\in\mathbb{R}^{n\times m},\ g:\mathbb{R}^{n\times m}\to\mathbb{R}$$ $$dg(x,dx)=A\cdot dx$$ $$\frac{dg(x)}{dx}=A$$ As you see the derivatives of $f$ and $g$ are the same, but they represent different things.
In general we can define a derivative when we can represent linear operators $\mathbb{S}\to\mathbb{P}$ with some product on $\mathbb{S}$. For $\mathbb{S}$ Hilbert space and $\mathbb{P}$ its field we can always use Riesz representation theorem to get such a representation. In this case a derivative will be always an element from the input space $\mathbb{S}$.