I understand mnemonic of chain rule. But for example if I have some error function E and I want to find its first derivative against some matrix W. Or if I have some vector valued function V and I also want to find its derivative of some matrix W what kind of entities these would be?
$$\frac{\partial E}{\partial W} = ?$$ $$\frac{\partial V}{\partial W} = ?$$
These are tensors? Or multi dimensional arrays? What operation is between entities in some derivate chain, is a matrix multiplication, or dot product? What should I study to understand these entities tensor algebra, differential geometry?
I know that there are plenty materials in web that avoid this question completely, or introduce silly notation like using $$\partial w_{ij}$$ instead of $$\partial W$$ but I tired follow that I want to see the general form. I want to operate and see entities as they are
I recommend reading the relevant parts of Tom Mitchells book
I made a short summary in my own words in my bachelors thesis
Things to understand for gradient descent in neural networks:
The Gradient
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ be a function:
$$f(x_1, x_2, \dots, x_n) = (F_1(x_1, x_2, \dots, x_n), F_2(x_1, x_2, \dots, x_n), \dots, F_m(x_1, x_2, \dots, x_n))$$
Then the gradient of $f$ is denoted by $\nabla f$ and $$\nabla f = \begin{pmatrix} \frac{\partial F_1}{\partial x_1}, & \frac{\partial F_1}{\partial x_2}& \dots &\frac{\partial F_1}{\partial x_n}\\ \frac{\partial F_2}{\partial x_1}, & \frac{\partial F_2}{\partial x_2}& \dots &\frac{\partial F_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial F_m}{\partial x_1}, & \frac{\partial F_m}{\partial x_2}& \dots &\frac{\partial F_m}{\partial x_n}\\ \end{pmatrix}$$
You can see that you can decompose the problem by output neuron ($F_i$)
TODO: Explain $\frac{\partial f}{\partial x_1}$ notation
http://www.markusengelhardt.com/skripte/grad-div-rot.pdf (German)
The chain rule
(To be continued - I need to go to work now)