Suppose $f : \mathbb{R}^n \to \mathbb{R}^m$ is a smooth function. I know that the Jacobian of $f$ is the linear operator $\mathbf{F} \in \mathbb{R}^{m \times n}$ such that for any (small) $\epsilon \in \mathbb{R}_+$ and $\delta \in \mathbb{R}^n$,
$$ f(x + \epsilon\delta) = f(x) + \mathbf{F} \epsilon\delta + o(\epsilon^2).$$ In other words, $f(x) + \mathbf{F}\epsilon\delta$ is an affine approximation to $f(x + \epsilon\delta)$ up to the first-order terms.
Now suppose that $g : \mathbb{R}^{n \times n} \to \mathbb{R}^m$ is a smooth function. Then, the derivative of $g$ will be a linear operator $\mathbf{G}$ from $\mathbb{R}^{n \times n}$ to $\mathbb{R}^m$.
I just know that $\mathbf{G}$ will be a "3rd order tensor" (i.e., matrix with another dimension) but I am not comfortable with using/thinking about it. This means that I should think of it as an $m \times n \times n$ object? Can you describe what $\mathbf{G}$ is exactly and how I should think of it?
Suppose $\delta \in \mathbb{R}^{n \times n}$. Is it correct that any such $\mathbf{G}$'s can be expressed as $$ \mathbf{G}(\delta) = \begin{bmatrix} \mathrm{trace}(A_1\delta) \\ \vdots \\ \mathrm{trace}(A_m\delta) \end{bmatrix}$$ where $A_i$'s are constant (for a given $\mathbf{G}$) $n \times n$ matrices? [if so, can you please prove it?] Is there a significance to this representation? (does $\{A_i\}$ have a name? is this representation useful for computation?)
How do you "multiply" such an $m \times n \times n$ object by an $n \times n$ matrix? This is needed to approximate $g(x + \epsilon\delta) \approx g(x) + \mathbf{G}(\epsilon\delta)$ where now $\delta \in \mathbb{R}^{n \times n}$.
Now suppose $h : \mathbb{R}^p \to \mathbb{R}^{n \times n}$. Similar to (1), what is the form of the derivative operator $\mathbf{H}$ for $h$? I can guess $\mathbf{H}(\delta) = \sum_{i} B_i \delta_i$ where $\delta \in \mathbb{R}^p$ and $B_i$'s are $n \times n$ matrices associated to $h$ - but I don't know how to prove this.
How do I apply the chain rule to compute the Jacobian matrix of $ g \circ h$ which is an $m \times p$ matrix? This question is related to (1) and (2). I guess the answer involve multiplying an $m \times n \times n$ tensor by and and $n \times n \times p$ tensor to get a $m \times p$ Jacobian matrix - but I don't know how to really compute this?
I have heard that I can vectorize the input of $g$ and the output of $h$ (such that they are mapping from/to $\mathbb{R}^{n^2}$) and then compute this with the usual chain rule based on Jacobian matrices of the stacked functions. If this is correct, can you please prove it? Is there a better way to think about and/or compute the chain rule?
Thanks for your help!
PS. I have seen related questions - but none was as comprehensive as this question.
Let $$\eqalign{ &a,b\in{\mathbb R}^{m} , \quad &x,y\in{\mathbb R}^{n}, \quad &X,Y\in{\mathbb R}^{n\times n},\quad w\in{\mathbb R}^{p} \\ &f = f(x),\quad \;\,&a = f(y),\quad &dx= x-y\\ &g = g(X),\quad \;\,&b = g(Y),\quad &dX = X-Y \\ }$$ The affine approximations are most clearly expressed using index notation (Einstein convention). $$\eqalign{ &f_i = a_i + F_{ij}\,dx_j \\ &g_i = b_i + {\cal G}_{ijk}\,dX_{jk} \\ }$$ So $F$ is a second-order tensor and ${\cal G}$ is a third-order tensor, as you anticipated.
The approximation of a matrix-valued function is quite similar. $$\eqalign{ X_{jk} &= Y_{jk} + {\cal H}_{jkm}\,dw_{m} \\ }$$ Substituting this into the previous formula yields an expression for the requested Jacobian. $$\eqalign{ dX_{jk} &= X_{jk} - Y_{jk} \\ &= {\cal H}_{jkm}\,dw_{m} \\ dg_i &= g_i(X)-g_i(Y) \\ &= {\cal G}_{ijk}\,dX_{jk} \\ &= {\cal G}_{ijk}{\cal H}_{jkm}\,dw_{m} \\ \frac{\partial g_i}{\partial w_m} &= {\cal G}_{ijk}{\cal H}_{jkm} \\ }$$