Chain rule and vector-matrix calculus

210 Views Asked by At

I'm trying to figure out the chain rule in relation to a vector-matrix calculation. I calculate derivatives of several vector-functions:

$q_1=x^Tx$, $q_2=x \cdot x$, $q_3=xx^T$, $q_4=xx^Tx$, $q_5=(xx^T)(xx^T)$

We use a vector $x$ for differentiation, and the above functions $q_{1...5}$ are various combinations of the vector $x$ and the resulting objects:

$q_1,q_2 \rightarrow$ scalars

$q_3 \rightarrow$ matrix

$q_4 \rightarrow$ vector

$q_5 \rightarrow$ matrix

The derivative of a vector with respect to a vector will be the identity matrix, i.e. $\frac{dx}{dx}=\boldsymbol{1}$:

Now let's see the results obtained through the chain rule:

  1. $\frac{dq_1}{dx}=\frac{dx}{dx}^Tx+x^T\frac{dx}{dx}=\boldsymbol{1}^Tx+x^T\boldsymbol{1}$

  2. $\frac{dq_2}{dx}=\boldsymbol{1}x+x\boldsymbol{1}$

  3. $\frac{dq_3}{dx}=\boldsymbol{1}x^T+x\boldsymbol{1}^T$

  4. $\frac{dq_4}{dx}=\boldsymbol{1}x^Tx+x\boldsymbol{1}^Tx+xx^T\boldsymbol{1}$

  5. $\frac{dq_5}{dx}=\boldsymbol{1}x^T(xx^T)+x\boldsymbol{1}^T(xx^T)+(xx^T)\boldsymbol{1}x^T+(xx^T)x\boldsymbol{1}^T$

Now let's briefly analyze the results:

  1. sum of a row-vector and a column-vector. To get the result, we need to transpose either a row-vector or a column-vector

  2. a similar situation, only this time in one of the terms we need to swap $x$ and $\boldsymbol{1}$ manually

  3. none of the terms is computable, but logically, as a result of differentiation, a tensor should be obtained, therefore, ordinary products must be replaced by Kronecker products

  4. first and third terms are matrices, which corresponds to the logic of the result, but the second has a non-computable structure, and it is not known how to convert it to a computable one

  5. logically, a tensor should be obtained, but the logic of permutations in the terms is also difficult to disclose

My question is: there must be rules for transforming "chain" expressions obtained by differentiating complex vector-matrix expressions by the chain rule to obtain computable results. Are they known? I would be happy and grateful for help in understanding the solution to this problem.

Some example:

enter image description here

enter image description here

EDIT NUMBER 3:

enter image description here

2

There are 2 best solutions below

2
On BEST ANSWER

$ \newcommand\DD[2]{\frac{\mathrm d#1}{\mathrm d#2}} \newcommand\tDD[2]{\mathrm d#1/\mathrm d#2} \newcommand\diff{\mathrm D} \newcommand\R{\mathbb R} $

Let's change perspectives. Your rule $\tDD xx = \mathbf 1$ tells me that what you want is the total derivative; this rule is equivalent to saying that the total derivative $\diff f_x$ at any point $x \in \R^n$ of the function $f(x) = x$ is the identity, i.e. $\diff f_x(v) = v$ for all $v \in \R^n$. Your transposes are essentially stand-ins for inner products. Let $\cdot$ be the standard inner product on $\mathbb R^n$. Then we may write each of your $q$'s as $$ q_1(x) = q_2(x) = x\cdot x,\quad q_3(x; w) = x(x\cdot w),\quad q_4(x) = (x\cdot x)x,\quad q_5(x; w) = x(x\cdot x)(x\cdot w). $$ I've interpreted the outer products $xx^T$ as functions $w \mapsto x(x\cdot w)$, and in $q_5$ I've used the associativity of matrix multiplication to get $$ (xx^T)(xx^T) = x(x^Tx)x^T. $$ When taking a total derivative $\diff f_x$, we may leave the point of evaluation $x$ implicit and write e.g. $\diff[f(x)]$ or even just $\diff f$ if $f$ is implicitly a function of $x$. If we want to differentiate a variable other than $x$, e.g. $y$, we will write e.g. $\diff_y[x + 2y](v) = 2v$. The total derivative has three fundamental properties:

  1. The derivative of the whole is the sum of the derivative of the parts. For example, $$ \diff[f(x,x)] = \dot\diff[f(\dot x,x)] + \dot\diff[f(x,\dot x)]. $$ The overdots specify precisely what is being differentiated, and anything without a dot is held constant. A more verbose notation would be $$ \diff_x[f(x,x)] = \diff_y[f(y,x)]_x + \diff_y[f(x,y)]_x, $$ or even more verbose $$ \diff_x[f(x,x)] = \bigl[\diff_y[f(y,x)]\bigr]_{y=x} = \bigl[\diff_y[f(x,y)]\bigr]_{y=x}. $$
  2. The chain rule says the derivative of a composition is the composition of derivatives: $$ \diff[f\circ g]_x = (\diff f_{g(x)})\circ(\diff g_x). $$ We don't need to use the chain rule directly for any of the $q$'s, but property 1 above is actually a consequence of the chain rule.
  3. The derivative of a linear function is itself. If $f(x)$ is linear, then $$ \diff f_x(v) = f(v). $$ To make it clear, if say $f(x, y)$ is a function linear in $x$ then the above means that $$ \diff[f(x,x)](v) = \dot\diff[f(\dot x,x)](v) + \dot\diff[f(x,\dot x)] = f(v,x) + \dot\diff[f(x,\dot x)], $$ and if $f(x, y)$ is additionally linear in $y$ then we can continue in the same fashion to get $$ \diff[f(x,x)](v) = f(v,x) + f(x,v). $$

Lets apply this to each $q$:

$$ \diff[q_1](v) = \diff[x\cdot x](v) = \dot\diff[\dot x\cdot x](v) + \dot\diff[x\cdot\dot x](v) = 2\dot\diff[\dot x\cdot x](v) = 2v\cdot x, $$$$ \diff[q_3](v) = \diff[x(x\cdot w)](v) = \dot\diff[\dot x(x\cdot w)](v) + \dot\diff[x(\dot x\cdot w)](v) = v(x\cdot w) + x(v\cdot w), $$$$ \diff[q_4](v) = 2(v\cdot x)x + (x\cdot x)v, $$$$ \diff[q_5](v) = v(x\cdot x)(x\cdot w) + 2x(v\cdot x)(x\cdot w) + x(x\cdot x)(v\cdot w), $$ in summary $$ \diff[q_1](v) = 2v\cdot x,\quad \diff[q_3(x; w)](v) = v(x\cdot w) + x(v\cdot w),\quad \diff[q_4](v) = 2(v\cdot x)x + (x\cdot x)vm $$$$ \diff[q_5(x; w)](v) = v(c\cdot x)(x\cdot w) + 2x(v\cdot x)(x\cdot w) + x(x\cdot x)(v\cdot w). $$ Note how $\diff[q_3]$ and $\diff[q_5]$ end up with two extra vector parameters $v, w$; this is indicating that these derivatives are higher-order tensors (where by "tensor" we mean a multilinear map). The tensor types of each of the above are

Tensor type
$\diff[q_1]$ (0, 1)
$\diff[q_3]$ (1, 2)
$\diff[q_4]$ (1, 1)
$\diff[q_5]$ (1, 2)

In this case, $(p, q)$ says that $q$ vectors are inputs and $p$ vectors are outputs. We call $p + q$ the degree of the tensor. We can translate these back into index/tensor notation as follows: $$ (\diff[q_1])_i = 2x_i \sim 2x^T, $$$$ (\diff[q_3])_{ij}^k = \delta^k_ix_j + \delta_{ij}x^k \sim \mathbf1\otimes x^T + x\otimes\mathbf g, $$$$ (\diff[q_4])_i^j = 2x_ix^j + x_kx^k\delta_i^j \sim 2x\otimes x^T + |x|^2\mathbf1, $$$$ (\diff[q_5])_{ij}^k = \delta_i^kx_lx^lx_j + 2x^kx_ix_j + x^kx_lx^l\delta_{ij} \sim |x|^2\mathbf1\otimes x^T + 2x\otimes x^T\otimes x^T + x\otimes\mathbf g. $$ In this context, $x^T$ is best thought of as the $(0,1)$ tensor dual to $x$. $\mathbf1$ is the (1,1)-identity tensor, which can be thought of as the identity matrix. Closely related is the metric tensor $\mathbf g(v, w) = v\cdot w$. Only $\diff[q_1]$ and $\diff[q_2]$ can be written in matrix notation, since they are the only degree $\leq2$ tensors; for $\diff[q_2]$ we could write $$ \diff[q_2] \sim 2xx^T + |x|^2\mathbf1. $$ We can see from the above precisely where your equations fail

  1. The total derivative always takes a $(p,q)$-tensor and produces a $(p,q+1)$-tensor. More over, this means that in using a matrix derivative positioning matters, and it only makes sense to matrix-differentiate scalar and vector expressions. Allow $\tDD{}x$ to act in both directions; we may treat it like it's a row vector. Then there are both left and right derivatives: $$ \DD{}xx = 1,\quad x\DD{}x = \mathbf 1. $$ In the first equation, $1$ is scalar; in the second equation, $\mathbf 1$ is a matrix. The correct derivation of your equation (1) would look like $$ (x^Tx)\DD{}x = (\dot x^Tx)\DD{}{\dot x} + (x^T\dot x)\DD{}{\dot x} = x^T\left(\dot x\DD{}{\dot x}\right) + x^T\left(\dot x\DD{}{\dot x}\right) = 2x^T\mathbf1 = 2x^T. $$ Note that $\DD{}x(x^Tx)$ doesn't make sense, being row vector $\times$ row vector $\times$ vector. If we interpret $x^Tx$ as a scalar $x\cdot x$, then we will simply reproduce the derivation above.
  2. There's needs to be a distinction between between e.g. the $(1,1)$-tensor $\mathbf1$ and the $(0,2)$-tensor $\mathbf g$. These have the same components $$ (\mathbf1)_i^j = \delta_i^j,\quad (\mathbf g)_{ij} = \delta_{ij} $$ but act very differently: $\mathbf1$ is a function $\R^n \to \R^n$ where $\mathbf1(v) = v$, and $\mathbf g$ is a function $\R^n\times\R^n \to \R$ where $\mathbf g(v, w) = v\cdot w$.
0
On

The issue here is that $\frac{dx^T}{dx}$ is a $(0,2)$-tensor and not a $(1,1)$-tensor. This is because $x^T$ is already a $(0,1)$-tensor and taking derivatives adds one order in the second component (you can eat one more vector). If $\frac{dx^T}{dx}$ eats the vector $x$, it becomes a $(0,1)$-tensor just like $x^T\boldsymbol{1}$ and so you can add them with no issues. Similarly for the other expressions, you just need to be careful with the tensor orders.