How to calculate $\nabla_{\mathbf{x}}(\mathbf{c}\mathbf{x}-A)\mathbf{x}^t$?

87 Views Asked by At

How to calculate $\nabla_{\mathbf{x}}(\mathbf{c}\mathbf{x}-A)\mathbf{x}^t$ directly?

$\mathbf{x}\in\mathbb{R}^{1\times n}$ , $\mathbf{c}\in\mathbb{R}^{m\times 1}$, $A\in\mathbb{R}^{m\times n}$.

My attempt:

Denote the expression $(\mathbb{c}\mathbf{x}-A)\mathbf{x}^t$ by $f$. Then, find $\frac{\partial f_i}{\partial x_j}$

$f_i = ((\mathbf{c}\mathbf{x}-A)\mathbf{x}^t)_i = \sum_j c_ix_j^2-\sum_jA_{ij}x_j$

$\frac{\partial f_i}{\partial x_j} = 2c_ix_j - A_{ij}$

So, $\nabla_\mathbf{x}(\mathbb{c}\mathbf{x}-A)\mathbf{x}^t = 2\mathbf{c}\mathbf{x}-A.$


Instead of calculating the $ij$-th component, how to compute the gradient directly?

3

There are 3 best solutions below

0
On

The map $h : x \mapsto (cx-A)x^T$ is defined on $\mathbb R^n$ and takes its values in $\mathbb R^m$. Hence you can't really speak of the gradient as $h$ is not a scalar-valued function.

However, you can speak of the Fréchet (or total) derivative of $h$ at a point $x$.

$f : x \mapsto cx-A$ is an affine map, therefore its Fréchet derivative is constant and equal at each $x \in \mathbb R^n$ to the linear map $u \mapsto cu$. $B : (C,x) \mapsto Cx^T$ is a bilinear map, whose Fréchet derivative at $(C,x)$ is the map $(U,v) \mapsto Ux^T +Cv^T$.

Based on the chain rule, you get

$$h^\prime(x)(u) = (cu)x^T + (cx-A)u^T$$ where $h^\prime(x)$ is the Fréchet derivative of $h$ at $x$.

Moreover because matrix product is associative and inner product commutative, you have $$(cu)x^T= c(u x^T) = c(xu^T) = (cx)u^T$$ which allows to conclude to

$$h^\prime(x)(u) = (2cx-A)u^T.$$

In terms of Matrix calculus, it means that $$\frac{\partial h}{\partial x} = 2cx - A.$$

0
On

My understanding is that many people understand a derivative as being defined by $\nabla_x f = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right)$ (and usually as a column vector).

For me, what made me understand derivatives better (and I can't promise this will be helpful for everyone) is to think a bit more abstractly. I also can't promise that you'll walk away from reading this with 100% understanding, but if you are interested in learning more, this is the subject of "differential geometry"/"differntial topology" so you can look into that if you're curious.

Given a function $f : N^n \to M^m$ between two manifolds (e.g. vector spaces), the derivative $(df)_x : TN \to TM$ is the "best linear approximation to $f$ at $x$." In concrete terms, for $f : \mathbf{R}^n \to \mathbf{R}^m$, the derivative is the $m \times n$ matrix $(df)_x$ such that

$$f(x + h) = f(x) + (df)_xh + o(h)$$

which is Taylor's theorem essentially but now taken as a definition. The reason for this is that the usual definition:

$$ \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} = f'(x)$$

doesn't make sense when $f, x, h$ are vectors. Instead we rearrange it as so:

$$ 0 = \lim_{h \to 0} \left( \frac{f(x + h) - f(x)}{h} - f'(x) \right) \leadsto \lim_{h \to 0} \frac{f(x + h) - f(x) - f'(x)h}{h} = 0$$

then replace the denominator by $|h|$ (the norm). You can see exactly that this is saying that

$$f(x + h) = f(x) + f'(x)h + o(h).$$

Given this definition, the form of the gradient I wrote at the very beginning is now a theorem (when $m = 1$). This definition has the nice property that it is now "obvious" that a linear function is its own derivative. That is, if $f(x) = Ax$ then $f'(x) = f = A$ (without the $x$ on the right). Note that

$$f(x + h) = f(x) + f(h)$$

so this has the form $f(x + h) = f(x) + (df)_xh + o(h)$ with $(df)_x = f$ and $o(h) = 0$.

Likewise, when you have a bilinear map $x^Tx$ you can compute $(x + h)^T(x + h)$ and write it as

$$\underbrace{(x + h)^T(x + h)}_{f(x + h)} = \underbrace{x^Tx}_{f(x)} + \underbrace{2x^T}_{(df)_x}h + \underbrace{h^Th}_{o(h)}. $$


More abstractly, one has the chain rule:

$$(d(f\circ g))_x = (df)_y \circ (dg)_x$$

and the Leibnitz rule:

$$(d(f \otimes g))_{x\otimes y} = (df)_x \otimes g(y) + f(x) \otimes (dg)_y.$$

And we have $x^Tx = \operatorname{tr}(x^* \otimes x)$ which is a composition of the linear function $\operatorname{tr}$ and the function $x^* \otimes x = (I^* \otimes I)(x)$. Thus, using the chain rule and Leibnitz rule, we have

$$(dx^Tx)_x = (d\operatorname{tr})_{x^* \otimes x} \circ [(dx^*)_x \otimes x + x^* \otimes (dx)_x] = \operatorname{tr} \circ [I^* \otimes x + x^* \otimes I]$$

As a function of $h$, this is

$$\operatorname{tr} \left( [I^* \otimes x + x^* \otimes I] h \right) = \operatorname{tr}(h^* \otimes x + x^* \otimes h) = h^Tx + x^Th = 2x^Th.$$

That gives us the derivative $(dx^Tx)_x = 2x^T$.

0
On

Elaborating on previous answers, you can use a full differential approach to compute the Jacobian of $ \mathbf{u}= [\mathbf{c}\mathbf{x}^T-\mathbf{A}]\mathbf{x} $.

Note: I think it is better to represent vectors as column vectors (so I swapped the transpose)

From here $$ d\mathbf{u}= [\mathbf{c}\mathbf{x}^T-\mathbf{A}] d\mathbf{x}+ [\mathbf{c}(d\mathbf{x})^T]\mathbf{x} $$ The second term writes $\mathbf{c} \mathbf{x}^T d\mathbf{x} $

Finally the Jacobian is the matrix $$ \frac{\partial \mathbf{u}}{\partial \mathbf{x}} = [\mathbf{c}\mathbf{x}^T-\mathbf{A}]+\mathbf{c} \mathbf{x}^T = 2\mathbf{c} \mathbf{x}^T - \mathbf{A} $$