Why do the terms get swapped in vector differentiation when calculating the gradient for a Lagrangian?

55 Views Asked by At

The context of this question is that I'm taking a convex optimization class and was studying about Lagrangians. I was looking at an example problem when I experienced some difficulty understanding something.

The optimization problem in question is a simple least-squares solution of linear equations:

$\text{minimize}\quad \ x^Tx$

$\text{subject to}\quad Ax = b$

The Lagrangian of this problem is

$$L(x, \nu) = x^Tx + \nu^T(Ax - b)$$

When we're finding the dual function $g(\nu)$, we use the gradient of the Lagrangian:

$$\nabla_xL(x, \nu) = 2x + A^T\nu=0$$

We can obtain the solution $x = -\frac{1}{2}A^Tv$ from this. Plugging this into the dual function I get:

$$ \begin{align} g(\nu)& = L(-\frac{1}{2}A^T\nu,\ \nu)\\ & = (-\frac{1}{2}A^T\nu)^T(-\frac{1}{2}A^T\nu) + \nu^T(A(-\frac{1}{2}A^T\nu) - b) \\ & = \frac{1}{4}\nu^TAA^T\nu - \frac{1}{2}\nu^TAA^T\nu - \nu^Tb \\ & = -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb\end{align}$$

However, the dual function that's in the textbook is:

$$g(\nu) = -\frac{1}{4}\nu^TAA^T\nu - b^T\nu$$

I have two questions regarding this derivation:

  1. When we find the gradient of the Lagrangian, I initially thought the second term would be $\nu^TA$ because we find the derivative of $\nu^TAx$. This probably stems from my lack of understanding of vector differentiation but why is it $A\nu^T$?
  2. This is similar to question 1, but in the derivation of the dual function I got $\nu^Tb$ but apparently the correct term is $b^T\nu$. When we unfold the parentheses, do vector placements usually change?
2

There are 2 best solutions below

4
On BEST ANSWER

The second is easily disposed of: $\nu^Tb$ is a scalar—the dot product of $\nu$ and $b$—therefore $$\nu^Tb=(\nu^Tb)^T=b^T\nu.$$ It doesn’t matter which one comes first.

As for $\nu^TAx$, its differential is indeed $\nu^TA$, but since you’re working with column vectors, $\nu^TA$ has the wrong shape: the gradient of a scalar-valued function is also a column vector. Expanding by coordinates, we have $\nu^TAx = \sum_{i,j} a_{ij} \nu_i x_j$ so that $\partial_j(\nu^TAx) = \sum_i a_{ij}\nu_i$, that is, the $j$th element of the column vector $\nabla(\nu^TAx)$ is the dot product of $\nu$ with the $j$th column of $A$, from which $\nabla(\nu^TAx) = A^T\nu$.

0
On

$\newcommand{\grad}{\nabla_x}$ 1) Remember, $\grad c^T x= c$, if $c$ is a constant (column) vector. An equivalent way to write this is that if $r$ is a row vector (and $x$ column vector), then $\grad rx = r\color{blue}{^T}$, i.e. we must remember to transpose the thing multiplying $x$ (put $r = c^T$ from above to see this, so $c = r^T$).

In your case, the row vector $r$ is $\nu^T A$, so $$\grad \left(\nu^T A x\right)=r^T = A^T \nu.$$

2) Recall that $\color{blue}{v^T w = w^T v}$ for all $v,w \in \Bbb{R}^n$ (since both equal $\sum\limits_{i=1}^{n} v_i w_i$). Hence $\nu^T b$ and $b^T \nu$ are both correct.