The context of this question is that I'm taking a convex optimization class and was studying about Lagrangians. I was looking at an example problem when I experienced some difficulty understanding something.
The optimization problem in question is a simple least-squares solution of linear equations:
$\text{minimize}\quad \ x^Tx$
$\text{subject to}\quad Ax = b$
The Lagrangian of this problem is
$$L(x, \nu) = x^Tx + \nu^T(Ax - b)$$
When we're finding the dual function $g(\nu)$, we use the gradient of the Lagrangian:
$$\nabla_xL(x, \nu) = 2x + A^T\nu=0$$
We can obtain the solution $x = -\frac{1}{2}A^Tv$ from this. Plugging this into the dual function I get:
$$ \begin{align} g(\nu)& = L(-\frac{1}{2}A^T\nu,\ \nu)\\ & = (-\frac{1}{2}A^T\nu)^T(-\frac{1}{2}A^T\nu) + \nu^T(A(-\frac{1}{2}A^T\nu) - b) \\ & = \frac{1}{4}\nu^TAA^T\nu - \frac{1}{2}\nu^TAA^T\nu - \nu^Tb \\ & = -\frac{1}{4}\nu^TAA^T\nu - \nu^Tb\end{align}$$
However, the dual function that's in the textbook is:
$$g(\nu) = -\frac{1}{4}\nu^TAA^T\nu - b^T\nu$$
I have two questions regarding this derivation:
- When we find the gradient of the Lagrangian, I initially thought the second term would be $\nu^TA$ because we find the derivative of $\nu^TAx$. This probably stems from my lack of understanding of vector differentiation but why is it $A\nu^T$?
- This is similar to question 1, but in the derivation of the dual function I got $\nu^Tb$ but apparently the correct term is $b^T\nu$. When we unfold the parentheses, do vector placements usually change?
The second is easily disposed of: $\nu^Tb$ is a scalar—the dot product of $\nu$ and $b$—therefore $$\nu^Tb=(\nu^Tb)^T=b^T\nu.$$ It doesn’t matter which one comes first.
As for $\nu^TAx$, its differential is indeed $\nu^TA$, but since you’re working with column vectors, $\nu^TA$ has the wrong shape: the gradient of a scalar-valued function is also a column vector. Expanding by coordinates, we have $\nu^TAx = \sum_{i,j} a_{ij} \nu_i x_j$ so that $\partial_j(\nu^TAx) = \sum_i a_{ij}\nu_i$, that is, the $j$th element of the column vector $\nabla(\nu^TAx)$ is the dot product of $\nu$ with the $j$th column of $A$, from which $\nabla(\nu^TAx) = A^T\nu$.