Finding the gradient in least squares

960 Views Asked by At

In Linear squares optimization I have
A=\begin{pmatrix} 1 & t_1 & t_1^2 & \cdots & t_1^k \\ 1 & t_2 & t_2^2 & \cdots & t_2^k \\ \vdots & \vdots& \vdots & \ddots & \vdots \\ 1 & t_n & t_n^2 & \cdots & t_n^k \end{pmatrix}
and X=\begin{pmatrix} x_0 \\ x_1 \\ \vdots\\ x_n\\ \end{pmatrix}

I want to find the gradient of $(AX)^TAX$.

Since $(AX)^TAX$ results in a scaler I let an element of $AX$ be $V_i$ and so $(AX)^TAX=\sum_{i=1}^{n}V_i^2$. where $V_i=\sum_{j=1}^{k+1}a_{ij}X_j$.

So ${\partial ((AX)^TAX)\over \partial X_j}=\sum_{i=1}^n2V_i{\partial V_i\over \partial X_j}=2\sum_{i=1}^n(\sum_{j=1}^{k+1}a_{ij}X_j)a_{ij}=2\sum_{i=1}^n\sum_{j=1}^{k+1}a_{ij}^2X_j$.

But I think this is wrong as I want gradient $(AX)^TAX=2A^TAX$.
Can some one please tell me whether my approach is wrong and what I should do

1

There are 1 best solutions below

0
On BEST ANSWER

(Expanding a comment)

More generally, you can observe that for any matrix $M\in\mathbb R^{n\times n}$, the function $$ \begin{array}{rrcl} f: &\hspace{-5pt}\mathbb R^n &\hspace{-5pt} \longrightarrow &\hspace{-5pt} \mathbb R \\ &\hspace{-5pt} x &\hspace{-5pt} \longmapsto &\hspace{-5pt} x^TMx \end{array} $$ is differentiable and has gradient $$\tag{1} \nabla f(x) = (M+M^T)x $$ Once you verify this, your claim immediately follows in that $(AX)^T(AX) = X^T(A^TA)X$ and $A^TA$ is symmetric and hence $(A^TA)^T+(A^TA)=2A^TA$.

We are left to verify the general statement $(1)$. $$ f(x) = x^TMx = \sum_{i,j=1}^n M_{i,j}x_ix_j $$ hence for $\alpha=1\ldots n$, the $\alpha$-th component of the gradient is $$ \bigl(\nabla f(x)\bigr)_\alpha = \frac{\partial}{\partial x_\alpha}\sum_{i,j=1}^n M_{i,j}x_ix_j = \sum_{i,j=1}^n M_{i,j}\tfrac{\partial}{\partial x_\alpha}(x_ix_j) = \sum_{i,j=1}^n M_{i,j} \left( \tfrac{\partial}{\partial x_\alpha}(x_i) x_j + x_i\tfrac{\partial}{\partial x_\alpha}(x_j) \right) $$ Now, since $\tfrac{\partial}{\partial x_\alpha}(x_i) = 1$ if $\alpha=i$ and $0$ otherwise, dividing the sum into two sums, the sums reduce to $$ \bigl(\nabla f(x)\bigr)_\alpha = \underbrace{ \sum_{j=1}^n M_{\alpha,j} x_j }_{i=\alpha} + \underbrace{ \sum_{i=1}^n M_{i,\alpha} x_i }_{j=\alpha} \stackrel{\text{renaming indices}}{=} \sum_{\beta=1}^n \left( M_{\alpha,\beta} + M_{\beta,\alpha} \right) x_\beta = \bigl( Mx+M^Tx \bigr)_\alpha = \bigl( (M+M^T)x \bigr)_\alpha $$ (The subscript $\alpha$ still refers to the $\alpha$-th element of the vector). This proves the claim.

A faster way to see all this is using the bilinearity of the scalar product $x^TMx = \langle x, Mx\rangle$ and taking the gradient of the last expression.