How is the gradient of this function calculated?

118 Views Asked by At

From understanding machine learning: from theory to algorithms:

How is the equation in the red box below derived?

Isn't the gradient of the objective function

$$J = \lambda w^T w + \frac{1}{m}\sum_{i=1}^m \frac{1}{2}(w^Tx_i - y_i)^2$$ equal to

$$\nabla J = 2\lambda w + \frac{1}{m}\sum_{i=1}^m (w^Tx_i - y_i)x_i$$

And $$\sum_{i=1}^m (w^Tx_i - y_i)x_i = \sum_{i=1}^mw^Tx_ix_i - \sum_{i=1}^my_ix_i$$

How does the author get $A$ out of this gradient?


enter image description here

1

There are 1 best solutions below

0
On

From your initial expression:

$$ J = \lambda w^T w + \frac{1}{m}\sum_{i=1}^m \frac{1}{2}(w^Tx_i - y_i)^2 $$

Gradient respect to the vector $w$ yields

$$ \nabla J = \lambda \nabla (w^T w) + \frac{1}{m} \sum_{i=1}^m \frac{1}{2} \nabla \left(w^Tx_i-y_i \right)^2, $$ Let's compute the gradient of the single bits. First bit $$ \nabla (w^T w) = 2w. $$ Second bit $$ \nabla(w^T x_i - y_i)^2 = \frac{d(w^T x_i - y_i)^2}{d(w^T x_i - y_i)} \nabla(w^T x_i) = 2(w^T x_i - y_i) x_i. $$

Let's back substitute

$$ \nabla J = 2\lambda w + \frac{1}{m} \sum_{i=1}^m (w^T x_i - y_i) x_i = 2\lambda w + \frac{1}{m} \sum_{i=1}^m (w^T x_i) x_i - \sum_{i=1}^my_i x_i $$

Now let's analyze the summation

$$ \sum_{i=1}^m (w^T x_i) x_i = \sum_{i=1}^m (x_i^T w) x_i = X X^T w = A w, $$ Here $X$ is the matrix constructed as $$ X = \left[x_1 , \ldots,x_d \right] $$ Observe that $$ X X^T = \sum_{i=1}^mx_i x_i^T. $$ I suppose, apart from dimensions checking, the most rigorous way to prove the equality is to prove that the two matrices $A = XX^T$ and $B = \sum_{i=1}^m x_i x_i^T$ define the same bilinear form, which means you need to check

$$ e_i^T X X^T e_j = e_i^T \left( \sum_{i=1}^m x_i x_i^T \right) e_j $$ are the same for all $i,j$ ($e_i$ is the $i$-th vector of the canonical basis in $\mathbb{R}^m$), if you don't get confused with the indices you should be able to prove the equality.