Derivation of gradient for SVM function.

1.1k Views Asked by At

referring to the site here: http://cs231n.github.io/optimization-1/ Given this equation:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

TO find the gradient with respect to W which is a matrix, The text jumps to the solution as:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

and

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

It's been quite a while since I've done multivariable calculus and I have no idea how they arrived at this solution. Can someone provide a detailed step by step procedure on how this was done? An example of a 2x2 matrix would help me understand how all of this was derived.

Also why does the summation disapear on the last equation? Is that a mistake?

2

There are 2 best solutions below

3
On BEST ANSWER

Fix $j \ne y_i$. The gradient of $w_j^\top x_i - w_{y_i}^\top x_i + \Delta$ with respect to $w_{y_i}$ is $-x_i$. (If this is not obvious to you, you should consider the function $f(w_{y_i}) := w_j^\top x_i - w_{y_i}^\top x_i + \Delta$, take all partial derivatives (with respect to the components of $w_{y_i}$, and stack them into a vector to form the gradient.)


Next, let's consider $g(w_{y_i}) := \max(0, w_j^\top x_i - w_{y_i}^\top x_i + \Delta)$. There are a few cases:

  • when $w_j^\top x_i - w_{y_i}^\top x_i + \Delta > 0$, then $g(w_{y_i}) = f(w_{y_i})$ (see above) so the gradient is $-x_i$.
  • when $w_j^\top x_i - w_{y_i}^\top x_i + \Delta < 0$, then $g(w_{y_i})$ is zero, so the gradient is $0$.
  • the site you linked is probably glossing over the non-differentiability in the case $w_j^\top x_i - w_{y_i}^\top x_i + \Delta=0$, so I will ignore it.

This yields the gradient $- \mathbb{1}(w_j^\top x_i - w_{y_i}^\top x_i + \Delta > 0) x_i$.


Next, just sum the above over the cases $j \ne y_i$ to get the first gradient $\nabla_{w_{y_i}} L_i$.



The third expression is the gradient with respect to $w_j$ for $j \ne y_i$. There is only one instance of $w_j$ in the expression $L_i$, so that is why there is no sum. The reason why there is a sum when we take the gradient with respect to $w_{y_i}$ is because $w_{y_i}$ appears in every addend in the sum.

0
On

In response too "Also why does the summation disapear on the last equation? Is that a mistake?"

The confusion here arises from the max function... To keep things simple and lets not think about the max function and Lets say...

$Li=∑_{j≠yi}[w_j^Tx_i−w_{y_i}^Tx_i+Δ)]$

This is roughly equal to

$w_0^Tx_i + w_1^Tx_i + w_j^Tx_i$ ... + $w_{y_i}^Tx_i + w_{y_i}^Tx_i + w_{y_i}^Tx_i +$ ...

but you are taking the gradient with respect to a single row either j = 0 or j = 1 or j = 2 or .... So if you took the gradient with lets say $w_2$

$\frac{\partial{w_0^Tx_i}}{\partial{w_2}} + \frac{\partial{w_1^Tx_i}}{\partial{w_2}} + \frac{\partial{w_2^Tx_i}}{\partial{w_2}} + ... + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_2}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_2}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_2}} ...$

All terms would go to zero except $\frac{\partial{w_2^Tx_i}}{\partial{w_2}}$ which is equal to $x_i$

For $\nabla_{w_{y_i}} L_i$ you get ...

$\frac{\partial{w_0^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_1^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_2^Tx_i}}{\partial{w_{y_i}}} + ... + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} ...$

which cancels to...

$\frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} + \frac{\partial{w_{y_i}^Tx_i}}{\partial{w_{y_i}}} + ... = x_i + x_i + x_i + ...$

Hence as you can see the sum remains for $w_{y_i}$ ...