referring to the site here: http://cs231n.github.io/optimization-1/ Given this equation:
$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$
TO find the gradient with respect to W which is a matrix, The text jumps to the solution as:
$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$
and
$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$
It's been quite a while since I've done multivariable calculus and I have no idea how they arrived at this solution. Can someone provide a detailed step by step procedure on how this was done? An example of a 2x2 matrix would help me understand how all of this was derived.
Also why does the summation disapear on the last equation? Is that a mistake?
Fix $j \ne y_i$. The gradient of $w_j^\top x_i - w_{y_i}^\top x_i + \Delta$ with respect to $w_{y_i}$ is $-x_i$. (If this is not obvious to you, you should consider the function $f(w_{y_i}) := w_j^\top x_i - w_{y_i}^\top x_i + \Delta$, take all partial derivatives (with respect to the components of $w_{y_i}$, and stack them into a vector to form the gradient.)
Next, let's consider $g(w_{y_i}) := \max(0, w_j^\top x_i - w_{y_i}^\top x_i + \Delta)$. There are a few cases:
This yields the gradient $- \mathbb{1}(w_j^\top x_i - w_{y_i}^\top x_i + \Delta > 0) x_i$.
Next, just sum the above over the cases $j \ne y_i$ to get the first gradient $\nabla_{w_{y_i}} L_i$.
The third expression is the gradient with respect to $w_j$ for $j \ne y_i$. There is only one instance of $w_j$ in the expression $L_i$, so that is why there is no sum. The reason why there is a sum when we take the gradient with respect to $w_{y_i}$ is because $w_{y_i}$ appears in every addend in the sum.