I have the following loss I want to minimize:
$L=(y-(max(0,w^Tx_1) + max(0,w^Tx_2) + max(0,w^Tx_3)))^2$
Now I want the gradient w.r.t. my weight vector $w$:
if $w^Tx_1>0$ & $w^Tx_2>0$ & $w^Tx_3>0$ $\rightarrow$ $2(y-(w^Tx_1 + w^Tx_2 + w^Tx_3))(-(x_1 + x_2 + x_3))$
if $w^Tx_1<0$ & $w^Tx_2>0$ & $w^Tx_3>0$ $\rightarrow$ $2(y-(w^Tx_2 + w^Tx_3))(-(x_2 + x_3))$
So the terms are removed if $w^Tx_i < 0$. This means that if all terms are $<0$:
$w^Tx_1<0$ & $w^Tx_2<0$ & $w^Tx_3<0$ $\rightarrow$ $2y$
So in that case I get a scaler as gradient, while in the other cases I get a vector. Are my derivatives correct? If so, is it applicable to have a vector of size len(x) of y's as gradient?
Edit: After checking again I think that the derivative in case all $w^Tx_i<0$ wrt $w$ is 0? So the minimum will be to predict always predict negative?
Note that the gradient wrt $w$ is actually a vector consisting of $\frac{\partial L}{\partial w_i}$, where $w_i$ is the $i$th element of $w$. If all $w^Tx_i<0$, then $\frac{\partial L}{\partial w_i} = 0$ for all $i$, i.e. the gradient is a vector of length $\text{len}(x)$ with each entry equal to $0$.
But this does not mean that the minimum is achieved when all $w^tx_i<0$. This is because the function is non-convex. As an example consider scalars $y = 2, x = 1$. Then $L = (2-\max(0, w))^2$. If you plot this function (see a plot here), you'll notice that there is a flat region with gradient $0$, but the minimum occurs at $w=2$.