I'm learning about gradient descent, and I think I've got the general jist of the partial differentiation behind it, however I'm a bit confused by one bit.
When doing:
$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial a} \frac{\partial a}{\partial w} $$
to compute a new weight, should you be differentiating the loss, $L$, in respect to $Y$ or in respect to $\hat{Y}$ - i.e. in respect to the true $Y$ value or the predicted.
Online I've seen it done both ways, producing a partial derivative of MSE that is either negative or positive.
Which should it be?
You ought to apply the chain rule. If a function such as a loss function has multiple inputs, that's effectively the same as having a single input which is a tuple:
$$ℓ:ℝ^n ⊕ ℝ^n ⟶ ℝ,(y,\hat{y}) ⟼ ℓ(y,\hat{y})$$
Now, here, $\hat{y} = \hat{y}(x,w)$ is the model that depends both on inputs $x$ and parameters $w$. Therefore, following the chain rule strictly by definition we have:
$$\begin{aligned} \frac{∂ℓ}{∂w} &= \frac{∂ℓ}{∂(y,\hat{y})} ⋅ \frac{∂(y,\hat{y})}{∂w} \\ &= \begin{bmatrix}\frac{∂ℓ}{∂y} & \frac{∂ℓ}{∂\hat{y}}\end{bmatrix} ⋅ \begin{bmatrix}\frac{∂y}{∂w} \\ \frac{∂\hat{y}}{∂w}\end{bmatrix} \\ &= \frac{∂ℓ}{∂y}\frac{∂y}{∂w} + \frac{∂ℓ}{∂\hat{y}}\frac{∂\hat{y}}{∂w} \end{aligned}$$
But since $y$ does not depend on $w$, the derivative $\frac{∂y}{∂w}$ is zero, and the first term vanishes.