Back propagation Loss Function differentiation

324 Views Asked by At

I'm trying to understand the math behind a simple neural network example for PyTorch.

Specifically, when calculating the loss function, I would like to minimize (with respect to $y$): $$SUM((y - y*)^2)$$ where $y$ and $y*$ are vectors of length $n$, and $SUM$ is a function that sums all the elements in the resulting vector.

My question is, how do you take the derivative of this, given that this $SUM$ function is present? It seems like the algorithm just states the derivative is $2(y-y*)$, which suggests the $SUM$ function didn't make any different, but I don't intuitively understand why.

It looks like this would also be the derivative of $f(y) = (y-y*)^2)$. As a related question, how does one interpret the derivative with respect to a vector in a function that also outputs a vector?

1

There are 1 best solutions below

1
On BEST ANSWER

Let's write $y_i$ for the $i$-th component of the vector $y$. Then the loss function can be written like so, using more common notation: $$L=\sum_i(y_i-y^*_i)^2.$$

The partial derivative with respect to $y_j$ is $$\frac{\partial L}{\partial y_j}=\frac{\partial}{\partial y_j}\sum_i(y_i-y^*_i)^2=\sum_i \frac{\partial}{\partial y_j}(y_i-y^*_i)^2.$$

I switched the order of the $\sum$ symbol with the derivative, because the derivative is linear! That's the sense in which the SUM function doesn't make a difference.

Now note that if $i\neq j$, then $(y_i-y^*_i)^2$ is not a function of $y_j$, so $\frac{\partial}{\partial y_j}(y_i-y^*_i)^2=0$. So we are left with only one term in the sum, where $i=j$: $$\frac{\partial L}{\partial y_j}=\frac{\partial}{\partial y_j}(y_j-y^*_j)^2=2(y_j-y^*_j).$$

Finally, the gradient $dL/dy$ is the vector whose $j$-th component is $\partial L/\partial y_j$, so we have $dL/dy=2(y-y^*)$.

As for $f$, the derivative of a vector with respect to another vector is a 2D matrix called a Jacobian. At a component level, you want to calculate $\partial f_i/\partial y_j$ where the $i$ and $j$ indices can vary independently. This can be written $\partial f_i/\partial y_j = 2(y_i-y^*_i)\delta_{ij}$, where $\delta_{ij}$ is the Kronecker delta. The presence of the delta symbol is the clearest difference between the derivatives of $L$ and $f$.