I've been attempting to grok how backpropagation works. I've therefore come up with a super simple model that I wanted to attempt to optimize:
$f_{p}(x) = p x$
For some parameter $p$.
My toy training data look like the following:
$X = \{(1, 1), (2, 2), (3,3)\}$
Therefore the $p$ should obviously end up being $1$.
I'm using a very simple loss function:
$E(X, p) = \frac 12 (\hat{y} - f_p(x))^2$
The change of $p$ per training step $t$ for a learning rate $\alpha$ is defined such that:
$p^{t+1} = p^t - \alpha \frac{\partial E (X, p^t)}{\partial p}$
I'm unsure how I should compute the above, here's my attempt for the first step so far:
$$\begin{eqnarray} p^{0} &=& -1 \\ p^{1} &=& -1 - \alpha \frac{\partial E(X,p^0)}{\partial p} \\ \frac{\partial E(X,p^0)}{\partial p} &=& \frac{\partial}{\partial p} \frac 12 (\hat y -y)^2 \end{eqnarray}$$
But I'm kind of stuck here. Any help would be much appreciated!
\begin{align*} \nabla =&\frac{\partial{E(X, p)}}{\partial p} = \frac{\partial ~0.5(\hat y - f_p(x))^2}{\partial p} = \frac{\partial ~0.5(\hat y - px)^2}{\partial p} \end{align*}
Here, let's set $q \equiv (\hat y - px)$, so that we can write
$$ \nabla = \frac{\partial ~0.5(\hat y - px)^2}{\partial p} = \frac{\partial 0.5q^2}{\partial p} = 0.5 \frac{\partial{q^2}}{{\partial p}} $$
We now invoke the chain rule to get:
$$ \nabla = 0.5 \frac{\partial{q^2}}{{\partial p}} = 0.5 \frac{\partial{q^2}}{{\partial q}} \frac{\partial{q}}{{\partial p}} = 0.5 \cdot 2q \cdot \frac{dq}{dp} = q \frac{dq}{dp} $$
We now evaluate $\frac{dq}{dp}$ as:
\begin{align*} \frac{dq}{dp} &= \frac{\partial (\hat y - px)}{\partial p} \\ &= \frac{\partial \hat y}{\partial p} - \frac{\partial (px)}{\partial p}\\ &= 0- x\frac{\partial p}{\partial p} \\ &= - x \cdot 1 = -x \end{align*}
This gives us the full expression:
$$ \nabla = q \frac{dq}{dp} = -qx = (\hat y - px) \cdot (-x) $$
The important thing to remember is that we assume that $x, \hat y$ are independent of the value of $p$ (the parameter), since that's the data. Therefore:
$$ \frac{\partial x}{\partial p} = 0 \qquad \frac{\partial \hat y}{\partial p} = 0 $$
Also note that in "real world" implementations of automatic differentiation, one does not compute the derivatives symbolically. Rather, they're a different set of techniques.
Here is a good reference on the topic