I have been trying to learn about neural networks, and I don't understand the backpropagation algorithm. I will state what I understood so far in a formal manner and point out where I don't understand. Please correct me if I'm wrong !
Let $S$, be an integer. Let $I$ be a subset of $[0,1]^S$ (the inputs) and $O$ be a finite totally ordered set (the outputs). Let $f : I \rightarrow O$ be a map (the one that we want the neural network to "learn").
Define $resp : [0,1]^O \rightarrow O$ by $\forall x = (x_o)_{o \in O}$, $resp(x) := \min\{o \ \vert \ x_o = \max\{x(o') \ \vert \ o' \in O\}\}$ (which chooses the most activated output). If $o \in O$, let $\delta_{o}$ be the map sending $o$ to $1$ and all other $o'$ to $0$.
Let $\sigma$ be a sigmoid-like function.
The goal is to find $M \in L(\mathbb{R}^S,\mathbb{R}^O)$ such that for "most" (if possible, for all) $i \in I$, $f(i) = (resp \circ (\sigma \times ... \times \sigma) \circ M) (i)$. Let $I’ \subseteq I$ be a finite subset. The elements of $I’$ are called example inputs.
To that purpose, define $c : [0,1]^O \times [0,1]^O \rightarrow \mathbb{R}$ by $\forall (a,b) \in [0,1]^O \times [0,1]^O$, $c(a,b) = \sum_{o \in O} (a(o) - b(o))^2$ (in general, we can take other $c$, but I restrict my attention to this function).
The goal is to find some $M$ minimizing $\phi_i : M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$ globally for all $i$.
If $\psi : \mathbb{R}^n \rightarrow \mathbb{R}$ is a differentiable function, and $\mu$ is a real number, and $x \in \mathbb{R}^n$, we call the action of replacing $x$ by $x - \mu \nabla_x \psi$ a step of gradient descent of rate $\mu$.
I will now describe a few algorithms. Some of them may fail to achieve the goal stated above. Can you tell me which of these work, and what their names are ? I must say that, for the moment, I am not interested in questions of local/global minima.
1) « Optimizing after each example »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing a lot of steps of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
- Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
- Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
- …
- Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.
I expect this algorithm not to do what we want : indeed, optimize at step $2$ (for example $i_2$) might « de-optimize » what have been done at step $1$ (for example $i_1$).
2) « Taking one step after each example »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing one step of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
- Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
- Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
- …
- Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.
I do not know if this algorithm does what we want.
3) « Studying a whole bunch of examples »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Make a lot of steps of gradient descent of rate $\mu$ for the function $E := M \mapsto \sum_{i \in I’} c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$.
I expect this algorithm to do what we want, but if it does, I don’t know yet how to perform it in an efficient manner.
I am also looking for any reference on neural networks which is written for mathematicians.
A large part of the question has been rewritten.
Allow me to use slightly different notation (I think you are having some confusion) and abstract the problem slightly. If you want a more general idea mathematically of what gradient-descent and backpropogation is, here goes (I will answer your bold question by the end).
Let's suppose our inputs are vectors $\mathbf{x} \in \mathbb{R}^n$ and our outputs are vectors $\mathbf{y} \in \mathbb{R}^N$ (note that you may want to restrict the codomain of your function to $[0,1]^N$ if you are doing logistic regression. This is a specific case of what I describe here)
Abstractly, in machine learning, we choose a space of functions $\mathbb{R}^n \to \mathbb{R}^N$ mapping inputs to outputs, and then find the best function in that space to meet our goal (minimizing a cost function). The space of functions we are considering is always determined by an assumption about the structure of the functions (e.g. we can choose a neural network with so many layers and so many nodes per layer etc.).
Let's abstract this and say that our function space is parameterized by a vector $\mathbf{w} \in \mathbb{R}^{W}$. i.e. there is a function $$f(-,-) \colon \mathbb{R}^n\times \mathbb{R}^W \to \mathbb{R}^N.$$ Fixing $\mathbf{w}$ is the same as picking out a function $$f(-,\mathbf{w}) \colon \mathbb{R}^n \to \mathbb{R}^N$$ in our function space. If we are considering a certain neural network architecture, the $w$'s will be the weights in all the hidden layers.
Now how do we choose the best $\mathbf{w}$? First, you have to choose a loss function, and a sample set, which is a list of $m$ examples of inputs and desired outputs $(\mathbf{x}_i,\mathbf{y}_i)$. Abstractly, the loss function is a function $$ L \colon \mathbb{R}^N \times \mathbb{R}^N \to \mathbb{R}$$ and we want to solve the minimizization problem $$\mathbf{w}^0 = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} E(\mathbf{w}) = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} \sum_{i=1}^m L(f(\mathbf{x}_i,\mathbf{w}),\mathbf{y}_i).$$ i.e. we want to find $\mathbf{w}^0$ so that the function $f(-,\mathbf{w}^0)$ minimizes loss on the example set. (here $E$ is just notation for sum over examples of the loss).
Now, there is a general algorithm for trying to find $\mathbf{w}^{0}$ called gradient descent. It goes like this:
So what is Backpropagation?
Well, in the algorithm above we wanted to compute $$\nabla E(\mathbf{w}) = \left( \frac{dE}{dw_1}(\mathbf{w}), \ldots , \frac{dE}{dw_W}(\mathbf{w})\right)$$ so that we can update each weight $w_i^j$ according to the rule $$w_i^{j+1} = w^j_i - \varepsilon \frac{dE}{dw_i}(\mathbf{w}^j).$$
How do we compute these derivatives? It depends on the function $f$, but in general, when $f$ has the structure of a neural network, you have to use the chain rule. More generally, if you know the "computational graph" of $f$ you can compute these derivatives using the chain rule as follows.
Example: Suppose we are considering a function space of functions of the form
$$f(\mathbf{x},\mathbf{w}= (\mathbf{u},\mathbf{v})) = g_2(g_1(\mathbf{x},\mathbf{u}), \mathbf{v})$$ (this could be a neural network with one hidden layer, where $g_i$ is linear + activation). Here $\mathbb{R}^W = \mathbb{R}^U \times \mathbb{R}^V$ and $\mathbf{w}= (\mathbf{u},\mathbf{v})$. Thus
$$ E(\mathbf{u},\mathbf{v}) = \sum_1^m L(g_2(g_1(\mathbf{x}_i,\mathbf{u}), \mathbf{v}), \mathbf{y}_i)$$
By the chain rule,
$$\frac{dE}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x}_i)$$ $$\frac{dE}{du_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dg_1}(\mathbf{u},\mathbf{v},\mathbf{x}_i)\frac{dg_1}{du_j}(\mathbf{u},\mathbf{x}_i)$$ Computing these derivatives is what backpropagation is doing, but backpropation is a bit more than just the chain rule, it is an algorithmic way of implementing the chain rule (and, importantly, it is optimized for computational speed even if you have $m$ very very large). This blog post sheds some light on it: https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/
Tuning gradient descent
There are several caveats to be aware of when working with gradient descent: