Backpropagation algorithm

823 Views Asked by At

I have been trying to learn about neural networks, and I don't understand the backpropagation algorithm. I will state what I understood so far in a formal manner and point out where I don't understand. Please correct me if I'm wrong !

Let $S$, be an integer. Let $I$ be a subset of $[0,1]^S$ (the inputs) and $O$ be a finite totally ordered set (the outputs). Let $f : I \rightarrow O$ be a map (the one that we want the neural network to "learn").

Define $resp : [0,1]^O \rightarrow O$ by $\forall x = (x_o)_{o \in O}$, $resp(x) := \min\{o \ \vert \ x_o = \max\{x(o') \ \vert \ o' \in O\}\}$ (which chooses the most activated output). If $o \in O$, let $\delta_{o}$ be the map sending $o$ to $1$ and all other $o'$ to $0$.

Let $\sigma$ be a sigmoid-like function.

The goal is to find $M \in L(\mathbb{R}^S,\mathbb{R}^O)$ such that for "most" (if possible, for all) $i \in I$, $f(i) = (resp \circ (\sigma \times ... \times \sigma) \circ M) (i)$. Let $I’ \subseteq I$ be a finite subset. The elements of $I’$ are called example inputs.

To that purpose, define $c : [0,1]^O \times [0,1]^O \rightarrow \mathbb{R}$ by $\forall (a,b) \in [0,1]^O \times [0,1]^O$, $c(a,b) = \sum_{o \in O} (a(o) - b(o))^2$ (in general, we can take other $c$, but I restrict my attention to this function).

The goal is to find some $M$ minimizing $\phi_i : M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$ globally for all $i$.

If $\psi : \mathbb{R}^n \rightarrow \mathbb{R}$ is a differentiable function, and $\mu$ is a real number, and $x \in \mathbb{R}^n$, we call the action of replacing $x$ by $x - \mu \nabla_x \psi$ a step of gradient descent of rate $\mu$.

I will now describe a few algorithms. Some of them may fail to achieve the goal stated above. Can you tell me which of these work, and what their names are ? I must say that, for the moment, I am not interested in questions of local/global minima.

1) « Optimizing after each example »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing a lot of steps of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
  • Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
  • Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
  • Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.

I expect this algorithm not to do what we want : indeed, optimize at step $2$ (for example $i_2$) might « de-optimize » what have been done at step $1$ (for example $i_1$).

2) « Taking one step after each example »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing one step of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
  • Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
  • Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
  • Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.

I do not know if this algorithm does what we want.

3) « Studying a whole bunch of examples »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Make a lot of steps of gradient descent of rate $\mu$ for the function $E := M \mapsto \sum_{i \in I’} c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$.

I expect this algorithm to do what we want, but if it does, I don’t know yet how to perform it in an efficient manner.

I am also looking for any reference on neural networks which is written for mathematicians.

A large part of the question has been rewritten.

1

There are 1 best solutions below

14
On BEST ANSWER

Allow me to use slightly different notation (I think you are having some confusion) and abstract the problem slightly. If you want a more general idea mathematically of what gradient-descent and backpropogation is, here goes (I will answer your bold question by the end).

Let's suppose our inputs are vectors $\mathbf{x} \in \mathbb{R}^n$ and our outputs are vectors $\mathbf{y} \in \mathbb{R}^N$ (note that you may want to restrict the codomain of your function to $[0,1]^N$ if you are doing logistic regression. This is a specific case of what I describe here)

Abstractly, in machine learning, we choose a space of functions $\mathbb{R}^n \to \mathbb{R}^N$ mapping inputs to outputs, and then find the best function in that space to meet our goal (minimizing a cost function). The space of functions we are considering is always determined by an assumption about the structure of the functions (e.g. we can choose a neural network with so many layers and so many nodes per layer etc.).

Let's abstract this and say that our function space is parameterized by a vector $\mathbf{w} \in \mathbb{R}^{W}$. i.e. there is a function $$f(-,-) \colon \mathbb{R}^n\times \mathbb{R}^W \to \mathbb{R}^N.$$ Fixing $\mathbf{w}$ is the same as picking out a function $$f(-,\mathbf{w}) \colon \mathbb{R}^n \to \mathbb{R}^N$$ in our function space. If we are considering a certain neural network architecture, the $w$'s will be the weights in all the hidden layers.

Now how do we choose the best $\mathbf{w}$? First, you have to choose a loss function, and a sample set, which is a list of $m$ examples of inputs and desired outputs $(\mathbf{x}_i,\mathbf{y}_i)$. Abstractly, the loss function is a function $$ L \colon \mathbb{R}^N \times \mathbb{R}^N \to \mathbb{R}$$ and we want to solve the minimizization problem $$\mathbf{w}^0 = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} E(\mathbf{w}) = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} \sum_{i=1}^m L(f(\mathbf{x}_i,\mathbf{w}),\mathbf{y}_i).$$ i.e. we want to find $\mathbf{w}^0$ so that the function $f(-,\mathbf{w}^0)$ minimizes loss on the example set. (here $E$ is just notation for sum over examples of the loss).

Now, there is a general algorithm for trying to find $\mathbf{w}^{0}$ called gradient descent. It goes like this:

  • fix a starting point $\mathbf{w}^1 \in \mathbb{R}^W$.
  • At each step, update with the rule $$ \mathbf{w}^{j+1} = \mathbf{w}^j - \varepsilon \nabla E(\mathbf{w}^j)$$ (where $\varepsilon$ is some constant you choose beforehand).
  • Stop after a stopping condition is reached (e.g. some specified number $j=1000$).

So what is Backpropagation?

Well, in the algorithm above we wanted to compute $$\nabla E(\mathbf{w}) = \left( \frac{dE}{dw_1}(\mathbf{w}), \ldots , \frac{dE}{dw_W}(\mathbf{w})\right)$$ so that we can update each weight $w_i^j$ according to the rule $$w_i^{j+1} = w^j_i - \varepsilon \frac{dE}{dw_i}(\mathbf{w}^j).$$

How do we compute these derivatives? It depends on the function $f$, but in general, when $f$ has the structure of a neural network, you have to use the chain rule. More generally, if you know the "computational graph" of $f$ you can compute these derivatives using the chain rule as follows.

Example: Suppose we are considering a function space of functions of the form

$$f(\mathbf{x},\mathbf{w}= (\mathbf{u},\mathbf{v})) = g_2(g_1(\mathbf{x},\mathbf{u}), \mathbf{v})$$ (this could be a neural network with one hidden layer, where $g_i$ is linear + activation). Here $\mathbb{R}^W = \mathbb{R}^U \times \mathbb{R}^V$ and $\mathbf{w}= (\mathbf{u},\mathbf{v})$. Thus

$$ E(\mathbf{u},\mathbf{v}) = \sum_1^m L(g_2(g_1(\mathbf{x}_i,\mathbf{u}), \mathbf{v}), \mathbf{y}_i)$$

By the chain rule,

$$\frac{dE}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x}_i)$$ $$\frac{dE}{du_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dg_1}(\mathbf{u},\mathbf{v},\mathbf{x}_i)\frac{dg_1}{du_j}(\mathbf{u},\mathbf{x}_i)$$ Computing these derivatives is what backpropagation is doing, but backpropation is a bit more than just the chain rule, it is an algorithmic way of implementing the chain rule (and, importantly, it is optimized for computational speed even if you have $m$ very very large). This blog post sheds some light on it: https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/

Tuning gradient descent

There are several caveats to be aware of when working with gradient descent:

  • To answer your question in bold above, there is no guarantee that the function $E$ descreases with each step of gradient descent. Oftentimes it won't, especially near a local minimum.
  • In general, the types of loss functions used in state of the art machine learning algorithms are non-convex, which means that there isn't theory guaranteeing that gradient descent converges to a global minimum. For applications this is ok, but the problem of getting close to the global minimum using gradient descent becomes a bit of an engineering problem.
  • Moreover, a priori, there is no guarantee that a global minimum even exists.
  • The factor $\varepsilon$ chosen above controls to some extent the magnitude of the steps that gradient descent takes. If it is very big, there is a chance your algorithm will approach the global minimum and then shoot past it. If it is too small, your convergence will be very slow, or you might get stuck in a region where the loss function is relatively flat (people call these plateaus). There are various sophisticated versions of gradient descent which mitigate some of these issues. A simple one is to decay the constant $\varepsilon$ as you iterate, so that near the minimum you don't overshoot. Others like "momentum" keep you moving near plateaus so you don't get stuck.
  • Weird things can happen in very deep neural networks that cause your gradients to explode or go to zero, which means that gradient descent will go wild, or stop. People put a lot of work into fiddling with their algorithms to avoid this.
  • It should be unclear in what I wrote above what an ideal "stopping condition" for gradient descent is. How many steps should you take? Should you choose a cut-off threshold error instead?
  • Above, I summed the loss function over all the examples in the training set. Versions of gradient-descent called mini-batch gradient descent and stochastic gradient descent choose only a subset of the training set for computing $\nabla E$ at each iteration. This has advantages for computational time and convergence.