Two different ways to train a neural network

106 Views Asked by At

Gradient descent (GD) is a common algorithm designed to find a local minimum of an assigned cost function. Simple feedforward neural networks, as long as my understanding goes, try to estimate a minimum of an unknown function by minimizing a set of supposedly similar training functions, $f_1,\dots,f_k$. Each training function $f_i$ is a function of the network's parameters (weight and biases) and its shape/form depends on the training input $x_i$.

If I have $k$ training inputs and perform for each input $n$ GD steps, I could define my local minimum in two different ways:

1) For each input $x_i$, start at the same point $p_0$ in parameter space and then perform GD for $n$ steps. This gives $k$ "possible minimums", $p_1,\dots,p_k$ one for each input, then take as the "common minimum" their average: $\hat{p}:=\frac{1}{k}\sum p_i$.

2) Starting at the point $p_0$ in parameter space, perform $1$ GD step for each input $x_i$, resulting in new parameters $p_1^1,\dots,p_k^1$, then take their average as the next starting point: $p_1:=\frac{1}{k}\sum p_i^1$. Take another GD step for each input again, this time starting from $p_1$. Repeat $n$ times.

The two processes are represented below for a $2$-dimensional parameter space, $2$ inputs and $5$ GD steps:

Two ways of optimizing a NN.

The resulting "common minimum" is different in the two processes. This consderation leads to the study of a sort of "commutator" of many functions, which I'll represent with "$((f,g))$" and which for two functions looks like this: \begin{equation} \left(\left(\ f,g\right)\right):= \frac{f(f(x))+g(g(x))}{2} - \frac{1}{2}\left[ f\left(\frac{f(x)+g(x)}{2}\right)+ g\left(\frac{f(x)+g(x)}{2}\right) \right] \end{equation} The actual "commutator" for the GD is more complicated and involves the derivatives of the functions but it all boils down to the previous definition.

My question is: which, between the two procedures should be used to optimize a NN and why? And is this "commutator" something that is actually studied somewhere?

1

There are 1 best solutions below

0
On

I am not sure I understand what you mean with "supposedly similar functions".

At a very basic level training a neural net $f_\theta$ with parameters $\theta$ is done by minimisation of the empirical risk, i.e. the best approximation we have to the true risk (of misclassification for mnist) since we do not know the distribution of the data $(X,Y)$. If $\mathbf{x}$ are the inputs and $\mathbf{y}$ their labels, this is simply

$$ L_\theta(\mathbf{x}, \mathbf{y}) = \sum_i l(f_\theta(x_i), y_i) $$

with e.g. $l(\xi,y) = \| \xi - y \|^2$ or any other loss function. Instead of performing gradient descent, i.e. optimising (for $\theta$) this sum over all training samples, one typically does stochastic GD, i.e. over one or a few samples. The expected value of this stochastic gradient is the gradient and as long as the learning rate fulfils some conditions (e.g. the classical "$\sum \alpha_t = \infty$ but $\sum \alpha_t^2 \le \infty$ ") this converges and is not only computationally cheaper but tends to exhibit better behaviour than vanilla GD.

If I understand you correctly, this is basically your 2. since "performing one GD step for each point and averaging" is by linearity the same as computing the gradient of the whole sum, then updating once the parameters, as long as you don't change update the parameters after each point. If you do, then that is SGD, but no average should be taken at the end (why would that make sense?)

Your 1. is a bit strange to me since it seems that would basically try to overfit to each sample, i.e. to learn $\theta_i$ such that $f_{\theta_i}(x_i)$ is as close to $y_i$ as possible, and this is unlikely to produce a good global solution, let alone generalise out of the training set.