I've been researching optimisation methods used to minimise the cost function of a neural network, such as:
Back propagation, which calculates the error at each node in order to calculate the partial derivatives of the cost function, and then uses gradient descent to minimise the cost function into either a local or global optima.
Continuous genetic algorithm, which starts with a population of random solutions, and the ones which produce the lower cost functions are more likely to survive to the next generation, where the solutions pair up and swap some of there parameters, and mutate, until a satisfactory solution is found.
Particle swarm optimisation, which has particles moving around in a n dimensional solution space, with each dimension representing a parameter. The particles move both towards the best solution they have personally found so far, and towards the best solution any particle has found.
I have heard that the last two are better for finding global minimum, but are there any advantages of one over the other?
Also, why does a analytical solution not exist? For example, why could you not calculate all of the partial derivatives, find where all of these equal zero, and compare them and use the lowest?
Surely this last method would work every time, so why does no one do it like this?
Thanks!
Simple answer to your last question: A layered neural network is a highly non-linear function, and so is its square sum of error terms and the gradient of that with respect to the parameters. Setting it to zero gives you a non-linear system of equations that in general is only iteratively solvable. So you are back to iterative methods where gradient descent is the simplest one using derivatives.
But of course, one can do better than gradient descent. Even in the linear case, one knows that gradient descent is sub-optimal if the system matrix does not have a very small condition number. That is why the conjugate gradient method exists and a whole host of ideas on pre-conditioners. For non-linear systems, the limited-memory BFGS method fills a similar role, and indeed the fast modern backprop implementations use some kind of that.
Another point is that the objective function, the square sum of the error terms of the training vectors, has, aside from its inner non-linear structure, the outer structure of a sum of squares that one can relate to the Gauss-Newton methods leading to a slightly different view of second-order derivative information using the QR-decomposition of the gradients of the unsquared errors of the single training points.
This outer sum also lends itself to a stochastic interpretation of the objective function and thus an implementation of the learning algorithm as a stochastic process.