I was reading about the EM algorithm (https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) - this algorithm is used for optimizing functions (e.g. the Likelihood Functions belonging to Statistical Models).
I have heard that in the context of optimizing the Likelihood Function for Mixture Models in Statistics (https://en.wikipedia.org/wiki/Mixture_model), the EM algorithm is preferred to more common algorithms such as Gradient Descent. Apparently, this is because the Likelihood Function of Mixture Models is usually "multi-modal" (e.g. a Mixture Model is a mixture of several Normal Distributions and each Normal Distribution has a "mode" - therefore, a Mixture Model is almost guaranteed to be "multi-modal").
What I am having difficulty in understand is the following point : Why should the EM algorithm be any more suited for optimizing "Multi-Modal" functions compared to Gradient Descent?
That is, by considering the mathematical properties of "Multi-Modal" functions, the EM algorithm and Gradient Descent - how can we use these mathematical properties to rationalize why the EM algorithm is more suited for optimizing "Multi-Modal" functions compared to Gradient Descent?
Thanks!
Note: My guess is that perhaps the EM algorithm might be "less computationally expensive" compared to Gradient Descent? Do either of these algorithms have any theoretical convergence properties that could be used to explain the traditional preference of EM over Gradient Descent in "Multi-Modal" FUnctions?
References:
Thanks!
You’re fixated with gradient descent… Gradient descent is by far the worst possible optimization algorithm ever - unless you qualify it better by describing which “improved” variant of gradient descent we are talking about, whatever that means.
It’s no surprise that EM is superior. Is BFGS superior to gradient descent? Most likely it is. Is SLSQP superior to gradient descent? Most likely it is. Is GRG superior to gradient descent? Most likely it is.
Do you have a single computational experiment - excluding idiotic simple quadratic objective functions - that shows gradient descent outperforming all other methods?