EM Algorithm vs Gradient Descent

Question

EM Algorithm vs Gradient Descent

450 Views Asked by Bumbble Comm At 01 Apr 2026 - 3:34

I was reading about the EM algorithm (https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) - this algorithm is used for optimizing functions (e.g. the Likelihood Functions belonging to Statistical Models).

I have heard that in the context of optimizing the Likelihood Function for Mixture Models in Statistics (https://en.wikipedia.org/wiki/Mixture_model), the EM algorithm is preferred to more common algorithms such as Gradient Descent. Apparently, this is because the Likelihood Function of Mixture Models is usually "multi-modal" (e.g. a Mixture Model is a mixture of several Normal Distributions and each Normal Distribution has a "mode" - therefore, a Mixture Model is almost guaranteed to be "multi-modal").

What I am having difficulty in understand is the following point : Why should the EM algorithm be any more suited for optimizing "Multi-Modal" functions compared to Gradient Descent?

That is, by considering the mathematical properties of "Multi-Modal" functions, the EM algorithm and Gradient Descent - how can we use these mathematical properties to rationalize why the EM algorithm is more suited for optimizing "Multi-Modal" functions compared to Gradient Descent?

Thanks!

Note: My guess is that perhaps the EM algorithm might be "less computationally expensive" compared to Gradient Descent? Do either of these algorithms have any theoretical convergence properties that could be used to explain the traditional preference of EM over Gradient Descent in "Multi-Modal" FUnctions?

References:

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf

Thanks!

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2022-05-08 19:32:05

You’re fixated with gradient descent… Gradient descent is by far the worst possible optimization algorithm ever - unless you qualify it better by describing which “improved” variant of gradient descent we are talking about, whatever that means.

It’s no surprise that EM is superior. Is BFGS superior to gradient descent? Most likely it is. Is SLSQP superior to gradient descent? Most likely it is. Is GRG superior to gradient descent? Most likely it is.

Do you have a single computational experiment - excluding idiotic simple quadratic objective functions - that shows gradient descent outperforming all other methods?

**Bumbble Comm** · Answer 2 · 2022-05-14 01:05:37

Conclusion first,

One way to think of the EM algorithm is that it's doing coordinate ascent/descent on a surrogate, i.e. a lower bound of likelihood. (your textbook also showed this point on P396) It optimizes one set of variables and then the other set of variables, alternatively. However, for each set of variables the optimization problem has a closed-form global maximum, which could be optimized without iteration.
In contrast, gradient descent optimizes the likelihood directly. It optimizes all variables jointly but always takes local optimization steps. Even for part of the quadratic problem grad descent could be slow to converge, not to mention hard multi-modal problems.

To make it concrete, let's take a simple example, Gaussian mixture models let observed variable $X$, discrete latent variable $Z$ and parameters $\Theta$.

$$ P(Z=k;\Theta)=\pi_k\\ P(X|Z=k;\Theta)=\mathcal N(\mu_k,\Sigma_k) $$

Then the likelihood function is the summation over dataset $x\in D$

$$ \log\mathcal L(\Theta)=\sum_i \log P(x_i;\Theta)=\sum_i \log \sum_k P(x_i|z=k;\Theta)P(z=k;\Theta) $$

As a trick, we use an arbitrary distribution over latent variables $q(z)$ to probe it

$$ \log\mathcal L(\Theta)=\sum_i \log \sum_k q(z=k)\frac{ P(x_i|z=k;\Theta)P(z=k;\Theta)}{q(z=k)}\\ \geq\sum_i \sum_k q(z=k) \log \frac{ P(x_i|z=k;\Theta)P(z=k;\Theta)}{q(z=k)} $$ In the 2nd line, the swap of $\log$ and $\sum$ comes from Jensen Inequality, with $\log$ a concave function and $q(z)$ probability summing to $1$. Since $z$ is discrete, function $q(z)$ is just a vector $q$.

Let's call the lower bound $J(q,\theta)$, then the EM algorithm is maximizing it alternatively,

E step $q^{(t)}\gets \arg\max_q J(q,\Theta^{(t)})$
M step $\Theta^{(t+1)}\gets \arg\max_\Theta J(q^{(t)},\Theta)$

In classic formulation, E step estimate the posterior of $Z$ given $X$ and current $\Theta$. To see the connection, I claim, $q(z)=p(z|x;\Theta^{(t)})$ solve the maximization problem $\arg\max_q J(q,\Theta^{(t)})$.

To prove this, we show that the posterior makes the aforementioned bound tight. When $q(z)=p(z|x;\Theta)\\$

$$ \log p(x;\Theta) = \log \sum_z p(x,z;\Theta) \\ \geq \sum_z q(z)\log \frac{p(x,z;\Theta)}{q(z)}\\ =\sum_z p(z|x;\Theta) \log \frac{p(x,z;\Theta)}{p(z|x;\Theta)} \\ =\sum_z p(z|x;\Theta) \log p(x;\Theta)\\ =\log p(x;\Theta) $$

which is tight. This shows that $p(z|x;\Theta)=\arg\max_q J(q,\Theta^{(t)})$. As we said, this maximization could be done in closed form without iteration.

Similarly in M step, retaining only terms related to $\Theta$ $$ J(q^{(t)},\Theta)\\ =\sum_i \sum_k q_i(z=k) (\log P(x_i|z=k;\Theta)+\log P(z=k;\Theta)-\log q_i(z=k))\\ =\sum_i \sum_k q_i(z=k) (\log P(x_i|z=k;\Theta)+\log P(z=k;\Theta))+const $$ Which is a weighted MLE, for basic distributions like Gaussian it's also solvable in one step in closed form.

Using Gaussian as example, abreviating $q_i(z_k)=q_{ik}$ $$ LHS=\sum_i \sum_k q_{ik}(-\frac 12\log \det\Sigma_k -\frac 12(x_i-\mu_k)^T\Sigma_k^{-1}(x_i-\mu_k) + \log \pi_k)+const $$ Optimizing w.r.t. $\Sigma_k,\mu_k,\pi_k$ all have closed form solution. No gradient iteration is needed.

All in all, EM solve subproblem of part of variables alternatively, but each subproblem could be maximized to global maxima without iteration.

That being said, EM has its own problem

Note that, EM directly optimizes the lower bound of $\mathcal L(\Theta)$. the bound is tight only when $q(z)=P(z|x;\Theta)$, so when $z$ itself has a complex posterior distribution (not a simple discrete distr.), then parametrizing it and optimizing it becomes tricky. For example, one could use a Gaussian distribution to approximate the posterior, but when this approximation is not exact, EM iteration is not guaranteed to increase the data likelihood.

Reference

Short Lecture note from Princeton which I found most illuminating!
Lecture slides from U Toronto longer but quite hand holding
EM and variational inference
Bishop's PRML

EM Algorithm vs Gradient Descent

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in OPTIMIZATION

Related Questions in MAXIMUM-LIKELIHOOD

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions