The Expectation-Maximization Algorithm (https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) is a well known optimization algorithm that is useful in situations when the function being optimized depends on some "unobservable information".
As an example, consider Gaussian Mixture Model (GMM) Clustering - given some data points, we assume that the underlying data generating process for this data comes from some affine combination of Gaussian Distributions. We choose some discrete number of Gaussian Distributions - the problem now becomes to estimate the "mean and standard deviation" parameters for these distributions as well as estimating the "weight" of each of these Gaussian Distributions to the overall mixture. In practice, this estimation process is often iteratively handled using the EM algorithm.
In this problem, we first write an expected likelihood of the Gaussian Mixture model - we then "maximize" this likelihood, and again update the expected likelihood. We repeat this process until some convergence criteria is met. This being said, I have the following question:
How exactly is the maximization step being performed? On the Wikipedia page, it appears that the maximization step might have some analytical solution. However, a note is left indicating that "Q(θ | θ(t)) being quadratic in form means that determining the maximizing values of θ is relatively straightforward". Does this mean that even within the EM algorithm itself, another optimization algorithm such as Gradient Descent or BFGS is being used to perform the maximization step?
A side note from my side - if the above point is true, i.e. the EM algorithm itself uses a gradient based optimization algorithm to perform the maximization step .... what is stopping us from using Gradient Descent on the original optimization problem? Is this just a matter of personal preference, i.e. for such optimization problems, it depends on my "mood" as to whether I use EM or Gradient Descent? Or are there some optimization problems where algorithms like Gradient Descent fundamentally won't work and EM is thus required?
Thank you!