Why is Gradient Descent always chosen for Neural Networks?

68 Views Asked by At

I am trying to understand why Gradient Descent are the chosen types of algorithm for optimizing the Loss Function in Neural Networks - and why other algorithms (e.g. EM Algorithm https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) are not.

Part 1: Recently, I posted a question (Why is this optimization problem difficult?) on understanding why the EM algorithm is a good choice for a difficult optimization problem involving estimating the parameters for a mixture of Normal Probability Distributions:

\begin{align*} p(x|\theta) &= \sum_{i=1}^{k} \pi_i \mathcal{N}(x|\mu_i, \sigma_i^2) \end{align*}

$$\sum_{i=1}^{k} \pi_i = 1$$

Optimization Problem: \begin{align*} L(\theta|X) &= \prod_{j=1}^{n} p(x_j|\theta) = \prod_{j=1}^{n} \sum_{i=1}^{k} \pi_i \mathcal{N}(x_j|\mu_i, \sigma_i^2) \end{align*}

We are interested in estimating the vector $\theta$ = $\pi_i$, $\mu_i$, $\sigma_i$ ) of this function (i.e. all $\pi_i$, $\mu_i$, $\sigma_i$ such that the resulting estimates produce the largest value of $L(\theta|X)$.

For this type of optimization problem, the EM algorithm is the standard choice (note: $Z = \pi_i$ ).

  • Step 1: First, choose initial values for the parameters $\theta^{(0)} = \{\pi_i, \mu_i, \sigma_i^2\}$ for all $i = 1, ..., k$.

  • (E Step) Step 2: Next, compute the expected value of the log-likelihood function, with respect to the conditional distribution of $Z$ given $X$ under the current estimate of the parameters $\theta^{(i)}$ : $Q(\theta|\theta^{(i)}) = E_{Z|X,\theta^{(i)}}[log L(\theta;X,Z)]$.

  • (M Step) Step 3: Find the parameter that maximizes this quantity: $\theta^{(i+1)} = argmax_{\theta} Q(\theta|\theta^{(i)})$.

  • Step 4: Repeat the E-step and M-step until convergence.

This optimization problem is considered to be a very difficult optimization problem as it is multi-modal (i.e. many local minima/maxima) as well as the fact that switching the values of $\pi_i$ while keeping $\mu_i$, $\sigma_i\$ constant can result in the same values of the function being optimized.

Essentially, the EM algorithm breaks the original optimization problem into two smaller sub-problems that are easier to optimize, and it can be mathematically proven that this approach can guarantee that the value of the loss function improves with each successive iteration.

Part 2: For a generic Neural Network, the loss function being optimized can be written as:

\begin{equation} L(w_1, w_2, ..., w_n) = \frac{1}{N} \sum_{i=1}^{N} L_i(y_i, f(x_i; w_1, w_2, ..., w_n)) \end{equation}

\begin{equation} \nabla L(\mathbf{w}) = \left[ \frac{\partial L(\mathbf{w})}{\partial w_1}, \frac{\partial L(\mathbf{w})}{\partial w_2}, ..., \frac{\partial L(\mathbf{w})}{\partial w_n} \right]^T \end{equation}

\begin{equation} w_{i}^{(j+1)} = w_{i}^{(j)} - \eta \frac{\partial L(w_1^{(j)}, w_2^{(j)}, ..., w_n^{(j)})}{\partial w_{i}^{(j)}} \end{equation}

My Question: As we can see from the above equations, optimizing Neural Networks are also high dimensional and multimodal (i.e. based on the dimensions of $w_i$) - the corresponding Loss Functions are said to be highly non-convex with multiple local minima/maxima and saddle points. But Gradient Descent is used here instead of the EM algorithm. Can anyone please explain why this is the case? Why is EM so well suited for mixture models and Gradient Descent so well suited for Neural Networks? How do these two optimization problems differ and complexity such that one is usually done with EM and the other with Gradient Descent?

Thanks!