From Goodfellow book: why can one rescale argmax of conditional probability into an expectation?

59 Views Asked by At

I don't understand why the two equations below are equivalent.

\begin{align} \boldsymbol{\theta}_{\rm ML} &= \mathop{\rm argmax}_\boldsymbol{\theta} \sum_{i=1}^{m} \log p_{\rm model}(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}) \tag{5.58}\label{5.58} \\ \boldsymbol{\theta}_{\rm ML} &= \mathop{\rm argmax}_\boldsymbol{\theta} \mathbb{E}_{\mathbf{x} \sim \hat{p}_{\rm data}} \log p_{\rm model}(\boldsymbol{x}; \boldsymbol{\theta}) \tag{5.59}\label{5.59} \end{align}

Quoted from chapter 5 of Deep Learning:

Because the $\mathop{\rm argmax}$ does not change when we rescale the cost function, we can divide by $m$ to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution $\hat{p}_{\rm data}$ defined by the training data.

1

There are 1 best solutions below

2
On BEST ANSWER

It just takes a bit of time to find out where the symbols are defined. There's no computations involved in this question. \begin{align} \boldsymbol{\theta}_{\rm ML} &= \mathop{\rm argmax}_\boldsymbol{\theta} \sum_{i=1}^{m} \log p_{\rm model}(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}) \tag{5.58}\label{558} \\ &= \mathop{\rm argmax}_\boldsymbol{\theta} \underbrace{\frac1m}_\text{const.} \sum_{i=1}^{m} \log p_{\rm model}(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}) \tag{divide by $m$}\label{frac1m} \end{align}

In \eqref{559}, $\boldsymbol{x}^{(i)}$'s are replaced by $\boldsymbol{x}$, and a new symbol $\hat{p}_{\rm data}$ is introduced, so it's better to scroll up the page to see where they are defined.

Consider a set of $m$ examples $\mathbb{X} = \{\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(m)}\}$ drawn independently from the true but unknown data-generating distribution $\hat{p}_{\rm data}(\mathbf{x})$.

Observe the difference in boldface styles and their corresponding meaning.

\begin{array}{|c|c|c|} \hline \text{$\rm \LaTeX$ code} & \texttt{\boldsymbol{x}} & \texttt{\mathbf{x}} \\ \hline \text{output} & \boldsymbol{x} & \mathbf{x} \\ \hline \text{meaning} & \text{realized} & \text{theoretical} \\ \hline \text{usage} & \boldsymbol{x}^{(i)} & \hat{p}_{\rm data}(\mathbf{x}) \\ \hline \end{array}

Let's take a closer look at the ${\rm \small{\bf smaller}}$ part of \eqref{559}

\begin{equation} \boldsymbol{\theta}_{\rm ML} = \mathop{\rm argmax}_\boldsymbol{\theta} \mathbb{E}_{\mathbf{x} \sim \hat{p}_{\rm data}} \log p_{\rm model}(\boldsymbol{x}; \boldsymbol{\theta}). \tag{5.59}\label{559} \end{equation}

It reads

$${\huge \mathbf{x} \sim \hat{p}_{\rm data}}. \tag{subscript} \label{sub}$$

From the previous quoted text, it's clear that $\boldsymbol{x}^{(i)}$'s are i.i.d. with (unknown) distribution $\hat{p}_{\rm data}$. In \eqref{558}, we calculate $\log p_{\rm model}(\boldsymbol{x}^{(i)}; \boldsymbol{\theta})$ from these $m$ realisations $\boldsymbol{x}^{(i)}$'s with $i = 1,\dots, m$, then we take the simple average in \eqref{frac1m}. The symbol $\mathbb{E}$ captures the idea of "average" and the \eqref{sub} indicates the underlying probability distribution.