Maximum Likelihood derivation

75 Views Asked by At

In "Deep Learning" book by Goodfellow, in section 5.5 Maximum Likelihood Estimation pdf, there is a following derivation step:

$\theta_{ML} = \arg\max_{\theta} \sum^m_{i=1}\log p_{model}(x^{(i)};\theta)$

Because the arg max does not change when we rescale the cost function, we can divide by m to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution $\hat{p}_{data}$ defined by the training data:

$\theta_{ML} = \arg\max_{\theta} \mathop{\mathbb{E}}_{x\sim\hat{p}_{data}} \log p_{model}(x;\theta)$

I understand the rescaling by $\frac{1}{m}$ doesn't change argmax.

But how did it result in the expectation appearing there?

Could you write out the step(s) in between?

1

There are 1 best solutions below

0
On

$\hat{p}_{data}$ is the empirical distribution that puts mass $\frac 1m$ on each training sample: $\hat{p}_{data} = \sum_{i=1}^m \frac 1m\delta_{x^{(i)}}$.

Thus $\mathop{\mathbb{E}}_{x\sim\hat{p}_{data}} \log p_{model}(x;\theta) = \sum_{i=1}^m \frac 1m\log p_{model}(x^{(i)};\theta)$.