Why is the KL divergence the number of bits required to represent the error of an estimator?

29 Views Asked by At

I am familiar with several interpretations of the KL divergence, last week I heard of a new one, mentioned in a lecture on probabilistic graphical models. It was stated kind of offhandedly, so I hope I'm getting the gist, but I seem to remember something like

"The KL divergence between a distribution $\mathcal{D}$ and an empirical distribution $\mathcal{D}_{emp}$ based on a sample $\mathcal{X}\sim\mathcal{D}$ is the number of bits required to represent the error of a MLE which is based on $\mathcal{X}$"

(I know this sounds weird, MLE for what parameter? Maybe for any parameter? I'm honestly not sure)

Is anyone familiar with such a result?