(Log) Likelihood as a Loss Function?

107 Views Asked by At

I'm trying to understand the relation between the theory of statistical decision problems and the theory of regression of distributions.

Recall that a statistical decision problem consists of

  • a measure space $(\Omega,\mathcal{A})$,
  • a set $\Theta$ of parameters,
  • a family $(P_\theta)_{\theta \in \Theta}$ of distributions on $(\Omega,\mathcal{A})$,
  • a decision space $(E,\mathcal{E})$,

and finally

  • a (measurable) loss function $L: \Theta \times E \rightarrow [0,\infty]$.

From this data one derives the risk function $R: \Theta \times \mathcal{S}(\Omega,E) \rightarrow [0,\infty]$ defined as the integral $$R(\theta,f) = \int_{\Omega} dP_\theta(\omega) \int_E df(\omega)(e) \ L(\theta,e),$$ where $\mathcal{S}(\Omega,E)$ denotes the space of stochastistic functions (Markov kernels) from $\Omega$ to $E$.

It seems to me that this framework should be general enough to capture the notion of regression of distributions from samples, in particular it should be possible to express the notion of maximum likelihood estimation. However, I am not quite sure how.

If on the other hand one allows the loss function to be explicitly dependent on $\Omega$, i.e. if one assume that $$L: \Theta \times \Omega \times E \rightarrow [0,\infty]$$ and accordingly defines $R$ as $$R(\theta,f) = \int_{\Omega} dP_\theta(\omega) \int_E df(\omega)(e) \ L(\theta,\omega,e),$$ then one could consider view maximum likelihood estimation in this more general framework in the following way, at least in the example I consider.

Given a statistical model $(X,\mathcal{X},(P'_\theta)_{\theta \in \Theta})$ and sample size $N \geq 1$, let

  • $(\Omega,\mathcal{A}) = (X,\mathcal{X})^{\times N}$ be the $n$-fold product,
  • $P_\theta = {P'_\theta}^{\otimes N}$ the product measure, and let
  • $\omega = (\omega_1,\ldots,\omega_N) \in \Omega$ be a sample.

Now assume that $(X,\mathcal{X}) = (\{0,1\},\mathcal{P}(\{0,1\}))$ is the Bernoulli space and $P'_\theta = \text{Bern}_\theta$ the Bernoulli distribution, i.e. $\text{Bern}_\theta(\{1\}) = \theta$.

Then we consider the decision space

  • $E = [0,1]$ (with the Borel algebra), and the loss function
  • $L(\theta,\omega,e) = -\log(P_e(\{\omega\}) = L(\omega,e)$ (independent of $\theta$).

The risk $$R(\theta,f) = \int_{[0,1]} d\text{Bern}_\theta^{\otimes N}(\omega) \int_{[0,1]} df(\omega)(e)\ L(\omega,e)$$ is minimal when we the decision function f is chosen to be $$f(\omega) = \delta_{\text{argmin}_{e \in [0,1]} L(\omega,e)}\quad \text{(non-randomized)},$$ which is the same to say that for $\omega$ we choose $e$ to maximize the (log) likelihood of the sample $\omega$. You can also compute that for this choice of $f$ the risk equals the Shannon entropy $$R(\theta,f) = \mathcal{H}((\Omega,\mathcal{A},P_\overline{\theta})),$$ where $$\overline{\theta} = \frac{1}{N} \sum_{i = 1}^N \omega_i$$.

Now my question is, is there a natural way to view maximum likelihood estimation in the framework of decision theory, in particular, is there a natural way to view (log) likelihood as a loss function?