Maximum likelihood as correspondence (or, How I hate the poor usage of Mathematics in Statistical textbooks)

49 Views Asked by At

Maybe the title was a bit much, but it describes both my question and my sentiment towards (what I perceive to be) the neglect, in statistical textbooks, of mathematics. The preamble to my question is this:

Consider a statistical experiment $(\Omega, \mathcal{F}, \mathscr{P})$ over the parameter space $\Theta = (0, \infty)$.

(By a statistical experiment I mean a measurable space $(\Omega, \mathcal{F})$ together with a collection of probability measures indexed by $\Theta$, i.e., $\mathscr{P} = \{ P_{\theta} \, | \, \theta \in \Theta \}$).

Assume that you have a collection of random variables $X_1, ..., X_n \overset{iid}{\sim} U(0, \theta)$, for every $\theta \in \Theta$.

With this information, a natural question one could ask is "what is the Maximum Likelihood Estimator (MLE) of $\theta$?" To answer it, first we set up the likelihood function, which in this case will be $\mathcal{L} : \Theta \times \mathbb{R}^n \rightarrow [0, \infty)$, defined as \begin{equation*} \mathcal{L}(\theta, x_1, ..., x_n) = \left[ \prod_{i = 1}^n \boldsymbol{1}_{[0, \, \theta]} (x_i) \right] \cdot \cfrac{1}{\theta^n}. \end{equation*}

Here is where my troubles begin: Usually textbooks don't give you any of the specifics about the MLE function, in particular a common definition of the MLE estimator goes like this: "For a sample point $\boldsymbol{x}$, let $\hat{\theta}(\boldsymbol{x})$ be a parameter value at which $\mathcal{L}(\theta, \boldsymbol{x})$ attains its maximum as a function of $\theta$ with $\boldsymbol{x}$ held fixed." This is completely uninformative and a bad definition in my opinion, but it contains a hint of formality: Notice that the MLE should be defined for every sample point (in our case, the sample space is $\mathbb{R}^n$ given the fact that we are operating on the space induced by the random vector $(X_1, ..., X_n)$, so the domain of the MLE should be $\mathbb{R}^n)$. Other times, authors are more explicit and they define the MLE as \begin{equation*} \hat{\theta}_{MLE}(x) = \underset{\theta \, \in \, \Theta}{arg \, max} \; \mathcal{L}(\theta, x) \end{equation*} Notice that this cannot be just a standard function, that is, $\hat{\theta}_{MLE}: \mathbb{R}^n \rightarrow \Theta$, simply because for some values $(x_1, ..., x_n) \in \mathbb{R}^n$ either there are multiple $\theta \in \Theta$ that satisfy the maximization problem or it might not have any solution at all. In the problem I described above (finding the MLE of the uniform), both cases actually happen, which leads me to believe that the MLE is actually a correspondence, that is, $\hat{\theta}_{MLE}: \mathbb{R}^n \rightarrow \mathcal{P}(\Theta)$ (where $\mathcal{P}(\Theta)$ is the set of parts of $\Theta$) and it would be

\begin{equation*} \hat{\theta}_{MLE}(x_1, ..., x_n) = \left\{ \begin{array}{ll} \Theta, & \mbox{if } \exists \, i \in \{ 1, ..., n \} \; \text{such that} \; x_i < 0 \\[3pt] \; \emptyset, & \mbox{if } \forall \, i \in \{ 1, ..., n \}, \; x_i = 0 \\[3pt] max \{ x_1, ..., x_n \}, & \mbox{if } (x_1, ..., x_n) \in \mathbb{R}^n_{+} \; \text{and} \; \exists \, i \in \{ 1, ..., n \}, \; \text{s.t.} \; x_i > 0. \end{array} \right. \end{equation*}

My question amounts to asking if this interpretation is correct? If yes, how do we take the expectation of this statistic (or any moment)? Do we ignore the set aspect and stick to the function? If not, how is the MLE defined?