Maximum likelihod estimator formula

77 Views Asked by At

Let $\mathbf{X}$ be an i.i.d sample from the parametric family of distributions $\mathcal{P} = \{P_\theta: \theta \in \Theta \subset \mathbb{R}^k\}$, $\mathbf{x} \in \mathbb{R}^n$ is some realization of $\mathbf{X}$. There is a widely known formula for the ML-estimate: $$\hat\theta(\mathbf{x}) = \underset{\theta \in \Theta}{\operatorname{argmax}} L(\mathbf{x}; \theta)$$ where $\mathbf{x}$ is fixed and does not depend on $\theta$.

But could we use the following formula $$\hat\theta(\mathbf{X}) = \underset{\theta \in \Theta}{\operatorname{argmax}} L(\mathbf{X}; \theta) $$ to find ML-estimator $\hat\theta(\mathbf{X})$? I am concerned about the fact that random vector $\mathbf{X} = (X_1, \ldots, X_n)$ implicitly depends on $\theta$ because $X_i \sim P_\theta$ whereas in the first formula vector $\mathbf{x}$ doesn't depend on $\theta$.

1

There are 1 best solutions below

0
On BEST ANSWER

Remember that a random vector is defined as a mapping from the sample space to Euclidean space --- that is, your random vector $\mathbf{X}: \Omega \rightarrow \mathbb{R}^n$ maps each outcome $\omega \in \Omega$ to a real vector $\mathbf{x} \in \mathbb{R}^n$. This object does not depend on $\theta$, but its probability distribution does depend on $\theta$. (This is an important distinction - it means that the random variable is well-defined independently of the parameter $\theta$. It is merely a real vector description of each outcome in the sample space.)

The equation for the MLE takes an argument $\mathbf{x} \in \mathbb{R}^n$ which is a data vector, and maps this to an output $\hat{\theta}(\mathbf{x}) \in \Theta$ which is the corresponding estimate formed via maximisation of the likelihood.$^\dagger$ The maximum-likelihood estimator $\hat{\theta}(\mathbf{X})$ involves substitution of the random variable $\mathbf{X}$ into the argument of the MLE function, which implicitly forms a random variable by function composition (i.e., this random variable is a function of another random variable, which is itself a function). Putting these together, we get the function composition $\hat{\theta}(\mathbf{X}) = \hat{\theta} \circ \mathbf{X}$ which means that we now have a mapping (showing the intermediate set):

$$\hat{\theta}(\mathbf{X}): \Omega \rightarrow \mathbb{R}^n \rightarrow \Theta.$$

This means that the maximum-likelihood estimator $\hat{\theta}(\mathbf{X})$ is a mapping from the sample space to the parameter space, which means that it is a random variable on the parameter space. This random variable also does not depend on $\theta$ in a functional sense, since it is a mapping from the sample space. As with the observed data, its probability distribution depends on $\theta$, through the distribution of the data vector.


$^\dagger$ Even this involves a slight abuse of terminology, since $\arg \max$ technically refers to the set of maximising values in the optimisation. We assume that there is a unique maximising parameter for any data input in the problem at hand, and we take the $\arg \max$ to be this unique parameter, not the singleton set containing that parameter. This is a common "abuse of notation" that is used when working with $\arg \max$ definitions.