I have the following definition of the likelihood principle:
Def: (Likelihood Principle). The information brought by an observation $x$ about $\theta$ is entirely contained in the likelihood function $L(\theta|x)$. Moreover, if $x$ and $x'$ are two observations depending on the same parameter (possibly in different experiments), such that there exists a constant $c$ satisfying $L(\theta|x) =cL'(\theta|x')$ for every $\theta$, they bring the same information about $\theta$ and must lead to identical inferences.
In my book further explanation is given:
The likelihood principle says this: the likelihood function, long known to be a minimal sufficient statistic, is much more than merely a sufficient statistic, for given the likelihood function in which an experiment has resulted, everything else about the experiment -what its plan was, what different data might have resulted from it, the conditional distributions of statistics under given parameter values, and so on- is irrelevant
Consider the following exercise:
Exercise: Consider an experiment with outcomes $\{1,2,3\}$ and probability mass functions $f(\cdot|\theta)$, $\theta\in\{0,1\}$ given by $$\begin{array}{|c|c|c|} \hline x & 1 & 2 & 3\\ \hline f(\cdot|0) & 0.9 & 0.05 & 0.05 \\ \hline f(\cdot|1) & 0.1 & 0.05 & 0.85 \\ \hline \end{array}$$
Show that the procedure to reject the hypothesis $H_0:\theta = 0$ vs $H_1:\theta =1$ when $X\in\{2,3\}$ has a probability of $0.9$ to be correct (both under $H_0$ and $H_1$). What is the implication of the likelihood principle?
What I've tried: I was able to show that the procedure has a probability of $0.9$ to be correct (both under $H_0$ and $H_1$). However, I'm not sure what the likelihood principle implies when $X = 2$. The likelihood principle says that for given the likelihood everything else is irrelevant. I don't really understand what irrelevant even means in this context. Shouldn't the likelihood principle imply something in general? That is, for whatever value of $X$?
Question: What is the implication of the likelihood principle when $X = 2$?
Thanks in advance!
I think the question is trying to show the, sort-of, combative nature of classical statistics and the likelihood principle, and how assuming "all of the information is in the likelihood" can give less than optimal results.
By constructing the hypothesis test as specified, you are conditioning on unobserved, and potentially never to be observed, data, whereas the likelihood principle says that all the information for the experiment is contained in the likelihood function as a sufficient statistic of the data. If you saw $X=2$, then the situation $X=1$ or $X=3$ only "exists in the abstract."
You can see however, that conditioning on the unobserved data allows you to construct a test with desirable properties. The specified test, by putting $X=2$ in the rejection region, is a UMP test for $\alpha = .1$. If you put $X=2$ in the acceptance region, you would have a UMP test for $\alpha = .05$.
The key point is, if you just considered that you saw $X=2$ with the likelihood principle, then you wouldn't have any information to distinguish between the two parameter values. However, if you take into account the full distribution, including data you didn't see, and the degree of type one error you are willing to accept ($\alpha = .05$ or $.1$), then the testing framework gives you an informed strategy about what decision to make for $X=2$.
Constrast the previous two tests, the first with significance .05 and power .85 and the second with significance .1 and power .9, with the corresponding performance criteria for deciding based on the likelihood function.
We have that the type one error of the maximum likelihood decision is $P_0[f(x|0)<f(x|1)] = .05$, and that the "power" is $P_1[f(x|1)>f(x|0)]=.85$. You could also do: $P_0[f(x|0)>f(x|1)] = .9$ and $P_1[f(x|1)<f(x|0)]=.1$. It's important to note that doing these calculations on the likelihood is "using the unobserved data" in a way that is contrary to the likelihood principle; however, you aren't making decisions with it, just quantifying the performance.
So you can see that the testing framework does at least as good or better in terms of correctly determining the parameter. With the test in the question classifying correctly with probability .9 under both parameters, and the maximum likelihood having probability .9 and .85, respectively.
Edit You could always construct a test out of the likelihood by choosing the acceptance region to be $\{x| f(x|0)>f(x|1)\}$, and specifying a deterministic decision rule or random decision rule for $f(x|0)=f(x|1)$. This just ends up giving you the original test, and does not follow the likelihood principle (in the randomized case because you need the additional information of the random event, in the deterministic case because something outside the likelihood dictates what the decision is).