In hypothesis testing, the definition of p value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
My question is why the "at least as extreme" part? Why is it not enough to consider only the probability of obtaining the test result?
For example:
A hypothesis test on the fairness of a coin.
H0: P(Heads) = 0.5
HA: P(Heads) > 0.5
We carry out a test on the coin with the result being 8 heads out of 10 coin flips.
The p-value is P(8 Heads|H0 is True) + P(9 Heads|H0 is True) + P(10 Heads|H0 is True).
My question is why the p-value not just P(8 Heads|H0 is True)? Why care about the probability of 9 Heads and 10 Heads when the test only gave 8 Heads?
This should be explained in lots of different articles. Lets recap some idea.
The first thing is to understand the word "extreme" here. In hypothesis testing, we define a certain critical region for test statistics, such that when the test statistics falls into that critical region, we will reject the null hypothesis.
That means there is a certain "direction" for the test statistic - when the test statistic is closer to a certain region, a certain extreme, it will favor the alternative hypothesis - providing evidence supporting the alternative.
In your example, using likelihood ratio test / it is natural to see that a larger test statistic will favor the alternative. With a fixed sample size, defining the actual boundary / cutoff of the critical region is always a trade-off between the Type-I and Type-II error. Usually we will use the significance level of the test to control the Type-I error.
Back to your example, if you observe an $8$ and decided to reject the null, then you will also reject the null if you observe more extreme test statistics - larger value like $9$ or $10$, as they are even more favorable to the alternative. The duality plays the role here - imagine the critical region is defined as $\{8, 9, 10\}$ and you can compute the Type-I error of this test - which will gives you the p-value.
Instead of fixing the critical region in advance, one can specify the significance level of the test and compared with the p-value. So in either way we are controlling the type-I error.
The bottom line: The critical region of this test will not be $\{8\}$ only - you should not reject the test only when the test statistic is $\{8\}$ but do not reject when it is in $\{9, 10\}$. Such test is always sub-optimal than $\{8, 9, 10\}$