Rigorous criteria for the choice of an area of rejection in statistical hypothesis tests

118 Views Asked by At

I have problems understanding rigorously how p-values can be used to reject a null-hypothesis. I get the basic idea of p-values; they are based on an reductio-ad-absurdum argument and I have no problem performing statistical tests. My problem, however, is deeper. In my opinion, one can always construct such a reductio-ad-absurdum argument, even if the null hypothesis is definitely true by adapting the area of rejection.

To make my problem clearer, lets look at an example where we have a test statistic $X$ and a null hypothesis $H_0$. Let us assume, that under the null hypothesis $X$ follows a normal distribution with $\mu = 0$ and $\sigma = 1$ and the that the observed value of $X$ is $\tilde{x}$. For a significance level of $\alpha \in (0,1)$ the area of rejection of the null hypothesis is usually defined as $$ A_\alpha := \{ x \in \mathbb{R} \ \vert \ \vert x \vert > c_\alpha \} $$ with $c_\alpha$ such that $P( X \in A_\alpha) = \alpha$. If we fix $\alpha$, then we reject $H_0$ if and only if $\tilde{x} \in A_\alpha$. The logic behind this is that under $H_0$ the observed value $\tilde{x}$ is highly unprobable and therefore "close to impossible" and with a reductio at via a reductio-ad-absurdum argument we can conclude that $H_0$ cannot be true.

But now my question:

Why would I chose this specific $A_\alpha$? What is keeping me from chosing a completely different area of rejection $B_\alpha$ with $P( X \in B_\alpha) = \alpha$? I could for example chose an area around the mean: $$B_\alpha = \{ x \in \mathbb{R} \ \vert \ \vert x \vert < d_\alpha \}$$ with a $d_\alpha$ such that $P( X \in B_\alpha) = \alpha$. Then suddenly my observation $\tilde{x}$ of $X$ might be not so absurd/unprobable anymore and I might not be able to reject my null hypothesis.

It is obviously counterintuitive to chose an area of rejection around the mean of a normal distribution. But I am not satisfied with the mere fact that this is counterintuitive. It all comes down to one question:

What are the rigorous criteria to chose a reasonable area of rejection $A_\alpha$ besides $P( X \in A_\alpha) = \alpha$?

If $P( X \in A_\alpha) = \alpha$ is accepted as the only criterion for the choice of an area of rejection, bad things happen. In this case, for example, I can always construct an interval of rejection $I_\alpha$ around my observed value $\tilde{x}$ to suddenly make the observation "unprobable" and be able to reject $H_0$. This is obviously nonsense. As an additional reasonable criterion to solve this problem one could demand that the area of rejection must be chosen independently of the observed value of $X$. But then I am still allowed to chose strange rejection intervals around the mean of the normal distribution of $X$.

I hope I could make my problem clear. Thanks in advance for any advice!

2

There are 2 best solutions below

0
On

I'm a bit rusty on some aspects of this or I'd have answered sooner. For now, I'll recommend three topics:

The Neyman–Pearson lemma applies only to situations in which the null and alternative hypotheses are both "simple" rather than "compound", i.e. for each of those hypotheses there is only one probability distribution of the data that is consistent with it. But one can use it along with some other assumptions to find a uniformly most powerful test in some situations involving compound hypotheses.

0
On

What are the rigorous criteria to chose a reasonable area of rejection $A_\alpha$ besides $P( X \in A_\alpha) = \alpha$?

The logic of statistical testing, is to draw the line between "plausible" and "implausible" realized values (related to the distribution implied if the null hypothesis is true).

"Plausible/implausible" realized values under the null are inextricably linked to how the distribution of the statistic used for the test allocates probability mass.

So if say, we construct a statistic that under the null follows a distribution that has a lot of mass in the extreme values but little mass in the middle (see for example arcsine distribution), it would follow that a "reasonable" rejection region should include the middle values of the support, not the extreme ones.

As it happens, nearly all (or totally all) statistical tests use statistics that have a unimodal distribution and mass concentrated in the middle values of the support (normal, student, chi-square, things like that). Then to be consistent with the founding logic of the test, we determine the rejection regions at the extreme values of the support, because it is there that the probability is low.

Given the above, note that there are no "rigorous criteria" as regards the exact length of the rejection region, and how much probability mass corresponds to it. As we increase the width of the (one-sided or two sided) rejection region, we increase the probability that we may falsely reject a true null ("Type I error"): hence our choice reflects our attitude towards the risk and the consequences of making this mistake (i.e. rejecting a true null).

The nearly universal use of $1$%, $5$%,$10$% significance levels (probability of Type I error) can find its explanation in the history and sociology of science and, to make a very long story short, it reflects a "conservative" attitude towards "new findings": if I claim that I have discovered "effect A", then the scientific ethics require that I set as my "null hypothesis" the "no effect" hypothesis, and then, in order to persuade the world that nevertheless "effect A" does exist, the data should reject the null while accepting a "small" probability that the rejection will be false.