I am currently teaching an introductory level statistics course and I had two quick questions on hypothesis tests. Just to be clear, I have a fairly good background in probability and analysis, but I have not done much high level statistics.
My first question is would it be possible to perform a hypothesis test like the following:
$H_0: \ \mu \in (-1,1)$
$H_1: \ \mu \not\in(-1,1)$
If we suppose that we have a sample mean $\overline{x}=2.3,$ a sample standard deviation of $s=1.3,$ and a significance level of $\alpha=.2.$ Since the null hypothesis is inexact, we can't use the general approach. I was thinking that you would just decompose it into the two hypothesis test below and add the p-value, but I fear this is overly simplistic/optimistic thinking.
$H_0: \ \mu=1\hspace{1in}$ $\hspace{1.31in} H_0: \ \mu =-1$
$H_1: \ \mu>1\hspace{1in}$ and $\hspace{1in} H_1: \ \mu <-1$
I've asked several other people who have stronger backgrounds in statistics, but the answers have been mixed at best.
My second question has to do with determining if a test is left, right, or two tailed. My boss has us teach the students that the tail is determined by where your "more convincing data is" (combination of alternative hypothesis and test statistic) instead of the more common way of using the inequality in the alternative hypothesis. He does this because he wants the approach to be consistent even when you move on to harder hypothesis test, i.e. test for normality. I was curious if there are any examples using simple test statistics that could illustrate how you need to be careful when determining the tail of a test, or would I have to introduce more complicated material?
I think a good way to look at the first question is to think about the traditional hypothesis test, and how that might change when the null hypothesis is made to cover a very small interval rather than a point. What I mean by this is the following: suppose we have the hypothesis $$H_{0a} : \mu = 0 \quad \text{vs.} \quad H_{1a} : \mu \ne 0,$$ which is how we would go about testing if the mean $\mu$ of a random variable $X$ from which IID samples are drawn, is nonzero. Now, under such a test, we could construct, say, a $100(1-\alpha)\%$ confidence interval, and if this CI does not contain $0$, we would reject $H_0$ at a significance level of $\alpha$.
Now suppose instead we consider the hypothesis $$H_{0b} : \mu \in (-\epsilon, \epsilon) \quad \text{vs.} \quad H_{1b} : \mu \not \in (-\epsilon, \epsilon),$$ for some very small $\epsilon > 0$. Intuitively, if this hypothesis were tested on the exact same data, we should rightly expect that approximately the same proportion of properly constructed CIs would lead to rejection of $H_0$. But if the rejection criterion is that the CI is not a subset of $(-\epsilon, \epsilon)$, we can see that this is immediately problematic: for if $\epsilon$ is very small, much smaller than the standard error of the data, you would reject $H_0$ with a high probability even if $\mu = 0$.
Now, if one were to argue that such tests are not relevant to the case where $\epsilon$ is "large," note that we could always rescale our units of measurement.
Let's look at a simple example. Say I have $n = 16$ observations from a normal distribution with unknown mean and variance, and I want to test hypothesis A. My observations were:
$$\{-0.12361, -1.12843, 0.605598, 0.941108, -0.295106, -0.606082, 0.415842, -0.771654, -1.1774, 0.327114, 0.754497, 0.201877, 1.40783, 0.485589, 0.712437, -0.668724\}.$$
The sample mean is $\bar x = 0.0675554$ and the sample standard deviation is $s = 0.773162$. A Wald $95\%$ confidence interval for the mean is $$[-0.344434, 0.479545].$$ We would fail to reject $H_0$ under hypothesis A. Now, if hypothesis B was $H_{0b} : \mu \in (-0.1, 0.1)$, would I reject? The CI is not a subset of this interval.
In fact, the data was generated from a standard normal distribution. When I perform $N = 10^5$ simulations, I found that $94924$ of the CIs generated contained $0$. I found that $92134$ of the CIs were contained in $[-1,1]$, despite the mean being $0$ and the standard deviation being $1$, suggesting the Type I error exceeds the nominal value. It gets even worse: only $8509$ CIs were contained in $[-1/2,1/2]$, and here we see the Type I error is huge.
What if we apply the decision rule to reject $H_{0b}$ only if the CI lies entirely outside of $(-\epsilon, \epsilon)$? Then for $\epsilon = 1$, none of the simulations had such a CI (the probability of this event is too small for the number of simulations conducted). For $\epsilon = 1/2$, there were $18$ such CIs. For $\epsilon = 1/4$, there were $402$ such CIs. For $\epsilon = 1/20$, there were $3172$ such CIs. For $\epsilon = 1/1000$, there were $5022$. And as you can see, the Type I error increases gradually from $0$ to $\alpha$, as is consistent with expectations.