Does the confidence interval that maximizes the parameter estimate have nominal coverage?

63 Views Asked by At

Assume there are two confidence intervals for parameters $\theta_1$ and $\theta_2$, $C_1$ and $C_2$ with upper and lower bounds $[l_1,u_1]$ and $[l_2,u_2]$, respectively. Assume the intervals have nominal coverage $1-\alpha$.

The parameter estimates are $\hat{\theta}_1$ and $\hat{\theta}_2$. Now I always (in replications of the experiment) choose the interval with the higher estimate. Does this interval still have nominal coverage?

Formally define $$g:=\begin{cases} 1 \quad \text{if} \ \hat{\theta}_1 > \hat{\theta}_2 \\ 2 \quad \text{if} \ \hat{\theta}_1 \le \hat{\theta}_2 \end{cases}$$

with interval $C_g=[l_g,u_g]$. What is $P(\theta_d \in [l_g,u_g])$?

Remark 1: This seems to hold for two intervals, but fails for generalizations to more than two. I am looking for a proof of this result or a starting point.

Remark 2: I would also be interested if this problem is discussed anywhere in the literature and under which keyword / topic.

Remark 3: The problem can be understood in two ways. 1) The one population case where data $X_n$ are sampled from the same population and two different parameters are estimated on $X_n$ (e.g. mean and median, see BruceET's answer). 2) The two population case where $X_{n_1}$ and $X_{n_2}$ are sampled from two populations with $\theta_1$ and $\theta_2$ the same type of parameters (e.g. both means of potentially different values; this was my original intention with this question).

Edit: Based on BruceET's answer I programmed a simulation. It shows good nominal coverage of two-sided intervals. This simulation concerns the (1) scenario in remark 3.

n=15
reps=10^5
true_mean = 100 # = true_median in case of normal distribution

set.seed(12345)
cover_m = cover_med = cover_g = numeric()
for(i in 1:reps){
  x   = rnorm(n, true_mean, 10)
  m   = mean(x)
  med = median(x)
  g   = med > m
  CI_m   = t.test(x)$conf.int[1:2]
  CI_med = wilcox.test(x, conf.int=T)$conf.int[1:2]
  CI_g   = CI_m * (1-g) + CI_med * g
  cover_m[i]   = CI_m[1] <= true_mean & CI_m[2] >= true_mean
  cover_med[i] = CI_med[1] <= true_mean & CI_med[2] >= true_mean
  cover_g[i]   = CI_g[1] <= true_mean & CI_g[2] >= true_mean
}

apply( cbind(cover_m,cover_med,cover_g), 2,mean)

>cover_m cover_med   cover_g 
>0.95278   0.95097   0.95197
1

There are 1 best solutions below

4
On BEST ANSWER

This is not a proof of what I said in my Comment, but perhaps it gives some insight into what might go wrong, picking the most favorable estimate and its corresponding CI.

By trial and error using R statistical software, I found a sample of size $n=15$ from $\mathsf{Norm}(\mu=100, \sigma=10)$ for which a one-sided t CI incorrectly excludes $\mu = 100,$ while a one-sided Wilcoxon CI is correct. This is a case in which the sample mean is a little larger than the sample median, so (according to your suggested method) I would have chosen the incorrect t interval. [The same kind of error can happen with the more common two-sided confidence intervals, but it is easier to give an example with one-sided CIs.]

I believe choosing the larger of the two estimates will get non-convering CIs more often than sticking just with t CIs or just with Wilcoxon CIs.

x=rnorm(15, 100, 10); wilcox.test(x, conf.int=T, alte="g")$conf.int[1:2]; 
  t.test(x, alte="g")$conf.int[1:2]
[1] 99.6146     Inf      # Wilcoxon CI correct
[1] 100.4227    Inf      # t CI incorrect
mean(x); median(x)
[1] 103.9438             # mean larger than median ...
[1] 102.9209             #   ... so choose t CI
sort(round(x,2))
 [1]  93.94  94.40  94.55  96.36  97.15 101.69 102.08 102.92 104.06
[10] 109.13 109.39 109.96 111.00 115.79 116.74

I haven't seen this particular issue discussed in a published paper. Presumably a large-scale simulation could settle the issue for sure.