Find an estimate for $p=P(X\geqq 20)$

168 Views Asked by At

I have this problem I'm stuck with. I would really appreciate some help.

"The following is a sample from a normal distribution.

$7.6 , 9.6, 10.4, 10.7, 11.9, 14.1, 14.6, 18.5$

a) Let $X$ have this normal distribution and let $p=P(X\geqq 20)$. If we estimate $p$ by the relative frequency, we just get 0. Suggest another estimate.

b)Let $x$ be the 95th percentil, i.e. a value such that $P(X \leqq x) > = 0.95$. Suggest an estimate for $x$."

This seems fairly straight forward. I dont get the same answer as my text book however. I'm thinking since $X\sim N(\mu,\sigma ^2)$ we can estimate $\mu = \bar{X} = 12.175$, and $\sigma ^2 = s^2 =\frac{1}{n-1}\sum_{k=1}^{n}(X_k-\bar{X})^2 \approx 11.79 \implies s \approx 3.43$. So far so good.

Now $p = P(X\geqq 20) = 1- P(\frac{X-\bar{X}}{s}\leqq\frac{20-\bar{X}}{s})$ where $\frac{X-\bar{X}}{s} \sim F_{t_6}(0,1)$, i.e. a T-distribution of $n-2 = 8-2 = 6$ degrees of freedom. (This is where I'm uncertain whether this is correct).

This leads to $p = 1-F_{t_6}(2.28)$ which gives the wrong answer according to the book. In the same way does the (b)-part go wrong as well. The main question is how to use the T-distribution correctly. Is the T-distribution even the right choice in this problem?

I would very much appreciate some help because my text book dont explain this types of problems very well.

Cheers!

1

There are 1 best solutions below

1
On BEST ANSWER

Estimating a probability. My guess is that you are supposed to estimate $\mu$ as $\hat\mu = \bar X = 12.175$ and $\sigma$ as $\hat \sigma = 3.434.$ Then use the estimated distribution $X \sim \mathsf{Norm}(\hat \mu, \hat\sigma)$ to estimate $P(X \ge 20) \approx 0.01135.$ In R statistical software (where pnorm is a normal CDF), it looks like this:

x=c(7.6,9.6,10.4,10.7,11.9,14.1,14.6,18.5)
mean(x); sd(x)
[1] 12.175
[1] 3.434177
1 - pnorm(20, mean(x), sd(x))
[1] 0.01134643

Here is a look at the quantile method suggested in part (b). According to R, the 95th percentile of your data is about 17.1. (Various texts and software programs have various rules for finding quantiles of small datasets. The 'percentile rule' in your book may give a somewhat different answer--presumably somewhere between 14.6 and 18.5.).

sort(x)
[1]  7.6  9.6 10.4 10.7 11.9 14.1 14.6 18.5
quantile(x, .95)
   95% 
17.135 

If we look for the 95th percentile of the estimated normal distribution, $\mathsf{Norm}(\mu=12.175,\sigma=3.434)$ from above, the answer is not a lot different: about 17.82. (Percentiles of continuous distributions are precisely defined, so there is no quibbling here, except for rounding.)

qnorm(.95, mean(x), sd(x))
[1] 17.82372

Having shown you this much, the following may be beyond the point. But it was reasonable for you to think of using a t distribution for something in part (a), and I want to finish that part of the story. Read as much as interests you.

Also, I have to say that using only $n = 8$ observations in this way to 'find' $P(X \ge 20)$ from a normal distribution with unknown $\mu$ and $\sigma$ is simply not a workable idea in practice. As you will see from the confidence intervals (CIs) below, there is a lot of room for random error in estimating $\mu$ and $\sigma$ from only $n=8$ observations.

Estimating the population mean: The t distribution would be appropriate if you were finding a confidence interval based on $\bar X$ for the normal population mean $\mu$ or testing a null hypothesis about $\mu.$

Specifically, the t.test procedure in R uses Student's t distribution with $n - 1 = 8 - 1 = 7$ degrees of freedom to find the 95% CI $(9.30, 15.05).$

t.test(x)

        One Sample t-test

data:  x
t = 10.027, df = 7, p-value = 2.101e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.303956 15.046044
sample estimates:
mean of x 
   12.175 

The CI does not contain 0, and the null hypothesis $H_0: \mu = 0$ is overwhelmingly rejected because of the tiny p-value.

Estimating the population variance and SD: You could also get a 95% CI for $\sigma^2,$ using the chi-squared distribution: $(n-1)S^2/\sigma^2 \sim \mathsf{Chisq}(7).$ A 95% CI for $\sigma^2$ is $(7S^2/U,\, 7S^2/L),$ where $L$ and $U$ cut 2.5% of the area from the lower and upper tails, respectively, of $\mathsf{Chisq}(7).$ Then take square roots of the endpoints to get a 95% CI for $\sigma,$ which is $(2.27, 6.99).$

sqrt(7*var(x)/qchisq(c(.975,.025), 7))
[1] 2.270589 6.989485

The following session from Minitab software (somewhat edited for relevance) gives essentially the same CI for $\sigma:$

MTB > OneVariance 'x';
SUBC>   Confidence 95.0.

Test and CI for One Variance: x 

Method

The chi-square method is only for the normal distribution.

Statistics

Variable  N  StDev  Variance
x         8   3.43      11.8

95% Confidence Interval

                         CI for        CI for
Variable  Method          StDev       Variance
x         Chi-Square  (2.27, 6.99)  (5.2, 48.9)

Neither Student's t distribution nor the chi-squared distribution is useful for finding $P(X \ge 20).$ Each is relevant for making a confidence interval for one of the parameters $\mu$ and $\sigma.$


Addendum: Everything works better for large samples. Consider the distribution $\mathsf{Norm}(\mu = 12, \sigma = 3.4),$ for which $P(Y \ge 20) = 0.0111.$

1 - pnorm(20, 12, 3.5)
[1] 0.01113549

If I generate a random sample of size $n = 800$ from this distribution the proportion of observations that are 20 or greater is 0.02, not far from 0.0111.$

y = rnorm(800, 12, 3.5)
mean(y >= 20)
[1] 0.02

If I pretend I don't know $\mu$ and $\sigma,$ I get respective estimates $\bar Y = 12.22$ and $S = 3.53,$ which are reasonably close to the truth.

mean(y);  sd(y)
[1] 12.22147
[1] 3.525766

The estimated normal distribution gives $P(Y \ge 20) \approx 0.013.$

1 - pnorm(20, mean(y), sd(y))
[1] 0.01368513

The 95th percentile of the true distribution is $17.76.$

qnorm(.95, 12, 3.5)
[1] 17.75699

The 95th percentile of the estimated distribution is $18.02.$

qnorm(.95, mean(y), sd(y))
[1] 18.02084

The 95th percentile of the points sampled from the true distribution is $18.15.$

quantile(y, .95)
     95% 
18.15345 

A 95% t confidence interval for $\mu$ is $(11.98, 12.47)$ [no output shown, but trust me on this], which closely brackets the true mean $\mu = 12$ of the distribution that produced the data.

And finally, the density function of $\mathsf{Norm}(\mu = 12, \sigma = 3.4),$ is pretty well matched by a histogram of the $n = 800$ sampled observations.

enter image description here

None of these matches have been perfect, but I hope you can see that the ideas of your problem work a lot better for sample size $n = 800$ than for $n = 8.$

Note: I could probably have shown an example with better matches if I had generated a dozen samples of size $n = 800$ and picked the 'best' one. But what you see here is the first sample that came up. Also, I guess it's obvious that an example with $n=80,000$ would have worked better yet.