Applying Chebychev's inequality on samples

273 Views Asked by At

I came across the following question recently.

A sample of 68 books has a mean cost of \$ $96.01$ and a standard deviation of \$ $3.33$. Use this information and the special cases of Chebychevs rule, at least 51 of the 68 books cost between __ and __.

Now, $51/68 = 0.75$; Using Chebychev's inequality

$P(| {X} - \mu | \le k\sigma ) > 1 - \dfrac{1}{k^2} = 0.75 \implies k = 2$

I found that atleast $75\%$ of the values are within (\$ $89.35$, \$ $102.67$), but does this bound on probably correspond to bound on the sample? Is it possible that only 50 sample values turn out to be in this interval?

In general, is it possible that less than $(1-1/k^2)*100\%$ of the sample values are within $k$ sample standard deviations of the sample mean?

2

There are 2 best solutions below

0
On

Chebyshev's Inequality applies to (the empirical) distribution of a sample as well as to the distribution of a population. In your case, you can consider the costs of the 68 books to form the distribution; the mean and SD are the sample mean and SD.

The Empirical Rule (not a theorem) suggests that in a sample from a (nearly) normal population about 95% of observations will ordinarily lie in the interval $\bar X \pm 2S.$ Chebyshev's Inequality guarantees that at least 75% lie in that interval, whether or not the population from which the sample was taken is close to normal.

Because Chebyshev's Inequality applies to all distributions (and samples) that have means and standard deviations, its bounds are sometimes quite 'loose' so that the actual percentage of observations within the Chebyshev bounds may be considerably larger than the guaranteed percentage (but never smaller). Here are four examples, based on samples of size 100 from four different distributions. (Sampling and computations in R statistical software.)

(1) A sample from $\mathsf{Norm}(\mu=50, \sigma=10).$ Guaranteed 75%; actual 97%.

x = rnorm(100, 50, 10);  a = mean(x); s = sd(x)
a;  s;  mean(x >= a-2*s & x <= a+2*s)
[1] 50.06139   # sample mean
[1] 10.97697   # sample SD
[1] 0.97       # proportion in Chebyshev 2SD interval

(2) A sample from $\mathsf{Unif}(50, 70).$ Guaranteed 75%; actual 100%.

x = runif(100, 50, 70);  a = mean(x); s = sd(x)
a;  s;  mean(x >= a-2*s & x <= a+2*s)
[1] 60.01553
[1] 5.738045
[1] 1

(3) A sample from $\mathsf{Gamma}(\text{shape}=3,\text{rate}=1/2).$ Guaranteed 75%; actual 95%.

x = rgamma(100, 3, 1/2);  a = mean(x); s = sd(x)
a;  s;  mean(x >= a-2*s & x <= a+2*s)
[1] 5.721303
[1] 3.006212
[1] 0.95

(4) A sample from $\mathsf{Pois}(\lambda=3).$ Guaranteed 75%; actual 97%.

x = rpois(100, 5);  a = mean(x); s = sd(x)
a;  s;  mean(x >= a-2*s & x <= a+2*s)
[1] 4.87
[1] 2.227718
[1] 0.97

Addendum: Here is a sample that very closely matches the specifications in your Question: $n=68,\, \bar Y = 96.01,\, S = 3.33.$ I am not sure what @User49582934 means by his/her Answer, but I did not want to leave room for confusion. My example also has over 94% $(64 > 51)$ of its observations in $(89.35,102.67).$ [I have edited brackets [ and ] into the sorted listing to indicate these boundaries. Otherwise, the output is precisely from R.]

sort(y)
 [1]  87.57  89.23 [89.95  90.32  90.83  91.38  91.43  91.72  92.53  92.54
[11]  92.75  92.75  93.14  93.16  93.32  93.48  93.70  93.73  93.82  93.84
[21]  94.07  94.11  94.31  94.61  94.72  94.74  94.81  95.01  95.01  95.26
[31]  95.64  95.85  96.28  96.36  96.40  96.45  96.63  97.04  97.10  97.23
[41]  97.29  97.38  97.39  97.46  97.60  97.63  97.64  97.88  98.02  98.10
[51]  98.17  98.23  98.54  98.86  98.89  98.99  99.06  99.15  99.28  99.36
[61]  99.71 100.43 100.56 100.83 101.60 101.70] 102.78 103.30

length(y); mean(y); sd(y)
[1] 68
[1] 96.00956
[1] 3.330302
mean(y>=89.35 & y<=102.68)
[1] 0.9411765
0
On

No. There cannot be a sample with $50$ books in this range.

Chebychev's inequality extends to the finite samples. Let $X$ = cost of a book randomly selected form this sample. Each point in the sample of these 68 books is equally likely to be selected.

The Mean for this is $96.01$, and standard deviation is about $3.33$*. Using Chebychev's inequality, there is atleast $75\%$ probability $X$ is between $( 89.35, 102.67)$. This probability bound is the very reason why there must be atleast 51 books in this range, because each point has probability of 1/68.

* Standard deviation for $X$ would be less than 3.33 because of the correction factor in sample standard deviation formula. But still, atleast $75\%$ of the costs are between above interval.