sampling distribution question

101 Views Asked by At

Need clarification on a binomial sample example:

we drew a sample of size $100$ from a binomial($m = 2$,$p = 0.2$) distribution and observed $76$ of the $x_i = 0$, $20$ of the $x_i = 1$ and $4$ of the $x_i = 2$

now as $n → ∞$ the empirical distribution will look more and more like the binomial$(2,0.2)$ it was drawn from (why?).

I am guessing this is because of Law of Large No. (not sure if WLLN or SLLN). The empirical dist mean and variance will approach Binomial dist mean and variance??

my class notes has the following explanation:

For example, if $X_i ∼ Binom(2, 0.2)$ then for $n = 100$,

$$ P \bigg( \frac{1}{n} \sum_{i=1}^{n} 1_{[X_i=0]} \geq 0.76 \bigg) \simeq 0.007 $$ but for $n = 1000$ $$ P \bigg( \frac{1}{n} \sum_{i=1}^{n} 1_{[X_i=0]} \geq 0.76 \bigg) \simeq 2.3\times(10^{-16}) $$

I understand the LHS in the probability function, that is the total proportion of $x_i = 0$, but I dont get why RHS is $0.76$, shouldn't it be $0.64\,$ (which is the probability of $x_i = 0$, from R code dbinom(0,2,0.2)).

1

There are 1 best solutions below

3
On BEST ANSWER

In the initial experiment with $n=100$ realizations of $X \sim Binom(2, .2)$ you got $X = 0$ 76 times, which is larger than the expected 64 occurrences.

 dbinom(0, 2, .2)
 ## 0.64

This is a somewhat surprising outcome because the probability of getting 76 or more 0's is only 0.007, as your first displayed equation shows. (A slightly different binomial computation verifies this claim.)

 1 - pbinom(75, 100, .64)
 ## 0.007013119

The question is whether experiments with ever larger numbers $n$ of realizations of $X \sim Binom(2, .2)$ will continue such excessive numbers of 0's. So we check to see what we could expect for $n = 1000$ realizations. The answer is that this degree of departure from what we expected is then very unlikely indeed, as claimed in the second displayed equation.

 1 - pbinom(759, 1000, .64)
 ## 2.220446e-16

The WLLN prohibits this kind of 'bad behavior' from persisting as $n \rightarrow \infty.$

Perhaps a picture will help show the WLLN in action. For $n = 1, 2, \dots 10,000$ the boundary between the light and dark blue regions shows the running averages of instances with 0's (ending very near 0.64); the boundary between light blue and white shows running averages with either 0's or 1's (ending very near 0.96).

enter image description here

The R code below (inelegant, but simple and correct) shows how the plot was made.

 n = 10^4;  x = numeric(n)
 for(i in 1:n) { x[i] = rbinom(1, 2, .2) }
 trace.0 = cumsum(x==0)/(1:n)
 trace.01 = cumsum(x<=1)/(1:n)
 plot(trace.01, type="h", ylim=c(0,1), col="skyblue", ylab="Running Avg.") 
 lines(trace.0, type="h", col="blue")
 abline(h = .64, col="red");  abline(h=.96, col="red")