Statistic of choice?

58 Views Asked by At

I am not good at maths. So, don't mind if it is silly. Suppose we have a mode choice of transport given - as such as people who take

  • bike - 1%
  • car - 45%
  • Walk- 54% This is the representation of the population of 20,000 people.

Now, suppose, I want to translate this choice to 20 people. Will it be the same (1, 45, 54%)? and how do I check if it's right If there are 20 new people every time for 100 iterations?

1

There are 1 best solutions below

7
On BEST ANSWER

Sampling. Suppose that the true proportions in categories B, C, W of a population are $0.01, 0.45, 0.54,$ respectively.

Then if you take a huge random sample of size $n = 20,000$ from the population, you might get the counts below. (Sampling and computations in R.)

set.seed(810)
huge = sample(1:3, 20000, rep=T, c(.01,.45,.54))
table(huge)
huge
    1     2     3 
  198  9066 10736 

Then corresponding proportions would be $0.0099, 0.4533, 0.5368,$ which are very close to the population proportions $0.01, 0.45, 0.54.$ (Discrepancies seem like rounding errors.)

table(huge)/20000
huge
     1      2      3 
0.0099 0.4533 0.5368 

However, if I take a tiny sample of only size 20, then I will not get proportions so close to the true population proportions.

set.seed(811)
tiny = sample(1:3, 20, rep=T, c(.01,.45,.54))
table(tiny)
tiny
 1  2  3 
 1  7 12 
table(tiny)/20
tiny
   1    2    3 
0.05 0.35 0.60 

Testing. By contrast, a question can arise in research about the validity of a hypothetical population proportion, perhaps arising from theory about human behavior or from a supposition that behavior has not changed since the last large survey was done ten years ago.

From whatever source suppose our null hypothesis is that the population proportions are $0.01, 0.45, 0.54.$ A take a moderate-sized random sample of size $n = 200.$ And I get counts, $5, 100, 65.$

The proportions don't agree exactly with the hypothesis. The question is whether the disagreement is sufficiently large to reject the null hypothesis as untrue, or whether random sampling error can account for the discrepancy.

Oberved and expected counts. I will compare by observed counts with the expected counts according to the null hypothesis. I get the expected counts by multiplying the sample size 200 by the hypothetical population proportions. (I happen to get integers here, but expected counts should not be rounded to integers if they're not integers.)

       B   C   W   Tot
Obs    5 100  95   200
Exp    2  90 108   200 

Test statistic. In a chi-squared test, the chi-squared statistic is $$Q = \sum_{i=1}^K \frac{(X_i = E_i)^2}{E_i},$$ where $K$ is the number of categories, $X_i$ are the observed counts and $E_i$ are the corresponding expected counts. For our data $Q = 7.18.$

X = c(5, 100, 95)
E = 200 * c(.01, .45, .54)
Q = sum((X-E)^2/E);  Q
[1] 7.175926

Distribution of test statistic. Provided that all of the $E_i > 5,$ we have $Q \sim \mathsf{Chisq}(2),$ the chi-squared distribution with $K-1 = 2$ degrees of freedom.

Critical value. The critical value $c = 5.991$ for a test at the 5% level is the value that cuts 5% of the probability from the upper tail of this distribution. [You can find this value in printed tables of chi-squared distributions, or by using software, as below.]

qchisq(.95, 2)
[1] 5.991465

Because we have $Q = 7.18 > 5.99,$ we reject the null hypothesis. We say that the counts we observed are not consistent with the null hypothesis.

P-value. Another way to test the null hypothesis is to get the P-value. It is the probability of a more extreme result than observed. Specifically, it is $P(Q \ge 7.18),$ computed using $A \sim \mathsf{Chisq}(2).$ For our test, the P-value is $0.276 < 0.05 = 5\%,$ so we can use the P-value to reject the null hypothesis.

1 - pchisq(7.18, 2)
[1] 0.02759833

You usually can't get exact P-values from printed chi-squared tables. But you may be able to see from tables that the P-value is between 0.01 and 0.05. Statistical software usually gives a P-value as part of the output from the test procedure.

The plot below shows the density function of $\mathsf{Chisq}(2).$ The vertical red dotted line shows the critical value, and the vertical black line show the value of the test statistic. The area under the density curve to the right of the red line is 5%' the area to the right of the black line is the P-value.

enter image description here

Chi-squared test in R. Below is output from the procedure chisq.test in R. (It differs slightly from results above because of differences in rounding.)

chisq.test(X, p=c(.01,.45,.54))

        Chi-squared test for given probabilities

data:  X

X-squared = 7.1759, df = 2, p-value = 0.02765

There is a warning message that the P-value may not be exactly correct. One of our expected counts is $2$ not $> 5,$ so $Q$ might not have exactly the distribution $\mathsf{Chisq}(2).$ Many textbooks say it is OK if most of the $E_i > 5$ and all $E_i > 3.$ So we should have used a slightly larger sample.

In such cases, the chisq.test in R can simulate the the P-value. Abbreviated output is shown below:

chisq.test(X, p=c(.01,.45,.54), simulate=T)$p.val
[1] 0.03448276

There seems no doubt we can reject at the 5% level.