Choosing the samples that satisfies the Central Limit Theorem

94 Views Asked by At

My informal understanding of the CLT is that: if we draw a number of samples from a population, then the mean of all the samples will approach a normal distribution as the sample size (of each sample) grows. I have problem applying this in practice, like how to choose the samples and how to define the random variable to make the distribution normal.

For example, I encountered this problem in the book Introduction to Probability and Statistics for Engineers and Statistics by Sheldon M.Ross.

enter image description here

The solution applies the two sample t-test with $H_0: \mu_p \le \mu_c$ and $H_1: \mu_p > \mu_c$, but not specify why the samples satisfy the normality assumption required to use the two sample t-test. The sample sizes applied to calculate the degree of freedom are 12 and 10 respectively. So if the CLT is to be applied in this case, there is only one sample for each population and the sample size is 12 and 10 respectively. Is that right?

So is it true that we can just have 1, 2 or 3 samples and each with a large sample size ($n > 30$), we can apply the CLT and the mean of the samples will follow the normal distribution? I tested with 3 samples and each with size 52 in R using the Anderson-Darling test but it requires at least 7 samples to implement the test.

Sorry for my long post but I need to add one more specific example that I am working on to clarify my problem further. I have the sales data of 2 brands by period, by stores and by week. I think of the following ways to define my random variables:

(a) Grouping sales data by stores (there are 3 of them) and I would have 3 samples, each with the sample size 52. So my random variables according to the CLT definition would be $X_1, X_2, ..., X_{52}$ ? $X_i$ is defined as the sales at week $i$.

(b) Grouping sales data by week (there are 52 weeks) and I would have 52 samples, each with the sample size 3. My random variable would be $X_1$, $X_2$, $X_3$. $X_i$ is defined as the sales at store $i$.

(c) Grouping sales data by period (there are 13 periods) and I would have 13 samples, each with the sample size 12. My random variable would be $X_1$, $X_2$,... ,$X_{12}$. $X_i$ is then defined as the sales at each week of each store (4 weeks * 3 stores).

(d) Grouping sales data by the sales at each store each week, I would have 1 sample for each population (2 samples for 2 population of brand 1 and brand 2) and the sample size is 157 each. My random variable would be $X_1$, $X_2$, ... $X_{157}$. $X_i$ is defined as the sales at each week of each store for each period.

In summary for my really long post, my questions are:

  1. Does the number of samples affect in anyway to the application of CLT? (The AD test in R requires at least 7 samples to carry out the test).

  2. How can we prove the random variables are i.i.d and each having the same mean $\mu$ and standard deviation $\sigma$?

  3. Which is the best way among the 4 methods (a), (b), (c), (d) above to define my samples so that it will satisfy the normality assumption under CLT and I can use two sample t-test to compare the 2 population means?

Thank you so much for taking time reading my question and sharing your insight!