Hypothesis Testing in Python: One sample T Test without Standard deviation

214 Views Asked by At

Update

So the way I solved the question was through taking one sample of the weights array like so

sample = np.random.permutation(weights)

Then calculate the sample standard deviation

std = np.std(sample)

Finally plug it all in to the t score equation

The result is 1.8742476583604653, which is used to find the p value in a t distribution table, leading to p ~= 0.075, which accepts the null hypothesis of the mean being 72.


I am brand new to Hypothesis testing and I want to know the correct way to answer this question

The question asks:

  • If the weights are normally distributed
  • Use a one sample t-test to test hypothesis that the mean is 72

Significance level is 0.05

First, you are given a set of weights

weights = [94.93428306,  82.23471398, 97.95377076, 115.46059713, 80.31693251,  80.31726086, 116.58425631, 
           100.34869458,  75.61051228, 95.85120087, 75.73164614, 75.68540493, 89.83924543,  46.73439511,  
           50.50164335,  73.75424942,  64.74337759,  91.28494665, 66.83951849, 56.75392597, 114.31297538, 
           80.48447399,  86.35056409,  56.50503628, 74.11234551,  66.1092259 ,  53.49006423,  68.75698018,
           58.9936131 ,  62.0830625 ,  58.98293388,  83.52278185, 64.86502775,  54.42289071,  73.22544912,  
           52.7915635 ,67.08863595,  45.40329876,  51.71813951,  66.96861236, 72.3846658 ,  66.71368281,  
           63.84351718,  61.98896304, 50.2147801 ,  57.80155792,  60.39361229,  75.57122226, 68.4361829 , 47.36959845]

Using the Shapiro Wiki test (from scipy library) to check if the results are normally distributed

  • H0 = weights are normally distributed
  • HA = weights are not normally distributed
form scipy import stats
shapiro = stats.shaprio(weights)

# ShapiroResult(statistic=0.9404902458190918, pvalue=0.014088480733335018)

So from what I understand, there is a 95% probability that the weights are not normally distributed.

This leads to the second issue. If the weights are not normally distributed, how can you use a one sample t test ?

My initital thought was to permutate the data and then sample it randomly like so

perm_repl_means =[]

for i in range(1000):
    weights_perm = np.random.permutation(weights)
    sample_a = weights_perm[:len(weights) //2]
    sample_b = weights_perm[len(weights)//2:]
    mean_diff = sample_a.mean()- sample_b.mean()
    perm_repl_means.append(mean_diff)

Then get the standard deviation

std = np.std(perm_repl_means)
Standard Deviation = 4.918230395520542

And from there try to use the equation

t = sample mean - population mean / (standard deviation / sqrt(sample size) )

But I get strange values like -50.41615468880369. I know its incorrect because taking the mean of the weights array yields 71.9277206544

Can anyone let me know what I am doing wrong and how to correctly approach this question ?

1

There are 1 best solutions below

4
On BEST ANSWER

I put your data into R and did a Shapiro-Wilk test, with the following results:

shapiro.test(weights)

        Shapiro-Wilk normality test

data:  weights
W = 0.94049, p-value = 0.01409

The P-value $0.014$ indicates that the probability a truly normal sample of size $n=50$ would give such results is small. So it seems prudent not to use the data in procedures that require normal data.

However, it is possible that an inconconsequential quirk in your data could account for such a small P-value. Many statisticians prefer to look at a Q-Q plot of the data to judge randomness. Data points in a Q-Q plot of normal data should be roughly linear (except perhaps for relatively few points in the tails).

Here is a Q-Q plot of your weight data made using R. It is not satisfactorily linear.

qqnorm(weights)
 qqline(weights, col="blue")

enter image description here

Even a histogram of the weights shows marked skewness toward the right.

hist(weights, prob=T, col="skyblue2")

enter image description here

I don't know what kind of t test you have in mind. There is some controversy about using t tests with sample sizes as large as $n = 50,$ with non-normal data. I would not trust results of a t test from such a markedly skewed distribution.

The 'rule' in some elementary texts that normality does not matter in t tests with sample sizes above thirty is clearly wrong for strongly skewed data. In particular, the numerator and denominator of the t statistic are not independent for such non-normal data, so the so-called 't statistic' does not have a t distribution.

Addendum per Comment: A 95% nonparametric bootstrap CI for the mean $\mu$ of population weights is $(67.0,76.7).$

set.seed(2022)
a.obs = mean(weights);  a.obs
[1] 71.92772
d = replicate(2000, mean(sample(weights,50,rep=T))-a.obs)
UL = quantile(d, c(.975,.025))
a.obs - UL
   97.5%     2.5% 
66.97763 76.72769