I want to calculate the pooled z-test in python, for two-samples. I generated 2D Pareto distributed data, I calculated their probabilities px and py. Now I need to calculate the p-pooled and the z test.
That would be the formula:
z = (px - py) / sqrt(p-value * ( 1 - p-value)((1 / nx) + (1 / ny)))
My python code:
z = (px - py) / np.sqrt(p_value * (1-p_value) * ((1 / nx) + (1 / ny)))
I already calculated px and py. But when I calculate my p-value I get a number which is over 1, which is not possible if I understand that right, it should be between 0 and 1.
p-value = (nxpx + nypy) / (nx + ny)
That is the formula I used in order to calculate the p-pooled. nx and ny are the lenghts of my 2d arrays which I needed for calculating px and py.
My python code:
p_pooled = (nx*px + ny*py) / (nx + ny)
My test data:
nx = 39
ny = 36
px = 2.1167384708788747e-31
py = 4.9852596177390274e-29
p_pooled = 2.4039316565633032e-29
z_test = -4.3805809079192486e-14
The z-test should be between 0 and 1.
What am I doing wrong? Is there a better way for calculating the z-test and p-value in python? I found the library statsmodels but it only accepts 1D arrays, and I have 2D data-sets.
I get the probabilities calculating the kernel density with gaussian and then calculating the exp of the KDE score (because it gives me back a log result).
kde = KernelDensity(kernel='gaussian', bandwidth=0.98).fit(data)
px = np.exp(kde.score(data))
Thank you!
I think you're confusing the $p$-value of the test, and the sample proportions of your test statistic. You use "p-value" interchangeably and this is not correct.
From the description of your test statistic, it seems the type of data you have are observed proportions or frequencies from two groups, and you are interested in testing the hypothesis that the true proportions from these two groups are unequal. That is to say, your data are binomially distributed, with observed sample sizes and proportions $$(n_x, \hat p_x), (n_y, \hat p_y),$$ or equivalently observed sample sizes and frequencies $$(n_x, x), (n_y, y)$$ with $\hat p_x = x/n_x$ and $\hat p_y = y/n_y$; and your hypothesis is $$H_0 : p_x = p_y \quad \text{vs.} \quad H_a : p_x \ne p_y$$ where $p_x$ and $p_y$ are the true proportions from the two groups. Your choice of test statistic, under the assumption of the null hypothesis, is $$Z \mid H_0 = \frac{\hat p_x - \hat p_y}{\sqrt{\hat p_{\text{pooled}}(1-\hat p_{\text{pooled}}) (n_x^{-1} + n_y^{-1})}} \sim \operatorname{Normal}(0,1)$$ where $$\hat p_{\text{pooled}} = \frac{n_x \hat p_x + n_y \hat p_y}{n_x + n_y} = \frac{x + y}{n_x + n_y}$$ is the pooled sample proportion. Under $H_0$, this test statistic is asymptotically normally distributed with mean $0$ and variance $1$. Consequently, if the test is performed at a significance level of $\alpha$, we can either compare $Z$ against the critical value corresponding to a level $\alpha$ test, or we can compute a $p$-value for the test statistic: $$p \text{-value} = 2(1-\Phi(|Z|)),$$ where $\Phi$ is the cumulative distribution function for the standard normal distribution.
For example, suppose our data is $$(n_x, x) = (22, 14), \\ (n_y, y) = (35, 19).$$ Then $$\hat p_x = 14/22 = 0.636363, \quad \hat p_y = 19/35 = 0.542857,$$ and our test statistic is $$Z = \frac{\frac{14}{22} - \frac{19}{35}}{\sqrt{\frac{11}{19} \cdot \frac{8}{19} (\frac{1}{22} + \frac{1}{35})}} \approx 0.696085.$$ Then the $p$-value is $$p \text{-value} = 2(1 - \Phi(|0.696085|)) = 2(1 - 0.756812) \approx 0.486376.$$ If the test were conducted at a significance level of $\alpha = 0.05$, we would fail to reject $H_0$ in favor of $H_a$, meaning the data do not provide sufficient evidence to suggest the true group means are unequal. In Python, the function $\Phi$ is included in the library
scipy.statsasnorm.cdf(x).