I got contradictory results between p-values in single trials, obtained with Wilcoxon test, and Fisher's p-value, obtained from a simulation over 1000 trials.
I am confused about correctly interpreting the null-hypothesis and p_values obtained from a simulation. Can you please explain and help in interpreting results, and correct wrong ideas ? I don't feel to have yet solid background in statistics, grateful if you can help in grasping key concepts better and design an experiment that actually test what I meant to test.
Consider this experiment:
I have a list of sequences associated to two different contexts:
- context_1 - [[33, 33], [33], [23, 36, 35], ...]
- context_2 - [[356, 122, 45], ... ]
Those numbers represent how common tokens are in a sequence, observed for a given context (you may think that the sequence is like a sentence that encode the frequencies of observed words, e.g. if "apple" is frequent 33 times in a vocabulary, then 33, 33 encode apple, apple ". I am working with animal sounds).
Null hypothesis :
- sequences follows a similar distribution
Alternate:
- sequences do not follow a similar distribution
I am using Wilcoxon rank-sum test. As control experiment, I compare the samples belonging to the same context (say, context_1), and run a simulation for n times (say, n=1000).
- Example:
rank_sum([33, 33], [33]), rank_sum([33], [23, 36, 35]) , ... - collect the 1000 p_values
I see that majority of p-values (97%) are above 0.05, that is, I interpret that I cannot reject the null hypothesis and the single trials follows the same distribution.
Now, I want to aggregate those p_values of the simulation. I use Fishers' method.
And I found that p_value of the "aggregated" 1000 p_values is: 0.
That is, I should reject the null-hyptothesis and the samples do not come from the same ditribution.
But in fact they are.
That is confounding:
- why I get contradictory results between the majority of single trials using Wilcoxon and the final p-value using Fisher ?
- Is the null hypothesis I am testing with Fisher equivalent to the one I am testing with Wilcoxon, or is it another thing?
FYI: I am computing Fisher p value like this:
n = 1000 # simulation
p_values = [ rank_sum([33, 33], [33]), rank_sum([33], [23, 36, 35]) , ... ] #1000 p-values
chi2_statistic = -2 * np.sum(np.log(p_values))
degrees_of_freedom = 2 * n
fisher_p_value = 1 - chi2.cdf(chi2_statistic, degrees_of_freedom)
Ref:
- https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/
- https://en.wikipedia.org/wiki/Fisher%27s_method
Edited
If I run the simulation 1000 times against the same sequences, I have a list of 1000 p-values with all 1.0 (...they likely come from the same distribution, and if fact they are since I am comparing the same sequence over and over)
But I apply Fisher's on those p-value lists, I got 0.
- Why is that ?
- I am confused in interpreting the p-value of Fisher.