Fisher test to aggregate p-values in a simulation (from Wilcoxon rank-sum tests )- interpret correctly null-hypothesis

19 Views Asked by At

I got contradictory results between p-values in single trials, obtained with Wilcoxon test, and Fisher's p-value, obtained from a simulation over 1000 trials.

I am confused about correctly interpreting the null-hypothesis and p_values obtained from a simulation. Can you please explain and help in interpreting results, and correct wrong ideas ? I don't feel to have yet solid background in statistics, grateful if you can help in grasping key concepts better and design an experiment that actually test what I meant to test.

Consider this experiment:

I have a list of sequences associated to two different contexts:

  • context_1 - [[33, 33], [33], [23, 36, 35], ...]
  • context_2 - [[356, 122, 45], ... ]

Those numbers represent how common tokens are in a sequence, observed for a given context (you may think that the sequence is like a sentence that encode the frequencies of observed words, e.g. if "apple" is frequent 33 times in a vocabulary, then 33, 33 encode apple, apple ". I am working with animal sounds).

Null hypothesis :

  • sequences follows a similar distribution

Alternate:

  • sequences do not follow a similar distribution

I am using Wilcoxon rank-sum test. As control experiment, I compare the samples belonging to the same context (say, context_1), and run a simulation for n times (say, n=1000).

  • Example: rank_sum([33, 33], [33]), rank_sum([33], [23, 36, 35]) , ...
  • collect the 1000 p_values

I see that majority of p-values (97%) are above 0.05, that is, I interpret that I cannot reject the null hypothesis and the single trials follows the same distribution.

Now, I want to aggregate those p_values of the simulation. I use Fishers' method.

And I found that p_value of the "aggregated" 1000 p_values is: 0.

That is, I should reject the null-hyptothesis and the samples do not come from the same ditribution.

But in fact they are.

That is confounding:

  • why I get contradictory results between the majority of single trials using Wilcoxon and the final p-value using Fisher ?
  • Is the null hypothesis I am testing with Fisher equivalent to the one I am testing with Wilcoxon, or is it another thing?

FYI: I am computing Fisher p value like this:

n = 1000 # simulation
p_values = [ rank_sum([33, 33], [33]), rank_sum([33], [23, 36, 35]) , ... ] #1000 p-values
chi2_statistic = -2 * np.sum(np.log(p_values))
degrees_of_freedom = 2 * n
fisher_p_value = 1 - chi2.cdf(chi2_statistic, degrees_of_freedom)

Ref:


Edited

If I run the simulation 1000 times against the same sequences, I have a list of 1000 p-values with all 1.0 (...they likely come from the same distribution, and if fact they are since I am comparing the same sequence over and over)

But I apply Fisher's on those p-value lists, I got 0.

  • Why is that ?
  • I am confused in interpreting the p-value of Fisher.