I'm currently conducting an experiment with a relatively large time-based set of data. The data is not normally distributed, so a wilcoxon test is required. I recently compared two parameters for which the p-value was incredibly small (<0.00001), however the actual difference in mean between the two data sets was almost nothing (0.23 on a scale from 1-100).
What does this mean? From what I understand, the test matches up values of the same 'rank' and compares them. Does this just mean that the values output by group 2 are just on a slightly higher scale than the values output by group 1? It just doesn't make sense to me that you can have statistical significance with an actual difference that small. If it's obvious, please forgive me, as I'm a beginner.
Here is a summary of the test from Wikipedia for those who are unfamiliar:
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples.[1] The one-sample version serves a purpose similar to that of the one-sample Student's t-test.[2] For two matched samples, it is a paired difference test like the paired Student's t-test (also known as the "t-test for matched pairs" or "t-test for dependent samples"). The Wilcoxon test can be a good alternative to the t-test when population means are not of interest; for example, when one wishes to test whether a population's median is nonzero, or whether there is a better than 50% chance that a sample from one population is greater than a sample from another population.
Statistical significance is a related but distinct concept from effect size. We will illustrate with an example.
Suppose I have a coin, and I know its exact probability $\theta$ of landing heads when tossed. I let you borrow this coin, but I do not tell you the value of $\theta$. Naturally, you would like to test whether the coin is fair; i.e., $\theta = 0.5$.
Unfortunately, I give you only a minute to experiment with the coin. You inspect it, and you can see nothing unusual about it. You proceed to toss the coin $n = 10$ times. The statistical hypothesis you are testing is $$H_0 : \theta = \theta_0 = 0.5 \quad \text{vs.} \quad H_a : \theta \ne 0.5.$$ You happen to observe $X = 8$ heads and $n - X = 2$ tails. Under the assumption that the null hypothesis is true--i.e., assuming the coin is fair, the probability of seeing a result as extreme as this is $$\begin{align} p = \Pr[(X \ge 8) \cup (X \le 2) \mid H_0] &= \frac{1}{2^{10}} \left(\binom{10}{0} + \binom{10}{1} + \binom{10}{2} + \binom{10}{8} + \binom{10}{9} + \binom{10}{10} \right) \\ &= \frac{112}{1024} = \frac{7}{64} = 0.109375. \end{align}$$ This is the exact $p$-value of the hypothesis test, because it is the probability that, assuming the coin is fair, that you would erroneously conclude that it is biased as a result of seeing at least $8$ heads or at least $8$ tails. In other words, the limited number of coin tosses you are able to make prevents you from being able to conclude with a high degree of statistical confidence that the coin is unfair, except in perhaps very extreme circumstances. If your rejection criterion were that you had to see all heads or tails, then the significance level of such a test would be $\alpha = \frac{1}{512} \approx 0.00195313$. Put another way, if you had seen all heads or all tails in $10$ tosses, the chance that a fair coin would have produced such a result by random chance is less than $0.2\%$, so most people would be reasonably assured (although still not absolutely certain) that the coin is unfair.
But the problem with such a stringent rejection criterion is that you'd almost never be able to conclude the coin is unfair, including cases where there might be moderate bias. In other words, such a test has low statistical power. The effect size in this case is the extent of deviation from fairness that we estimate the coin to be; e.g., if you saw $X = 8$, this represents a point estimate of $$\hat \theta = \frac{X}{n} = 0.8$$ as your "best guess" for the probability of heads $\theta$, and $|0.8 - 0.5| = 0.3$ is in some sense an "effect size" (though strictly speaking, it is un-standardized) of the coin's observed deviation from fairness.
So to reframe the issue of low power in terms of effect size, what this means is that when the sample size is small, you have limited ability to detect small effect sizes; that is, your ability to reject the null hypothesis on the basis of observing a sufficiently small $p$-value requires that the true effect size is very large. Your coin toss experiment has to give you really extreme results for you to say it's unfair with any meaningful degree of confidence.
Yet the coin could be just a "little bit" unfair. Suppose in fact, $\theta = 0.51$. In order for you to detect such a small deviation from fairness, you'd have to flip the coin many, many more times. For instance, if you had the whole day to flip the coin, you might be able to do $n = 1000$ flips, and if you observe even $X = 550$ heads, the $p$-value of your test would be $p \approx 0.00173054$, which is even smaller than all heads or all tails out of $n = 10$ tosses. Yet your point estimate for $\theta$ is $\hat \theta = 0.55$ and the (Wald) $95\%$ confidence interval for $\theta$ would be $$[0.519166, 0.580834].$$
So this is what is happening in your experiment. When the sample size is large, the test's ability to reject the null hypothesis is enhanced--it becomes sensitive to even small effect sizes. Your effect size is small, but the test is still able to show statistical significance because your sample size is so large.
One of the consequences of this issue is that experimenters often need to interpret the significance of a hypothesis test in the context of the observed or measured effect size, and not simply parrot the $p$-value that was obtained. As a result, the experimenter may prespecify that a certain effect size is needed to be observed in order for the results to have practical value. For instance, a clinical trial to test the effectiveness of a cancer treatment might require that the treatment improves the median progression-free survival by least 3 months in order to have clinical relevance. That means at least half of patients who are given such treatment would live without disease progression for at least 3 months more than if they had not been treated. The desired effect size is this "3 months" improvement. However, if the study enrolled tens of thousands of patients, it could still conclude with statistical significance that the treatment is effective, but only slightly--say, it demonstrated that the median progression-free survival is between $1.2$ and $1.7$ months ($95\%$ confidence interval). Maybe such patients would say they'd still take a $1.2$ month improvement, but maybe not: the treatment might be highly toxic or cause side effects that decrease the quality of life so much that it might not be worth the small increase in progression-free survival. So this is the kind of consideration that we must make when interpreting the results of a hypothesis test, and in your situation, you need to decide whether that small effect size is meaningful.