Union over two sets within which a pattern is statistically significant

32 Views Asked by At

If I have two variables $f_{a}$ and $f_{b}$ (let's say they are sampled from a Poisson point process and are thus independent) which form a pattern $C$ via some association rule. Now if this pattern is statistically significant in both set $S_{1}$ and set $S_{2}$, is it possible to prove that the pattern would be statistically significant in the union, i.e. $S_{1} \cup S_{2}$ where set $S_{1}$ and set $S_{2}$ are disjoint sets?

Thanks for your inputs!

1

There are 1 best solutions below

0
On

No. Consider the following pattern $C$:

The dataset includes an increasing sequence of length at least $3$.

For example, $(1,2,5,3)$ matches the pattern, but $(3,6,2,4)$ does not.

For any $S$, let $\{X_s\}_{s\in S}$ be a sequence of iid $\mathrm{Poisson}(1)$ random variables indexed by $S$. This is our model for production of the dataset associated with $S$.

The criterion for "statistical significance" in this hypothetical is quite simple: the pattern is statistically significant if the probability of its appearance in a dataset is not more than 5%.

Suppose $S_1=\{1\}$ and $S_2=\{2,3\}$. Then neither $\{X_s\}_{s\in S_1}$ nor $\{X_s\}_{s\in S_2}$ can support $C$; they are not long enough to contain any sequences of length $3$, much less increasing ones. So the appearance of $C$ in those datasets is an event of probability $0$, and thus statistically significant.

But $\{X_s\}_{s\in S_1\cup S_2}$ is large enough to support $C$. In fact, by considering the rearrangement symmetry, it is easy to see that $$\mathbb{P}[C\text{ in }\{X_s\}_{s\in S_1\cup S_2}]=\frac{1}{6}>5\%$$

This example might strike you as contrived, because the statistical significance in $\{X_s\}_{s\in S_1}$ and $\{X_s\}_{s\in S_2}$ is vacuous. But the problem remains even in non-vacuous examples; the pattern $C$ might frequently appear "split across the boundary" between $S_1$ and $S_2$, and thus be unsignificant in the combined dataset. It's just that the numbers then are harder to compute, so it doesn't make a good example.