Justifying data samples are from different distribution.

56 Views Asked by At

Let $x \in \{0,1\}^N$, and

\begin{align} D &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{M} \end{bmatrix} \end{align}

So that $D \in \{0,1\}^{N \times M} $.

This is the original dataset. The zero indicate a trait and 1 indicate absence of the trait. The sequence of the 0's and 1's matter for each $x$. A new data sample $D'$ was generated using a different generation process (for example Boltzman machine).

I am looking for a test statistic to show that $D$ and $D'$ are different distributions or otherwise. For example, it would be possible to use Kolmogorov-Smirnov test, but I am not certain this would be appropriate for the data. Another contending approach is kernel 2 sample test. Again, while this might work I am wondering if there is any caveat.

Or is there any other statistical test that might be more relevant?

References:

https://stats.stackexchange.com/questions/88764/test-for-difference-between-2-empirical-discrete-distributions

https://stats.stackexchange.com/questions/204359/method-to-justify-claim-that-two-samples-come-from-the-same-distribution

https://stats.stackexchange.com/questions/1047/is-kolmogorov-smirnov-test-valid-with-discrete-distributions