How to test if two sets of data are closely related?

Question

How to test if two sets of data are closely related?

1.2k Views Asked by Bumbble Comm At 31 Mar 2026 - 8:35

As part of my masters thesis i am 'Examining the Reliability of Markov Chains and The Kalman Filter as Stock Market Forecasters'. I will be using the daily returns from the s&p500 over a 5 year period as a benchmark and compare the results from each model against this benchmark to establish which is more accurate. However i am looking for a formal test which will allow me to establish which data set is more similar to the s&p500?

I considered a simple correlation between data sets but i would prefer something more conclusive.

Any ideas?

I really would appreciate any help.

Thank You

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2015-03-05 17:50:33

I would keep this one simple, as follows: Models $A$ and $B$ each produce predictions $X_A^{(i)}$ and $X_B^{(i)}$ for each of $N$ historical data situations, where the actual results are $X^{(i)}$. I will discuss below how to choose these data situations given a time-sequence of states.

Then it is easy to form squared error ensembles $\Delta_A^{(i)} = \{X_A^{(i)}-X^{(i))^2} : i < N\}$ and similarly for $\Delta_B^{(i)}$. Now if those values were statistically uncorrelated, you could estimate (separately for ensemble $A$ and ensemble $B$) the mean and variance in the usual way where mean $\mu_A$is the average of $\Delta_A^{(i)}$, and variance is $$ \hat{\sigma}_A^2 = \frac{1}{N-1} \left( \sum_i (X_A^{(i)})^2-N\mu^2 \right) $$ Now you have two (assumedly normal) distributions with means separated by $\mu_A - \mu_B$ and variances $\hat{\sigma}_A^2, \hat{\sigma}_B^2$. The difference of the two is normally distributed random $\delta$ with mean $\mu_A - \mu_B$ and standard deviation $\hat{\sigma} = \sqrt{\hat{\sigma}_A^2 +\hat{\sigma}_B^2}$. You assume the standard deviation of $\delta$ is really your estimated value $\hat{\sigma}$, and apply the two-sided $z$ test of the hypothesis that the distribution mean is actually zero. For example, if your significance criterion is a 95% confidence level, the effect seen is deemed significant if significant if $|\mu_A - \mu_B| > 1.96 \hat{\sigma}$.

You could instead of this method compare Pearson's "coefficient of determination" for the two models; the trouble tiwth that is that you still need to figure out how big a difference is needed to consider it significant.

The remaining issue is how to form the ensemble of historical data situations such that the accuracy of method $A$ at point $i$ is not correlated with the accuracy at point $i+1$ (and similarly for method $B$). The simple technique is to first modify method $A$ by subtracting off its mean prediction error (if that happens to be non-zero; a decent model will already have done that). Then consider a time sequence of $\Delta_A^{(t)}$ and create, for $k=1$, an ensemble of $C_{t,k} = (\Delta_A^{(t=mk)})(\Delta_A^{(t=mk+k)})$ for all $m$ such that $mk+k$ is less than the total time available. The The mean value of this ensemble is related to the mean squared $\Delta_A^{(i)}$; their ratio is closely related to the time correlation coefficient for $k$ steps. This will likely be something close to $1$ for $k=1$. Try again for $k=2$, $k=4$ and so forth until you come to a step size that reduces the correlation to (say) 10% for both methods $A$ and $B$, and use that step size to form your ensembles for doing the $z$-test analysis.

How to test if two sets of data are closely related?

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in FINANCE

Trending Questions

Popular # Hahtags

Popular Questions