I'm simulating a simple Ornstein-Uhlenbeck process
$dx=-x dt+\sqrt{2}dW$
which is well-known to have a steady state distribution of
$p_s(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$
Here's my matlab code to run the simulation
dt = 0.01;
T= 5000;
X = zeros(size(0:dt:T));
for t=0:dt:T
% This is the Euler scheme
X(:,i+1)=X(:,i)-X(:,i)*dt+sqrt(2*dt)*randn;
% This is the exact formula
% X(:,i+1)=X(:,i)*exp(-dt)+sqrt(1-exp(-2*dt))*randn;
i=i+1;
end
Then I do the Kolmogorov–Smirnov test to check the normality of the resulting distribution
[h,p] = kstest(X);
but always get h = 1, which rejects the null hypothesis that the data obeys standard normal distribution.
To find out the reason, I generate random numbers of normal distribution
x = randn(size(X))
and compare their cumulative distribution function with the standard normal distribution
pd = makedist('normal',0,1);
[fX,t] = ecdf(X);
y = cdf(pd,t);
plot(t,fX-y)
[fx,t] = ecdf(x);
hold on;
plot(t,fx-y);
The simulation-generated X and the matlab-generated x show very similar shape.
Then I do the two-sample KS test,
[h,p] = kstest2(X,x)
It returns h=0 (X and x are from the same distribution). So I'm really confused here. Why the simulated X cannot pass the normality test and the p value is far less than the significance level 0.05?
The process $x(t)$ can be written as : $$x(t)=x(s)e^{-(t-s)}+\sqrt{2}\int_{s}^{t}{e^{-(t-u)}dW_u}$$, with $s<t$ and $x(0)=0$.
Let $t_0<t_1,...<t_n$ a subdivision of time, each $x(t_k) \sim \mathcal{N}(0,1-e^{-2t_k})$.
Obviously , one would expect that if $t_n$ is large, $x(t_n) \sim \mathcal{N}(0,1)$ approximately.
This can be tested by building a sample $U=\{x(t_n,\omega_1),...,x(t_n,\omega_N)\}$, with $x(t_k,\omega _l)$ the value of $x(t_k)$ for the $l^{th}$ scenario.
We can pick $t_n=5000$ as you did , and a decent number of scenarios $N=1000$ for example. MATLAB function kstest applied on $U$ should be positive, because we stick to one distribution : $x(t_n)$
What you tried to do is to test the sample $U'=\{x(t_1,\omega_1),...,x(t_n,\omega_1)\}$, which is basically one occurrence for each distribution $x(t_k)$. Each element of $U'$ are not taken from one distribution but $n$ ones. For example, $var(X_{t_1}) \approx 0.0198$ while $var(X_{t_{50}}) \approx 0.6321$. We hit the $0.999$-ish from $t_{500}$.
That means that roughly $10\%$ of your sample comes from distributions with variances smaller than $0.999$.
Let assume that this bogus part of the sample is irrelevant. We still have the issue with the covariance
Indeed, by definition , a sample must come from IID random variables. In your case, we have $$cov(x(t),x(s))=cov(\sqrt{2}\int_{0}^{t}{e^{-(t-u)}dW_u},\sqrt{2}\int_{0}^{s}{e^{-(s-u)}dW_u})$$ $$=2\int_{0}^{s}{e^{-(t-u)}e^{-(s-u)}du}=e^{s-t}$$
By construction of $U'$, we have non-nil correlation between its elements, even for large time points. Take $x(1000)$ and $x(1001)$ , both variance "are" one, but the correlation is $0.3678$.
The conclusion is that one should not expect a positive result from $ktest$.
You can try to take a subset of your initial sample from $t_1=50$ for example, and take the values every $30$ units. For example $U''=\{x(50,\omega_1),x(80,\omega_1)...,x(4970,\omega_1),,x(5000,\omega_1)\}$. It is just a suggestion, I do not guarantee results.