Test for inferring distribution of random variable

29 Views Asked by At

Suppose random variable $Z$ is known to come from one of two distributions: $X$ or $Y$. Given a set of observations $\{Z = z_i\}_{i=1}^{N} \;$, what would be the best statistical test(s) to use to infer the distribution of $Z$?

1

There are 1 best solutions below

0
On

Choices. Suppose $X \sim \mathsf{Gamma}(5, 5),$ where the second parameter is the rate. And suppose $Y \sim \mathsf{Norm}(1, 1/\sqrt{5})$ where the second parameter is the standard deviation. Both distributions have $\mu=1$ and $\sigma= 1/\sqrt{5}.$

I have 100 observations in a vector z sampled according to one of these two distributions and rounded to 2 decimal places. Which distribution?

K-S test statistic. Perhaps the best test is the Kolmogorov-Smirnov goodness-of-fit test, which essentially matches the empirical CDF (ECDF) of the sample with the CDF of a distribution and rejects when the fit is bad.

Data summary. Summary statistics for z:

summary(z);  sd(z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2800  0.6575  0.8950  0.9581  1.2125  2.0300 
[1] 0.3815934

You might suspect that the data are from the right-skewed gamma distribution because the sample mean is noticeably larger than the sample median, which sometimes indicates right-skewness.

K-S Distances from R software. The K-S test statistics D are shown below. (A few ties were induced by rounding; the K-S test as implemented in R, gives a warning message if any ties are present. I have deleted these warning messages).

The smallest statistic is for the gamma distribution. So I choose the gamma distribution as the better fit. (The K-S test rejected neither model, but our criterion here is to use the smallest $D$ for identification.)

ks.test(z, "pgamma", 5, 5)$stat
         D 
0.06906961 

ks.test(z, "pnorm", 1, 1/sqrt(5))$stat
       D 
0.114528 

ECDF plots. In each case it is the largest absolute vertical distance between the sample ECDF and the target CDF. The ECDF plot of the sample (black dots) is the same in each panel. The red curve at left is the CDF of $\mathsf{Gamma}(5, 5)$ and the red curve at right is the CDF for $\mathsf{Norm}(1, 1/\sqrt{5}).$ Visually, it is clear that the ECDF fits the gamma CDF best.

enter image description here


Note: Data were sampled using R statistical software as shown below. Using the set.seed statement, you will get exactly the same data used above.

set.seed(1023)
z = round(rgamma(100, 5, 5), 2)