Relation between Empirical CDF and ordering

170 Views Asked by At

I have N realizations of iid Random variable X. If I order this realizations:

$$x_1 \leq x_2 \leq \dots \leq x_n $$

It seems like an empirical CDF, If I reescale this correctly. There is an relation between this ordination and the empirical CDF? Its seems like I have one realization per bin in histogram of cdf.

1

There are 1 best solutions below

4
On BEST ANSWER

@Henry's description of an ECDF is essentially correct. If there are ties in the data at value $v$ then the upward jump at $v$ is by $k/n,$ where $k$ is the number of observations tied at value $v.$

The ECDF of a sufficiently large random sample, approximates the CDF of the population from which the sample was taken.

For example, consider $n = 80$ observations from $\mathsf{Norm}(\mu = 100, \sigma=15).$

set.seed(729)
x = rnorm(80, 100, 15)
plot(ecdf(x))
 curve(pnorm(x, 100, 15), add=T, col="orange", lwd=2)

enter image description here

The test statistic $D$ of a one-sample Kolmogorov-Smirnov test, with the null hypothesis that $X_i \stackrel{iid}{\sim} \mathsf{Norm}(\mu = 100, \sigma=15),$ is the maximum vertical discrepancy between the hypothetical CDF and the sample ECDF.

ks.test(x, pnorm, 100, 15)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.12758, p-value = 0.1355
alternative hypothesis: two-sided

Although the fit of the ECDF to the CDF in this instance is not excellent, this is to be expected for a sample as small as $n = 80.$ So, relative to jumps of size $1/80 = 0.0125,$ the observed $D = 0.12758$ is not unusually large, and $H_0$ is not rejected at the 5% level.

Typically, the fit of the ECDF to the CDF is "better" than the fit of the density function (orange curve) to a histogram of the data, partly because binning for the histogram is somewhat arbitrary. The default kernel density estimator (KDE) of the sample is often a better representation of the data than a histogram. (The KDE of this sample is shown as a dotted blue line.)

hdr = "Histogram of Sample with Population Density"
hist(x, prob=T, col="skyblue2", main=hdr);  rug(x)
 curve(dnorm(x, 100, 15), add=T, col="orange", lwd=2)
 lines(density(x), type="l", col="blue", lwd=2, lty="dotted")

enter image description here

A sample of size $n = 2000$ gives a more accurate view of the population.

set.seed(2020)
x = rnorm(2000, 100, 15)

summary(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47.98   89.84   99.19   99.81  109.87  155.54 
[1] 15.32257  # sample SD

For this larger sample, within the resolution of our plots, the ECDF is hardly distinguishable from the CDF. So their plots are not shown.

The histogram, density curve and KDE are shown below. (The rug, which shows locations of individual sampled values, is omitted here because there are too many of them for a useful view.)

enter image description here