What curve represents random sampling sorted by frequency?

87 Views Asked by At

I perform a discrete random sample with a uniform distribution (such as drawing colored balls from a bag with replacement). I then plot the frequency that each color has been selected, ordered by frequency (where the mode is the first plotted point). This will produce a curve with a negative slope.

Obviously the exact curve you get depends on the random sample, but I'm looking for the theoretical expected/average curve.

As an example, I got $10^5$ different numbers between $1\sim 100$ generated here, and wrote a small script to count the frequency of those numbers and to sort by frequency. This is the graph I got:

enter image description here

What equation describes the curve produced by sorting a random sample?

1

There are 1 best solutions below

0
On BEST ANSWER

It simplifies analysis a bit if we sort the counts of the random numbers in ascending order, rather than descending as in the OP. With that change, to a good approximation the curve has a value at $x=k$ which is the mean of the $k$-th order statistic of a sample of size $100$ taken from a Normal distribution with mean $100$ and standard deviation $9.95$. (I don't know of a simpler description.) Theoretical vs Observed Values In the plot above, the dots represent counts resulting from random number draws from a uniform distribution, and the solid line shows the theoretical value.

Since we do not have access to the numbers used in the OP, we used R to generate a set of values, using R's built-in pseudo-random number generator. The code we used is

set.seed(1234)
x <- sample(1:100, 10000, replace=TRUE)

Because of the set.seed statement, anyone running the same code should be able to replicate these results exactly.

The chance that any one random number is $k$ is $p=1/100$ for $1 \le k \le 100$, so the total number of random numbers with value $k$ follows a Binomial distribution with $p=1/100$ and $n = 10^5$. We can approximate the Binomial with a Normal distribution with mean $n p = 100$ and standard deviation $\sqrt{n p (1-p)} \approx 9.95$.

In general, when we take a random sample of size $n$ from any distribution, the results of the sample when sorted in ascending order are called the "order statistics" of the distribution, usually written $X_{(1)}, X_{(2)}, X_{(3)}, \dots X_{(n)}$, where $X_{(1)} \le X_{(2)} \le X_{(3)} \dots \le X_{(n)}$. (I've using $n$ in two different ways here, I hope that isn't too confusing. Now I'm using $n$ as the size of the sample.) In order to qualify as a "random sample", our sample values should be independent and identically distributed. We are fudging a bit on the "independent" part, because the counts of the random numbers are not independent. However, they are "nearly independent", so the lack of independence should not effect our results very much.

Still thinking about the order statistics of an arbitrary continuous distribution, suppose our original distribution has a density $f(x)$ and a cumulative density function $F(x)$. To simplify notation a bit, let's say $X_{(k)} = y_k$. Then the probability density function of $y_k$ is $$g_k(y_k) = \frac{n!}{(k-1)!(n-k)!} [F(y_k)]^{k-1} [1-F(y_k)]^{n-k} f(y_k)$$ If we want the mean of this distribution, it is $$\mu_k = \int_{-\infty}^{\infty} y_k \;g_k(y_k) \; dy_k \tag{*}$$

In our particular case, we have a Normal distribution with mean $\mu=100$ and standard deviation $\sigma = 9.95$, i.e. $$f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( \frac{(x-\mu)^2}{2 \sigma^2} \right)$$ and $$F(x) = \int_{-\infty}^x f(x) \; dx$$ The "theoretical line" in the plot was generated using $(*)$ and numerical integration in Mathematica.