What is the cost of reconstructing a discrete probability distribution within a given error?

127 Views Asked by At

I'm aware that there are several methods to estimate the probability distribution from samples (e.g. density estimation methods). However, I haven't been able to find the actual asymptotic costs of these methods, and I'm not sure whether these methods can be applied to the case of discrete probability distributions.

Given a discrete probability distribution $p$ on $[n]\equiv \{1,...,n\}$, I'm looking for the number of samples required to construct a distribution $q$ on $[n]$ such that $\|p-q\|_1\le\epsilon$ ($\|\cdot\|_1$ denoting here the $\ell_1$ distance). In other words, how many samples are needed to estimate the correct probability distribution within a given error $\epsilon$?

I've recently read the statement (in this paper, second paragraph of introduction) that the cost of doing this is $\Theta(n/\epsilon^2)$, but if this is a well-known result I haven't been able to find a good explanation for it.

I suppose the naive reconstruction method would be to build the histogram corresponding to the samples (that is, count the fraction of samples corresponding to each $i=1,...,n$). However, I'm not sure whether this method is optimal for the purpose, or how many samples would be required to achieve a given estimation error with it.

Estimates using distances other than the $\ell_1$ one are of course also welcome.

2

There are 2 best solutions below

9
On

Upon some reflection, here is what I think might be a possible solution.

Suppose the true distribution is $\newcommand{\bs}[1]{{\boldsymbol{#1}}}\bs p\equiv(p_1,...,p_n)$, and we are drawing $N>0$ samples from it. This amounts to be drawing samples from the multinomial distribution, call it $\mathrm{Multinomial}(N,\bs p)$.

If $X\sim \mathrm{Multinomial}(N,\bs p)$, then the possible outcomes of $X$ are (representable by) sequences of $N$ integers $x_i\in[n]\equiv \{1,..., n\}$, with $i=1,...,N$. Because our goal is to estimate $\bs p$, we are interested in the number of times each $k\in [n]$ occurred in a given instance of $X$. Denote the corresponding estimator with $\hat N_i$. We know that $$\mathrm{Pr}(\hat N_i = N_i) = \binom{N}{\bs N}\bs p^{\bs N} \equiv \binom{N}{N_1\cdots N_n} p_1^{N_1}\cdots p_n^{N_n},$$ where $\binom{N}{\bs N}\equiv \frac{N!}{N_1!\cdots N_n!}$ denotes the associated multinomial coefficient.

The problem at hand can be formalised as the task of looking for a good estimator for the parameter $\bs p$. The intuition of using the histogram of the distribution for the purpose can be formalised in the choice of the estimators $$\hat p_i \equiv \frac{\hat N_i}{N}.$$ We can then compute the first moments, to find: $$\mathbb E[\hat p_i] = \frac{\mathbb E[\hat N_i]}{N} = p_i \qquad \text{(the estimator is unbiased)}, \\ \mathrm{Var}[\hat p_i] = \frac{\mathrm{Var}[\hat N_i]}{N^2} = \frac{p_i(1-p_i)}{N}.$$ We can now compute the estimator corresponding to the trace distance between estimated and true distribution. For $n=2$, this reads $$\|\hat{\bs p}-\bs p\|_1 = |\hat p_1 - p_1 | + |\hat p_2-p_2| = 2|\hat p_1 - p_1|,$$ and thus, from Chebyshev's inequality, we have $$\mathrm{Pr}(\|\hat{\bs p}-\bs p\|_1 \ge \epsilon) \le \frac{2\mathrm{Var}[\hat p_1]}{\epsilon^2} = \frac{2p_1 (1-p_1)}{N \epsilon^2}.$$ Without making more assumptions about the probability distribution, we should probably take the worst-case scenario, and thus replace this with an identity. Shuffling terms around, we can then say that, in the worst-case scenario, to estimate a binary probability distribution $(p_1,p_2)$ with probability $t$ of committing an additive error $\epsilon$, we need a number of samples $$N \ge \frac{2p_1(1-p_1)}{t\epsilon^2}.$$ If $p_1$ is unknown, as it would be in this context, we can again take the worst-case scenario, and get the bound $N\ge \frac{1}{2t \epsilon^2}$.

More generally, we can consider the total variation distance $\delta(P,Q)$. This equals the maximum difference between the probabilities that $P$ and $Q$ assign to any given event. We can then define the corresponding estimator, call it $\hat\delta=\hat\delta(\bs p)$, in the naive way: $$\hat\delta(\bs p) = \max_{S\subseteq[n]}\sum_{i\in S} (\hat p_i-p_i)\equiv \max_{S\subseteq [n]}[\hat p(S) - p(S)],$$ where the maximum is taken over all subsets of $[n]\equiv\{1,...,n\}$. Therefore, for all $S\subseteq [n]$, we have $$\mathrm{Pr}(|\hat p(S) - p(S)|\ge \epsilon) \le \frac{\mathrm{Var}[\hat p(S)]}{\epsilon^2}.$$ We thus conclude that $$\mathrm{Pr}(|\hat \delta(\bs p)| \ge \epsilon) \le \frac{\max_{S\subseteq [n]}\mathrm{Var}[\hat p(S)]}{\epsilon^2} = \frac{1}{N\epsilon^2} \max_{S\subseteq [n]} C(S,p),$$ where $C(S,p) \equiv \sum_{i\in S} p_i (1-p_i -2 \sum_{j>i} p_j)$.

Maximising $C(S, p)$ over $S$ and $p$, we again get a bound of the form $N\ge C/t\epsilon^2$, as in the previous case.

4
On

As the sample sizes increases, density estimators do tend to come ever closer to the distribution of the population from which the sample was randomly samples. (However, depending on the kernel and bandwidth used, ECDFs can be computationally intricate.)

Here is an illustration using samples of size 50, 500, 5000 from a gamma distribution with shape 5 and rate 0.1, so that the population mean is 50. The default kernel density estimator (KDE) in R was used to compute the density estimators (red). However, I am not sure how you propose to measure the fit of the KDE to the (possibly unknown) population density.

set.seed(2021)
x1 = rgamma(50, 5, .1)
x2 = rgamma(500, 5, .1)
x3 = rgamma(5000, 5, .1)

enter image description here

 par(mfrow=c(3,1))
 hist(x1, prob=T, ylim=c(0,.022), col="skyblue2", main="n=50, GAMMA(6,.1)")
  lines(density(x1), lwd=2, col="red")
  curve(dgamma(x,5,.1), add=T, lwd=2, lty="dotted")
 hist(x2, prob=T, ylim=c(0,.022), col="skyblue2", main="n=500, GAMMA(6,.1)")
  lines(density(x2), lwd=2, col="red")
  curve(dgamma(x,5,.1), add=T, lwd=2, lty="dotted")
 hist(x3, prob=T, ylim=c(0,.022), col="skyblue2", main="n=5000, GAMMA(6,.1)")
  lines(density(x3), lwd=2, col="red")
  curve(dgamma(x,5,.1), add=T, lwd=2, lty="dotted")
 par(mfrow=c(1,1))

I don't know your goals, but it may be worth mentioning that empirical CDFs (ECDFs) of samples often show a better fit to population CDFs than histograms or density estimators show to population PDFs. Additionally, the Kolmogorov-Smirnov test statistic $D$ measures the maximum vertical distance between ECDF and CDF. As for computations cost, the ECDF essentially involves only sorting observations (and possibly plotting).

ks.test(x1, pgamma, 5, .1)

        One-sample Kolmogorov-Smirnov test

data:  x1
D = 0.089923, p-value = 0.7801
alternative hypothesis: two-sided

ks.test(x2, pgamma, 5, .1)

        One-sample Kolmogorov-Smirnov test

data:  x2
D = 0.037263, p-value = 0.4912
alternative hypothesis: two-sided

enter image description here

par(mfrow=c(1,2))
 plot(ecdf(x1), col="skyblue", main="n=50")
  curve(pgamma(x,5,.1), add=T, lwd=2)
 plot(ecdf(x2), col="skyblue", main="n=500")
  curve(pgamma(x,5,.1), add=T, lwd=2)
par(mfrow=c(1,1))