Probability: Estimating Database Size with N Smallest Random Values

55 Views Asked by At

Note: This was first posted on StackOverflow, but I did not have much luck there.

I have a very large database. Each item in the database is assigned a uniformly distributed random value $\geq 0$ and $<1$. This database is so large that performing a COUNT (i.e., determining the total size) on queries does not yield acceptable performance, but I'd like to use the random values to estimate the total size of a query (COUNT).

The idea is this:

  1. Run a query and order the query by the random value. Random values are indexed, so it's fast.
  2. Grab the lowest $N$ values, and see how big the largest value is, say $R$. Estimate COUNT as $N/R$

The question is two-part:

  1. Is $N / R$ the best way to estimate COUNT? Maybe it should be $(N+1)/R$? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
  2. What is the error margin on this estimated value of COUNT?
1

There are 1 best solutions below

0
On BEST ANSWER

Let's restate the problem:
There are $n$ draws from a uniform distribution with $n$ unknown. Given only the $k$ smallest values find a good estimator $\hat{n}$ of $n$.
First of all, as you correctly noted, the only thing that matters among the given numbers is the largest of them (that is to say that the $x$ - the biggest among $k$ smallest numbers is the sufficient statistic for our sample). To prove it just note that the remaining $k-1$ values are uniformly distributed on $[0,x]$ so any value except $x$ indeed does not provide any additional information about $n$. Also, note that (according to https://en.wikipedia.org/wiki/Order_statistic#Order_statistics_sampled_from_a_uniform_distribution) the distribution of $x$ follows a $Beta(k,n-k+1)$ distribution. There are four estimators which we can consider, three of which are very similar:

1) Your estimator: $$\hat{n}=\frac{k}{x}$$ Using a smple beta integral one easily sees that the bias and variance of this estimator is equal to: $$B[\hat{n}]=\frac{n}{k-1}$$ $$Var[\hat{n}]=\frac{k^2n(n-k+1)}{(k-2)(k-1)^2}$$
2) Method of moments: The expected value of $x$ is equal to (by the above): $$\frac{k}{n+1}$$ So the MM estimator satisfies: $$x=\frac{k}{\hat{n}+1}$$ Which is equivalent to: $$\hat{n}=\frac{k}{x}-1$$ It's very similar to yours. Actually it's just its shifted version therefore it has the same variance and bias smaller by one (so it's slightly better).

3) MVUE: Since we have a sample of size one which is also a sufficient and complete statistic there is only one unbiased statistic which therefore would have to have a minimum variance too. It is precisely: $$\hat{n}=\frac{k-1}{x}$$ Its bias is obviously zero and the variance is equal to: $$\frac{n(n-k+1)}{k-2}$$ Which is also smaller than the above.

4)Maximum Likelihood estimator: The likelihood function is equal to: $$\frac{x^{k-1}(1-x)^{n-k}(H_n-H_{n-k}+\log(1-x))}{Beta(k,n-k+1)}$$ So the derivative of the likelihood function is equal to zero precisely for these numbers $n$ for which: $$\frac{1}{n-k+1}+\frac{1}{n-k+2}+\dots+\frac{1}{n}=H_n-H_{n-k}+\log(1-x)=0$$ Since $n$ has to be natural, the answer is one of the two natural numbers between which the left hand side changes sign (the one for which the likelihood function is bigger). There is no closed formula for $n$ (or at least I don't know any) but because the maximum has to be an integer the answer can be computed exactly. Also, it's hard or even impossible to compute bias and variance in this case.

Personally, I would choose $3$ because it is simple to compute and yet unbiased.