Frequency of the most frequent element using random sampling

98 Views Asked by At

Problem statement
I want to find the frequency $f_{max}$ of the most frequent element (mode) in a list $L$. The element itself is not needed. For example, for $L=[5,3,5,7,7,5,0]$, we have: $$f_{max}=\frac{|\{l \in L| l=max(L)\}|}{|L|}=\frac{3}{7}=0.43$$
For various reasons, I want to only observe a sample $S \subset L$ of size $|S|=s$ and extrapolate to the full list $L$. This sample $S$ is made by uniformly sampling the list $L$ with or without replacement. For this sample $S$, we compute an estimate $\hat{f}_{max}$. I now want to assess the veracity of the result in regard to the true frequency $f_{max}$ (probably in terms of error and/or confidence).

Litterature
In a paper I have been reading (Estimating the Confidence of Conditional Functional Dependencies, Cormode 2009), the authors propose the following Lemma (without much explanation actually):

Applying standard sampling arguments yields the following result:
Lemma 1. Given L, drawing sample $S \subset L$ such that $|S|=O(\frac{1}{\epsilon^2} log(\frac{1}{\delta}))$ allows us to find $\hat{f}_{max}$ so that:
$P(|\hat{f}_{max}-f_{max}|>\epsilon)<\delta$

I have taken the liberty to reformulate their original Lemma for simplicity. I have trouble understanding their process. I feel like this is very close to Hoeffding's inequality but, if this is the case, it seems to me that their result is wrong. Indeed, if we consider Bernouilli variables $X_i$ for each element $l_i \in L$ such that $X_i=1$ if $l_i$ is the most frequent element and $X_i=0$ otherwise, we have, indeed: $$f_{max}=\frac{1}{|L|}\sum_{i=0}^{|L|}X_i=E[\frac{1}{|S|}\sum_{i=0}^{|S|}X_i]=E[\hat{f}_{max}]$$

However, it is to my eye impossible to assign correctly the value of $X_i$ in $S$ as the most frequent element is $S$ in not necessarily the most frequent in $L$. There is maybe something I am not understanding correctly or maybe just an error in their proposition.

Question
I am looking for insights about the problem in general (your brain, some literature...) and/or the solution proposed above which, for now, seems wrong to me. Thank you!

1

There are 1 best solutions below

0
On

Something seems wrong. Maybe there needs to be more conditions?

If I start with your example $L=\{5,3,5,7,7,5,0\}$ and take samples of size $5$, then the mean of the sample estimate is not $3/7$. This is true whether it is sampling with or without replacement.

The mean and variance of samples of size 5 without replacement are 0.4571 and 0.0082 The mean and variance of samples of size 5 with replacement are 0.5430 and 0.0220

Of course, I can sample without replacement and take a sample of size 7, then find the correct $f_{max}$. Or, I can sample with replacement with samples of greater size and the estimate will converge to the true value. But, even if I sample with replacement with samples of size 7, the estimate is biased.

Sample size 7, the mean is 0.5116 and the variance is 0.0171.
Sample size 20, the mean is 0.454 and the variance is 0.0077.
Sample size 50, mean=0.4351, variance=0.00388

I used 1 million replications for each scenario.

Note: if you happen to know the most frequent element is, for example $5$. Then the estimate from the sample $\hat{f}_{max}=\frac{\text{number of 5's}}{\text{sample size}}$ is unbiased. For example, sampling with replacement with sample size 7, the mean is 0.42859 and the variance is 0.04886. So, your concern about the most frequent value in $S$ is not necessarily the most frequent value in $L$ is valid and that causes the estimate to be biased.