Computing the false positives for repeated Bloom filter queries

16 Views Asked by At

Suppose I have two sets, $A$ and $B$. I wish to compute the containment of $A$ in $B$ by checking what proportion of $A$'s elements are in $B$ (i.e $\epsilon = {|A \cap B| \over |B|}$). I do this by inserting the elements of $B$ into a Bloom filter and querying the filter for the all the elements of $A$.

Given that the Bloom filter has a probability $p$ of returning a false positive for a query and there are $L = |A|$ elements in $A$ (assume all elements in $A$ and $B$ are independent and unique) what is the probability that I will falsely report that $A$ is at least $90\%$ contained in $B$? That is, $\epsilon \ge 0.90$.

I figured out that in the $100\%$ containment case ($\epsilon = 1$), the probability would just be $p^L$ but I'm having trouble extending it to the $90\%$ case. Would it just be $\sum_{k=\text{floor}(0.90*L)}^{L} \text{Binomial}(n=L, k=k, p=p)$?

And as an extension to this problem, what would the probability be if I decided to sample uniformly from $A$ and $B$ to make the smaller sets $A', B'$? Then I insert $B'$ into a Bloom filter and query the elements of $A'$ as an approximate measure of containment for $A$ in $B$. My intuition says that there should be an additional source of false positive error due to the sampling, but I don't know how to express it mathematically.

Any help would be greatly appreciated.