You are monitoring a data stream which is delivering very many $32$-bit quantities at a rate of $10$ Megabytes per second. You know that either:
$A$: All values occur equally often, $or$
$B$: Half of the values occur $2^{10}$ times more often than others (but you dont know anything about which would be the more common values).
You are allowed to read the values off the data stream, but you only have $2^{20}$ bytes of memory.
Describe a method for determining which of the two situations, $A$ or $B$, occurs.
Roughly how many data values do you need to read to be confident of your result with a probability of $0.999$? [This is about the $3$ sigma level – $3$ standard deviations of a normal distribution.]
Please could I have some thoughts on the solution, it is a difficult problem.
In situation $A$, we're drawing from $2^{32}$ possible values. To a first approximation, in situation $B$, we're drawing from $2^{31}$ possible values.
The only reasonable thing to do is to count collisions.
Option A: Fill memory with samples, then count
Read the first $2^{18}$ values and store them in memory. To a first approximation, they are all different.
Now read $N$ more values, and increment a counter whenever the value collides with one of the values we've stored. The final counter $c$ is approximately a Poisson distribution. In situation $A$, its mean and variance are equal to $2^{-14}N$, and in situation $B$, its mean and variance are equal to $2^{-13}N$. Let us guess $A$ if $c<x$ and $B$ if $c>x$ for some decision boundary $x$.
We would like a wrong guess to be equally unlikely in both cases, so we want $c$ to be a bit closer to the expectation of $A$, but since everything is "rough", let's put it halfway between the means, at a distance of $2^{-15}N$. Now if we want that to be at least three standard deviations away, with a worst-case variance of $2^{-13}N$, then $$2^{-15}N\geq3\sqrt{2^{-13}N}$$ $$2^{-30}N^2\geq9\times2^{-13}N$$ $$N \geq 9\times2^{17}= 1179648$$ so we need about 1.2 million values. Adding that to the original $2^{18}$, we need about 1.5 million values in all. You can get away with less, if you calculate a better decision boundary, but since we're getting 10 MB/sec, it's not really worth the added effort.
Actually, I doubt that this algorithm can keep up with the data stream; it's probably limited by the time it takes to search the storage for the incoming values.
Option B: Bloom filter
A Bloom filter would be both faster and more space-efficient, allowing for a better balance between the storage phase and the test phase. It would also allow us to combine the two phases, testing for collisions with every addition. You could probably get under half a million values if you're clever about it. You'd want to tune the filter error rate carefully.