Where does the bias come from in LogLog and HyperLoglog?

209 Views Asked by At

In the original LogLog paper from P. Flajolet, the cardinality estimation function is presented as

$$E := \alpha_m m2^{1 \over m} {^{\sum_{j} M^{(j)}}}$$

Where $\alpha$ is needed to correct the systematic bias in the asymptotic limit which cause overestimates.

However I am having trouble identifying where does this bias come from.

My intuition tells me this is related to the fact we retain the $max$ count of leading zeros, which is probably skewed to higher values.

Wikipedia states this is due to the risk of hash collisions, but that seems wrong as hash collision would underestimate not overestimates (this is actually an improvement in the HyperLogLog paper)

Has anyone a more explicit definition (ideally in layman terms) or reason explaining this bias?

Thank you!