Problem understanding the math in HyperLogLog

104 Views Asked by At

I have problem understanding how the math works in the hyperloglog algorithm. More specifically, I have trouble seeing how the author get formula 5 from formula 4, in the HyperLogLog paper, page 132.

The formula 4 says

$\mathbb{E}_n(Z) = \sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\sum_{n_1+...+n_m=n}\begin{pmatrix}n\\n_1,...,n_m\end{pmatrix}\frac{1}{m^n}\prod_{j=1}^m\gamma_{n_j,k_j}$, where $\gamma_{n_j,k_j} = (1-\frac{1}{2^{k_j}})^{n_j}-(1-\frac{1}{2^{k_j-1}})^{n_j}$

Then the paper assumes $n$ satisfies Poisson distribution of rate $\lambda$, $\mathbb{P}(N=n) = e^{-\lambda}\frac{\lambda^n}{n!}$, and get formula 5

$\mathbb{E}_{P(\lambda)}(Z) = \sum_{n\geq 0}\mathbb{E}_n(Z)e^{-\lambda}\frac{\lambda^n}{n!}=\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^mg(\frac{\lambda}{m2^{k_j}})$, where $g(x) = e^{-x}-e^{-2x}$.

The paper says it is obtained by simple series rearrangements. However I fail to understand how this is done. Any help is greatly appreciated!

1

There are 1 best solutions below

1
On BEST ANSWER

Here is the derivation which uses the identity $e^x = \sum_{n\geq 0} \frac{x^n}{n!}$:

\begin{aligned} \mathbb{E}_{P(\lambda)}(Z) &= \sum_{n\geq 0}e^{-\lambda}\frac{\lambda^n}{n!}\mathbb{E}_n(Z) \\ &= \sum_{n\geq 0}e^{-\lambda}\frac{\lambda^n}{n!}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\sum_{n_1+...+n_m=n}\begin{pmatrix}n\\n_1,...,n_m\end{pmatrix}\frac{1}{m^n}\prod_{j=1}^m\gamma_{n_j,k_j} \\ &= e^{-\lambda}\sum_{n\geq 0}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\sum_{n_1+...+n_m=n}\frac{\lambda^{n_1+n_2+\cdots n_m}}{n_1!n_2!\cdots n_m!}\frac{1}{m^n}\prod_{j=1}^m\gamma_{n_j,k_j} \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\sum_{n\geq 0}\sum_{n_1+...+n_m=n}\frac{\left(\frac{\lambda}{m}\right)^{n_1+n_2+\cdots n_m}}{n_1!n_2!\cdots n_m!}\prod_{j=1}^m\gamma_{n_j,k_j} \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\sum_{n_1\geq 0}\sum_{n_2\geq 0}\cdots\sum_{n_m\geq 0}\frac{\left(\frac{\lambda}{m}\right)^{n_1+n_2+\cdots n_m}}{n_1!n_2!\cdots n_m!}\prod_{j=1}^m\gamma_{n_j,k_j} \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m\sum_{n_j\geq 0}\frac{\left(\frac{\lambda}{m}\right)^{n_j}}{n_j!}\gamma_{n_j,k_j} \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m\sum_{n_j\geq 0}\frac{\left(\frac{\lambda}{m}\right)^{n_j}}{n_j!}\left((1-\frac{1}{2^{k_j}})^{n_j}-(1-\frac{1}{2^{k_j-1}})^{n_j}\right) \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m \left( \left( \sum_{n_j\geq 0}\frac{\left(\frac{\lambda}{m}(1-\frac{1}{2^{k_j}})\right)^{n_j}}{n_j!}\right) - \left( \sum_{n_j\geq 0}\frac{\left(\frac{\lambda}{m}(1-\frac{1}{2^{k_j-1}})\right)^{n_j}}{n_j!}\right) \right) \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m \left( e^{\frac{\lambda}{m}\left(1-\frac{1}{2^{k_j}}\right)} - e^{\frac{\lambda}{m}\left(1-\frac{1}{2^{k_j-1}}\right)} \right) \\ &= e^{-\lambda}\sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m e^\frac{\lambda}{m}\left( e^{-\frac{\lambda}{m 2^{k_j}}} - e^{-\frac{\lambda}{m 2^{k_j-1}}} \right) \\ &= \sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^m \left( e^{-\frac{\lambda}{m 2^{k_j}}} - e^{-\frac{\lambda}{m 2^{k_j-1}}} \right) \\ &= \sum_{k_1,...,k_m\geq 1}\frac{1}{\sum_{j=1}^m2^{-k_j}}\prod_{j=1}^mg(\frac{\lambda}{m2^{k_j}}) \end{aligned}

By the way, a simpler derivation of the HyperLogLog algorithm that does not involve complex analysis can be found in my paper https://arxiv.org/pdf/1702.01284.pdf.