How to find the Expected height of a randomly built binary tree

408 Views Asked by At

I would like to find out the Expected height of a binary tree where the insertions are based on a random function. I.e. for each node I visit, there is a $\frac{1}{2}$ probability of choosing right or left. I know that the following property holds for height $h$, but it's difficult to add the probability:$$h_{tree}= 1+max(h_{left}, h_{right})$$I think that this version/random tree differs from the random (search) tree mentioned in CLRS chapter 12.4, where you pick a random element from a sorted list $\{1,\dots, n\}$ and insert based on whether the visited node is greater or less than the inserted element. Because, here we choose each path on each visited node to be random.

Note: the binary tree has all its elements at the leafs and internal nodes are only used for routing.

//pseudocode
insert(i, tree):
    if at leaf v:
        split(); //Create a parent u and set its children to be the leaf v and element i.
    else:
        int left = random()
        int right = random()
        if (left > right):
            insert(i, left-subtree):
        else:
            insert(i, right-subtree)

See this figure (also in the link on CS stackexhange - https://cs.stackexchange.com/questions/136582/how-to-find-the-expected-height-of-a-randomly-built-binary-tree)

enter image description here

1

There are 1 best solutions below

5
On

Assuming that I have understood the question correctly, let $H_n$ be a random variable which equals the height of the tree with $n\ge 1$ elements. Then $H_1=0$, $H_2=1$, $H_3=2$ and, more generally, for all integers $m\ge 0$, $$ H_{m+2}=_d1+\max(H_{k+1}, J_{m-k+1}), \qquad \hbox{where}\ \ \ k \sim {\rm Binomial}(m, \frac 1 2), $$ $$ \hbox{$J_{m-k+1}=_d H_{m-k+1},\ \ $ and $H_{k+1}$ and $J_{m-k+1}$ are independent conditioned on $k$. $\ $ (*)} $$ This is because, after adding the first two elements, which must go on the left and right sides of the root, each of the next $m$ elements added will each go, independently, on either the left or right sides of the root, with probability $\frac 1 2$ for each case. This gives two subtrees with $k+1$ and $m-k+1$ elements, where $k$ is the number of the $m$ elements which went left. Observe that we always have $H_n\le n-1$.

For integers $i\ge 0$, let $p_{ni}$ be the probability that $H_n$ is $i$ or less. Then $H_1=0$ together with (*) immediately gives the recurrence $$ p_{ni}=\left\{\begin{array}{ccc} 1,&\ \ &\hbox{if $n=1$,}\\ 0,&\ \ &\hbox{if $n>1$ and $i=0$,}\\ 2^{-(n-2)} \sum_{0\le k\le n-2} \binom{n-2}{k} p_{(k+1)(i-1)} p_{(n-k-1)(i-1)},&\ \ &\hbox{otherwise.} \end{array}\right\}\ \ (!) $$ This can be used to compute the expected value of $H_n$ by using $$ {\Bbb E}H_n=\sum_{i\ge 0} {\Bbb P}(H_n>i) = \sum_{i\ge 0} (1 - p_{ni}).\qquad(@) $$ What is the approximate value of ${\Bbb E} H_n$ for large $n$? I will prove that $$ {\Bbb E} H_{n+1}=\log_2 n + O((\log n)^{2/3}),\qquad n\ge 1, $$ although the error term could surely be improved.

It's clear that $H_{n+1}\ge \log_2 (n+1)\ge \log_2 n$ always. To prove an upper bound on $H_{n+1}$, I will prove by induction on $i$ that $$ p_{(n+1)i}\ge 1-e^{-A (i-B)} n^c, \qquad \hbox{for all integers $n\ge 0$, $i\ge 0$},\ \ (+) $$ where $c:=A(1+\epsilon)/\log 2$, for appropriately chosen $0<\epsilon\le 1$, $A>0$ and $B>0$. If $n=0$, (+) is clear as the left-hand side is 1. Otherwise let $n=m+1$, $m\ge 0$. If $i=0$, (+) is true since $-A(i-B)=AB\ge 0$, so the right-hand side of (+) is nonpositive; if $m=0$ and $i>0$, (+) is true since the left-hand side is 1. Otherwise, by (*) or (!), $$ p_{(m+2)i}={\Bbb E}(p_{(k+1)(i-1)}p_{(m-k+1)(i-1)}) $$ where the expectation value is taken over the binomially distributed $k$, so by the induction hypothesis \begin{eqnarray*} p_{(m+2)i}&\ge& {\Bbb E}[\max(1-e^{-A ((i-1)-B)} k^c, 0)\\ &&\ \ \ \ \max(1-e^{-A ((i-1)-B)} (m-k)^c, 0)]\\ &\ge& 1-e^{-A(i-1-B)} {\Bbb E}[k^c+(m-k)^c]\\ &=& 1-2 e^{-A(i-1-B)} {\Bbb E}(k^c). \end{eqnarray*} Then, by Corollary 1 here, $$ {\Bbb E}(k^c)\le (\frac m 2)^c e^{c^2/m}, $$ so $$ p_{(m+2)i} \ge 1-2 e^{-A(i-1-B)} (\frac m 2)^c e^{c^2/m}, $$ which will complete the induction if $$ 2 e^A 2^{-c} e^{c^2/m} \le (1+\frac 1 m)^c, $$ or certainly if $$ 2 e^{-A \epsilon} e^{c^2/m} \le 1. $$ We can satisfy this by setting $A:=\epsilon^{-1}$ as long as $$ m\ge m_0:=\frac {c^2}{1 - \log 2} = \frac 1 {1 -\log 2} (\frac {1 + \epsilon^{-1}}{\log 2})^2. $$ In order to complete the induction for $m<m_0$, we set $B:=m_0+1$; then if $m<m_0$, $H_{m+2}\le m+1< m_0+1=B$, so if $i\ge B$, $p_{(m+2)i}=1$, automatically satisfying (+); if $i<B$, $-A(i-B)>0$, so the right-hand side of (+) is nonpositive. This completes the proof of (+).

Now from (@) and (+), \begin{eqnarray*} {\Bbb E}H_{n+1}&=& \sum_{i\ge 0} (1 - p_{(n+1)i})\\ &\le& I_0 + \sum_{i\ge I_0} e^{-A(i-B)} n^c, \end{eqnarray*} where $I_0:=\lceil m_0+(1+\epsilon)\log_2 n + 2\rceil$; then $$ e^{-A(I_0-B)} n^c \le e^{-1/\epsilon} $$ so \begin{eqnarray*} {\Bbb E}H_{n+1}&\le& I_0 + e^{-1/\epsilon} + e^{-2/\epsilon} + \cdots\\ &\le& I_0 + \epsilon \\ &\le& \frac 1 {1 -\log 2} (\frac {1 + \epsilon^{-1}}{\log 2})^2+(1+\epsilon)\log_2 n + 3 + \epsilon\\ &\le& K \epsilon^{-2} + (1 + \epsilon) \log_2 n, \end{eqnarray*} for some constant $K>0$. Setting $\epsilon:=(\log (n+2))^{-1/3}$ now gives the desired bound on ${\Bbb E}H_{n+1}$.