Proof that "LazySelect" algorithm finds the $k$-th smallest element of a set with probability $1-O(n^{\frac{1}{4}})$ on the first pass through.

88 Views Asked by At

I was reading a book about randomized algorithms and stumbled upon the following claim, which I would like to prove in detail but didn't succeed:

LazySelect Algorithm

Input: A set of $n$ elements form a totally ordered universe, and a integer $k\in\{1,\ldots,n\}$.

Output: The $k$-th smallest element of $S$, denoted as $S_{(k)}$.

  1. Pick $n^{\frac{3}{4}}$ elements form S, chosen independently and uniformly at random with replacement; call this multiset of elements $R$.
  2. Sort R in $O(n^{\frac{3}{4}}\log{n})$ steps.
  3. Set $x:=kn^{-\frac{1}{4}}$ and $l:=\max\{x-\sqrt n,1\}$ and $h:=\min\{x+\sqrt n, n^{\frac{3}{4}}\}$.
  4. By comparing $a:=R_{(l)}$ and $b:=R_{(h)}$ to every element of $S$, we determine $rank_S(a)$ and $rank_S(b)$.
  5. $$Q:=\begin{cases} & S\cap(-\infty,R_{(h)}]\text{ ,if } k<n^{\frac{1}{4}}\log(n) \\ & S\cap[R_{(l)},R_{(h)}]\text{ ,if } > k\in[n^{\frac{1}{4}}\log(n),n-n^{\frac{1}{4}}\log(n)]\\ & S\cap[R_{(l)},\infty)\text{ ,if } k>n-n^{\frac{1}{4}}\log(n) \end{cases}$$
  6. Check whether $S_{(k)}\in Q$ (by comparing $k$ with $rank_S(a)$ and $rank_S(b)$) and $|Q|\leq 4n^{\frac{3}{4}}+2$. If not repeat steps 1-5 until such a set $Q$ is found.
  7. By sorting $Q$ in $O(|Q|\log(|Q|))$ steps, identify $Q_{k-rank_S(a)+1}$, which is $S_{(k)}$.

Now i try to proof the following claim:

The algorithm finds $S_k$ with probability $1-O(n^{\frac{1}{4}})$ on the first pass through 1-6.

First i try to proof that the probability is $1-O(n^{\frac{1}{4}})$. Set $r:=|R|=n^{\frac{3}{4}}$. Lets denote $R_1,R_2,\ldots,R_r$ as the individual drawings of $R$. Now we define the counter-r.v. ... $$Y_n=\sum^r_{i=1}\mathbb{I}_{\{R_i\leq S_{(k)}\}}$$

... which counts the drawings of $R$ smaler than $S_{(k)}$. Obviously $Y_n\sim B(r,\frac{k}{n})$ is Bernoulli distributed. Therefore $E(Y_n)=r\frac{k}{n}=x$ and $Var(Y_n)=r\frac{k}{n}\left(1-\frac{k}{n}\right)\leq \frac{n^{\frac{3}{4}}}{4}$ applies.

Lets take a look at the case $k\in[n^{\frac{1}{4}}\log(n),n-n^{\frac{1}{4}}\log(n)]$. The algorithm will fail in step 6, if $S_{(k)}<R_{(l)}$ or $R_{(h)}<S_{(k)}$ occurs, or if $Q$ is to large. The sum of the probabilities of these three events must be less than $n^{\frac{1}{4}}$. Lets calculate those probabilitys by using the Tschebychev-inequality:

\begin{align*} P(S_{(k)}<R_{(l)})&=P(Y_n<l)=P(Y_n<x-\sqrt{n})\\ &\leq P(-(Y_n-x)>\sqrt{n})\\ &\leq P(|Y_n-E(Y_n)|>\sqrt{n})\\ &\leq\frac{1}{\sqrt{n}^2}Var(Y_n)=\frac{1}{n}\frac{n^{\frac{3}{4}}}{4}=\frac{1}{4\sqrt[4]{n}} \end{align*}

\begin{align*} P(R_{(h)}<S_{(k)})&=P(Y_n>h)=P(Y_n>x+\sqrt{n})\\ &\leq P(Y_n-x>\sqrt{n})\\ &\leq P(|Y_n-E(Y_n)|>\sqrt{n})\\ &\leq\frac{1}{\sqrt{n}^2}Var(Y_n)=\frac{1}{n}\frac{n^{\frac{3}{4}}}{4}=\frac{1}{4\sqrt[4]{n}} \end{align*}

In case $l=1$ or $h=r$ we get:

\begin{align*} P(S_{(k)}<R_{(l)})&=P(Y_n=0)=\left(1-\frac{k}{n}\right)^r=\left(1-\frac{k}{n}\right)^{n^{\frac{3}{4}}}\\ &\leq\left(1-\frac{\sqrt[4]{n}\log{n}}{n}\right)^{n^{\frac{3}{4}}}=\left(1-\frac{\log{n}}{\sqrt[4]{n}}\right)^{n^{\frac{3}{4}}}\\ &\leq e^{(-\log(n))}=e^{(\log(\frac{1}{n}))}=\frac{1}{n}\leq\frac{1}{\sqrt[4]{n}} \end{align*}

$$P(R_{(h)}<S_{(k)})=P(Y_n=r)=\left(\frac{k}{n}\right)^r=\left(\frac{k}{n}\right)^{n^{\frac{3}{4}}}\leq\frac{1}{n^{n^{\frac{3}{4}}}}\leq\frac{1}{n}\leq\frac{1}{\sqrt[4]{n}}$$

The other two cases ins step 5 are analogous. So it remains to calculate the probability that $Q$ is too large.

Now another lecture script is claiming that:

For all $i$ and $j$ is evident that: $$ \{|Q|>i+2\}=\{|S\cap[R_{(l)},R_{(h)}]| >i+2\}\subset\{R_{(l)}\leq S_{(j)}\}\cup\{S_{(j+(i+2))}\leq R_{(h)}\}$$

And for $i=4n^{\frac{3}{4}}$ and $j=k-2n^{\frac{3}{4}}$ this leads us to:

$$P(|Q|>4n^{\frac{3}{4}}+2)=P(R_{(l)}\leq S_{(k_l)})+P(S_{(k_h)}\leq R_{(h)}))$$

... where $k_l:=\min\{k-2n^{\frac{3}{4}},1\}$ and $k_h:=\min\{k+2n^{\frac{3}{4}},n\}$. And for this, the analysis is very similar to that above, with $K_l$ and $K_h$ playing the role of $k$.

But is that really true? If i recalculate $i+j+2$ I get $k+2n^{\frac{3}{4}}+2$ and not $k+2n^{\frac{3}{4}}$! What happened to the $+2$? Beside that, i would try to calculate the probability (for $k_l\neq1$ and $k_h\neq n$) as follows

\begin{align*} P(|Q|>4n^{\frac{3}{4}}+2)&=P(R_{(l)}\leq S_{(k_l)})+P(S_{(k_h)}\leq R_{(h)})) \\ &=1-P(R_{(l)}> S_{(k_l)})+1-P(S_{(k_h)}> R_{(h)}))\\ &=1-\frac{1}{4\sqrt[4]{n}}+1-\frac{1}{4\sqrt[4]{n}}\\ &=2-\frac{1}{2\sqrt[4]{n}} \end{align*}

But this cant be true. What am i doing wrong?

I would guess that i can show that $P(|Q|>4n^{\frac{3}{4}}+2)\leq\frac{1}{2\sqrt[4]{n}}$ such that ...

$$P(S_{(k)}<R_{(l)})+P(R_{(h)}<S_{(k)})+P(|Q|>4n^{\frac{3}{4}}+2)\leq \frac{1}{4\sqrt[4]{n}}+\frac{1}{4\sqrt[4]{n}}+\frac{1}{2\sqrt[4]{n}}=\frac{1}{\sqrt[4]{n}}$$

... which would proof the claim.

Any assistance would be much appreciated!