classification of rational and irrational via one-nearest neighbor

417 Views Asked by At

Consider the binary classification setting, where the input space is $[0,1]$ and the labels are $\{0,1\}$ Assume the distribution on the input-label $(X,Y)$ pair is as follows. Let $\Pr(Y=0)=\Pr(Y=1)=\frac{1}{2}$. If $Y=0$, then the distribution of $X$ is uniform on $[0,1]$. Then, if $Y=1$, the support of the distribution of $X$ is rational numbers in $[0,1]$ such that every rational number has a positive probability. Let $S_n=\{(X_1,Y_1),\dots,(X_n,Y_n)\}$ denotes $n$-iid samples from this distribution.

Fix an irrational $x \in [0,1]$. Consider the event $E_n$ which is the nearest point ( in usual distance) in $S_n$ to $x$ is a rational. We want to show that $\lim_{n\to \infty} \Pr(E_n)=0$.

Note that the distribution on the input is fixed and does not change with $n$. This is Problem 5.38 from Probabilistic Theory of Pattern Recognition by Devroye, Györfi, and Lugosi.

3

There are 3 best solutions below

2
On BEST ANSWER

I will prove the result under an additional hypothesis that $\mathbb E|Q-x|^{-2}<\infty$ where $Q$ has the same distribution as $X$ given $Y=1$ and $x$ is the fixed irrational number being considered. I believe this is a reasonable assumption since the authors themselves seem to have a specific distribution in mind which does satisfy this property, namely $I,J$ independent geometric random variables and $$ Q=\frac{\min(I,J)}{\max(I,J)}, $$ in which case all the negative moments $\mathbb E|Q-x|^{-k}<\infty$ exist, not just $k=2$.

Formally let $\{(X_i,Y_i)\}_{i=1}^{\infty}$ be an infinite iid sequence with the given distribution. This allows us to consider the events $\{E_n\}_{n=1}^{\infty}$ as being defined on the same probability space. Thus by Borel-Cantelli it suffices to show that $$ \sum_{n=1}^{\infty}\mathbb P(E_n)<\infty. $$ Now we conclude using the same technique as Mike Earnest in the other partial answer by observing that \begin{align*} \mathbb P(E_n)&\leq n\ \mathbb P\left(Y_1=1\text{ and } |X_1-x|=\min_{1\leq i\leq n} |X_i-x|\right)\\ &\leq\sum_{q\in \mathbb Q}\mathbb P(X_1=q)\ n\left(1-\frac{|q-x|}{2}\right)^{n-1}, \end{align*} thus by Fubini-Tonelli we have that \begin{align*} \sum_{n=1}^{\infty}\mathbb P(E_n)&\leq \sum_{n=1}^{\infty}\sum_{q\in\mathbb Q}\mathbb P(X_1=q)\ n\left(1-\frac{|q-x|}{2}\right)^{n-1}\\ &=\sum_{q\in \mathbb Q}\mathbb P(X_1=q)\sum_{n=1}^{\infty}n\left(1-\frac{|q-x|}{2}\right)^{n-1}\\ &=\sum_{q\in \mathbb Q}\mathbb P(X_1=q)\left(\frac{2}{|q-x|}\right)^2=4\mathbb E|Q-x|^{-2}. \end{align*}


To justify my claim that something like this may have been what the authors had in mind, I share the following short snippets from page 80 of the text which sets the problem in context: geometric rational and further down on the page problem 5.38


Lastly, while I am not sure if the claim is true without some negative moment condition on $Q$, I will point out that in general the Hewitt-Savage zero-one law applies to the event $\{E_n\textrm{ occurs infinitely often}\}$, so even if Borel-Cantelli is insufficiently strong to show the bound, all that one needs to establish is that $\mathbb P(E_n\textrm{ occurs infinitely often})<1$ to conclude.

If one wanted to try to construct a counterexample to show that some moment condition is needed on $Q$, I suggest to start by looking at Marcus M's answer here: classification of rational and irrational via one-nearest neighbor

16
On

Not a complete solution

Let $q_1,q_2,\dots,$ be a complete list of the rational nubmers in $[0,1]$, and let $p_k=P(X=q_k\mid Y=1)$. We have $$ \begin{align} P(E_n) &=P\left(\bigcup_{i=1}^n \{\text{$X_i\in \mathbb Q$ and $X_i$ is closest point to $x$}\}\right) \\&\le nP( \text{$X_1\in \mathbb Q$ and $X_1$ is closest point to $x$}) \\&= n\sum_{k=1}^\infty P(X_1=q_k)P\left(\bigcap_{j=2}^n\{|X_j-x|>|q_k-x|\}\right) \\&= \sum_{k=1}^\infty \tfrac12\cdot p_k\cdot n\cdot P\left(|X_2-x|>|q_k-x|\right)^{n-1} \end{align} $$ We can write $$ P(|X_2-x|>|q_k-x|)\le 1-\tfrac12|q_k-x| $$ because for the complementary event that $X_2$ is closer to $x$ than $q_k$, a lower bound is the stricter event that $X_2$ is irrational and between $q_k$ and $x$, which occurs with probability $\frac12|q_k-x|$.

So far, we have the bound $$ P(E_n)\le \sum_{k=1}^\infty \tfrac12\cdot p_k\cdot n\cdot (1-\tfrac12 |q_k-x|)^{n-1} $$ We would be done if we could use the Domintated Convergence THeorem to switch the $\lim_{n\to\infty}$ and $\sum_{k=1}^\infty$. To do this, we need a dominating function, which is where I am stuck.

0
On

I’m not entirely sure if this is true in this exact wording.

This is not very formalized and thought trough, but consider a sequence $q_n\to x$ (monotonic) and $p_q$ so that $p_{q_{n+1}} \geq 2|q_n-q_{n+1}|$.

For example choose $p_{q_{n+1}} = 3|q_n-q_{n+1}| = \pm3(q_n-q_{n+1})$. The total propability is then $$ p_{q_0}+\pm 3(q_0-x) $$ which can be chosen to be smaller than $1$. So this leads to a possible configuration.

If $X$ in uniformly distributed on $[0,1]$ this means that $$\mathbb P(|q_n-x| > |X-x| > |q_{n+1}-x|) < p_{n+1} $$ (in our case $=2/3 p_{n+1}$).

Under independence this means that $$ \mathbb P(|X-x|<|q_n-x|) < \sum_{k=n+1}^\infty p_{q_k} $$

But this means that the expected time for the first occurance of something closer to $x$ than $q_n$ is expected to be smaller if $Y=1$ than if $Y=0$.

This should mean that it is highly probable that if the current closest hit is between the distance of $q_n$ and $q_{n+1}$ the next closer hit will most likely be with $Y=1$, so a rational number.

Of course I could be messing up something, but I do have some doubt that this is true. What can be said very easily though is that the set of $x$ for which this is not true is at most a set of measure $0$.