About setwise convergence on open sets of nearest neighbor algorithm.

104 Views Asked by At

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and $(\mathcal{X},d)$ be a complete, separable, locally compact metric space. Suppose that $X,X_1,X_2,X_3,... : \Omega\to\mathcal{X}$ are $\mathbb{P}$-i.i.d. random variables.

Define: $$\forall m\in\mathbb{N}, \pi_m: \mathcal{X}\times\mathcal{X}^m\to\{1,...,m\}, (x,x_1,...,x_m)\mapsto \min\left(\operatorname{argmin}_{k\in\{1,...,m\}}\left(d\left(x,x_1\right),...,d\left(x,x_m\right)\right)\right).$$ Define: $$\forall m\in\mathbb{N}, Z_m:\Omega\to\mathcal{X}, \omega\mapsto X_{\pi_m\left((X(\omega),X_1(\omega),...,X_m(\omega)\right)}(\omega).$$

If $A$ is a open set of $(\mathcal{X},d)$, is it true that: $$\limsup_{m\to+\infty}\mathbb{P}_{Z_m}(A)\le\mathbb{P}_{X}(A)?$$

Edit 1: or maybe that there exists a constant $C>0$ independent of $A$ such that: $$\limsup_{m\to+\infty}\mathbb{P}_{Z_m}(A)\le C\cdot \mathbb{P}_{X}(A)?$$

If it is false in general, what if we add the hypothesis that $\mathcal{X}=\mathbb{R}^n$, $d$ is the Euclidean distance and $\mathbb{P}_X$ is absolutely continuous w.r.t. Lebesgue measure in $\mathbb{R}^n$?

Edit 2: in this last case, we have that if $B$ is a ball of $\mathbb{R}^n$, then $\mathbb{P}_X(\partial B)=0$ so, since $\mathbb{P}_{Z_m}\to \mathbb{P}_{X}$ in distribution (as explained below by WoolierThanThou), we have that $\mathbb{P}_{Z_m}(B)\to\mathbb{P}_{X}(B), m\to \infty$. Now, since by Besicovitch covering theorem there exist an universal constant $C_n\in\mathbb{N}$ such that every open subset $A$ is the union of at most $C_n$ open sets $A_1,...,A_{C_n}$ each of them is a disjoint countable union of open balls, say $A_i = \cup_{j\in I_i} B_{i,j}$, we have that: $$\mathbb{P}_{Z_m}(A)\le \sum_{i=1}^{C_n}\mathbb{P}_{Z_m}(A_i)= \sum_{i=1}^{C_n}\sum_{j\in I_i}\mathbb{P}_{Z_m}(B_{i,j}) = (*)$$ Now, if only we could exchange the limit and the series, we obtain that $$(*)\to \sum_{i=1}^{C_n}\sum_{j\in I_i}\mathbb{P}_{X}(B_{i,j})= \sum_{i=1}^{C_n}\mathbb{P}_{X}(A_i)\le \sum_{i=1}^{C_n}\mathbb{P}_{X}(A) = C_n \mathbb{P}_{X}(A) $$ Can anyone see a reason why could we exchange the limit an the series?

2

There are 2 best solutions below

3
On

Well, $Z_m$ converges to $X$ in probability! For any, $\varepsilon>0,$ we have, by independence and identical distribution,

$\mathbb{P}(d(Z_m,X)\geq \varepsilon)=\mathbb{P}( \cap_{k=1}^m (d(X_k,X)\geq \varepsilon))=\mathbb{P}( d(X_1,X)\geq \varepsilon)^m$,

which goes to $0$, so long as we can show that $\mathbb{P}( d(X_1,X)\geq \varepsilon)\neq 1$. Clearly, the universe is broken if this isn't true, but you can note that if $(x_n)_{n\in \mathbb{N}}$ is a dense, countable subset of $\mathcal{X}$, then $\mathcal{X}=\cup_{n=1}^{\infty}B(x_n,\frac{\varepsilon}{2}),$ and hence, there must be some $n_0\in\mathbb{N}$ such that $\mathbb{P}(X\in B(x_{n_0},\frac{\varepsilon}{2}))\neq 0.$ Hence, by the fact that they're i.i.d., $\mathbb{P}(X,X_1\in B(x_{n_0},\frac{\varepsilon}{2}))\neq 0$, establishing the desired.

Of course, one might ask, does convergen in probability still imply weak convergence of the underlying measures? Well, assume $Y_n\to Y$ in probability with $Y_n$ and $Y$ $\mathcal{X}$-valued and fix $C\subseteq \mathcal{X}$ closed. Then, for any $\varepsilon>0$, denote $C_{\varepsilon}=\cup_{x\in C} B(x,\varepsilon)$

$$\mathbb{P}(Y_n\in C)\leq \mathbb{P}(d(Y_n,Y)\geq \varepsilon)+\mathbb{P}(Y\in C_{\varepsilon})$$

By continuity of measures (and the fact that $C$ is closed), we in particular find that

$$\limsup_{n\to\infty}\mathbb{P}(Y_n\in C)\leq \limsup_{n\to\infty} \mathbb{P}(Y\in C_{1/n})=\mathbb{P}(Y\in C),$$

which is equivalent to weak convergence of distributions in a complete, separable space.

Adding these things up, we get for $A\subseteq \mathcal{X}$ open, by weak convergence, that $\liminf_{m\to\infty} \mathbb{P}(Z_m\in A)\geq \mathbb{P}(X\in A)$, so your hypothesis is only true if $\lim_{m\to\infty}\mathbb{P}(Z_m\in A)=\mathbb{P}(X\in A)$.

6
On

Here's a proof for $\mathbb{R}^n$ with Euclidean distance and Lebesgue measure restricted to disk of radius $\frac{1}{\sqrt{\pi}}$.

The idea of the proof (specifically to contrast the flawed proof above) is to use the symmetry/averaging in $x$. In fact, averaging in $x$ allows for a simpler expression for $P(Z_m \in A)$.

Note $P(Z_m \in A) = \int_{x \in \Omega} \int_{x_1,\dots,x_m \in \Omega} 1_{\text{closest $x_i$ to $x$ is in $A$}} = \int_{x \in \Omega} m\int_{x_1,\dots,x_m \in \Omega} 1_{\text{$x_1$ is closest to $x$ and $x_1 \in A$}} = \int_{x \in \Omega} m \int_{x_1 \in A} \mu(B_{d(x,x_1)}(x)^c)^{m-1}d\mu(x_1)d\mu(x) = m\int_{x_1 \in A} \int_{x \in \Omega} (1-|x-x_1|^2)^{m-1}d\mu(x)d\mu(x_1) \le m\int_{x_1 \in A}\int_0^{2\pi}\int_0^1 (1-r^2)^{m-1}r dr d\theta d\mu(x_1) = \mu(A)$.


The following answer is (seriously) flawed. It proves that, in some metric spaces, there is some $x$ so that if you always do nearest neighbor algorithm relative to $x$, the $Z^x_m$ measure of a set $A$ can be arbitrarily (multiplicatively) larger than the $X$-measure of $A$. However, as Bob pointed out, in this problem, you do nearest neighbor algorithm relative to an $X$-randomly chosen point of $\mathcal{X}$.


All of it is false.

Let $\mathcal{X} = \{0 \le |z| < 1\} \subseteq \mathbb{R}^2$, $d$ the standard Euclidean metric, and $P$ pushforward to the Lebesgue measure (i.e. we choose points randomly). Let $A_N = \cup_{k \ge N} \{\frac{1}{2^{10^{2k+1}}} < |z| < \frac{1}{2^{10^{2k}}}\}$. Clearly $A_N$ is open for each $N$. Also, $\mu(A_N) \to 0$ as $N \to \infty$. But, for each $N \ge 1$, we have that $\limsup_{m \to \infty} P_{Z_m}(A) = 1$. Indeed, with $x$ the origin, $$P(|Z^x_m| \in [\frac{1}{2^{10^{2k+1}}},\frac{1}{2^{10^{2k}}}]) \ge 1-P(|Z^x_m| < \frac{1}{2^{10^{2k+1}}})-P(|Z^x_m| > \frac{1}{2^{10^{2k}}})$$ $$\ge 1-m\frac{\pi}{(2^{10^{2k+1}})^2}-(1-\frac{\pi}{(2^{10^{2k}})^2})^m,$$ so we can take $m = 2^{3\cdot 10^{2k}}$ to see that $P(Z^x_m)$ is exponentially close to $1$. $\square$