What I want to achieve?
Let's say, I have a set of $N$ samples of a uniformly distributed random variable. For simplicity, they fall into the range $[0,1)$. If I picked a random point $a \in N$ and a set of $K\subset N\backslash \{a\}$ other points in that interval, what is the probability that a given set $K$ holds all the closest points to point $a$?
In my concrete case I'm interested in $|K| = 20$
What (I think) I found out
- The closest points to $a$ follow beta distributions (order statistics)
- The probability that we find $x$ peers in a certain interval follows the poisson distribution
Idea
In the case of $|K| = 20$. I thought I could take the average distance to $a$ of all $K$ keys and compare it to the beta distribution of the 10.5th closest key.
I did the following simulation to familiarize myself:
import numpy as np
ROUNDS = 10_000
N = 15_000
nodes = np.random.uniform(0, 1, size=(ROUNDS,N))
nodes.sort()
averages = []
for n in nodes:
averages += [np.average(n[:20])]
I plot the CDF of the averages together with the beta distribution with $\alpha=k$ and $\beta=N-k+1$ (according to Wikipedia).
This gives me the following graph:
My idea was that if I have a set of $K$ keys, I calculate their average distance to $a$ and just look up the corresponding CDF value. For example, if the average distance is $0.001$ the probability that the set contains the 20 closest keys is 1-CDF. In that case ~10%.
However, as you can see above, the distributions are similar but not the same.
Some Context
In the context of the Kademlia Distributed Hash Table there is a process that tries to find the $k=20$ closest nodes to a certain key $a$ according to the XOR metric. We can assume that node identities in such a hash table are uniformly distributed and live in the same key space as $a$. The process to find the closest nodes iteratively yields ever closer nodes and eventually returns the set of the 20 closest ones. However, this process can take quite long and I want to check if the process can be terminated earlier by calculating the probability that the set of 20 closest nodes I know (in the middle of the process) is already the closest one we can get.
Questions
- What would be the correct distribution there?
- Is my line of thought correct?
- Is there a simpler method to achieve my goal?

Average distant doesn't matter that much: of you by luck got one element very close to $a$, this doesn't mean you can relax requirements on next.
Note that $K$ consists of points closest to $a$ iff the $\max_{x \in K} d(x, a) < \min_{x \in N \setminus K} d(x, a)$.
In other words, probability of $K$ been the correct set is exactly probability of rest $N \setminus K$ points been further away from $a$ than $\max_{x \in K} d(x, a)$.
In uniform case, it's simply $(1 - 2 \max_{x \in K} d(x, a))^{N - K}$ (assuming this $\max$ is less than $a$ and $1 - a$, otherwise modify correspondingly).