Probability a hypercube is non-empty in a uniformly distributed point set in higher dimensions

120 Views Asked by At

Before I start, I just want to say I'm actually a computer science student and thus, my maths is not very strong so please feel free to make your explanations more detailed via adding additional links or comments and such.

The Scenario

I am randomly generating $N$ points in $D$ dimensional space (let's call this the data set) and I also generate a test set of points. The values are generated with uniform probability distribution and range from $[0,1]$. So let's say $D$ = 5, then an example point would be $(0.5, 0.2, 0.1, 0.9, 0.7)$.

I want to determine the probability $p$ that there exists at least one point in the data set which lies within a hypercube of side length $2e$ centred at $Q$, where $Q$ is a point in the test set. In other words, I want an equation relating $p$ and $e$.

My thoughts

For a point $Z$ to lie within a hypercube of side $2e$ centred at $Q$, each of its coordinates would need to lie within ${\pm}e$ of each corresponding coordinate of $Q$. For example, if $Q$ =$(Q_1, Q_2, Q_3,...,Q_D)$ and $Z$ = $(Z_1, Z_2, Z_3,...,Z_D)$, for $Z$ to lie within this hypercube, we need:

$Q_1-e{\leq}Z_1{\leq}Q_1+e$

$Q_2-e{\leq}Z_2{\leq}Q_2+e$

... and so on.

Let's rephrase $P(Q_i-e \leq Z_i \leq Q_1+e)$ as $P(|Z_i-Q_i|\leq{e})$. Since $Z_i$ and $Q_i$ are 2 independent uniform variates in $[0,1]$, then:

$P(|Z_i-Q_i|\geq{e}) = (1-e)^2$ (visual depiction of why) and thus:

$P(|Z_i-Q_i|\leq{e}) = 1-(1-e)^2 = 2e-e^2$

Then the probability that $Z$ lies within the hypercube of $Q$ with coordinates bounded by $[0,1]$ in $D$ dimensions is

= (2e-e^2)^D$.

Of course, the probability will vary with each point since this assumes $Z$ and $Q$ are independently uniformly random (e.g. if your test point $Q$ is $(0, 0)$ at the very corner, then the probability of $Z$ being in the hypercube will be much smaller since much of the hypercube is "cut off" example of what I mean in 2 dimensional space. The non-shaded area has no points inside it) but over a large number of test points, it will average out* (I think this is the problem, please read the bottom).

After testing this, it seems to be accurate. For example with, $D = 2, e = 0.2$, we get $(2(0.2)-0.2^2)^2=0.1296$. If we have $N=100 000$ and a test set of 10 000 points and for each test point, we counted the number of data set points which lied within its hypercube, we'd expect a total count of $0.1296...{\times}10 000{\times}100 000=129 600 000$ points. My test gives me 129 393 108 points.

But what is the probability that the hypercube is non-empty? If $P(k)$ = the probability that $k$ points lie inside the hypercube of side $2e$ centred at a test point $Q$, then a simple binomial gives us:

$P(k)=((2e-e^2)^D)^k(1-(2e-e^2)^D)^{N-k}\binom{N}{k}$

Since $p$ is the probability that there is at least one point in this hypercube:

$p=1-P(0)$

$=1-(1-(2e-e^2)^D)^N$

Subjecting $e$ and we get:

$e=1-\sqrt{1-(1-(1-p)^{\frac{1}{N}})^\frac{1}{D}}$.

My Problem

However, this doesn't seem to be correct. As the dimensions go higher, it gets more and more inaccurate. Let's take a data set of $N = 100 000$ and a test set of 10 000 and $p=0.99$. I should be generating around 100 empty hypercubes (1% of 10 000 test points should have no points in the data set within a hypercube of side $2e$ centred at the test point).

For $D=2$, I get $e=0.003399$. This generates 113 empty hypercubes.

For $D=5$, I get $e=0.070334$. This generates 209 empty hypercubes.

For $D=10$, I get $e=0.205269$. This generates 644 empty hypercubes.

What I think is wrong

As pointed out below by David K, maybe the events are not independent because the each test point $Q$ is fixed when testing against all N points in the data set. Therefore, binomial probability would not apply to this analysis since we assumed 2 independently random points Z and Q but in this case, each test point Q would be fixed for multiple Z.

Intuitively, I think I need to also analyse the distribution of the probability that Z lies in the hypercube of Q for a fixed Q or to analyse something about the "cut-off" areas for when Q is closer to the corners/edges.

Any insights are appreciated.

1

There are 1 best solutions below

0
On

I think the error in the calculated probability is that the cases where $Q$ is in a corner do not "average out" over a large number of test points.

Let $d(A,B)$ be the $L_\infty$ distance between points $A$ and $B$, so that $d(A,B)<e$ means $A$ is in the hypercube of side $2e$ centered at $B$. Then for a large number of test points $Z,$ $Z',$ $Z'', \ldots,$ you are dealing with the events $d(Z,Q)<e,$ $d(Z',Q)<e,$ $d(Z'',Q)<e,$ and so forth. The probability of each of these events is $(2e-e^2)^D,$ as you found, but the events are not independent. So the rest of your calculations are based on a false assumption.