Sampling from a uniform Dirichlet distribution

784 Views Asked by At

I'm trying to get a sample of points distributed uniformly in the 26-dimensional simplex (otherwise put, uniformly distributed 27-dimensional probability vectors). As I have been told, for example, here, the right way to do this is to sample from the uniform Dirichlet distribution.

The issue I am finding is that a largish sample from this distribution contains no points near the vertices of the simplex. In numpy:

from numpy.random import dirichlet
points = dirichlet([1] * 27, size=1000000)

After doing this repeatedly, numpy.max(points) is typically around .5. In that million samples nothing gets closer to the vertex.

I assume this is correct (and my 27-dimensional intuitions are wildly unreliable anyway). My question is: how much more should I sample in order to have, say a .9 probability of getting a point with a value of a coordinate of .9 or higher? If this turns out to be an unworkable sample, is there any other way I could coax the distribution to give me more extreme points?

1

There are 1 best solutions below

3
On BEST ANSWER

Say you have $n=27$ coordinates. The probability that one of them is greater than $q=0.9$ is $n$ times the probability that a particular one of them is greater than $q$, as long as $q\ge\frac12$, since in this case these $n$ events are mutually exclusive.

Now the probability density for one coordinate $x_1$ is proportional to $\left(1-x_1\right)^{n-2}$ (since $1-x_1$ is left for the remaining $n-1$ coordinates to sum to in an $(n-2)$-dimensional simplex). The probability for $x_1\gt q$ is thus

\begin{eqnarray*} \mathsf P\left(x_1\gt q\right)&=&\frac{\int_q^1\left(1-x_1\right)^{n-2}\;\mathrm dx_1}{\int_0^1\left(1-x_1\right)^{n-2}\;\mathrm dx_1} \\ &=& \frac{\int_0^{1-q}x_1^{n-2}\;\mathrm dx_1}{\int_0^1x_1^{n-2}\;\mathrm dx_1} \\ &=& \left(1-q\right)^{n-1}\;. \end{eqnarray*}

In your case, with $n=27$ and $q=0.9$, this is $10^{-26}$, so the overall probability is $27\cdot10^{-26}$. To have a $.9$ probability for this to happen, you'd need to draw $k$ samples such that

$$ \left(1-27\cdot10^{-26}\right)^k\le1-0.9\;, $$

and solving for $k$ yields

\begin{eqnarray*} k&\ge&\frac{\log(1-0.9)}{\log\left(1-27\cdot10^{-26}\right)} \\&\simeq&-\frac1{27}10^{26}\log0.1 \\&\simeq&8.5\cdot10^{24}\;. \end{eqnarray*}

So you'd better start drawing samples right away; the universe might end any time soon.