Probability distribution on $X \in \{0, 1\}^N$ given that $\|X\|_1$ is distributed uniformly

110 Views Asked by At

I've recently been diving into a bit of machine learning, and lots of probability theory has been coming up as a result. I took one course on probability and statistics in my first year of university and never touched it again, so, needless to say, I'm a bit rusty.

To train a certain neural network, I first need to generate training data. The inputs take the form of binary strings $x$ of length $N$, so that the sample space is just $\{0, 1\}^N$. I would like to take a random sample so that there is roughly an equal number of strings with the same taxicab norm: $$\|x\|_1 = \sum_{i=1}^N x_i, \qquad \text{ where } x = (x_1, \dots, x_N) \in \{0, 1\}^N.$$ It's not too hard to work out that the number of binary strings of length $N$ with $\|x\|_1 = k$ is just $N \choose k$, and that there are $2^N$ binary strings of length $N$ in total, so this could be rephrased as finding the distribution such that the probability of selecting some string $X \in \{0, 1\}^N$ given that $\|X\|_1 = k$ is $$P(X \ \ \vert \ \ \|X\|_1 = k) = \frac{1}{N \choose k}.$$ From Bayes' Theorem, $$P(X) = \frac{P(\|X\|_1 = k)}{{N \choose k} P(\|X\|_1 = k \ \ \vert \ \ X)}.$$ By design, $P(\|X\|_1 = k) = \frac1{N+1}$. Moreover, $P(\|X\|_1 = k \ \ \vert \ \ X) = 1$, so that $$P(X) = \frac{1}{(N+1){N \choose k} }.$$ Is this argument correct? Does this distribution have some particular name?


Edit: By "binary string of length $N$," I really mean a vector in $\mathbb{R}^n$ whose components are all either $0$ or $1$.For $N=4$, there are $2^4 = 16$ possible binary strings of length $4$, some examples of which are: $$x_1 = (0, 0, 1, 0), \ x_2 = (1, 1, 0, 0), \ x_3 = (0, 1, 1, 1).$$ The taxicab metric is introduced to count the numbers of $1$s that appear in the string. In the above examples, we see that $$\|x_1\|_1 = 1, \ \|x_2\|_1 = 2, \ \| x_3\|_1 = 3.$$ An example of my goal in the case that $N=4$ would be to sample, say, $100$ points in such a way that there are $20$ strings containing no $1$s, $20$ strings containing one $1$, $20$ strings containing two $1$s, etc. (allowing for repetition).

1

There are 1 best solutions below

3
On BEST ANSWER

NOTE. This answer pertains to the original version and 1st revision of OP's question, in which OP required both $\|X\|_1$ and $(X \mid \|X\|_1 = k)$ are uniformly distributed on their respective range.


Your original argument is correct, modulo some rectification:

Result. Let $X_N$ be a random binary string of length $N$ such that

  1. $\|X_N\|_1$ is equally likely to be any of the values $0, 1, \ldots, N$, and
  2. Given that $\{\|X_N\|_1 = k\}$, $X_N$ is equally likely to be any of the binary strings with taxicab norm $k$.

Then the distribution of $X_N$ satisfies $$ \mathbf{P}(X_N = x) = \frac{1}{(N+1)\binom{N}{\|x\|_1}}. \tag{*} $$

This formula, although mathematically correct, does not directly tell us how to sample binary strings from the distribution of $X_N$ efficiently (without needing to tabulate the values of $\mathbf{P}(X_N = x)$ for all the $2^N$ possible values of $x$).

Thankfully, there is an efficient way of sampling from the distribution of $X_N$ via making a connection between the distribution of $X_N$ and a well-known stochastic process.

Pólya's Urn. Suppose we have an urn containing $1$ black ball ● and $1$ white ball ○. We proceed to draw balls at random from the urn. On the $i$-th draw, we define

$$ C_i = \begin{cases} 1, & \text{if a black ball ● has been drawn}, \\ 0, & \text{if a white ball ○ has been drawn}. \end{cases} $$ Also, after the color of the drawn ball has been observed, we return the ball to the urn, with an additional ball of the same color. Then:

Theorem. $\tilde{X}_N = (C_1, \ldots, C_N)$ and $X_N$ have the same distribution.

Proof. See this Wikipedia article.

Below is a python code for simulating this process (assuming you have imported numpy library as np, as per the sacred tradition):

def polya_urn(n):
    b = 1
    w = 1
    res = [0] * n
    for i in range(n):
        res[i] = np.random.binomial(1, b / (b + w))
        if res[i] == 1:
            b += 1
        else:
            w += 1
    
    return res