Probability of an unordered sample under weighted sampling without replacement

1.1k Views Asked by At

Imagine the following situation: An urn contains $K$ balls of different colours $\mathcal{U}=\{1,\ldots,K\}$, and with different weights $\mathbf{w}=(w_1,\ldots,w_K)$ (where $\sum_i w_i = 1$). You draw from the urn $m \leq K$ times without replacement, with probability of selecting each of the (remaining) balls proportional to their weight, and observed a sample $\mathcal{K} \subset \mathcal{U}$, $|\mathcal{K}|=m$.

I want to compute the probability of such a (unordered) sample $\mathcal{K}$. For a particular ordered tuple $\mathbf{k}=(k_1,\ldots,k_m)$ of draws from the urn, the total remaining weight after $i$ draws is $1-w_{k_1}-\ldots-w_{k_i}$, and it follows that $$ \mathbb{P}(\mathbf{k}) = \frac{w_{k_1}}{1}\frac{w_{k_2}}{1-w_{k_1}}\cdots\frac{w_{k_m}}{1-w_{k_1}-\ldots-w_{k_{m-1}}}. $$ For an unordered sample $\mathcal{K}=\{k_1,\ldots,k_m\}$, we must sum over all possible permutations $S_m$ of the elements of $\mathcal{K}$, and the probability of observation is therefore $$ \mathbb{P}(\mathcal{K}) = \sum_{\pi\in S_m} \prod_{i=1}^{m} \frac{w_{k_{\pi(i)}}}{1-w_{k_{\pi(1)}}-\ldots-w_{k_{\pi(i-1)}}}, $$ which can be (slightly) simplified to $$ \mathbb{P}(\mathcal{K}) = w_{k_1}\cdots w_{k_m}\sum_{\pi\in S_m}\Bigg(\prod_{i=1}^{m-1} \big(1-w_{k_{\pi(1)}}-\ldots-w_{k_{\pi(i)}}\big)\Bigg)^{-1}. $$

The last expression is, unfortunately, still computationally intractable even for samples containing about 100 elements -- yet I'd need this for samples of tens of thousands elements, selected from millions of colours initially present in the urn. Yet this is were I'm stuck: I can't see how to simplify this further.

Thus my question(s): Does anyone have an idea how to simply the denominator and/or how to replace it by an approximation? Or can anyone point me at literature that deals with such a situation?

My eventual goal is to estimate the weights $\mathbf{w}$ from many such (independently taken) samples $\mathcal{K}_1,\ldots,\mathcal{K}_N$ which different sizes $|\mathcal{K}_1|$, $\ldots$, $|\mathcal{K}_N|$, either by a ML or (preferrably) a Bayesian approach.

2

There are 2 best solutions below

0
On BEST ANSWER

This is an answer form of a comment I posted previously.

The Conditional Bernoulli distribution is a distribution on a binary vector $X = \{X_1,....,X_m\}\in\{0,1\}^m$. The $X_i$ are distributed $X_i \sim \textrm{Bern}(p_i)$, with the total number of non-zero observations restricted to be $|S_m| = r$. All of the following is mostly just parroting the results of the linked paper.

In your proposed framework, we have a class (color) $c_i$ associated with each of the balls in the urn, and the number of observed balls pulled out without replacement is the number of binary variables associated with that class which are non-zero.

In math that would be, with $Q$ different colors in the urn:

\begin{equation} X | \{p_q\}_{q=1}^Q,\{c_i\}_{i=1}^m, r \sim \textrm{CB}(\{p_q\}_{q=1}^Q,\{c_i\}_{i=1}^m, r) \end{equation}

If and only if $X_i \sim \textrm{Bern}(p_{c_i})$ and $\sum_i X_i = |S_m| = r$; with each $c_i \in \{1,...,Q\}$.

Writing down the density is kind of complicated. Let $\mathcal{X}^r = \{X \in \{0,1\}^m$ such that $\sum X_i = r\}$. Letting $w_i = \frac{p_i}{(1-p_i)}$ be a weight associated (with a 1-1 mapping) to each of the probabilities.

Then the density of the CB distribution can be written down in a not-very-useful form as:

\begin{equation} P[X | \{w_i\}, r] =\frac{(\prod_{i \in \{1,...,m\}} w_i^{X_i})}{\sum_{Y \in \mathcal{X}^r}(\prod_{i \in \{1,...,m\}} w_i^{Y_i})} I_{(X \in \mathcal{X}^r)} \end{equation}

There are some more convenient representations involving recursive equations which make it easier to understand the algorithm presented for obtaining the MLE estimate.

They discuss an $R$ function $R(k,A) := \sum_{B \subseteq A, |B| = k}\prod_{i \in B} W_i$. There is a recursive equation for the $R$ function:

\begin{equation} R(m,A) = \frac{1}{m}\sum_{i=1}^m(-1)^{i+1}T(i,A)R(m-i,A) \end{equation}

where $A \subset \{1,...,m\}$ and $T(i,A) = \sum_{j\in A} w_j^i$.

Then the density has the expression:

\begin{equation} P[X | \{w_i\}_{i=1}^m, r] =\frac{\prod_{i \in \{1,...,m\}} w_i^{X_i}}{R(m,\{1,...,m\})} \end{equation}

Given these definitions of the R function, and multiple observations $\{X^n\}_{n=1}^N$ of the random binary vectors, denote by $X_{(i)} = \sum_{n=1}^N X^n_i$; representing the number of times variable $i$ was evaluated as 1 across all the samples. Then the log likelihood has the form:

\begin{equation} \mathcal{L}(w_i | \{X^n\}_{n=1}^N) = \sum_{i=1}^m X_{(i)} \log w_i - N\log R(m,\{1,...,M\}) \end{equation}

The approach used to calculate the MLE under this formulation is an iterative algorithm (Algorithm 1 in that paper) which repeats the following steps until convergece:

  1. For all $i \in \{1,...,m-1\}$: $w_i^{(t+1)} = \frac{X_{(i)}R(m-1,\{1,...,m\}\setminus m)}{NR(m-1,\{1,...,m\}\setminus i)}$
  2. $w_m^{(t+1)} = \frac{X_{(m)}}{N}$

This will solve your problem, but could take an extremely long time. We can probably simplify things a bit by departing from the Chen paper at this point. Noting that there are only $Q$ colors associated with each $X_i$, the log-likelihood can be reduced to:

\begin{equation} \mathcal{L}(\{w_q\}_{q=1}^Q | \{X^n\}_{n=1}^N) = \sum_{q=1}^Q (\sum_{i \textrm{ s.t. } c_i = q}X_{(i)}) \log w_q - N\log R(m,\{1,...,m\}) \end{equation}

I think you can also do some reductions on the R function so it is not as difficult to calculate (requiring so many recursions), but I need some more time to think about it; I'm posting this for now. I think even with the reductions, assuming someone can figure it out, having millions of colors present in the might just mean you have a computationally intractable problem.

EDIT There is also a result in that paper which states that in the limit as $m \rightarrow \infty$ the distribution of a conditional binomial converges in total variation to a Multinomial distribution.

The explicit statement from that paper is that:

Assumption: There exists $c \in \mathbb{R}^+$ such that $\forall i,j \in \{1,...,M\} \frac{w_i}{w_j}\leq c$.

This assumption equates to non of the probabilities being very small or large. For instance if $p_i = \frac{i}{m}$ the condition would not be satisfied as $m\rightarrow \infty$.

Theorem (Chen 2000): If the previous assumption holds, a Conditional Binomial model with probabilities $\pi_i$ can be approximated by a $\textrm{Multinomial}(m,\lambda_1,...,\lambda_N)$ where $\lambda_i = \frac{\pi_i}{m}$ for each $i \in \{1,...,m\}$ when $\frac{N}{m^2}$ is large.

The theorem is only useful if you have a lot more samples than colors present, which according to your question seems pretty unlikely.

0
On

The accepted answer is somewhat confusing in that it proposes to use a different model. It is very useful in that the model is relevant for the problem and has an MLE algorithm with tractable steps. Yet, I believe that sampling from a categorical distribution without replacement (specified in OP) defines a different probability distribution than the Conditional Bernoulli distribution. There seem to be no connection between $p_i$ -- the probabilities of the categorical variables in the OP model and the weights $w_i$ or bernoulli probabilities $p_i$ or the coverage probabilities $\pi_i$ in the CB model. I found the following information addressing the OP more directly. Would be very interested if someone can complement.

Equivalent Sampling Methods

The sampling in OP is known as weighted random sampling [2] and the distribution also as Plackett-Luce distribution. The sampling process described in the OP is a draw-by-draw sampling, but the same model is obtained also via top-k Gumbel sampling. Let $\phi_i = \log p_i$ be model logits (scores). Let $G_i$ be independent Gumbel(0,1) variables. Then the following are equivalent in distribution

  • draw-by-draw $m$ samples without replacement from $\text{Cat}(p)$
  • $\text{top-m}(\phi + G)$, i.e. select indices of $m$ largest noisy scores
  • $\text{top-m}(\phi - \log(-\log(U)))$, $U\sim \text{Uniform}(0,1)$
  • $\text{top-m}(U^{1/p})$

The last variant in the chain (obtained by performing monotone transforms: $\exp$, $-1/x$, $\exp$ under the top-m) is the Efraimidis algorithm [2].

Practical Considerations

The OP mentions the case if interest:

samples of tens of thousands elements, selected from millions of colours

The fraction of the sample size to the total number of choices is thus about $1/100$. I think it would be safe to ignore drawing with replacement and just estimate those probabilities using MLE or Bayesian MAP estimates for the plain Categorical distribution. Basically, in a draw with replacement the probability to select the same elements is rather low and hence the probability of a sample with replacement or without are close to each other. At least this would be a good starting point to initialize any further optimization.

Probability of Unordered Draw and MLE

Calculating the probability of an unordered draw indeed appears hard. Kool et al. [3] propose an $O(2^m)$ exact algorithm eq. (33) (needs to enumerate all subsets), which is exponential complexity but still a good saving over the naive $O(m!)$ (all orderings). They also propose a numerical solution via 1D integral, eq. (32), as follows. Let $S$ be an unordered sample, represented as a subset of all indices $D$.

$$p(S) = \int_{0}^1 \prod_{i \in S} (1 - F_{\phi_i} (F^{-1}_{\phi_{D\backslash S}}(u) ) ) du,$$

where $\phi_i = \log p_i$, $\phi_{D\backslash S} = \log(1-\sum_{i\in S}p_i)$, and $F_\mu$ is the CDF of $\text{Gumbel}(\mu,1)$ distribution: $F_\mu(g) = \exp(−\exp(−(g - \mu)))$.

If I am not mistaking, the integrated function is continuous and when computing its gradient in parameters one can do differentiation under the integral sign. However in the log-likelihood we are going to have

$$\sum_{S \sim \text{Data}} \log \int_{0}^{1} f(S,\phi, u) d u , \ \ \text{where}\ \ f(S, \phi, u) = \prod_{i \in S} (1 - F_{\phi_i} (F^{-1}_{\phi_{D\backslash S}}(u) ) ) $$

which makes it necessary still to evaluate those integrals when differentiating. I think this still allows for a stochastic optimization with a meaningful approximation:

  • Draw $S \sim \text{Data}$ at random
  • draw $M$ samples $u_1 \dots u_M \sim \text{Uniform}$;
  • estimate stochastic gradient as

$$ \nabla_\phi \approx \frac{ \sum_{l=1}^M \frac{d f(S, \phi, u_l)}{d \phi} }{ \sum_{l=1}^M f(S, \phi, u_l) }$$

  • Make grad ascent on parameters: $\phi = \phi + \alpha \nabla_\phi$

[2]: Efraimidis: Weighted Random Sampling over Data Streams

[3]: Kool (2020): Estimating Gradients for Discrete Random Variables by Sampling without Replacement, https://openreview.net/forum?id=rklEj2EFvB