Bagging a random sample with duplicate

144 Views Asked by At

this is a question posed by an engineer friend who had this probability problem arise at work. It goes like this,

Say we have 5000 marbles. Of them, we know that 50 are duplicates. They are not necessarily all the same duplicates, and not necessarily perfect distinct pairs of duplicates. So, we know that there are 4950 distinct marbles and 50 marbles that are duplicates.

Consider randomly bagging these marbles into 100 bags of 50 marbles. What is the probability that at least one bag has at least one pair of duplicate marbles? Another way of phrasing: What is the probability that after bagging, one or more bags contain two (or more) of the same marbles? (You can simplify this if you would like as it is a bit vague, but an answer to the general question is preferred. So, if you want to assume that the duplicates are all the same or the duplicates all show up in perfect pairs.)

I gave this a shot myself, but it is a bit more complicated than I am capable of solving with certainty. The engineers working with this problem seem to believe the probability is very low. However, I disagree and believe that the probability is considerably high. Any solution including a simplification of the problem to give a rough approximation (or what we hope would be an approximation) is great. I'd prefer to see steps and reasoning to justify your answer as well as teach me how to approach such a problem. Thank you.

2

There are 2 best solutions below

2
On

Here is a simulation $10^4$ times using R. It makes no assumption about whether the extra $50$ balls are all the same or all different or something else; each time it chooses them with replacement from the distinct set.

anymatches <- function(x){
  sum(duplicated(x)) > 0
  }

bagswithmatches <- function(bags, ballsperbag, extra){
  distinct <- bags * ballsperbag - extra
  full <- c(1:distinct, sample(1:distinct, extra, replace=TRUE))
  mat <- matrix(sample(full), ncol=bags)
  sum(apply(mat, 2, anymatches)) 
  }

set.seed(2023)
sims <- replicate(10^4, bagswithmatches(bags=100, ballsperbag=50, extra=50))
table(sims)
# sims
#    0    1    2    3    4 
# 6129 3033  712  113   13

which seems to suggest that the probability that at least one bag has matches is about $39\%$, and quite often more than one bag does.

I would say this is not very low.


Alternatively, here is a crude approximation to the same result, taking short-cuts which are not quite correct:

  • Suppose all the duplicates in the $5000$ are simple pairs, so there are $50$ distinct pairs (only true about $78\%$ of the time, but most of the others there are $48$ pairs and $1$ triplet, which will behave similarly)
  • Consider the probability that a particular pair both appear in the first bag: this has probability $\frac{50}{5000}\times \frac{49}{4999}\approx 0.000098$
  • So the probability they appear together in any of the $100$ bags is $100$ times this, about $0.0098$, and the probability they do not is about $0.9902$
  • Slightly wrongly assuming independence, the probability none of the $50$ pairs appear together in any of the bags might be about $0.9902^{50}\approx 0.611$, so the probability that at least one bag has matches might be about $0.389$.

That is suspiciously close to the simulation results.

If you made a further wrong assumption that the number of bags with matches had a Poisson distribution with mean about $-\log_e(0.611)\approx 0.493$ then you might find the probabilities of $0$ bags with a match about $0.611$, $1$ bag with a match about $0.301$, $2$ bags about $0.074$, $3$ bags about $0.012$, and $4$ bags about $0.0015$, still close to the simulation values.

3
On

For 4950 distinct marbles among 5000, a very good approximation can be made from treating the bagging process for each set of duplicates as independent (which they are not). Start by randomly bagging only the marbles that have a duplicate.

If the $i^{\text{th}}$ set of duplicates has $k$ identical marbles that are placed uniformly at random in $b$ bags of size $n/b$, the probability that they each land in a different bag is

$$q_i=\prod_{j=1}^{k-1}\frac{n-bj}{n-j}$$

The probability of finding a duplicate in one or more bags is $p\approx1-\prod_iq_i$.

there are 4950 distinct marbles and 50 marbles that are duplicates

There are 204226 distinct ways to form the duplicate sets as given by the integer partitions of 50. Each will have a different $p$. A few possiblilities are: 50 sets of duplicates, each containing two marbles ($p\approx0.3889)$, a single set of duplicates containing 51 marbles ($p\approx0.9999998)$, or 14 sets of duplicates containing $k=\{2,2,2,2,4,4,5,5,5,5,6,6,7,9\}$ marbles.

The following R code shows the distribution of the estimates of $p$ for the distinct integer partitions of 50.

k <- partitions::parts(50)
k <- matrix(k + 1L, ncol(k), nrow(k), TRUE)
# calculate log(q) for each possible duplicate set size (q = 1 for k = 1)
q <- cumsum(log((5000 - 50*(0:50))/(5000 - (0:50))))
# calculate p for each distinct partition
p <- 1 - exp(rowSums(array(q[k], dim(k))))
range(p)
#> [1] 0.3889129 0.9999998
k[which.min(p),]
#>  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [39] 2 2 2 2 2 2 2 2 2 2 2 2
k[which.max(p),]
#>  [1] 51  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
#> [26]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
hist(p)

enter image description here


Simulation

The R code below performs 1M replications of the bagging simulation for each of 30 different possible ways to form duplicate sets based on the integer partitions of 50.

library(parallel)
nc <- detectCores() - 1L
nsim <- 1e6L
cl <- makeCluster(nc, type = "PSOCK")
phat <- unlist(
  parLapply(
    cl, data.frame(t(k[i <- sort(sample(nrow(k), 2*nc)),] - 1L)),
    function(k, nsim) {
      # 1:4950 belong to the first bag, 4951:9900 belong to the second, etc.
      b <- 4950L*(0:99)
      mean(
        replicate(
          nsim, anyDuplicated(sample(c(1:4950, rep.int(1:50, k))) + b) != 0L
        )
      )
    }, nsim = nsim
  )
)
stopCluster(cl)

The formula-based approximations and the simulation-based estimates for $p$ differ by less than 0.0005 for each simulated value.

max(abs(df$p - df$phat))
#> [1] 0.0004736181

Compare the estimates $p$ with some 90% credible intervals of $p$ from the simulation.

df <- data.frame(i, p = p[i], phat,
                 LCL = qbeta(0.05, nsim*phat + 0.5, nsim*(1 - phat) + 0.5),
                 UCL = qbeta(0.95, nsim*phat + 0.5, nsim*(1 - phat) + 0.5))

         i         p     phat       LCL       UCL
X1   14182 0.9703393 0.969956 0.9696742 0.9702358
X2   17514 0.9713908 0.971325 0.9710495 0.9715985
X3   20894 0.9680982 0.968259 0.9679696 0.9685464
X4   24681 0.9638628 0.963938 0.9636303 0.9642437
X5   26758 0.9535365 0.953500 0.9531527 0.9538454
X6   28356 0.9323432 0.931965 0.9315499 0.9323783
X7   29376 0.9682808 0.968351 0.9680620 0.9686380
X8   34904 0.9230245 0.923162 0.9227230 0.9235992
X9   38984 0.9456270 0.945766 0.9453925 0.9461376
X10  43983 0.9543882 0.954349 0.9540047 0.9546914
X11  64979 0.9303090 0.930465 0.9300457 0.9308825
X12  67211 0.9424809 0.942514 0.9421302 0.9428959
X13  71954 0.8952376 0.895368 0.8948637 0.8958706
X14  80893 0.9003469 0.900811 0.9003185 0.9013018
X15 103811 0.8302738 0.830116 0.8294976 0.8307330
X16 112028 0.8764857 0.876683 0.8761414 0.8772230
X17 118864 0.7915026 0.791976 0.7913077 0.7926430
X18 120295 0.8471327 0.847133 0.8465403 0.8477242
X19 121476 0.8111211 0.811461 0.8108170 0.8121037
X20 123101 0.8163041 0.816122 0.8154841 0.8167585
X21 127133 0.8858111 0.885781 0.8852570 0.8863034
X22 136584 0.7986852 0.798614 0.7979537 0.7992730
X23 145497 0.8659411 0.865491 0.8649290 0.8660514
X24 147172 0.8336017 0.833610 0.8329967 0.8342219
X25 162866 0.8666869 0.866797 0.8662373 0.8673551
X26 170388 0.7684281 0.768812 0.7681180 0.7695049
X27 193224 0.6948132 0.694549 0.6937910 0.6953062
X28 194190 0.6412669 0.641463 0.6406739 0.6422515
X29 200422 0.5653060 0.564918 0.5641024 0.5657333
X30 202663 0.5652984 0.565772 0.5649566 0.5665871

Plot where the formula-based approximations of $p$ land relative to the credible intervals (0 = UCL, 1 = LCL).

df <- within(df, x <- (p - LCL)/(UCL - LCL))
plot(df$x)
abline(h = 0:1, lty = 2)

enter image description here

The agreement is very good.

The following snippet demonstrates one way to efficiently simulate $p$ when the non-distinct marbles are sampled uniformly from the distinct marbles.

system.time({
  b <- 4950L*(0:99)
  p <- mean(replicate(1e5, anyDuplicated(sample(c(1:4950, sample(4950, 50, 1))) + b) != 0L))
})
#>    user  system elapsed 
#>   34.31    4.14   38.49
p
#> [1] 0.39008