Sampling subsets of positive numbers with sum close to a given number

68 Views Asked by At

Given a set of $N$ positive numbers $a_1 \ldots a_N$, I am trying to generate random subsets of this set such that the sum of numbers in subset is close to a given number $M$.

  1. Suppose I can only randomly select any number with some probability $p$ or not select it with probability $(1-p)$. How have I to chose $p$ to get samples having sum close to $M$ given $a_1 \ldots a_N, M$?
  2. If I can set individual probabilities $p_i$ for selecting on not selecting the number $a_i$, how have I to set $p_i$ given $a_1 \ldots a_N,i,M$ so that sum of selected $a_i$ is close to $M$?

Update: I need to define "close to M". I suppose a definition like "80% of samples fall into $[M-\epsilon,M+\epsilon]$ range" will be good for me.

To give a bit of context, I am trying to find out a good way to initialize chromosomes for a genetic algorithm solution of $0$-$1$ knapsack problem. Having said that, I don't care if some small proportion of samples have sums that aren't close to the requested $M$, but I want majority of them around it.

My own thoughts: I can get lower and upper bound for $p$ for variant 1 of problem formulation.

Let $a_{\text{sorted}}$ be $a_i$ sorted in ascending order. $$p_{\text{upper}} = \frac{\max_k(\sum_{i=1..k} {a_{\text{sorted}}}_i \leq M)}{N}$$ $$p_{\text{lower}} = \frac{N-\max_k(\sum_{i=k..N} {a_{\text{sorted}}}_i \geq M)}{N}$$

I also can hypothesize that desired $p$ is around:

$$p_{\text{mean-based}} = \frac{M \cdot N^2}{\sum_{i=1}^N a_i}$$

or

$$p_{\text{median-based}} = \frac{M \cdot N}{\mathrm{median} a_i }$$

1

There are 1 best solutions below

0
On

The best approach may be hard to implement, given that it essentially involves solving another knapsack problem with lower precision: finding a sum within $\epsilon$ of $M$ is approximately equivalent to finding a sum of $\lfloor \frac{a_1}{2\epsilon} \rfloor, \dots, \lfloor \frac{a_N}{2\epsilon} \rfloor$ equal to $\lfloor \frac{M}{2\epsilon} \rfloor$. But choosing $p$ so that the mean of the random sum is $M$ seems like a reasonable approach, and we can prove pretty good bounds on the deviation from the mean.

Say we choose $p$ so that $p \sum_{i=1}^N a_i = M$ (your mean-based approach, except that you have an extra factor of $N^2$ I don't understand). Then the expected value of the sum is $M$. More generally, say we choose any $p_1, \dots, p_N$ so that $\sum_{i=1}^N p_i a_i = M$, making the expected value $M$ again. Then we can use McDiarmid's inequality to bound the probability of a large deviation: if $\mathbf X$ is the random sum, then $$ \Pr[|\mathbf X-M| \ge \epsilon] \le 2 \exp\left(-\frac{2\epsilon^2}{\sum_{i=1}^N a_i^2}\right). $$ With $\epsilon = \sqrt{\frac32 \sum_{i=1}^N a_i^2}$, for example, this guarantees a probability of $1 - 2e^{-3}$ or about $90\%$ that we fall in the range $[M-\epsilon, M+\epsilon]$.

Moreover, the same bound holds for any random sum like this one. So we must choose $p$ (or $p_1, \dots, p_N$) so that $\sum_{i=1}^N p_i a_i$ is close to $M$, or else the same inequality will show that we are almost never in the right range.

A cheap way to improve this bound in cases where $p_1, \dots, p_N$ are chosen independently is to avoid some or all of the larger $a_i$ entirely, reducing the value of the sum $\sum_{i=1}^N a_i^2$. However, this seems like it will run counter to your intended goal: if we never use the largest $a_i$, then even if this lets us get closer to $M$ more often, the genetic algorithm won't be as good because it won't ever learn to use $a_i$ if it needs to.

Still, if some $a_i$'s are very large outliers, it may make more sense to handle them differently. Suppose that $a_1$ is much, much, bigger than any of $a_2, \dots, a_N$. Then we can seed your genetic algorithm with two kinds of approximate solutions:

  1. A bunch of subsets chosen with $p_1 = 0$ (so $a_1$ is never included) and $$p_2 = \dots = p_N = \frac{M}{\sum_{i=2}^N a_i}.$$
  2. A bunch of subsets chosen with $p_1 = 1$ (so $a_1$ is always included) and $$p_2 = \dots = p_N = \frac{M-a_1}{\sum_{i=2}^N a_i}.$$

Then we get much tighter concentration for each case than we would if we chose the subsets independently, while still getting solutions both with and without $a_1$. We can do the same thing to handle more than one outlier, though no more than a few because the number of cases grows exponentially.