Given a set of $N$ positive numbers $a_1 \ldots a_N$, I am trying to generate random subsets of this set such that the sum of numbers in subset is close to a given number $M$.
- Suppose I can only randomly select any number with some probability $p$ or not select it with probability $(1-p)$. How have I to chose $p$ to get samples having sum close to $M$ given $a_1 \ldots a_N, M$?
- If I can set individual probabilities $p_i$ for selecting on not selecting the number $a_i$, how have I to set $p_i$ given $a_1 \ldots a_N,i,M$ so that sum of selected $a_i$ is close to $M$?
Update: I need to define "close to M". I suppose a definition like "80% of samples fall into $[M-\epsilon,M+\epsilon]$ range" will be good for me.
To give a bit of context, I am trying to find out a good way to initialize chromosomes for a genetic algorithm solution of $0$-$1$ knapsack problem. Having said that, I don't care if some small proportion of samples have sums that aren't close to the requested $M$, but I want majority of them around it.
My own thoughts: I can get lower and upper bound for $p$ for variant 1 of problem formulation.
Let $a_{\text{sorted}}$ be $a_i$ sorted in ascending order. $$p_{\text{upper}} = \frac{\max_k(\sum_{i=1..k} {a_{\text{sorted}}}_i \leq M)}{N}$$ $$p_{\text{lower}} = \frac{N-\max_k(\sum_{i=k..N} {a_{\text{sorted}}}_i \geq M)}{N}$$
I also can hypothesize that desired $p$ is around:
$$p_{\text{mean-based}} = \frac{M \cdot N^2}{\sum_{i=1}^N a_i}$$
or
$$p_{\text{median-based}} = \frac{M \cdot N}{\mathrm{median} a_i }$$
The best approach may be hard to implement, given that it essentially involves solving another knapsack problem with lower precision: finding a sum within $\epsilon$ of $M$ is approximately equivalent to finding a sum of $\lfloor \frac{a_1}{2\epsilon} \rfloor, \dots, \lfloor \frac{a_N}{2\epsilon} \rfloor$ equal to $\lfloor \frac{M}{2\epsilon} \rfloor$. But choosing $p$ so that the mean of the random sum is $M$ seems like a reasonable approach, and we can prove pretty good bounds on the deviation from the mean.
Say we choose $p$ so that $p \sum_{i=1}^N a_i = M$ (your mean-based approach, except that you have an extra factor of $N^2$ I don't understand). Then the expected value of the sum is $M$. More generally, say we choose any $p_1, \dots, p_N$ so that $\sum_{i=1}^N p_i a_i = M$, making the expected value $M$ again. Then we can use McDiarmid's inequality to bound the probability of a large deviation: if $\mathbf X$ is the random sum, then $$ \Pr[|\mathbf X-M| \ge \epsilon] \le 2 \exp\left(-\frac{2\epsilon^2}{\sum_{i=1}^N a_i^2}\right). $$ With $\epsilon = \sqrt{\frac32 \sum_{i=1}^N a_i^2}$, for example, this guarantees a probability of $1 - 2e^{-3}$ or about $90\%$ that we fall in the range $[M-\epsilon, M+\epsilon]$.
Moreover, the same bound holds for any random sum like this one. So we must choose $p$ (or $p_1, \dots, p_N$) so that $\sum_{i=1}^N p_i a_i$ is close to $M$, or else the same inequality will show that we are almost never in the right range.
A cheap way to improve this bound in cases where $p_1, \dots, p_N$ are chosen independently is to avoid some or all of the larger $a_i$ entirely, reducing the value of the sum $\sum_{i=1}^N a_i^2$. However, this seems like it will run counter to your intended goal: if we never use the largest $a_i$, then even if this lets us get closer to $M$ more often, the genetic algorithm won't be as good because it won't ever learn to use $a_i$ if it needs to.
Still, if some $a_i$'s are very large outliers, it may make more sense to handle them differently. Suppose that $a_1$ is much, much, bigger than any of $a_2, \dots, a_N$. Then we can seed your genetic algorithm with two kinds of approximate solutions:
Then we get much tighter concentration for each case than we would if we chose the subsets independently, while still getting solutions both with and without $a_1$. We can do the same thing to handle more than one outlier, though no more than a few because the number of cases grows exponentially.