Let $l_1,\cdots,l_n \sim U(0,1)$ be i.i.d uniform variables. Given $L>0$ a natural number, let us define the "uniform" subset sum problem as: Find $I \subset \{1,\cdots,n\}=:[n]$ such that
$$\sum_{i \in I} l_i \le L$$
$$I = \operatorname{argmax}_{I \subset [n]} \left( \sum_{i \in I} l_i \right )$$
I want to propose an algorithm to solve this problem:
Observe that $E(l_i) = 1/2$ and so by linearity of expectation:
$$E(\sum_{i \in I} l_i ) = \frac{|I|}{2}$$
If we make the Ansatz:
$$|I| \approx 2 L $$
then we could consider two cases:
$$n \le 2L \rightarrow \text{ return } [n]$$
$$n > 2L$$ Set $I = \emptyset, k = 1, s = 0$.
sl = sort(l1,...ln)
while s <= L:
s <- s+sl[k]+sl[n-k+1]
I <- I.union( {k, n-k+1} )
k <- k+1
k <- k-1
if s-sl[k]<=L:
return I.difference({k})
if s-sl[n-k+1]<=L:
return I.difference({n-k+1})
return I.difference({k,n-k+1})
Heuristic explanation of the idea behind the algorithm.
Sorting the list of numbers $l_{(1)} \le l_{(2)} \le \cdots \le l_{(n)}$ we observe, that the order statistic is Beta distributed:
$$l_{(k)} \sim \operatorname{Beta}(k,n-k+1)$$
and thus has expected value:
$$E(l_{(k)}) = \frac{k}{n+1}$$
hence:
$$E(l_{(k)}+l_{(n-k+1)}) = 1$$
Hence each of the pair $l_{(k)},l_{(n-k+1)}$ is expected to contribute $1$ to the sum. So we keep adding pairs $l_{(k)}+l_{(n-k+1)}$ to the sum $s$ and increase $k \leftarrow k+1$ by one, until we arrive at $s>L$.
We go one step back with $k$:
$k \leftarrow k-1$
In this case we first remove the smaller $l_{(k)}$ from the sum and check if it already satsifies $s \le L$. If this is the case we return $I-\{k\}$.
If this is not the case, we remove the larger of the pair, $l_{(n-k+1)}$ from $s$ and check if $s \le L$. If this is the case we return $I-\{n-k+1\}$.
If neither of the above is the case, we return $I-\{k,n-k+1\}$ which is guaranteed to have $s \le L$.
I have not been able to analyze this algorithm without using heuristics, but I could try it out empirically and it seems "to work quite well":
def solve_uniform_ssp(ll,L):
if all([0<=l<=1 for l in ll]) and L>0:
n = len(ll)
if n <= 2*L:
return list(range(n)),sum(ll)
sll = sorted(ll)
I = set([])
k = 0
s = 0
while s <= L:
#print(k,n-1,n-k-1-1+1)
s = s+sll[k]+sll[n-k+1-2]
k = k+1
I = I.union(set([k,n-k+1]))
k = k-1
if s-sll[k]<=L:
s = s-sll[k]
return sorted(list(I.difference(set([k])))),s
if s-sll[n-k+1-2]<=L:
s = s-sll[n-k+1-2]
return sorted(list(I.difference(set([n-k+1-1])))),s
s = s-sll[k]-sll[n-k+1-2]
return sorted(list(I.difference(set([k,n-k+1-1])))), s
print("S, s, L")
for experiment in range(1,1000):
n = randint(1,1000)
L = randint(1,100)
ll = [randint(0,100)/100.0 for k in range(n)]
#print(ll)
I,s=(solve_uniform_ssp(ll,L))
print(sum(ll),s,L)
S, s, L
468.420000000000 97.3100000000000 98
300.380000000000 75.7300000000000 76
96.5200000000000 1.00000000000000 1
116.620000000000 96.9899999999999 97
334.450000000000 40.1800000000000 41
93.6600000000000 23.5500000000000 24
238.540000000000 53.4300000000000 54
225.220000000000 2.00000000000000 2
122.960000000000 58.6600000000000 59
416.870000000000 31.9700000000000 32
381.590000000000 3.02000000000000 4
445.520000000000 18.9800000000000 19
70.3700000000000 55.8900000000000 56
314.080000000000 4.00000000000000 4
479.199999999999 89.8899999999999 90
56.8600000000000 56.8600000000000 61
74.1700000000000 74.1700000000000 83
369.610000000000 17.9900000000000 18
478.030000000001 44.1500000000000 45
67.4700000000000 7.84000000000000 8
466.509999999999 43.3800000000000 44
493.579999999999 84.4300000000001 85
472.899999999999 9.03000000000000 10
156.450000000000 15.1000000000000 16
My question is if it is possible to formulate what it means to "work quite well" and if possible to prove that the probability that "it works quite well" is so to say "high", if it is the case?
One thing we can prove about this algorithm is that if we increase a sum by trying to add numbers in $[0,1]$ to it one at a time until we exceed $L$, then the sum right before that will be at least $L-1$. So we can guarantee that the algorithm always does at least this well, except in cases where the sum of all $n$ numbers is less than $L-1$ and doing this well is impossible.
(Another exception is cases where the initial heuristic suggests taking all the numbers. I think including it is completely unnecessary, and there are cases where it can lead you astray; say your target is $L=10$, there are only $19$ numbers with expected sum $9.5$, but actually by chance their sum is $10.2>L$ and you should have taken only the largest $18$ numbers.)
The algorithm will perform at its best when $L \approx \frac n3$. In this case, when the sum $s$ first exceeds $L$, the quantities $$ s - l_{(k)} - l_{(n-k)} \le s - l_{(n-k)} \le s - l_{(k)} \le s $$ should be about evenly spaced, since $k \approx \frac n3$ leads us to expect $l_{(k)} \approx \frac13$ and $l_{(n-k)} \approx \frac23$. So the first sum that is larger than $L$ and the last sum that is smaller than $L$ are about $\frac13$ apart. Under that assumption (which becomes very likely when $n$ is large) the value of the final sum will be approximately uniformly distributed on $[L - \frac13, L]$, and on average it will only be off by $\frac16$.
(In trials such as
381.59 3.02 4we see that the algorithm does worse - about as badly as our provable guarantee - when $L$ is very small compared to $n$.)On the other hand, if our criterion for doing well were the unrealistic goal of achieving the best possible value, then the algorithm almost certainly never does well. It is very unlikely that the sum closest to $L$ but not less than $L$ is obtained by taking the $k$ smallest and $k$ largest terms for any $k$ - simply because it is very unlikely that it has any particular form. There are $2^n$ possible sums to look at, and this algorithm is limited to around $2n$ of them.
A better strategy along the same lines would be to:
The advantage here is that there are $\frac{n(n+1)}{2}$ possible sums of the form $S_{(k)}$ or $l_{(j)} + S_{(k)}$, and we go through them in increasing order. On average, these sums will be between $0$ and $\frac n2$ and close to evenly spaced, so the gap between consecutive sums will be on the order of $\frac 1n$, and we expect our final answer to be in the range $[L - \frac1n, L]$.