Imagine you wish to encode a non-negative integer across N buckets, where each bucket can hold a unary number. The probability of an integer being given as input to the encoder is specified by a function P(i) which sums to 1.
If
N = 1, you more or less have to encode the number in unary. So if you want to express 42, you will need to use 42 symbols, all in the first (and only) bucket.If
N = 2, you start getting options. Example: you might say that the second bucket is the tens place. So expressing 42 could be done with 2 symbols in the first bucket, and 4 in the second, for a total of 6.
P is fixed, but you don't a-priori know what N is. What's desired is a system of encoding which for a given N will use (on average) as few unary "digits" as possible to encode a number from P.
I don't want to prescribe what kind of function P is--it could be uniform within a range (so long as it included 0), or gaussian or whatever. But having it specified seems necessary to make the problem one that has "a solution".
I'm curious if this maps to an existing problem, or if someone here would have an approach to addressing it...
Just to make things a bit easier notationally, we can represent each encoding as an $N$-tuple where the $i$th component is the number of items in the $i$th bucket. Additionally, I'll use the notation $e_N(n)$ to represent the encoding of $n$ using $N$ buckets. So in your examples, $e_1(42) = (42)$ and $e_2(42) = (4, 2)$.
In order to optimize our encoding, we need to see which integers occur the most frequently. So first, sort the integers as $n_1, n_2, \dots$ such that $P(n_i) \geq P(n_{i+1})$ for all $i$.
Since $n_1$ occurs most frequently, we want to represent it with as few symbols as possible. Thus, we'll represent $n_1$ as, $e_N(n_1) = (0, \dots ,0)$. Thus, $n_1$ is encoded with zero symbols.
Next, look at how many encodings have one symbol. There are exactly $N$ such encodings. Thus, we'll encode the next $N$ integers using those encodings:
$e_N(n_2) = (0, 0, \dots, 0, 0, 1)$
$e_N(n_3) = (0,0, \dots, 0, 1, 0)$
$\dots$
$e_N(n_N) = (0,1, \dots, 0, 0, 0)$
$e_N(n_{N+1}) = (1,0, \dots, 0, 0, 0)$
Next, we look at the number of encodings with exactly $2$ symbols and we assign all the currently unassigned integers at the front of our list to those encodings. We then look at the encodings with exactly $3$ symbols and repeat the same process. Then do the same for $4$ symbols, $5$ symbols, and so on and so forth.
As an example suppose I had $2$ buckets and after ordering my list of integers, I get $15, 3, 4, 7, 2, 8, 5, 1, \dots$. Then, my encodings will be,
$e_2(15) = (0,0)$
$e_2(3) = (0,1)$
$e_2(4) = (1,0)$
$e_2(7) = (0,2)$
$e_2(2) = (1,1)$
$e_2(8) = (2,0)$
$e_2(5) = (0,3)$
$e_2(1) = (1,2)$
$\dots$