Spoiler, tap to reveal.
In the answers, DanaJ demonstrates encoding natural numbers, as a set of Nth prime factors, as described in the question below, taking about $1.2$ to $1.5$ times the bits as using a more straightforward single binary number encoding.
A natural number can be stored as its prime factors, for example:
$10 = 2*5 = product(2, 5)\\12 = 2*2*3 = product(2, 2, 3)\\13 = 13 = product(13)$
And it's prime factors, being prime numbers, can be stored as the "position" of the prime that they are.
$p(1) = 2\\p(2) = 3\\p(3) = 5\\p(4) = 7\\p(5) = 11\\p(6) = 13$
Therefore: (Let $C$ be the name of a new function)
$10 = 2*5 = p(1) * p(3) = product(p(1), p(3)) = C(1, 3)\\12 = 2*2*3 = p(1) * p(1) * p(2) = product(p(1), p(1), p(2)) = C(1, 1, 2)\\etc\ldots$
The density of primes decreases with magnitude, larger adjacent primes being spaced further apart.
For example, the primes between $10$ and $20$ are $11, 13, 17$, and $19$, but $1000$ to $1010$ contains just one prime, $1009$.
However larger numbers tend to have more factors.
Question:
Typically, roughly how many times more data should it take to store numbers as described, than in the usual single binary number way?
Is this approach more efficient for larger numbers?
Following is an example with a 2 digit, and a 4 digit number.
I've included roughly how many binary bits could be required for this in square brackets. This is probably an underestimate as it doesn't count everything such as storage for the lengths of the numbers, or the storage for the length of the list of numbers. I'll use $n$ to represent this additional cost.
I'll represent for example $C(2, 3, 3, 6)$ as $C(+1, +1, +0, +3)$, this works because consecutive values never decrease.
.
$35\ [5\text{ bits}] = C(+2, +1)\ [2+1+n\text{ bits}]\\1822\ [11\text{ bits}] = C(+0, +155)\ [0+8+n\text{ bits}]$
I wrote some Perl modules that relate to this, so I thought I'd try them out. Use
cpan ntheory Data::BitStream Data::BitStream::XSthen:I use a terminator between values so we know when the factors for this number have stopped.
We can use
get/put_binword(<bits>, $n)to store data in a fixed number of bits. There are lots of variable-length codes to choose from. I compare binary (simple, but either relies on knowing an upper limit or wasting space), adaptive Rice encoding the number, and using Elias Delta encoding of the factor indices. Fibonacci and Boldi-Vigna-2 codes seem to be a bit more efficient (e.g. 322M bits vs. 339M bits for 1-10M). There are other tricks that can reduce the number of bits as well (e.g. encode the number of '2' factors using unary).In bits, for the first 100k numbers I get:
For 1-1,000,000:
For 1-10,000,000:
For 2^32-1M to 2^32-1:
For 2^36-1M to 2^36-1:
I believe we could lower this to ~1.34x using Fibonacci codes. Perhaps a taboo or start/stop code could be made to lower it further. As mentioned there are more complicated methods that can shave off even more bits.
Of course a big limitation here is the factoring, prime count, and nth prime. Factoring 64-bit numbers is pretty easy so this isn't taking much time. Exact prime counts aren't cheap however -- past 2^36 or so it gets pretty expensive (prime count for 2^32 is about 3 milliseconds; 2^36 is 14ms; 2^40 is 75ms). nth prime has a similar cost. It's really not a viable method past 2^50 or so.
This storage compares well to rounded-up binary storage (e.g. using 64 bits to store numbers between 1 and 10M), but that's not a fair comparison. We should look at an efficient encoding of the raw numbers, such as Gamma/Delta/Omega, adaptive Rice, etc. encoding, or using a tight bit bound.
How many times more data: Somewhere between 1.2x and 1.5x a tight bit bound. It starts to look much better if you are rounding up to 32- or 64-bit per number where a lot of those bits are wasted.
Is it more efficient for larger numbers? It looks like it might be, but not by much (clarification: more efficient than the same method on smaller numbers, not more efficient than tightly storing the original number). The limitation of storing the prime index means "large" can't be very large.