How to number the natural numbers lexicographically with minimal overhead (and provide a lower bound for the overhead)?

49 Views Asked by At

Working in binary, note that the number 100 is lexicographically smaller than the number 11 even though $100 > 11$. How can we devise a function $f$ such that $f(a)$ is lexicographically smaller than $f(b)$ exactly when $a < b$ for all integers $a, b > 0$, and such that $string\_length(f(x))$ is "small" relative to $string\_length(x)$?

For example, prepending $1^n0$ to a number of string length $n$ is one such scheme that requires $2 * n + 1$ characters to represent a number of length $n$.

A better solution is the following:

$f(x) = 1^{string\_length(g(x))}0g(x)00x$,

where $g(x)$ is an encoding of the length of $x$ that avoids the number pattern $00$. Such a $g$ can be constructed as a bitset encoding of set of Fibonacci numbers that is not missing more than one consecutive Fibonacci number where the sum of the numbers in the set is equal to the string length of $x$. With this definition, $g(x)$ will have at most $log_\phi(string\_length(x))$ characters, so the total encoding of a number of length $n$ will require no more than $n + 2 * \log_\phi n + 3$ characters.

This process can be iterated one or more times to achieve a bound of $n + \log_\phi n + o(\log_\phi n)$.

As a further optimization, we can avoid the intervening zeros and increase the base of the $\log$ by deducing the length of the prepended padding. Consider the following scheme, which has two iterations of "length-prepending" compared to just one iteration above and uses $|x|$ to represent $string\_length(x)$ (ceiling operations omitted):

$f(x) = 1^{\lg \lg |x|} 0 (\lg \lg |x|) (\lg |x|) x.$

Again the process can be iterated to shrink coefficient for the low order iterated log term (and we can remove one character by omitting the most-significant bit of the $\lg \lg |x|$ portion).

How close is this to optimal? Is there a better solution that requires significantly less overhead? Is there a lower-bound to show that you need an overhead of at least an additive $\Omega(\log n)$ characters? Has anything like this been studied (I thought of this recreationally as a tangent to a problem that came up at work)?