Numerical conventions which survive lexical sorting

87 Views Asked by At

The numerical strings "5","150" and "23" when sorted lexically do not appear in numerical sequence. The normal approach in system design is to either store these values as integers, or in a fixed-length string, padded on the left with zeroes.

This is not ideal in some systems (for example) file names when the number of items is not known in advance. "Item-5" will still appear after "Item-150" when the items are listed alphabetically.

I thought of a simple transformation where the integer N is represented by a numerical string in two parts. The first part is a single digit which denotes the number of subsequent digits in the string, the subsequent digits are just the string representation of the integer.

So 0 would be represented as "10" ("1" digit, value "0")
9 as "19" ("1" digit, value "9")
10 as "210" ("2" digits, value "10")
950 as "3950" ("3" digits, value "950")

The advantage of this approach is that a lexical sorting of the transformed digits will appear in numerical order, and a developer does not need to know in advance the maximum number of items that need to be sorted.

It would work for collections of up to $10^9$ items, and might be useful in ad-hoc filing systems.

I am sure this is nothing new, and would like to know: Has this method been described more rigorously before?

I don't want to re-invent the wheel, and would prefer to take advantage of the experience of others if possible.

Is there a better / more efficient way to coerce a numerical order from an integer stored as a string?

1

There are 1 best solutions below

1
On

This is a very sensible approach. There are two objections to it, both appealing more to purists than to practical people. The first is that the most common numbers, the single digit ones, get doubled in length. Yes, that is just one more digit. The second is that you are limited to $10^9$ numbers in the scheme. You are unlikely to have more versions of a file than that.

You can avoid both of these by decreeing that no version number will have $9$ as a leading digit. Your versions will start $0,1,2,\ldots 8$, but the ninth will be $10$ and every one up to $88$ will be incremented by $1.\ \ 89$ will get incremented to $100$ and every number up to $888$ will get incremented by $11$ and so on. Then $0$ through $8$ will be represented in the system unchanged. Larger numbers will be prefixed by a string of $9$s one less than the number of digits, so $47$ would get incremented to $48$ and represented as $948$. This has less increase in digits than your scheme for $0-8$, breaks even for $9-88$ and has one more for numbers above $88$. It works for arbitrarily large numbers.

If we worry about very large numbers, we can do better still. If the version number is of order $10^{2000}$ we will have to prepend $2000$ nines. Numbers will almost double in length. Let a single $9$ be followed by a single digit that is the number of digits to follow. Now we add two digits to two digit numbers instead of one, but we also only add two digits to numbers with nine digits. If the number of digits is nine or more, we encode it just like the version numbers to not have $9$ as the leading digit. We then prefix the version with $99$ followed by the two digit encoded version of the number of digits. If the encoded number of digits has three digits, we prefix with $999$ and the number of digits and so on.

You have to decide how many layers of length encoding you want to deal with. If you want very large numbers, more layers will be worthwhile. You can then go to a meta layer, where you encode the number of layers in the first number, the layer in the second, and the number in the third. It goes on....