Faster significant prefix comparison in binary representation

155 Views Asked by At

An "integer prefix" here is the first non-zero digits of a representation... Suppose first two digits:

  • Same decimal prefix 45: $(4512)_{10}$ and $(0045)_{10}$
  • Same binary prefix 101: $(000101)_{2}$ and $(010101)_{2}$

So, for the second example, suppose a function p that checks same prefix of size 3: $p(000101,010101,3)$.

... A function p(int binary,int bynary,size) that returns true when numbers have the same prefix. Suppose $p(x,y,L)$ for $x,y$ as positive 64 bits integers and 0<L<64.

I have all usual integer arithmetics (*,+,-,etc.) and bitwise (AND, OR, shift left, etc.) basic operations... How to express this p function?
Supposing that there are many solutions, I need the "faster" (less computational time).


NOTES

There are some clues here but I not see how to use it.

There are also a good example of application, for Geohash 64-bit representation. See this pg_geohash issue. Analog thing can be used with Metaphone, SOUNDEX, pHash and other "similarity hashes" (mathematical Locality-sensitive hashing).

1

There are 1 best solutions below

3
On

Summary

Perhaps there are an answer, and it is

$$p(y,x,L) := [[(f(y) \& f(x)) >> (64-L)] = 0]$$

but is difficult to proof that it is the "faster way". The only algorithm that actually looks "the faster", with performance gain when compared to "common" ones, is one that uses the cache strategy (pre-processing y values). So, enhancing the above formula with the replacing of $f(y)$ by a constant $c$.

Discussion and details

This is not a generic solution, is valid only for use in situations where only one parameter need "real time" speed.

The problem explaned in the question seems not offers a trivial solution, so using a brute force solution: to cache a pre-processed number... Is an economic cache, and valid as solution of my problem in a database.

Cache-phase: some examples, supposing 8 bits.

  • original=$00010100$ and cached=$10100000$.
  • original=$01010100$ and cached=$10101000$.
  • original=$11010101$ and cached=$11010101$.

Real-time checking: comparing the cached number c with x.

Suppose comparision by $L=3$, with $x=01011000$. Need to transform x into $10110000$ them preserve only first 3 (L) non-zero bits, resulting $y=10100000$. Examples:

  • for $c:=10100000$, compare $c\&y$ (is $10100000$) with y, that is true.
  • for $c:=10101000$, compare $c\&y$ (is $10100000$) with y, that is true.
  • for $c:=11010101$, compare $c\&y$ (is $11000000$) with y, that is false.

Conclusion

  • Cache-phase is a "while non-zero" function that does only x << 1 to obtain the cached 64 bits representation. Let label it $f(x)$, a function sometimes named "BitScanForward".

  • The function $p(c,x,L)$ that checks prefixes of size L, need also a function $d(L)$ that do "padding left ones" ($d(2)=11000000$, $d(3)=11100000$, etc.). Is an integer comparison with a bitwise AND after after f transformation: $$p(c,x,L) := [x = [c \& f(x) \& d(L)]]$$

or, using the faster Ben Voigt's solution:

$$p(c,x,L) := [[(c \& f(x)) >> (64-L)] = 0]$$

There are many performance gains:

  • the replace of the &d(L) part by >>(64-L), it removes the function-call and ensure a good implementation for this part of the function;

  • the replace of the x= part, a "comparison with variable" operation, by a comparison with a constant. To use a constant instead a variable is the main performance gain.
    As this constant is zero, modern compilers will use something as test eax,eax instead of cmp eax,0, that is some marginally better performing instruction.

PS: the function f(x) can be implemented as "left align" using a function g(x,L) that tells you how far to shift, so $f(x)=x<<g(x,L)$. As Ben Voigt explained, is the faster way to do it, and the g usual name in the C-lang's libraries, is BitScanReverse64().