Shortcut to finding the square-root of a perfect-square?

2.8k Views Asked by At

I've been trying to speed up an algorithm where the most expensive operation is the square-root. However, I can often guarantee that the input value is a perfect-square. I'm curious to know if there are any algorithms that will find the square-root faster (constant time?) if it is known that the input is a perfect-square?

Thanks, Ryan

5

There are 5 best solutions below

9
On

The sum of the first $k$ odd numbers is $k^2$. Knowing this, you can you calculate the square root by summing successive odd numbers (starting from one)—once you reach the input value, return the number of summations you made.

For example, $16 = 1 + 3 + 5 + 7$; that's $4$ addends, so $\sqrt{16}=4$. This process will always work, since our input is guaranteed to be of the form $k^2$ with $k \in \mathbb N$.

I think this method would run in $O(\sqrt n)$.

4
On

I think a binary search type algorithm would be quite efficient for large input values if we know the input is a perfect square.

Let $n$ be the input value.

  1. Begin with two integers $a$ and $b$ such that $a^2 < n$ and $b^2 > n$. We could use $a=0$ and $b=n$.

  2. Find the midpoint of $a$ and $b$ and round down to the nearest integer if necessary. Call this $m$.

  3. Find $m^2$. If $m^2=n$ then $m$ is the square root. If $m^2>n$ then $m$ is too high, so we return to step 2 with $a$ and $m$ as our two integers. If $m^2<n$ then $m$ is too low, so we return to step 2 with $m$ and $b$ as our two integers. Repeat until the square root is found.

The squaring of $m$ may be what slows the algorithm down, however I believe that multiplication algorithms are implemented in processor hardware and therefore very efficient. In terms of the number of operations, I believe the binary search would run in logarithmic time and therefore be preferable to $O(\sqrt n)$ for large input values. However, I am no expert on algorithm efficiency...

7
On

With integers within sensible bounds compared to what your CPU can natively compute, it can be quite easy to restrict the range of numbers you have to binary search to find the square root of x.

(0. remove two-blocks of trailing 0 from your binary number. Each block you remove is one factor of 2 to be multiplied to the result of the following step. This can be done in constant time, if I'm not mistaken: Observe the structure of "Subtract 1 and XOR with the input" for numbers with $t$ trailing 0s. Then use the POPCNT (Hamming weight) instruction of most serious CPUs. After removing these 0s, i.e. dividing by $4^n$, you'll end up with an odd number; if you end up with an even number after removing an even number of 0s, your number is not a perfect square.)

  1. Find $k=\lfloor\log_2 x\rfloor $, see https://graphics.stanford.edu/~seander/bithacks.html
  2. $a=\frac k2$
  3. Thus, $2^a$ becomes a lower limit for $\sqrt x$ and $2^{a+1}$ an upper. Both values can be found via bit-shifting 1.
  4. From here, do a binary search¹.

I doubt you'd be much faster than converting to floating point and letting the FPU do it in hardware, giving you an approximate value, comvertable back to integer, from which you only need to search small ranges (namely, the lost precision) for the actual integer square root.

Note that in such problems as yours, algorithmic elegance often plays a minor role - it needs to be fast on actual hardware, so execution avoiding a lot of memory interaction is a good thing, and: with SIMD instructions, doing four to 16 operations of the same type take about as long as doing one; so if you just need to test a few integers for their square, modifying your algorithm to be able to try four in parallel is way more efficient than saving half of the operations necessary.

You have a technological problem, not so much a numerical.


¹ binary search assumes that you can do one squaring and one comparison at once; as hinted at before, you might very well be able to divide your interval into five search chunks by calculating four products at once and comparing four numbers at once using SIMD. This further hints that even if there should be no constant time algorithm (and I'm pretty sure there's none), you can be better than $\mathcal O(n^2·\log_2 x)$; compare Fürer's algorithm.

4
On

I think the only advantage gained by having a perfect square in analytic methods is that you know an iterative algorithm will actually terminate. So instead here is a number theoretic solution that'll work for numbers less than $2^{66}$.

Fact 1: If $p$ is a prime with $p \equiv 3 \mod 4$ and $x$ is a perfect square $\mod p$, then $$x \equiv \left(x^{(p+1)/4}\right)^2 \mod p,$$ i.e. you can compute the modular square root by exponentiating by $(p+1)/4$. (See https://crypto.stackexchange.com/a/20994/18112)

Fact 2: The numbers $m_{17}=2^{17}-1$, $m_{19}=2^{19}-1$, and $m_{31}=2^{31}-1$ are (Mersenne) primes whose product is greater than $2^{66}$.

Method: Let $S$ be the square whose root $t$ you'd like to find. Compute the following $$t_{17} \equiv S^{2^{15}} \mod m_{17}$$ $$t_{19} \equiv S^{2^{17}} \mod m_{19}$$ $$t_{31} \equiv S^{2^{29}} \mod m_{31}$$ Then the Chinese Remainder Theorem gives $$t \equiv \pm 31207 t_{17} m_{19} m_{31} \pm 298611 m_{17} t_{19} m_{31} \pm 413071270 m_{17} m_{19} t_{31} \mod m_{17}m_{19}m_{31}$$ Then check these 8 possibilities.

Remarks: I don't know how computationally efficient this is; it's more of a mathematical solution taking advantage of knowing that $S$ is a square. I would venture to guess it's about as "constant time" as you could get as the number of steps is essentially fixed, but that constant may be larger than the $\sqrt{n}$ of other methods for this range of $n$.

0
On

You can get good results using a 2-adic version of Newton–Raphson. The algorithmic complexity will be no better than with the usual adaptation of Newton–Raphson to the domain of integers, but convergence is easier to establish and computation modulo a power of $2$ is especially well-suited to modern computers.

As an example of the sort of performance possible, in a couple of dozen lines of division-free, almost branch-free C code I can compute the 32-bit square root of any 64-bit perfect square in around 14.9 CPU clock cycles (around 3.3 ns at 4.5 GHz) on my Intel Core i7-8559U laptop. I've included the code below so that you can do timings for yourself.

I think this fits the "known perfect square" criterion described by the questioner, since for non-perfect squares it doesn't give a useful result (outside the presumably highly specialised application of actually wanting a 2-adic approximation to the square root of a non-square integer).

Details: It's enough to be able to compute square roots of odd perfect squares: handling zero is trivial, and for positive even perfect squares we can shift out trailing zero bits (of which there must be an even number) to get an odd perfect square, take the square root, and shift back appropriately.

So suppose that $n$ is an odd perfect square integer and define a function $f$ (on the real numbers for now) by $$f(x) = n - \frac 1 {(2x+1)^2},$$ valid for all real $x$ except $x = -1/2$. The roots of $f$ are $((\pm 1/\sqrt n) - 1) / 2$. Working through the algebra, the Newton–Raphson method applied to $f$ says that if we have a sufficiently good approximation $x$ to one of those roots then $$x - \left((x^2 + x)n + \frac{n-1}4\right) (2x + 1) \tag{1}\label{eq1}$$ should be a better approximation, and moreover that convergence of the repeated iteration should be quadratic once we get close enough.

Now here's the key point: the expression \eqref{eq1} can be evaluated using only integer arithmetic. (Note that since $n$ is an odd square, it must be congruent to $1$ modulo $4$, so $(n - 1)/4$ is an integer.) There are exactly two roots of $f$ in the 2-adic integers, and we can use \eqref{eq1} to compute successive $2$-adic approximations to the roots. Moreover, we continue to get the expected quadratic convergence to the roots; while the normal proof of this goes via calculus, for this particular $f$ we can see the quadratic convergence purely algebraically. Suppose that $x_0$ is a (rational) integer that satisfies $$(x_0^2 + x_0)n + \frac{n-1}4 \equiv 0 \pmod{2^j}$$ for some positive integer $j$. This is an integer-only reformulation of the statement that $x_0$ is congruent to $((\pm 1/\sqrt n) - 1) / 2$ modulo $2^j$ in the 2-adics, where now $\sqrt n$ represents one of the square roots of $n$ in the ring of 2-adic integers.

Let $x_1$ be the next approximation computed using the formula \eqref{eq1}: $$x_1 = x_0 - \left((x_0^2 + x_0)n + \frac{n-1}4\right) (2x_0 + 1).$$ Then simple but tedious algebra (expanding both sides) shows that $$(x_1^2 + x_1)n + \frac{n-1}4 = \left((x_0^2 + x_0)n + \frac{n-1}4\right)^2 ((2x_0+1)^2 n - 4)$$ from which we immediately have $$(x_1^2 + x_1)n + \frac{n-1}4 \equiv 0 \pmod{2^{2j}}$$ So we double the number of good bits in our approximation on every iteration. If $n < 4^j$ for some $j$ then it's enough to iterate until we have an integer $x$ satisfying $$(x^2 + x)n + \frac{n-1}4 \equiv 0 \pmod{2^j}$$ Then $a = (2x+1)n$ is an integer solution to the congruence $$a^2 \equiv n \pmod{2^{j+2}}.$$ Note that we have to be a bit careful: there are four solutions in total to this congruence, but only one of them lies in the interval $(0, 2^j)$ (after reduction modulo $2^{j+2}$), and that's the square root that we're after. Given any one solution, the others are easy to find via negation and via addition of $2^{j+1}$.

As a final speed trick, while we could start with a solution to the congruence modulo $2^1$ and work our way up from there, on most machines it will make more sense to use a small lookup table to enable us to start with a solution modulo $2^8$, say. That lookup table will typically be small enough to fit into a couple of level-1 cache lines on a modern processor.

Here's some C code that applies the above ideas to the special case of computing the 32-bit square root of a 64-bit perfect square integer. It's mostly portable, but it does make use of the GCC / Clang intrinsic function __builtin_ctzl for counting trailing zero bits in a nonzero integer.

#include <stdint.h>

static const uint8_t lut[128] = {
    0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
    31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
    63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
    95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
    127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
    96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
    64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
    32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};

uint32_t isqrt64_exact(uint64_t n)
{
    uint32_t m, k, x, b;

    if (n == 0)
        return 0;

    int j = __builtin_ctzl(n);
    n >>= j;
    m = (uint32_t)n;
    k = (uint32_t)(n >> 2);
    x = lut[k >> 1 & 127];
    x += (m * x * ~x - k) * (x - ~x);
    x += (m * x * ~x - k) * (x - ~x);
    b = m * x + 2 * k;
    b ^= -(b >> 31);
    return (b - ~b) << (j >> 1);
}

I have a fuller version of this code available, including exhaustive tests and explanations, as a GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40