Is this alternative test faster than original Lucas-Lehmer test ?
Theorem (Lucas-Lehmer)
Let $M_p=2^p-1$ such that $p \in \mathbb{P}$ and $p \ge 3$ .
Let $S_i=S_{i-1}^2-2$ with $S_0=4$ , thus
$M_p$ is prime iff $S_{p-2} \equiv 0 \pmod {M_p}$
Theorem (alternative formulation)
Let $M_p=2^p-1$ such that $p \in \mathbb{P}$ and $p \ge 3$ .
Let $S_i=S_{i-1}^4-4S_{i-1}^2+2$ with $S_0=4$ , thus
$M_p$ is prime iff $S_{(p-1)/2} \equiv -2 \pmod {M_p}$
My PARI/GP implementation of alternative test is a little bit faster than PARI/GP implementation of original Lucas-Lehmer test , but I don't know if this comparison is relevant or not .
PARI/GP implementations :
MersennePrime1(p)=
{
S=4;
M=2^p-1;
ctr=1;
while(ctr<=p-2,
S=Mod(S^2-2,M);
ctr+=1);
S==0
}
MersennePrime2(p)=
{
S=4;
M=2^p-1;
ctr=1;
while(ctr<=(p-1)/2,
S=Mod((S^2-2)^2-2,M);
ctr+=1);
S==M-2
}
TL;DR: The comparison isn't theoretically relevant. It may make enough of a difference in your setting to be practically relevant for you.
Since the only difference between your version and the standard algorithm is that you take the modulus only every second $S \mapsto S^2 - 2$ step - and therefore need to take one step more and have a slightly different condition at the end - there is no difference in algorithmic complexity. Both versions need $\Theta(p)$ squarings, subtractions and remainder operations, and the involved numbers have roughly the same size.
So the question is whether we get different constant factors making one or the other variant consistently faster.
Taking the modulus is typically a rather costly operation (although division can be implemented to have the same algorithmic complexity as multiplication, it is a much slower operation even then), so reducing the required number of such operations is a promising idea.
However, you pay for taking the modulus less often by dealing with larger numbers. In the standard algorithm, one never multiplies numbers with more than $p$ bits, while in your variant, one needs to multiply numbers with up to $2p$ bits, which takes longer. As a result, taking the modulus $S_i \bmod M_p$ will usually also take longer on average in your variant compared to the standard variant.
Which effect - fewer modulo operations vs. slower multiplication and modulo operations - wins in the end depends on low-level implementation details of the multiplication and modulo algorithms, probably also on CPU architecture, and possibly even the OS.
I've implemented both variants in C (using the GMP library), and in that implementation, the standard algorithm is consistently faster (for the tried primes $p$), some examples:
That's always a factor of approximately $1.5$. Since my C code looks quite different from the PARI/GP code, I've also implemented it in Haskell (your variant as an example)
which looks a bit more like the PARI/GP code. The Haskell variants take approximately the same time as the corresponding C variants (unsurprisingly, since GHC uses GMP for
Integercomputations too). I don't have PARI/GP installed, so I can't check whether yourMersennePrime2would also be faster thanMersennePrime1on my box.But our results together show that neither variant of the algorithm is unconditionally faster than the other.
However, so far only the standard
modwas used, and for Mersenne numbers we can use a faster custom modulo computation (we know more about the algorithm than the library/compiler). That's not quite as fast as computing the remainder modulo $2^p$, which would be just one bit-masking, but we can still avoid all divisions and thus get a considerable speed-up. This custom modulo implementation plays especially nicely with the standard algorithm, since then $S < M_p^2$ always, and we know that we need at most one iteration of mask-and-shift. In C-ish pseudocode, that iswhere we need not do anything in the rare case that $S < 0$, since $S \geqslant -2$ and then $S^2$ is guaranteed to be very small in the next step. For the variant, we may need up to four mask-and-shift operations, it's simplest to then replace the first
ifwith awhile. This reduces the running time quite a bit, showing that the modulo operations were responsible for the bulk of the time:Now the factor between the standard algorithm and your variant is approximately $1.8$ (except for $p = 23209$ where it's $1.7$), so the standard algorithm profited relatively more.
For the not-faint-of-heart, here's the C code: