Background: I had previously asked this question and received a comment that it isn't worthwhile analyzing an algorithm that is sometimes slower than trial division. So, I had been working on a different idea and came up with the following method (which seems to do at least as well as Trial Division asymptotically i.e., better than $O(\sqrt n)$ by a constant factor.).
Context: I am seeking a review of this algorithm and suggestions for any potential optimizations to improve it.
Setup: Given $n \in \mathbb{Z}$, an odd composite number, not a perfect square, with $n = ab$ with $a,b$ unknown and not necessarily prime, the basic idea is that we look at $a^2 b^2 \equiv n^2$ modulo $p_1, p_2, \cdots p_k \in \mathbb{P}$ where $M =\prod_{i=1}^k p_i \gt \sqrt n$ and try to construct the points $(a,b)$ modulo $M$ using Chinese Remainder Theorem.
Without loss of generality we take $a \lt \sqrt n \lt b \lt n$ and $\gcd(p_i, n) = 1$. (If $p_i|n$ then, $p_i$ is a trivial cofactor of $n$ and we can stop.)
For $p_i$ prime, we have at most $(p_i + 1) / 2$ quadratic residues modulo $p_i$ (including 0 here). We compute by brute force, the sets of quadratic residues modulo $p_i$ given by $q_i$ as
$$ q_i = \{x^2 \bmod p_i : 0 \le x \lt p\} $$
This can even be precomputed and is not part of the main algorithm.
We then compute the cartesian product
$$ \pi_i = \{(a_i, b_i) : a_i,q_i \in q_i {\backslash}\{0\}, a_i^2b_i^2 \equiv n^2 \bmod p_i\} $$
Note that we exclude $0$ from the set of candidate quadratic residues because we have $\gcd(p_i, n) = 1$. We then combine $\pi_i$ using Chinese Remainder Theorem to get the set of points $(a,b) \bmod M$ given by $\Pi$ as
$$ \begin{align} \Pi = \{ & & \newline & (a \bmod M, b \bmod M) : & \newline & a = ChineseRemainder([a_1, a_2,\cdots, a_k], [p_1, p_2, \cdots, p_k]), a \lt \sqrt n & \newline & b = ChineseRemainder([b_1, b_2,\cdots, b_k], [p_1, p_2, \cdots, p_k]), & \newline & a^2+b^2+2n \equiv (a+b)^2 \bmod M & \newline \} & \end{align} $$
Analysis: The algorithm runs in time
$$ O\bigg(\prod_{i=1}^k \big(\frac {p_i - 1} 2\big) \bigg) = O\bigg({\frac 1 {2^k}} {\prod_{i=1}^k {(p_i - 1)}} \bigg) = O\bigg({\frac 1 {2^k}} \varphi(M)\bigg) $$
where $\varphi(x)$ is the Euler Totient function.
We know $\varphi(M) \sim \sqrt n $. The ${2^{-k}}$ constant factor ensures that the algorithm only runs a fraction of $O(\sqrt n)$.
Empirical data: I have written a program implementing the above algorithm and tested semi-primes with factors $a,b$ taken from the first 25 primes greater than $10000$ (and $a \ne b$) and counted the actual number of iterations for finding the smaller factor $a$.
Factors of 104643199 = 10007x10457 in 1383 iterations. Sqrt(n) = 10229, Rate = 0.13520
...
Factors of 100460333 = 10009x10037 in 960 iterations. Sqrt(n) = 10022, Rate = 0.09579
...
Factors of 104664113 = 10009x10457 in 981 iterations. Sqrt(n) = 10230, Rate = 0.09589
...
Factors of 100761443 = 10037x10039 in 1422 iterations. Sqrt(n) = 10037, Rate = 0.14168
...
Factors of 104956909 = 10037x10457 in 1453 iterations. Sqrt(n) = 10244, Rate = 0.14184
...
Factors of 101002379 = 10039x10061 in 749 iterations. Sqrt(n) = 10049, Rate = 0.07453
...
...
Count = 1225
Min = 0.00902049220511815
Max = 0.18988688001546933
Avg = 0.11078722057561649
$$Rate = {\frac {\text #iterations} {\sqrt n}}$$
In every one of the 1225 cases, the actual number of iterations is less than $\sqrt n$ as calculated by $Rate$. On average, the actual number of iterations is only $0.111 \times \sqrt n$ for semi-primes in this test range. In the best case less than 1% of the values ($0.009 \times \sqrt n$) were searched before finding the factor.
The following chart shows each observation on the X-axis and plots $\sqrt n$ and the $\text{#iterations}$ on the Y-axis.
Question:
1. Are there any other optimizations we can do to improve this algorithm?
Updates: Since posting this question, I have been able to confirm the following optimization.
Optimization-1. We already have a condition $a \lt \sqrt n$ where $a$ is the result of the Chinese Remainder Theorem on residues $a_j \bmod p_j$ . By adding a condition $M_i > \sqrt n \land b \lt \sqrt n$ where $M_i = \prod_{j = 1}^i p_j$, we are able to skip some combinations of $(a,b)$ after Chinese Remaindering if it satisfies this condition. This works since $a_i \bmod M_i \lt \sqrt n \lt M_i \lt b_i \bmod M_i$. This change doesn't impact asymptotic complexity (impacts only the constant factor) as the Chinese Remaindering still runs for all combinations first and the filtering is done after computing the partial Chinese Remainder result modulo $M_i$. I am wondering if we can exploit this further.
$$ \begin{matrix} \text{Metric} & \text{Without optimizations} & \text{With optimization-1} \\ \hline \text{Min $Rate$} & 0.00902049220511815 & 0.005071686335706621 \\ \text{Max $Rate$} & 0.18988688001546933 & 0.13132411067193686 \\ \text{Avg $Rate$} & 0.11078722057561649 & 0.07325461154374134 \end{matrix} $$
The updated chart for the same test dataset using optimization-1 is given below:

Optimization-2. (potential/under evaluation) Instead of considering distinct primes $p_1, p_2, \cdots, p_k$, should we consider prime powers? This brings up a couple of related questions:
- Are there prime powers $p^n$ for which we can easily generate the list of quadratic residues using a simple generating function (similar to quadratic reciprocity rule for a prime congruent to $3 \bmod 4$ or $5 \bmod 8$)?
- For a general $p^n$ how many quadratic residues modulo $p^n$ exist?
André Nicolas (https://math.stackexchange.com/users/6312/andr%c3%a9-nicolas), Number of quadractic residues $\mod p$ and $\mod n$., URL (version: 2014-04-22): https://math.stackexchange.com/q/764920
This MSE answer gives $\varphi(n)/2^k$ where $k$ is the number of distinct prime factors of $n$ as the number of QRs mod $n$.
I am sensing we use Hensel's Lemma to generate the quadratic residues modulo higher powers.
Optimization-3. If we have $(a_i,b_i) \bmod p_i$ and we want to know if it will remain in the solution set $(a_i, b_i) \bmod M_i$, it should never be eliminated when combining pairwise with any $p_j \ne p_i$. If it gets eliminated in any pairwise combination, it cannot appear in the Chinese Remainder solution modulo $M_i$. This opens the possibility of pruning the point set modulo $p_i$ incrementally. Also, we may also continue pruning using primes greater than $p_k$.
Update Nov-27-2022: I have completed evaluation of optimization-3. The algorithm has been modified to prune the point sets after each Chinese Remainder calculation. This is only a pair-wise check, so it is at most $O(p_{k}^2)$ operations for each pruning where $p_k$ is the largest prime in the basis. We do the pruning $O(p_{k})$ times. Therefore, the pruning adds $O(p_{k}^3)$ steps which is still asymptotically much less than the main algorithm. $p_k \approx O(\ln n)$, therefore the pruning adds an $O\big((\ln n)^3\big)$ component to the algorithm complexity taking it to
$$ \begin{align} & O\bigg({\frac 1 {2^k}} \varphi(M) + (\ln n)^3\bigg) \newline & = O\bigg({\frac 1 {2^{\ln n^{3/4}}}} {\sqrt n} + (\ln n)^3\bigg) \newline & = O\bigg({\frac {\sqrt n} {2^{(3 / 4)\ln n}}} + (\ln n)^3\bigg) \end{align} $$
(since the pruning occurs within the main loop).
I ran the pruning with the prime basis $M \gt \sqrt n$. That did not make any difference to the actual number of iterations. On hindsight, this made sense because the filtering happens only when the modulus $M$ is greater than $\sqrt n$. So, I expanded the prime basis to $M \gt n^{\frac 3 4}$. This gave a positive result with reduction in the Minimum, Maximum and Average Rates.
The chart is shown below for optimization-3:

$$ \begin{matrix} \text{Metric} & \text{Without optimizations} & \text{With optimization-1} & \text{With optimizations-1,3}\\ \hline \text{Min $Rate$} & 0.00902049220511815 & 0.005071686335706621 & 0.004681556617575343 \\ \text{Max $Rate$} & 0.18988688001546933 & 0.13132411067193686 & 0.116403162055336 \\ \text{Avg $Rate$} & 0.11078722057561649 & 0.07325461154374134 & 0.06531251513914364 \end{matrix} $$
