I am trying to apply an algorithm that requires significant modular arithmetic, but in such a way as to avoid as many costly mod or division operations as possible.
I found the Montgomery multiplication method and its form of integer residues (via wikipedia and thus the original paper), but I am having difficulty understanding how to apply it practically.
Using the example in the wikipedia writeup, with $n = 17$ and an $R^{-1} = 8$, I can do very simple arithmetic with an implementation of redc that matches the expected value $mod\ n$:
$ (7 * 15) `mod` n
3
$ redc (3 * 4 * r')
3
$ (7 + 15) `mod` n
5
$ redc (3 + 4)
5
However, while the original paper and the wikipedia page say explicitly that most arithmetic operations can be applied over residues as "normally used", I'm not seeing that borne out:
$ (7 * 15 * 15) `mod` n
11
$ redc (3 * 4 * 4 * r')
12
(I know this might be on the edge of this being a stackoverflow question, but I suspect I'm suffering from a formal misunderstanding here, rather than a simple programming error.)
$\def\redc{\mathrm{redc}}$ Throughout this post, the parameters $N$, $R$, and $R'$ will be implicit in function calls.
What is the operation?
The arithmetic primitive $\redc$ is meant to perform the operation:
It's primary advantage over a more direct modular reduction algorithm is that the only divisions this algorithm does is division by $R$. When $R$ is a power of 2, this can be especially convenient for computer implementation. $R = 2^{32}$ and $R=2^{64}$ are popular choices, since the corresponding division operations are essentially free due to how the processor stores and manipulates integers in registers.
Using it for modular multiplication
Using $\redc$ for reductions can be awkward since you have to account for that extra factor of $R^{-1}$ in your calculations. "Montgomery form" is a clever trick that makes $\redc$ more convenient to use. The relevant point is that you have an algebraic identity
$$ \redc(xR \cdot yR) \equiv xyR \pmod N $$
In an actual program, the trick is that you don't bother keeping around the values $x \bmod N$ and $y \bmod N$ — you only store the values (equivalent to) $xR \bmod N$ and $yR \bmod N$. Every time you multiply two values, you accumulate an extra factor of $R$ in the product, which is cancelled out when you invoke $\redc$, so the bookkeeping is nearly trivial.
Note that when you multiply three values, you have to invoke $\redc$ twice. The reason for this is three-fold.
First of all, you need to invoke redc twice to keep the bookkeeping lined up; if you only invoke it once your result has an extra factor of $R$:
$$\begin{align} \redc(xR \cdot yR \cdot zR) &\equiv xyzR^2 &\pmod N \\ \redc(\redc(xR \cdot yR) \cdot zR) &\equiv xyzR &\pmod N \end{align}$$
The second reason is that larger numbers require more work to multiply, since multiprecision integer operations get more expensive as the numbers get larger. It's faster to insert an extra reduction step in the middle to keep the operand sizes down. (If you're doing single precision calculations, you still need to keep operand sizes down in order to avoid overflow!)
The third reason is the amount of reduction that $\redc$ does. With typical parameters, the product of three numbers will be too large for $\redc$ to fully reduce in one invocation — you need to invoke it twice to get the result to be about the same size as $N$.
(Not) using it for modular addition
You don't want to invoke $\redc$ for additions. Addition doesn't make numbers much larger, so its faster just to test if you've exceeded $N$ and subtract $N$ if you have. (Actually, replacing $N$ with a small multiple like $2N$ or $4N$ is often more convenient)
Furthermore, check the algebra: $$\begin{align} \redc(xR + yR) &\equiv x + y &\pmod N \\ xR + yR &\equiv (x + y)R &\pmod N \end{align}$$ If you were planning on storing the result of addition in Montgomery form, then the latter identity is the one you want to implement!
Consider Barrett reduction as an alternative
A complementary algorithm to $\redc$ is that of Barrett reduction, which is based on essentially the same idea1 as $\redc$. Barrett reduction has the advantage that there is no extra factors in the result; the output is precisely what you want: $$ \mathrm{BarrettReduction}(x) \equiv x \pmod N $$ Thus, using it efficiently doesn't require any of this business of alternate forms.
Unfortunately, it tends to be slower than $\redc$.
1: Montgomery reduction and Barrett reduction are performing the same series of steps, but in different arithmetic contexts. Barrett reduction works in the real numbers, whereas Montgomery reduction works in the $2$-adic numbers.