Lets first consider an integer less than p. Say A and B are of size ~339 bits. Now we split this integers into powers of $Z=2^{32}$, say $$A= \sum\limits_{k=0}^{10} a_{k}Z^{k}, B= \sum\limits_{k=0}^{10} b_{k}Z^{k}$$
and we do the same with $p\sim 340$ bit. $$p= \sum\limits_{k=0}^{10} p_{k}Z^{k}$$
where $a_0, b_0$ and $p_0$.
Aiming the product mod p
If we do multiplying $C=A\cdot B$ first and reduce by p then, we need to calculate: $$C=\sum\limits_{k=0}^{22} a_{k}Z^{k}$$
first.
Reducing mod p would then mean,
1. Split C in $C_1 . C_0$, where $C_1= [c_{22}, ..., c_{11}]$ and $C_0=[c_{11},...,c_0]$.
2. Compute $C_1$ mod $p$
3. While C_0 != 0 do
3a) leftshift by multiples of 4-bit C_1
3b) fill C_1 with C_0
3c) leftshift by multiples of 4-bit C_0
3d) C_1 mod p
Where "leftshifting" means, that we fill the leading zeros with previous entries, such that this count is on the end.
"Filling" means, that we fill those zeros in the end of $C_1$, by entries of $C_0$.
Multiplication means, that we need to carry the overflow.
Question
Doing this, leads to the right product mod p. But I'm interested in some other mathematical methods to do this. But I need a practical description, like a "how to", even if this is only a scratch.
Montgomery Multiplication is an excellent algorithm designed to do modular multiplication of large numbers modulo a prime $p$ (large or small) extremely fast. The wikipedia page can tell you far more about it than I can, and it looks like it has detailed descriptions of various algorithms and implementations.