Ultimately, I'm looking to implement arithmetic in GF$(2^{32})$. I have a library that implements arithmetic in GF$(2^{16})$ using look-up tables for log and anti-log to implement multiplication, and addition/subtraction are simply $\oplus$ (xor).
My understanding is that I can implement GF$(2^{32})$ as GF$((2^{16})^2)$. I have been looking at this paper which describes algorithms to do that; however, I'm struggling to implement the reduction step.
Following Algorithm 2 on pg 14, attempting to multiply 0x0xAABBCCDD and 0x99887766 in GF$(2^{32})$, I'm using the following algorithm where * denotes multiplication in GF$(2^{16})$:
c[0] = a[0] * b[0]
c[1] = (a[1] * b[0]) ^ (a[0] * b[0])
c[2] = a[1] * b[1]
Now my understanding is that I need to reduce this number ($c[0]*2^{32} + c[1]*2^{16} + c[2]$) by the polynomial listed in Table II on pg 13: $x^2 + x + 8192$. This is where I become lost... how do I do this? The paper explains doing this assuming the reduction polynomial is of the form $p(x) = x^m + x^3 + x + v$. However, the polynomial listed in the table is not of this form, or if it is, then I'm completely lost.
My understanding is that if c[0] is not zero, then I need to reduce c[1] and c[2] using the following algorithm which will give a result in GF$(2^{32})$:
result[0] = c[1] ^ (poly[0] * c[0])
result[1] = c[2] ^ (poly[1] * c[0])
result = (result[0] << 16) + result[1]
However, given the polynomial $x^2 + x + 8192$, I'm not sure what values to use for poly[0] and poly[1] in the above algorithm. Also, this algorithm might also be wrong.
Any help on this final reduction step is greatly appreciated!
Update Posting the full Python code that uses PyFinite:
from pyfinite import ffield
f_16 = ffield.FField(16, gen=0x1100B)
f_32 = ffield.FField(32)
# x^2 + x + 8192
reduce_poly = [1, 1, 8192]
a_32 = 0xAABBCCDD
b_32 = 0x99887766
a_16 = [a_32 >> 16, a_32 & 0xFFFF]
b_16 = [b_32 >> 16, b_32 & 0xFFFF]
c = [0 for _ in range(0, 3)]
c[0] = f_16.Multiply(a_16[0], b_16[0])
c[1] = f_16.Add(f_16.Multiply(a_16[1], b_16[0]), f_16.Multiply(a_16[0], b_16[1]))
c[2] = f_16.Multiply(a_16[1], b_16[1])
print("C[0]: 0x{:04X}".format(c[0]))
print("C[1]: 0x{:04X}".format(c[1]))
print("C[2]: 0x{:04X}".format(c[2]))
print("32: 0x{:08X}".format(f_32.Multiply(a_32, b_32)))
print("16: 0x{:04X}{:04X}".format(c[1] ^ c[0], c[2] ^ f_16.Multiply(reduce_poly[2], c[0])))
For those that find this, the above values will not match because the polynomial used by GF$(2^{16})^2)$ and GF$(2^{32})$ in PyFinite are not the same. I'm not sure how to set the generator for the GF$(2^{32})$ field to something such that the match; however, I believe the above code is correct for GF$(2^{16})^2)$.
Short answer - there is no reduction step in that paper. Instead all of the operations are performed in GF((2^16)^2). Since any GF(2^32) could be mapped to the GF((2^16)^2) used in that paper, GF(2^32) is unknown.
In general, for any prime number p, any GF(p^k) can be mapped to any GF((p^n)^m) where k = n · m, but in order for the two representations of the fields to be mathematically compatible, the two fields must be isomorphic in addition and multiplication:
The parameters to map() are in GF(p^k), the mapped values are in GF((p^n)^m). Typically, the primitive element for GF((p^n)^m) = β(x) = x + 0. An exhaustive search is done for any primitive element α(x) of GF(p^k) that will meet the isomorphism requirements. The mapping can be accomplished via matrix multiplication k by k matrix with elements in GF(p) time a GF(p^k) value treated as a k by 1 matrix. The inverse matrix can be used to map back.
To answer the updated question, I changed the parameters so that the GF((2^16)^2) parameters are the same as the question. For a GF(2^32) example in this answer, I chose the common one used for jerasure since it's mentioned in the article, but any GF(2^32) can be mapped to the GF((2^16)^2) in the question. In this case the primitive element of GF((2^16)^2) = β(x) = x + 0. For mapping purposes, any primitive element α(x) of GF(2^32) that results in the two fields being isomorphic in addition (map(a+b) = map(a)+map(b)) and multiplication (map(a b) = map(a) map(b)) can be used.
I wrote an optimized exhaustive search program to find a isomorphic compliant primitive element α(x) for GF(2^32), and used it along with β(x) (= x + 0) to generate a 32 row by 32 bit mapping matrix and it's inverse to map between GF(2^32) and GF((2^16)^2). The mapping matrix column indexes correspond to bits 31 to 0, or 2^31 to 2^0. Define an array of powers p{...} = logα(x){2^31, 2^30, ..., 2, 1}, then the mapping matrix column values = β(x)^p{...}. The two matrices and the mapping code are shown at the bottom of this answer. The first row below is GF(2^32) multiply, the second row mapped the parameters and multiplied in GF((2^16)^2). The third row mapped the GF((2^16)^2) product back to GF(2^32), matching the GF(2^32) product:
Note that since the carryless multiply instruction pclmulqdq (operates on xmm registers) was added to X86 processors since 2008, GF(2^32) multiply can be implemented using 3 pclmulqdq and 1 xor, so no need to use composite field for multiply. For inversion (1/x) in GF(2^32), mapping to a composite field to calculate an inverse and mapping back may be faster than using exponentiation via repeated squaring (30 loops) to calculate x^(2^32-2) in GF(2^32) which results in (1/x).
The multiplication process should have been stated as
$c[0] \ x^2 + c[1] \ x + c[2]$ modulo $x^2 + x + 8192$
Where the coefficients of the polynomials are elements of GF(2^16), where
The 3 sub-products are:
Then for c[0] x^2 + c[1] x + c[2] modulo x^2 + x + 0x2000
The article also mentions an alternative mapping for GF(2^32)
On a X86, PSHUFB (xmm or zmm registers) can be used to multiply 16 (SSE3) or 64 (AVX512) bytes by a constant in parallel in GF(2^8). This can greatly speed up Reed Solomon code, such as a matrix multiply of a matrix with large rows of data, such as an erasure code used for the cloud. It would be used when encoding data or correcting erasures. For GF(2^8), two 16 or 64 byte tables per constant are needed. For GF(2^16), eight 16 or 64 byte tables per constant are needed.
Mapping tables (in hex):
Mapping functions: