I have already confirmed that my Cholesky decomposition routine is indeed at least 50% faster than the Crout's LU decomposition (for square, symmetric, positive definite matrix). However, I'm losing the gains afterward when I produce the actual inverse, so I must be doing something simple wrong.
The LU routine creates a combined lower and upper factor matrix, and then sends this to a solving routine which looks like it is doing both forwards and backwards substitution.
My Cholesky routine produces the single lower triangular factor $L$. I then compute the inverse $L^{-1}$ using forward substitution only. I then create the transpose $(L^{-1})^T$ and finally compute the inverse $A^{-1} = (L^{-1})^TL^{-1}$. The problem is this routine overall ends up being slower than the LU version overall. What am I doing wrong?