I have two dense (column-major) PSD matrices $\mathbf A$ and $\mathbf B$, $\mathbf A,\mathbf B\in\mathbb R^{n \times n}$ ($n$ is usually $\sim 1000$) and also $\mathbf A_1=\mathbf A+\eta\mathbf I_n$ and $\mathbf B_1=\mathbf B+\eta\mathbf I_n$ (for some $\eta \in \mathbb R^+$) and I want to write a code to compute the entity $\text{Trace}[\mathbf A\mathbf A_1^{-1}\mathbf B\mathbf B_1^{-1}]$. I've read some threads here and since its generally been suggested that computing/storing an inverse directly is better to be avoided as far as we can, I was wondering whether there were some other ways to solve this problem. I currently thought of using Cholesky decomposition $\mathbf A_1=\mathbf{LL}^\top$ and solve systems $\mathbf A_1\mathbf x_i=\mathbf{LL}^\top \mathbf x_i=\mathbf e_i$ ($\mathbf e_i$ being the $i^{\text{th}}$ column of $\mathbf I_n$) one by one. On the fly I am using the solution vector $\mathbf x_i$ to compute the matrix-matrix product $\mathbf C=\mathbf A\mathbf A_1^{-1}$ using $\mathbf C_{(j,i)}=\mathbf A_j\cdot \mathbf x_i$, where $\mathbf A_j$ is the $j^{\text{th}}$ row of $\mathbf A$ (or column, since it is symmetric) and then discarding the vector. Similarly I compute $\mathbf D=\mathbf B\mathbf B_1^{-1}$ and finally $\text{Trace}[\mathbf C\mathbf D]$.
My questions are:
- Is this approach numerically more stable than first computing the inverse and multiplying with another matrix?
- Is there a better approach for doing this? I read this paper which seems to give a $\frac{1}{2}n^3$ operation for computing the inverse using similar way. But I am using Eigen3/C++ which provides a mature implementation for LLT solver so I think native code won't improve much over that.
- Is there any linear algebra trick to simplify the operations that I seem to miss here?
Edit: Although I am computing the dot-products in parallel using OpenMP I saw in some benchmarks using my current code that storing the inverse that we obtain via LLT solve and then multiplying using Eigen3 is faster. Also, I tested with Eigen3's native inverse() method which is faster than this approach. If memory is not an issue, is storing the inverse such a bad idea in this case?