I have an matrix $X\in R^{m\times n}$ and the matrix is very sparse. Assume the $nnz(X) = D$, which means the number of non-zero elements is $D$, then what is the time complexity to compute
$$A = XX^T$$
Can it be faster than $O(nm^2)$ ? such as linear complexity
Short answer, the operation can be at least as good as $O(m D)$.
Long answer: This all depends on the sparse matrix format. There are three big ones: Compressed sparse column (CSC) format, compressed sparse row (CSR) format, and triplet format.
In compressed sparse row format, for each row $i$, you store a list of column indices $J_i$ and values $V_i$, such that if $X_{ik}$ is the $d$th nonzero in row $i$, then $k = J_i[d]$ and $X_{ik} = V_i[d]$. Then what you actually store is the concatenated lists $J = [J_1,J_2,...,J_m]$ and $V = [V_1,V_2,...,V_m]$, and $I = [0, |J_1|, |J_2|,...]$ tells you where the offsets of the list headers are. In compressed sparse column format, the same thing happens with row and column switched. In triplet form, you just store $I$, $J$, and $V$ the row index, column index, and value of each nonzero sequentially (e.g. by concatenating the columns.) The complexity mostly comes from looping through these lists, and you pick the format that's fastest for your operation.
I believe in your operation, the CSR format will be the best of the three. (I am not familiar if there are even better schemes out there... probably there are!)
Then the code should look something like this:
If you count each single operation as constant time, I believe the computational complexity should be like $O(m\sum_{i=1}^m |V_i|) = O(mD)$.
A valid question to ask is, is it possible that $|I_j\cap I_i| \ll \max{|I_i|, |I_j|}$ most of the time, and if the analysis could be tightened in that scenario? (e.g. the matrix $X$ is such that most of the row nonzero indices don't overlap, for example in a coding matrix.) Here, maybe, a different matrix format might be even better!