I have recently done a rather extensive numerical study of the effect of sparse matrix storing formats on the runtime of matrix-vector products. In particular, I considered the following formats: CSR, CSC, COO, BCSR and MBR (See Kannan, 2013 for a definition of the last two formats). The implementation of the matrix-vector product for CSR, CSC, COO and BCSR is rather straightforward, and some readily available implementations can be found in MKL and SciPy among other libraries. In MKL and SciPy, BCSR is referred to as BSR. The optimized implementation of the matrix-vector product for the MBR formats makes use of de Bruijn sequences to avoid iterating over all bits in order to detect the set bits in the binary encoding of the sparsity of the non-zero blocks. I made sure to optimize my implementation using this strategy. In Kannan (2013), the author documents some speedups of the matrix-vector product when using MBR over CSR. I tested my implementation of the matrix-vector product implementations with 17 large sparse matrices: the 10 matrices used in Kannan (2013) which can be found in SuiteSparse, and 7 more matrices from applications of circuit simulation which can also be found in SuiteSparse. To my surprise, and contrarily to the results presented in Kannan (2013), there is not even one single matrix tested (among 17 of them) for which the matrix-vector product is accelerated by using MBR over CSR or CSC. Of course, using MBR over BCSR does yield a speedup, but both are slower than using CSR, CSC or even COO. Another experiment I did was to compute the matrix-vector product using BCSR for sparse matrices with dense blocks randomly located. Similarly, I was not able to show speedups of using BCSR over CSR or CSC. I repeated the experiment using the BSR format from MKL (in C) and SciPy (in Python), and obtained very similar results. Note that I performed my experiments for different matrices and block sizes, as well as by multiplying either one or multiple vectors.
Hence, I wonder if anyone has ever noticed some speedups of matrix-vector products when using MBR or BCSR (i.e., BSR) over CSR or CSC? If so, would you mind sharing the sparse matrices, block sizes and numbers of vectors multiplied for which you made these observations?
The names of the sparse matrices used from SuiteSparse are: circuit5M_dc, Freescale1, Freescale2, FullChip, memchip, nxp1, transient, ASIC_680k, atmosmodm, circuit5M, dielFilterV2real, dielFilterV3real, ecology1, G3_circuit, Hamrle3, nlpkkt80, rajat29.
The memory layout of my CPU is the following: L1d cache (32K), L1i cache (32K), L2 cache (1024K) and L3 cache (25344K).
Kannan, Ramaseshan. "Efficient sparse matrix multiple-vector multiplication using a bitmapped format." 20th Annual International Conference on High Performance Computing. IEEE, 2013.