Assuming we have a linear transformation for vectors $\in \mathbb{R}^n$ to vectors $\in \mathbb{R}^m$ denoted $\mathbf{y} = \mathbf{W} \mathbf{x}$ and that the goal is to learn the values of $\mathbf{W}$ (or at least the values that produce minimum loss) from data.
My question is as follows:
Is there any benefit from splitting the linear transformation to two matrices and learning the factors instead of learning the original one, i.e., if $\mathbf{W} = \mathbf{AB}$, then learn $\mathbf{A}$ and $\mathbf{B}$ instead of learning $\mathbf{W}$? Are there any mathematical properties of such factoring that make it useful?
Unfortunately, I am unable to find a mathematical term for this matrix factorization (if this is a proper term to use), so it would be great if someone can tell me what should I be reading about.
Also, the only benefit I could find is that if $\mathbf{A} \in \mathbb{R}^{m \times q}, \mathbf{B} \in \mathbb{R}^{q \times n}$ and $q < n$, the number of parameters needed would be $(n + m) \times q$ instead of $n \times m$.
Yes there are lots of pro:s of doing that. Finding good factorizations can lead to more efficient algorithms in natural sciences and engineering. Maybe one of the most famous ones, the FFT ( Fast Fourier Transform ) can indeed be interpreted as a matrix factorization.
The original matrix is dense ( has no values equal to $0$, but in the FFT factorization each row of each factor matrix has only a small prime number of non-zero entries, usually $2$ non-zero values and there is only $\log_2(N)$ such matrix factors.
Now the FFT is factoring more than once and into more than two matrices, but this can be done step-by-step in a systematic fashion where each step splits one matrix into two.