R: From Word Embeddings to Doc Embeddings: Making Sense of the LSA Formulas

43 Views Asked by At

My question is more of a theoretical kind, and concerns the way of getting document embeddings using pre-trained word embeddings and LSA algorithm. A solution was offered here and it implies that a co-occurrence document-term matrix is multiplied by a word embedding matrix:

document_vecs = dtm %*% vecs

A similar solution is offered by J. Silge and E. Hvitfeldt (using sparse matrices):

doc_matrix <- word_matrix %*% embedding_matrix

This solution works fine for my (practical) tasks, I just do not quite understand, why.

Suppose I factorize a term-document matrix M as UΣV^t, then my word embeddings are calculated as UΣ, and my document embeddings are calculated as ΣV^t (I found these formulas in an online DA course, would be grateful for literature references).

However, adopting the solution above, I can also calculate document embeddings by multiplying M^t (transposed in order to get document-term matrix) by word embeddings (UΣ).

But (UΣV^t)^t * UΣ = U^t * Σ^t * V * U * Σ = Σ^t * V * Σ (since U^t * U gives an identity matrix) = Σ^2 * V (and not ΣV^t, as expected).

I am not a professional mathematician, and probably there is some simple explanation. Why does this multiplication thing work? Would be most grateful for a solution.