Unobserved values in SVD and matrix factorization problems

235 Views Asked by At

In class, in the context of matrix completion (recommender Systems, collaborative filtering), we have seen that the Eckart-Young theorem says the best rank-$k$ approximation to a matrix $A$ is given by taking the dominant first $k < \mbox{rank}(A)$ singular values.

Thus, if we have a rating matrix, like in the Netflix data, it was mentioned that the rating matrix has many unobserved values that are not defined. The task is to infer these entries. One solution approach we have seen is via Matrix Factorization using Alternating Least Squares.

Why can't we use SVD to solve this problem?

Since with the Eckart-Young theorem it is guaranteed that we get the best rank-$k$ approximation of the rating matrix, it seems that unobserved values incur a problem, we cannot just set the unobserved values to $0$ or the mean of the data, but why is that?

Does it have to do with the rank-k constraint in the minimization problem

$$ A_k = \arg \min_{\mbox{rank}(B)=k} \left\Vert A - B \right\Vert_F^2 $$

because with the rank constrained it is a non-convex problem and Eckart-Young theorem does not guarantee the best rank-$k$ approximation anymore if there are unobserved values in $A$?

2

There are 2 best solutions below

0
On

If you have started getting good at vectorization and Kronecker products, one approach you can do is

$$\min_{U,V,\Sigma} \{C_1\|A-U\Sigma V^T\|+ \|UU^T-I\|+ \|VV^T-I\| + \|C_2\Sigma\| \}$$

  1. Weight matrix $C_1$ should reflect which values of $A$ are unknown.

  2. Weight matrix $C_2$ should reflect the rank you seek. $k$ small $\epsilon$s and the rest large values along diagonal and outside. Should work neatly.

As it probably needs some maturity and skills, it might be suitable work for a PhD student main project.

0
On

As @mathreadler commented already, the problem is in the missing values. If we had the full rating matrix, then by Eckhart-Young the SVD would give us the optimal low-rank approximation.

However, because most ratings are missing (if they weren't, we wouldn't need a recommender), the problem becomes non-convex. Here is an explanation from [1, page 3], which provides a very readable overview to matrix factorization for recommendation.

Such a model is closely related to singular value decomposition (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparseness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incomplete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting. Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, imputation can be very expensive as it significantly increases the amount of data.

"Imputation" corresponds to your suggestion of setting the missing values to zero. This will make it work in principle, but most of your optimization will now focus on predicting values you know to be wrong. It's much better to optimize only for the known values. But then, the problem is no longer convex and Eckhart-Young no longer applies.

So, in short, yes the SVD gives you a good matrix factorization, but it suffers in the presence of missing values, and in recommendation, the vast majority of your rating matrix consists of missing values.

[1] Matrix factorization techniques for recommender systems, Koren et al, 2009