For Netflix Prize competition on recommendations one method used a stochastic gradient descent, popularized by Simon Funk who used it to solve an SVD approximately. The math is better explained here on pg 152. A rating is predicted by
$$ \hat{r}_{ui} = \mu + b_i + b_u + q_i^Tp_u $$
Regularized square error is
$$ \min_{b*,q*,p*} \sum_{u,i} (r_{ui} - \mu - b_i - b_u - q_i^Tp_u)^2 + \lambda_4 (b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2) $$
If partial derivatives are taken according to each variable, and through a constant for the update rule we have equations such as
$$ b_u \leftarrow b_u + \gamma (e_{ui} - \lambda_4 \cdot b_u) $$
[The rest is skipped]. My question is this, paper here (Wayback Machein) states square error function above non-convex because of $q_i^Tp_u$ and both variables here are unknown which, of course, is correct. Then how do we guarantee stochastic gradient descent scheme described in all papers will find a global minimum? I read somewhere SGD can be used to solve non-convex functions, I'd like to hear about some details on how. SGD SVD method described in the links works well in practice.
Also, alternating least squares can be applied to solve the SVD problem described above according to IEEE Koren paper. This method alternates between holding one or the other variable constant thereby creating convex problems in the process. I wonder if SGD, while it goes through dimensions, data points one-by-one also, in a way, creates convex sub-problems as it proceeds.
Maybe the answer is simply this presentation by LeCun, stochastic gradient on non-convex functions have no theoretical guarantees, but this doesn't mean we should not use them. "When Empirical Evidence suggests a fact for which we don't have theoretical guarantees, it just means the theory is inadequate".
I was able to reach Brandyn Webb (aka S. Funk) and he clarified some of my questions. The ideas that led to SGD SVD can be found in this paper
http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf
where Sanger, building on work of Oja
http://users.ics.aalto.fi/oja/Oja1982.pdf
talks about finding N top eigenvectors through a method called Generalized Hebbian Algorithm (GHA) using a neural network with probability 1. That is, training this particular network ends up doing PCA (very cool).
Webb was researching NNs at the time (way before Netflix);
Later Gorrell and Webb realized the formulation looked like it was solving an SVD problem (again before Netflix Prize). Webb first started using the SGD derivation for Netflix, he mentioned that he "went for the direct gradient method on the spur of the moment because it was faster to derive that on a napkin than to find my old paper".
The question can be asked "how much SVD is Funk SVD?". Some on the Net call it "Funk's heuristic approach to SVD" and "in missing data form it's a factor model not SVD". Another says "in the case of a partially observed matrix [..] the factorized solution U V'is not truely an SVD – there is no constraint that the matrices U and V be orthogonal".
Webb does not agree, to the first two
for the last,
Overall, we can say there are connections between GHA, the so-called "iterative power method", gradient solution and SVD; It is because of this, perhaps, Webb could delve into a solution without being scared off by the non-convexity of the problem.
No matter what route was taken during the invention of this method however, we can look at the error function for recommendations, and through its gradients one can see SGD can minimize it, which is proven by empirical tests. The error function is non-convex, however as Bottou and LeCun suggest absence of theory should not stop us using a method. Plus people started to look at stochastic approaches, their theoretical guarantees much closer now, as seen here
http://arxiv.org/pdf/1308.3509
The paper above also talks about power method and SGD connection BTW.
Note: power method is used to find a single eval/evec, that is evec for the largest eval, however it can be used to find all other evecs by "removing" the newlyfound evec from matrix A (through a process called deflation), and re-running the power method on A again which finds the second eigenvector, and so on.
http://www.maths.qmul.ac.uk/~wj/MTH5110/notes/MAS235_lecturenotes1.pdf
http://heim.ifi.uio.no/~tom/powerandqrslides.pdf