So I understand the proofs behind Singular Value Decomposition but I'm having trouble interpreting it in the context of a real world problem.
Specifically, If I'm given an $m\times n$ data matrix A, where we have m training examples and n features collected for each example, I'm having trouble understanding the meaning behind Av$_j$ = $\sigma$$_j$u$_j$ where L$_A$ (left multiplication by A) is our linear transformation and $\beta$ = {v$_1$, v$_2$, ... , v$_n$} is an orthonormal basis for F$^n$ and $\gamma$ = {u$_1$, u$_2$, ... , u$_m$} is an orthonormal basis for F$^m$.
From reading various posts and articles, the idea seems to be a larger $\sigma$$_j$ indicates more variation in the data along that vector u$_j$ while a smaller variation in a certain direction u$_j$ is captured with a smaller $\sigma$$_j$.
However, when we are looking at Av$_j$ = $\sigma$$_j$u$_j$ i'm not sure why we care about what L$_A$ is doing. After all, this relationship would be great if I wanted to see what L$_A$ does when it acts on a orthonormal basis $\beta$ but A is just a data matrix so i'm not sure how to interpret the range of a data matrix or what types of transformations A is making when presented a vector x to 'do' left multiplication on.
Your data sets are the rows of $A$. Thus you get the $k$th sample by computing $e_k^TA$. Using the SVD this is also $$ e_k^TA=\sum_{j=1}^nσ_j(e_k^T{\bf u}_j)\,{\bf v}_j^T $$ You can reduce this sum to the leading $d$ terms with the largest $d$ singular values to get a good approximation of the data, which means that your data vectors are all close to the subspace spanned by ${\bf v}_1,…,{\bf v}_d$ for some suitably chosen $d<n$.