I have a matrix of 1000 rows as the instances or observations of some kind, values are between 0-1. Every row has 10 positions as columns, 1000 rows X 10 columns. The data is outliers-free. Every row is valid. There are some patterns like no more than 3 positions in one row/instance will have values les than 0.25 and some few other similar patterns. My end goal is to define some rules out from this dataset in order to filter incoming observations so that I only take in plausible instances. After factorization I ended up with ~55 for the first singular value and between 8-6 for the remainder 9. I'm a little bit confused on how to proceed. I studied SVD for dimensionality reduction aspect so I know I can reconstruct my matrix using only the first view coumns and rows corresponding to the biggest singular values. Here I don't want to discard anything. I guess my question: is it possible to use this information as a filter ? and how ? should I include outliers in my matrix so that I end up with two clusters ?
Thanks in advance
Suppose you matrix is $M$ and its SVD is $M=V^T\Sigma U$ where $V$ is $1000 \times 1000$ and $U$ is $10\times 10$. If you call $\sigma_1,\sigma_2,...,\sigma_{10}$ the singular values, then you can rewrite the SVD as $$ M = \sum_i \sigma_i v_i^Tu_i $$ where $v_i^T$ are the first $10$ columns of $V^T$ and $u_i$ are the 10 rows of $U$. Thanks to Perron-Frobenius Theorem, both $u_1$ and $v_1$ must be vectors with nonnegative entries.
In your case, $\sigma_1$ is much larger than the others, so you can approximate $$ M \approx \sigma_1 v_1^T u_1 $$ meaning that all the rows are approximatively positive linear multiples of $u_1$. In this sense, all new "data" must somehow be close to multiples of $u_1$.
This can be tested, for example, in the following way: if $x$ is a new vector of data, then it must hold $$ x^Tu_1 \approx \|x\| $$ otherwise, it is not valid.