How can we show a data set satisfies the manifold assumption?

337 Views Asked by At

In machine learning, we often assume that a data set lies on a low-dimensional manifold (the manifold assumption), but is there any formal proof saying that assuming the data set satisfies certain conditions, it can be shown that it indeed forms (approximately) a low-dimensional manifold?

For example, given a data sequence $\{\mathbf{X}_1 \ldots \mathbf{X}_n\}$ where $\mathbf{X_i} \in \mathbb{R}^d$ (say the sequence of face images with different angles) and a corresponding label sequence $\{ y_1 \ldots y_n\}$ where $y_1 \preceq y_2 \ldots \preceq y_n$ (say the angles of the face sequence). Suppose when $X_i$ and $X_{i+1}$ are very close, their labels $y_i$ and $y_{i+1}$ are also very close, we can imagine that it is likely that $\{\mathbf{X}_1 \ldots \mathbf{X}_n\}$ lies on a low-dimensional manifold. Is this true? If so, how can we prove it? Or what conditions does the sequence need to satisfy so that the manifold assumption can be proven to be true?

Thanks in advance!

1

There are 1 best solutions below

2
On BEST ANSWER

I am not aware of any necessary and / or sufficient condition to prove that a finite set of data in $\mathbb R^n$ is actually contained in a smooth submanifold $\mathcal M$ of low dimension.

Actually, modern studies try to identify the topological structure lying behind a given set of data (yes, we have to move down to the topological level) using algebraic topology and, in particular, information coming from persistent homology. This some sort of backward reconstruction: the topological information gathered by the methodology are used for inference or to further characterize clusters of data. A survey is contained in this nice paper. This new machinery is quite powerful, as it is more flexible than MDS and PCA and allows the user to introduce functions to control the simplicial complex definition which is at the core of the method itself.

In some applications the authors showed that the given data lie on a smooth manifold; the machinery works at the algebraic topology level, though.

If you are interested in this backward reconstruction, then I would start by considering the nice case of the noisy circle introduced in here.