First, I apologize if this question sounds naive or does not make any sense at all since I'm not a mathematician or a math major student.
I'm working on a problem related to approximating the manifold of real-world data. As you may have known, the Manifold Hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space. This hypothesis makes sense to me for continuous data like images. However, I'm not really sure about discrete data such as texts.
Let me denote $V \subset \mathbb{R}^{|V|}$ as the set of English vocabulary. Each word $w_i \in V$ is represented as a $|V|$-dimensional one-hot vector where the $i$-th entry equals 1 and other entries equal 0. I define a sentence as an ordered sequence of words $w \in V$ and is denoted as $s$. Let $S$ be the set of all possible sentences.
To be precise, my questions are:
- Is $V$ discrete? Is it a closed set?
- Is $S$ discrete? Is it a closed set?
- If $S$ is discrete, is it possible that a manifold can lie inside it?
I'm pretty sure that my questions are somewhat sounded very dumb, so I'm very grateful for your help and patience.
This is from a mathematician's perspective, so I apologize if there are some technical details that are irrelevant to the question you are asking :).
If I understand correctly: $V$ is just the set $$\{(1,0,0,\dots, 0), (0,1,0,\dots, 0),\dots, (0,0,\dots, 0, 1)\}\subset\mathbb R^n,$$ where $n\in\mathbb N$ is the number of words in your vocabulary. Then $V$ is a basis of $\mathbb R^n$, it is discrete (since it is finite), and it is closed (in the canonical topology). $S$ is now defined as what you may write as $$\bigcup_{m\in\mathbb N_0} V^m,$$ i.e. it is the set of all finite sequences of words. Now it becomes a bit more tricky to say whether $S$ is discrete or closed, since these are topological properties, so they depend on what topological space you embed $S$ into.
I suggest that we embed $S$ into the product space $$(\mathbb R^n)^{\mathbb N},$$ associating to a $(w_1,w_2,\dots, w_m)\in V^m\subset S$ the sequence $$(w_1,w_2,\dots, w_m, 0, 0, \dots)\in (\mathbb R^n)^{\mathbb N}.$$ As is usual, the space $$(\mathbb R^n)^{\mathbb N}$$ shall be equipped with the product topology (see H. Schubert, Topologie (1969), pages 30ff. or https://encyclopediaofmath.org/wiki/Topological_product or https://en.wikipedia.org/wiki/Product_topology).
Claim. With the above conventions, $S$ is discrete but not closed.
Proof. Take any (in the sense of the embedding above) $$(w_1,w_2,\dots, w_m, 0, 0, \dots)\in S.$$ Then, since $V$ is discrete, for each $w_i$, there exists a neighborhood $U_i\subset\mathbb R^n$ of $w_i$ such that $V\cap U_i = \{w_i\}$. Then $$S\cap U_1\times U_2\times\dots\times U_m \times\mathbb R^n\times\mathbb R^n\times\dots=\{(w_1,w_2,\dots, w_m, 0, 0, \dots)\}.$$ This shows that $S$ is discrete. We now come to non-closedness of $S$: Consider the sequence $(x_k)_{k\in\mathbb N}$ with each $x_k\in S$ given by $$x_k=(\underbrace{w, w,\dots, w}_{k\text{ times}}, 0, 0, 0,\dots),$$ where $w\in V$ is any word. Then (exercise) the $x_k$ converge to $$(w,w,w,w,w\dots)\in (\mathbb R^n)^{\mathbb N}.$$ However, this is not a sentence since there are infinitely many words. Therefore $S$ is not closed. $\square$
Now about manifolds: The theory of infinite-dimensional manifolds is very complex and I know little about it, so I will restrict myself to looking at the following set: $$S_k=\{\text{All sentences with at most $k$ words}\}$$ for some $k\in\mathbb N$. A sentence is always an element of the form $$(w_1,w_2,\dots, w_l, 0, 0,\dots, 0)\in(\mathbb R^n)^k$$ for some words $w_1,\dots, w_l\in V$ and $l\in\{0,1,\dots, k\}$. Now $S_k$ is a subset of $(\mathbb R^n)^k$, which is much easier to handle.
Note. In fact, one can think of $S_k$ as a set of matrices, let me know if you are interested in me elaborating on this.
In this new setting, with the canonical topology of $(\mathbb R^n)^k$, one actually has that $S_k$ is discrete and closed. In particular, $S_k$ can itself be seen as a $0$-dimensional submanifold of $\mathbb R^{n\cdot k}$. However, it can have no higher dimension that $0$. The way that I understand the manifold hypothesis though would be that the sentences contained in $S_k$ which appear in the real world lie on some "nice" submanifold of $\mathbb R^{n\cdot k}$, even though I would first need to read more about the latter hypothesis.