AI Train/Test Split in Set Builder Notation

175 Views Asked by At

I'd like to express the splitting of input vectors and their corresponding labels in set builder notation. My question consists of a few smaller parts.

Consider a set of input vectors $X = \{\hat{x}_n \mid n \in \mathbb{Z}, 0 \leq n \lt C \}$, where $C$ is the total number of available input vectors. It is often the case in AI problems that you wish to split the input set into two uneven parts $X_{\text{train}}$ and $X_{\text{test}}$. Let's assume an $80\%$ train, $20\%$ test scenario. So, my first question is how to best represent this. I have considered binomial coefficient notation followed by set subtraction:

$$ X_\text{test} \in {X \choose {\lfloor \frac{C}{5} \rfloor}} \qquad X_\text{train} = X \setminus X_\text{test} $$

Is the above valid? This leads to the next question: for labels, lets say we have $y = \{\lambda_n \in \{0, 1\} \mid n \in \mathbb{Z}, 0 \leq n \lt C \}$. Is it enough to say:

$$ y_\text{test} = \{\lambda_n \mid \hat{x}_n \in X_\text{test}\} \qquad y_\text{train} = \{\lambda_n \mid \hat{x}_n \in X_\text{train}\} $$

or do I need to qualify the possible values of $n$ again, or include a $\forall$ somewhere, or am I just generally way off in notation? If I am, would something like the union of sets notation work?

$$ y_\text{test} = \bigcup_{x_n \in X_{\text{test}}} \{\lambda_n\} \qquad y_\text{train} = \bigcup_{x_n \in X_{\text{train}}} \{\lambda_n\} $$

Or does this just present the same issues as above?