By tree we mean a graph $G$ in which any two vertices $v, y$ are connected by a single path. Equivalently, an undirected, connected, acyclical graph.
So, there's this thing in statistics called a Decision, or Regression Tree. You feed it the algorithm a sample of data, each observation consisting of a pair $(x, y)$. Explanatory variables corresponding to a point $x \in \mathbb{R}^n$ and outcomes to $y \in Y$. If $Y = \mathbb{R}$ then it's called a regression tree, if $Y$ is closed and discrete then it's a decision tree. Once you find the better suited tree, you (ideally) have a good model for predicting these outcomes. One can extend this framework to outcomes with higher dimensionality, but let's stay with this toy model for now.
We now have to define a metric for inequality - for reasons which will soon become clear. Say you're modelling income using three variables: higher degree binary, years of experience and height. There's different ways to divide the sample of data into two parts such that the proportions of outcomes $y$ above a certain real number $m$ are the most distant. One can divide the data into people above and below the median height. There might be a difference in income, maybe statistically significant even. We do know however that dividing the data into people with and without college degrees likely induces a much larger difference of income across samples. That seems to be a better division of the data, in the sense that it maximizes inequality of outcomes across samples under a certain metric - or minimizes entropy within samples.
Paths from the root to leaves characterise classificafication or prediction rules such that for any $x$ in $\mathbb{R}^n$ there is a $y \in Y$ which is the best prediction given data and hyperparameters of the tree. Each leaf/rule maps points that arrive at it to predicted values using aggregations such as mean or median of the outcome. This sounds like a function $T : \mathbb{R}^n \to Y$.
(Little détour: It seems reasonable that aggregating $k$ decision trees into a forest and averaging their predictions, if all explanatory variables are continuous then for a high enough $k$ the resulting forest $F : \mathbb{R}^n \to Y$ is continuous.)
My question is how do I represent decision trees as mappings between a vector space and some target set?