Minimizing average depth of a decision tree: when is greedy optimal?

65 Views Asked by At

Fix parameters $m$ and $n$. Consider a finite set $A$ and a set of attributes $f_1,\ldots,f_m$ where $f_i: A \rightarrow [n]$. We want to construct a decision tree $T$ with minimum average depth.

We choose an element $a \in A$ uniformly at random. The element $a$ is of course hidden from $T$, but at each node of $T$, we are allowed to query $f_i(a)$ for some $i \in [m]$ of our choice.

We are given that for any $a \neq a' \in A$, there exists at least one $f_i$ such that $f_i(a) \neq f_i(a')$. So we can always distinguish between two elements in $A$.

Suppose we construct our decision tree in the standard way where we maximize information gain at every single stage: that is, at any level, if we have restricted our search space to some $A' \subset A$, we split based on the attribute with the most entropy. To be precise, fix a function $f_i$ and let $A_j = f_i^{-1}(j) \cap A'$ for $j \in [n]$. Let $J = \{j: A_j \neq \varnothing\}$. Then the entropy is given by $H(f_i) = -\sum_{j \in J} \frac{|A_j|}{|A'|} \log_2\left(\frac{|A_j|}{|A'|}\right)$. We choose the $f_i$ that maximizes $H(f_i)$ at this node. To emphasize, at a particular node, we are only allowed to split based on the result of one function $f_i$.

Is there anything known about when this greedy approach produces the optimal decision tree, that is, the decision tree with minimum average depth? Are any nice bounds known under some reasonable assumptions?

1

There are 1 best solutions below

1
On

Too long for a comment :

If optimal : I think in order to prove optimal you may want to proceed by induction on the symbols of $A$, and combine them. In order to do that you will probably have to also give probability to symbols of $A$ and adapt your entropy to $-\sum_{j\in J} \frac{p(A_j)}{p(A')}\log_2\left( \frac{p(A_j)}{p(A')} \right)$. You might also try induction by taking the set $A'$ that you describe in you procedure with lowest probability (it will be one just before leafs), you can replace this set in $A$ with a fresh symbol (something like $B=\{s\} \cup A\setminus A'$) and give probability $p(A')$ to that symbol $s$ (this is why you may need non uniform probabilities). By definition of $A'$, there exists a function $f_i$ that discriminate all of it's elements. Then assume there is another decision tree that is better than yours, this is the blurry step for me, but this decision tree has to discriminate elements of $A'$ at some point and there may be a way to transform the tree into something at least as good but such that those elements are always discriminated by $f_i$, if you manage to do that then essentially that new tree is a code for $B$ which will be at least as bad as what your procedure would do by induction, but then adding $f_i$ to discriminate will add the same cost to this one and yours therefore yours is optimal.