I am trying to find the exponential generating function of directed minimal acyclic graphs (which I now call dag), where every non-leaf node has two outgoing edges.
Context: A simple tree compression algorithm consists of saving repeated subtrees only once.
Further occurrences of repeated trees are simply linked to the first occurrence.
This way one gets a unique minimal directed acyclic graph, and since we started with a tree it's also rooted.
For simplicity I would like to treat binary trees, hence two outgoing edges per non-leaf node.
A natural question is how big dags of binary trees of size $n$ are,
and the question has been answered
here
(the paper is Analytic Variations on the Common Subexpression Problem, by Flajolet et al).
I would like to ask a different question, namely how many different dags of size $n$ there are, or equivalently, how many rooted plane binary trees have a dag of size $n$?
As an example, for $n=3$, we have three such trees, namely
$a(a(a,a),a(a,a))$, $a(a(a,a),a)$ and $a(a,a(a,a))$.
For $n=4$ there are $15$ trees, for $n=5$ there are $111$.
A promising sequence from OEIS is A001063, but I can neither make sense of the differential equation mentioned there, nor do I have a combinatorial explanation for the formula there that calculates $a_{n+1}$, given $a_1,\dots,a_n$:
$$
a_{n+1} = \sum_{k=0..n} \frac{n!}{k!} \cdot \binom{n-1}{k-1}\cdot a_k
$$
If requested, I could add where I got stuck (I mostly tried to make sense of the formula), but I think my post is already too long.
Addendum. This question has generated its own OEIS-series (A254789)! Thanks to everyone involved!
Update: An $\mathcal{O}(n^6)$ dynamic programming solution was found on the Project Euler forums : $T(100)$. A faster recurrence was discovered in [Genetrini 2017] : $T(350)$.
I'm hoping that this post will serve as a full exposition of our ideas thus far - so feel free to edit or add to it. We are trying to count trees by their number of distinct subtrees. The compiled list and OEIS A254789 (offset 1):
$$ \; \\ $$
Motivation
The question of counting such trees is a natural extension of the results in $[1]$, neatly summarized
$$ \; \\ $$
Problem Statement
In this thread we will be concerned with one type of tree, namely unlabeled plane rooted full binary trees. For convenience and clarity we drop the descriptive titles and simply call them "trees". In such trees each internal node has exactly two children (full binary) and we establish the order of children (plane). Instead of approximating the number of distinct subtrees given a tree, we want to count the number of trees with $k$ distinct subtrees, or rather the number of trees with $k$ nodes in its compacted DAG. More precisely, let $\mathcal{T}$ be the set of all trees, and for a given tree $\tau \in \mathcal{T}$, let $S(\tau)$ be the set of subtrees in $\tau$. Then we want to count $\tilde{\mathcal{T}_k} = |\mathcal{T}_k|$ where
$$ \mathcal{T}_k = \left\{ \tau \in \mathcal{T}, \; \left| \, S(\tau) \, \right| = k \right\} $$
Note that $\left| \, S(\tau) \, \right| = \left| \, \text{dag}(\tau) \, \right|$ is the number of nodes in the compacted DAG of $\tau$.
$$ \; \\ $$
Examples
To understand what we are counting and how the compaction works, here is an enumeration of $\mathcal{T}_k$ for $k\leq4$. The trees are colored black, their compacted DAGs blue, and their subtrees red. Note how each node in the DAGs corresponds to a particular subtree. [Higher Quality]
$ \qquad \qquad \qquad \qquad $
You may notice that the OEIS is offset by 1 and that the example trees are not full (some internal nodes have less than two children). By removing the leaves from each of our trees we notice a bijection between full binary trees with $k$ subtrees and binary trees with $k-1$ subtrees, hence the offset. All the methods of counting developed have an analog slightly altered to fit this interpretation (eg. in the canonical form of the DAGs each node may have less than two children). Since this doesn't provide any speedup in computation, we will ignore this interpretation henceforth.
$$ \; \\ $$
Preliminary Observations
If a tree has $n$ internal nodes, then it has $n\!+\!1$ leaves. The number of trees with $n$ internal nodes is the Catalan number $C_n = \frac{1}{n+1}{2n \choose n} \approx \frac{4^n}{\sqrt{\pi n^3}} $.
If a tree $\tau$ has height $h$, then $\left| \, \tau \, \right| \in [2h\!-\!1,2^h\! -\! 1]$ and $\left| \, S (\tau) \, \right| \geq h $. The number of trees with height $h$ satisfies the recurrence $T_{(h)} = T_{(h-1)}^2 + 2 T_{(h-1)} \sum_{k=1}^{h-2} T_{(k)}$. As seen on OEIS A001699, $T_{(h)} \approx 1.5^{2^h}$.
The result of $[1]$ gives us a rough estimate for $\tilde{\mathcal{T}_k}$. It tells us that the expected number of subtrees for a tree with $n$ internal nodes is $$\tilde{K}_n = 2 \sqrt{\frac{\log 4}{\pi}} \frac{n}{\sqrt{\log n}} \left( 1 + \mathcal{O}\left(\tfrac{1}{\log n} \right) \right)$$ Suppose we fix $\tilde{K}_n=k$ and solve for $n$. Then we get the expected number of internal nodes of a tree with $k$ subtrees. Given this estimate of $n$, we can guess that there will be roughly $C_n$ different trees with $k$ subtrees.$$ \; $$ $ \qquad \qquad \qquad \qquad \qquad $
$$ \; $$
The fact that this underestimates the true values can perhaps be attributed to the $\mathcal{O}(\frac{1}{\log n})$ term, which I took to be zero. In any case we see exponential growth, which means that we will need to develop a method to count rather than enumerate trees.
$$ \; \\ $$
A Method of Enumeration
Each tree $\tau = x\,(L,R)$ is uniquely defined by its set of subtrees, which can be written in terms of those of the left and right subtrees $L$ and $R$. $$ S(\tau) = \{\tau\} \cup S(L) \cup S(R) $$ This leads to a natural algorithm to enumerate $\tilde{\mathcal{T}_k}$. We build $\mathcal{T}$ by glueing trees together via an added root node. The resulting set of subtrees is the union of the two glued trees plus a new element that denotes the entire tree. Letting $\imath$ denote the singleton tree, the explicit trees we find after each glueing iteration are
$$ \begin{array} ( T^{(1)} &= \{\imath\} & = \{1\} \\ T^{(2)} &= \{\imath, \imath(\imath,\imath)\} & = \{1,2\}\\ T^{(3)} &= \{\imath, \imath(\imath,\imath), \imath(\imath(\imath,\imath),\imath), \imath(\imath,\imath(\imath,\imath)), \imath(\imath(\imath,\imath),\imath(\imath,\imath))\} & = \{1, 2, 3, 4, 5\} \\ \end{array} $$
$$ \begin{align} T^{(1)} &= \{ \{1\} \} \\ T^{(2)} &= \{ \{1\},\{1,2\} \} \\ T^{(3)} &= \{ \{1\},\{1,2\}, \{1,2,3\}, \{1,2,4\}, \{1,2,5\} \} \\ \end{align} $$
Thus we find $\tilde{\mathcal{T}_1}=1, \; \tilde{\mathcal{T}_2}=1, \; \tilde{\mathcal{T}_3}=3$. It is clear that after the $k$th iteration, you will have enumerated all trees in $\mathcal{T}_k$. Of course, you will also enumerate trees with more than $k$ subtrees, so it is prudent to prune any such trees along the way. The downside to this algorithm is that it enumerates every tree. Since the number of trees grows exponentially, we find that computing $\tilde{\mathcal{T}_k}$ becomes intractable for $k>9$. For larger $k$, we will need to develop a method of counting. An implementation was written in Python: code. User Marko Riedel also posted various implementations in (B1) (B2) (B3).
$$ \; \\ $$
A Method of Counting
This method was developed by @d8d0d65b3f7cf42's in his posts (A1) (A2) and then slightly optimized here. Instead of enumerating trees, we count DAGs. We start by characterizing the DAGs in a unique way. Every node $v$ in the DAG represents a unique subtree. We let the height of $v$ be the height of the tree it represents. Note that this equals the length of the longest chain from $v$ to the node $\imath$ that represents the singleton tree. By grouping nodes of the same height we form layers and the graph takes shape.
$\qquad \qquad \qquad \qquad \qquad \qquad\qquad \qquad$
Noting the natural ordering of the nodes, we can represent each DAG in canonical form:
One interesting observation is that the number of canonical DAGs with shape $(1,1,1,\ldots,.,1)$ is $(2k-3)!!$. There is a correspondence between these DAGs and a restrictive set of trees constructed in the recursive enumeration method. If at each iteration you only glue two trees if one is a subtree of the other, then you end up with these DAGs exactly. However, this doesn't play a role in our counting computation.
To count $\tilde{\mathcal{T}_k}$ we do the following. First we generate the possible shapes that the DAGs can take. The possible shapes are a subset of the compositions of $k$. With a great/little amount of effort you can enumerate these exactly/approximately by pruning any shapes that have layers with too many nodes (eg. a layer cannot be wider than the number of possible children pairs below). Next, the idea is to keep track of the nodes that have parents. So for each shape there will be many different boolean coverings where each node is assigned a value of $\mathtt{True}$ if it has atleast one parent or $\mathtt{False}$ otherwise. We can count the number of DAGs that have a given $\mathtt{shape}$ and $\mathtt{covering}$ by inducting on the height of the graph. This leads to a recursive algorithm in which we attach the top-layer nodes to those below them in every valid way.
It is here that another optimization manifests. If there are $\ell_1$ nodes in the top layer of the DAG, $\ell_2$ in the second layer and $\ell_{3}$ nodes below, then there are ${\ell_2^2 + 2\ell_2\ell_3 \choose \ell_1}$ ways of connecting the top layer to the rest of the DAG. This number grows fast, becoming unwieldy for even modest shapes. An alternative idea is to choose the children and then count the number of ways to assign the top layer to those children. Suppose we connect the top layer to exactly $\alpha$ nodes in the second layer and $\beta$ nodes below. Using inclusion exclusion, we find that the number of ways to connect the top layer in such a manner is $$ M(\ell_1, \alpha, \beta) = \sum_{j=0}^{\alpha + \beta} (-1)^j \sum_{i=0}^{j}{\alpha \choose i} {\beta \choose j-i} {(\alpha-i)^2 + 2(\alpha-i)(\beta-j+i) \choose \ell_1}$$
Each summand can be realized as such: we choose $\ell_1$ of the possible children pairs given that we leave $i$ children uncovered in the second layer and $j-i$ children uncovered in the lower layers. Finally, a shape $S$ and covering $c$ is canonical if the top layer of $S$ has one node (the DAG contains a root) and if $c$ assigns every other node $\mathtt{True}$ (requirement 3). Below is pseudocode to count the number of DAGs for each covering of a given shape.
$ \; \\ $
$\qquad \qquad \qquad \qquad \qquad \qquad \qquad$
$$ \begin{align} S &= (3,2,3,1,2,1,1) \\ \ell_1, \alpha, \beta &= 3,2,2 \\ c &= \;\;\;\;\;0010101011 \\ \varsigma &= 0001100100100 \\ \varsigma \vee c &= 0001110101111 \\ \end{align} $$
Here is a dirty Python implementation. This code confirms @d8d0d65b3f7cf42's values for $k\leq16$ and was used to obtain values for $k\leq 21$ -- though it takes about 14 hours for $k=21$. I fixed the memory issues by removing old values from the memoization table (being careful not to re-add any recomputed canonical shape). Excitingly, I think I have a way to directly count by shape (which would get us ~10 more values). It counts by an equivalence relation: a $k \times k$ matrix for which the $i,j$th entry is the number of trees of size $i$ that would add $j$ subtrees when taking the union.
The two codes can be easily altered to count trees where we don't care about the order of children (no longer "plane" trees). In each you simply need to comment one line of code. In the enumeration method, you only perform one of the two gluings $x(\tau_1,\tau_2), x(\tau_2, \tau_1)$. In the counting method, the only thing that changes is the third binomial coefficient in the inclusion exclusion formula. The list of values of this sister sequence for $k\leq19$ is [1, 1, 2, 6, 25, 137, 945, 7927, 78731, 906705, 11908357, 175978520, 2893866042, 52467157456, 1040596612520, 22425725219277, 522102436965475, 13064892459014192, 349829488635512316].