Let’s define a transducer as a $5$-tuple $(Q, A, B, \phi, \psi)$, where $Q$ is the collection of states, $A$ is the input alphabet, $\phi: Q\times A \to Q$ is the transition function and $\psi: Q \times A \to B$ is the output function.
Any transducer defines a transducer function $\overline{\psi}: Q\times A^* \to B^*$ described by the following recurrence:
$\overline{\psi}(q, \Lambda) = \Lambda$, where $\Lambda$ is the empty word.
$\overline{\psi}(q, a \alpha) = \psi(q, a) \overline{\psi}(\phi(q, a), \alpha)$, where $a \in A$, $\alpha \in A^*$.
We call $q_1, q_2 \in Q$ distinguishable iff $\exists \alpha \in A^*$ such that $\overline{\psi}(q_1, \alpha) \neq \overline{\psi}(q_2, \alpha)$.
We call a transducer reduced iff any two its states are distinguishable.
Suppose $n_t(k, m, n)$ is the number of distinct reduced deterministic transducers with input with $k$ states, $m$-symbol input alphabet and $n$-symbol output alphabet. Is there some sort of explicit formula (or at least asymptotic) for $n_t(k, m, n)$?
I managed to derive only the following three particular cases:
$n_t(1, m, n) = n^m$
Proof:
As there is only one state, the transducer is fully determined by its output function. And it is in this case just a map from the input alphabet to the output one.
$n_t(2, m, n) = (4n)^m(n^m - 1)$
Proof:
Suppose, $q_1$ and $q_2$ are the only two states of our transducer. It is not hard to see, that in this particular case they are not distinguishable iff $\psi(q_1, a) = \psi(q_2, a)$. Thus there are $n^m(n^m - 1)$ possible versions of $\psi$, that result in our transducer being reduced. And there are $4^m$ ways to chose $\phi$.
$n_t(k, m, 1) = 0$ $\forall k \geq 2$
Proof:
If $B$ is one-element, then $\overline{\psi}(q, \alpha)$ is determined only by the length of $\alpha$.
Generalising the argument for $n_t(2, m, n)$, we can say that two states $q_i$ and $q_j$ are similar, denoted $q_i \sim q_j$, if $\forall a \in A: \psi(q_i, a) = \psi(q_j, a)$. If $q_i \not\sim q_j$ then $q_i$ and $q_j$ are distinguishable by a one-symbol word.
For $k\le 2$ one of those two scenarios must occur. For $k > 2$ we can have situations with non-trivial non-universal clusters of similar states. We define
$$\overline\phi(q, w) = \begin{cases}q & \textrm{if } w = \Lambda \\ \overline\phi(\phi(q, a), \alpha) & \textrm{if } w = a\alpha \end{cases}$$
Then two similar states $q_i \sim q_j$ are distinguishable iff there is a word $w$ such that $\overline\phi(q_i, w) \not\sim \overline\phi(q_j, w)$.
From a computational point of view, the following algorithm serves to verify reducibility:
We can apply this fairly straightforwardly for $k=3$.
Total: $27^m n^m(n^m-1)(n^m-2) + 3^{m+1} (9^m - 5^m) n^m(n^m-1)$ $= 3^m n^m(n^m-1)[9^m (n^m+1) - 3 \cdot 5^m]$
For higher $k$ the sum needs to be over all possible ways of refining the partition.
As for asymptotics: there are $(kn)^{km}$ transducers, and $k^{km}(n^m)^\underline{k}$ transducers in which no two states are similar, so for $k \ll n^m$ the reducible transducers are negligible. In that case, it makes more sense to ask about the asymptotics of the reducible transducers, for which the upper bound $k^{km}(n^{km} - (n^m)^\underline{k})$ has a first order approximation of $k^{km} \cdot \tfrac12 k(k-1) n^{k(m-1)}$.