Stochastic block model is a clustering model on a graph $G=(V,E)$, defined by the following data structure,
- $k$, a number indicating the number of groups the graph nodes are to be clustered into.
- A $k\times k$ stochastic matrix $\bf M$, where each entry ${\bf M}(i,j)$ means the probability that a node of the $i$th group has a link with a node in the $j$th group.
- $\bf z$, an $n×1$ vector indicating the group membership of each graph node.
The observed edge set $E$ is viewed as being generated by the above model in the following way: for every possible edge (dependent on if the graph is directed, simple or loop-less) involving nodes $u,v$, flip an unfair coin with head probability $M(z(u),z(v))$; if the coin turns out head, then draw a directed link between $u,v$. The generation of each link is independent. Then the likelihood function is
$$\cal L( E \mid \bf M,z) = \prod_{(u,v) \in E} {\bf M}({\bf z}(u),{\bf z}(v)) \prod \limits_{(u,v) \notin E} \left( 1 - {\bf M}({\bf z}(u),{\bf z}(v)) \right)$$
Then our task is to find
$$({\bf M},{\bf z}) = \operatorname{argmax} \text{x}_{{\bf M},{\bf z}} \cal L(E \mid {\bf M},{\bf z})$$
Now here comes the problem. Let $E_{i,j}$ be the set of edges connecting a node in group $i$ and a node in group $j$. Let $n_{i,j}$ be the number of maximum possible edges between group $i$ and group $j$. Note that $n_{ij}$ varies with different graph definitions (if the graph is directed, simple, loopless, etc.). Since the generation of each link is independent, then the generation of links between group $i$ and group $j$ are $n_{ij}$-time i.i.d. Bernoulli trials, thus if $\bf z$ is given, ${\bf {M}}(i,j)$ can be estimated by ${\bf {\hat M}}(i,j)=\dfrac{|E_{i,j}|}{n_{i,j}}$, which is itself an MLE estimation, i.e. $$\left(\frac{|E_{i,j}|}{n_{i,j}}\right)_{i,j}={\bf\hat M}= \operatorname{argmax}_{\bf M} \cal L\left( E,{\bf z} \mid {\bf M} \right)$$
Then the textbook directly plugs ${\bf {\hat M}}(i,j)$ back in the original likelihood function, leading to a simplified likelihood
$$\cal L( E \mid {\bf M},{\bf z}) = \prod\limits_{i,j} {\bf M}^{|E_{i,j}|} (i,j)(1 - {\bf M}(i,j))^{n_{i,j} - |E_{i,j}|} = \prod\limits_{i,j} \left(\frac{|E_{i,j}|}{n_{i,j}}\right)^{|E_{i,j}|} \left( 1 - \frac{|E_{i,j}|}{n_{i,j}} \right)^{n_{i,j} - |E_{i,j}|} $$
which is a function only dependent on $\bf{z}$.
I am puzzled about why $\bf{\hat M}$ can be directly plugged into the original likelihood? More specifically,
Is $\bf{\hat M}$ also the solution for $\bf{M}$ of the original likelihood maximization problem?
Does the solution of $\bf{z}$ stays the same after simplification of the likelihood? Why?
This holds no matter $M$ and $z$ are dependent or not.
Recall maximizing a two-dimensional function $f(x,y)$, what you are gonna do is to find $\frac{{\partial f}}{{\partial x}},\frac{{\partial f}}{{\partial y}}$ and solve $\left\{ \begin{gathered} \frac{{\partial f}}{{\partial x}} = 0, \hfill \\ \frac{{\partial f}}{{\partial y}} = 0. \hfill \\ \end{gathered} \right.$ Also, you can first solve $\frac{{\partial f}}{{\partial x}} = 0$ and get a function $x=h(y)$ and plug it back in $f(x,y)$ to find $y^*$ that maximizes $f(h(y),y)$, and then $x^*=h(y^*)$. It is the same thing as solving the equation group.
To further see this, the maximum points $(x^*,y^*)$ has to satisfy $\frac{{\partial f}}{{\partial x}} = 0$, thus the problem is equivalent to maximize $f(x,y)$ with restriction $\frac{{\partial f}}{{\partial x}} = 0$ or $x=h(y)$. What you can do with a maximization problem with restriction $x=h(y)$ is to simply plug the restriction into $f$ if $h$ can be easily expressed; otherwise one may look for Lagrange method to simplify calculation.
Specifically about your problem, $\cal L$ is a function dependent on $M,z$, and the $\hat M=\frac{|E_{i,j}|}{n_{I,j}}$ is actually the solution of $\frac{{\partial L}}{{\partial M}} = 0$, and there is no problem to plug it back in $\cal L$ by the above argument.