A question about maximum likelihood

99 Views Asked by At

Stochastic block model is a clustering model on a graph $G=(V,E)$, defined by the following data structure,

  1. $k$, a number indicating the number of groups the graph nodes are to be clustered into.
  2. A $k\times k$ stochastic matrix $\bf M$, where each entry ${\bf M}(i,j)$ means the probability that a node of the $i$th group has a link with a node in the $j$th group.
  3. $\bf z$, an $n×1$ vector indicating the group membership of each graph node.

The observed edge set $E$ is viewed as being generated by the above model in the following way: for every possible edge (dependent on if the graph is directed, simple or loop-less) involving nodes $u,v$, flip an unfair coin with head probability $M(z(u),z(v))$; if the coin turns out head, then draw a directed link between $u,v$. The generation of each link is independent. Then the likelihood function is

$$\cal L( E \mid \bf M,z) = \prod_{(u,v) \in E} {\bf M}({\bf z}(u),{\bf z}(v)) \prod \limits_{(u,v) \notin E} \left( 1 - {\bf M}({\bf z}(u),{\bf z}(v)) \right)$$

Then our task is to find

$$({\bf M},{\bf z}) = \operatorname{argmax} \text{x}_{{\bf M},{\bf z}} \cal L(E \mid {\bf M},{\bf z})$$

Now here comes the problem. Let $E_{i,j}$ be the set of edges connecting a node in group $i$ and a node in group $j$. Let $n_{i,j}$ be the number of maximum possible edges between group $i$ and group $j$. Note that $n_{ij}$ varies with different graph definitions (if the graph is directed, simple, loopless, etc.). Since the generation of each link is independent, then the generation of links between group $i$ and group $j$ are $n_{ij}$-time i.i.d. Bernoulli trials, thus if $\bf z$ is given, ${\bf {M}}(i,j)$ can be estimated by ${\bf {\hat M}}(i,j)=\dfrac{|E_{i,j}|}{n_{i,j}}$, which is itself an MLE estimation, i.e. $$\left(\frac{|E_{i,j}|}{n_{i,j}}\right)_{i,j}={\bf\hat M}= \operatorname{argmax}_{\bf M} \cal L\left( E,{\bf z} \mid {\bf M} \right)$$

Then the textbook directly plugs ${\bf {\hat M}}(i,j)$ back in the original likelihood function, leading to a simplified likelihood

$$\cal L( E \mid {\bf M},{\bf z}) = \prod\limits_{i,j} {\bf M}^{|E_{i,j}|} (i,j)(1 - {\bf M}(i,j))^{n_{i,j} - |E_{i,j}|} = \prod\limits_{i,j} \left(\frac{|E_{i,j}|}{n_{i,j}}\right)^{|E_{i,j}|} \left( 1 - \frac{|E_{i,j}|}{n_{i,j}} \right)^{n_{i,j} - |E_{i,j}|} $$

which is a function only dependent on $\bf{z}$.

I am puzzled about why $\bf{\hat M}$ can be directly plugged into the original likelihood? More specifically,

  1. Is $\bf{\hat M}$ also the solution for $\bf{M}$ of the original likelihood maximization problem?

  2. Does the solution of $\bf{z}$ stays the same after simplification of the likelihood? Why?

2

There are 2 best solutions below

0
On BEST ANSWER

This holds no matter $M$ and $z$ are dependent or not.

Recall maximizing a two-dimensional function $f(x,y)$, what you are gonna do is to find $\frac{{\partial f}}{{\partial x}},\frac{{\partial f}}{{\partial y}}$ and solve $\left\{ \begin{gathered} \frac{{\partial f}}{{\partial x}} = 0, \hfill \\ \frac{{\partial f}}{{\partial y}} = 0. \hfill \\ \end{gathered} \right.$ Also, you can first solve $\frac{{\partial f}}{{\partial x}} = 0$ and get a function $x=h(y)$ and plug it back in $f(x,y)$ to find $y^*$ that maximizes $f(h(y),y)$, and then $x^*=h(y^*)$. It is the same thing as solving the equation group.

To further see this, the maximum points $(x^*,y^*)$ has to satisfy $\frac{{\partial f}}{{\partial x}} = 0$, thus the problem is equivalent to maximize $f(x,y)$ with restriction $\frac{{\partial f}}{{\partial x}} = 0$ or $x=h(y)$. What you can do with a maximization problem with restriction $x=h(y)$ is to simply plug the restriction into $f$ if $h$ can be easily expressed; otherwise one may look for Lagrange method to simplify calculation.

Specifically about your problem, $\cal L$ is a function dependent on $M,z$, and the $\hat M=\frac{|E_{i,j}|}{n_{I,j}}$ is actually the solution of $\frac{{\partial L}}{{\partial M}} = 0$, and there is no problem to plug it back in $\cal L$ by the above argument.

1
On

If the value that of $\bf M$ that maximizes the likelihood with the other parameters treated as if they were known does not depend on those other parameters, then this can be done. It is similar to something that happens when one finds the MLEs for $\mu$ and $\sigma$ when one has $n$ i.i.d. observations $x_1,\ldots,x_n$ from $N(\mu,\sigma^2)$. The MLE for $\mu$ when treating $\sigma$ as known is the sample mean $\bar x = (x_1+\cdots+x_n)/n$. This does not depend on $\sigma$, so one plugs in $\bar x$ in place of $\mu$ in the likelihood function and then finds the value of $\sigma$ that maximizes that.

Theorem: Suppose the value $\hat\alpha$ of $\alpha$ that maximizes $L(\alpha,\beta)$ for fixed $\beta$ does not depend on $\beta$. Let $\hat \beta$ be the value of $\beta$ that maximizes $L(\hat\alpha,\beta)$. Then $(\hat\alpha,\hat\beta)$ is the value of $(\alpha,\beta)$ that maximizes $L(\alpha,\beta)$.

Proof: $\qquad$ $L(\alpha,\beta) \le L(\hat\alpha,\beta) \le L(\hat\alpha,\hat\beta)$. $\qquad\blacksquare$