I am trying to implement the X-Means clustering algorithm. In it, the authors use the BIC to determine which model fits the data best. It is explained here: http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf (page 4/section 3.2).
I have a couple of points of confusion. First, in the second equation on p4, it has:
$$ \hat{\sigma}^2 = \frac{1}{R-K}\sum_i \left(x_i - \mu_{(i)}\right)^2 $$
$x_i$ and $\mu_{(i)}$ are vectors, so it seems like this should be a squared norm rather than just a squared term? i.e. there is an error that should be corrected to
$$ \hat{\sigma}^2 = \frac{1}{R-K}\sum_i ||x_i - \mu_{(i)}||^2 $$
is this correct?
Also, I follow up to the last equation on the left column of page 4. The log-likelihood of the data is: $$ l(D) = \log \Pi _i P(x_i) = \sum _i \left(\log\left(\frac{1}{\sqrt{2\pi} \sigma^M}\right) - \frac{1}{2 \sigma^2} ||x_i - u_{(i)}||^2 + \log\left(\frac{R_{(i)}}{R}\right)\right) $$
However, they go on to say that if you plug in the MLE, you get:
$$ \hat{l}(D_n) = -\frac{R_n}{2} \log(2\pi) -\frac{R_n M}{2} \log(\hat{\sigma}^2) - \frac{R_n - K}{2} + R_n \log(R_n) - R_n \log(R) $$
I don't see how they get that - can anyone explain what was done here? Namely, the $x_i$ terms should not cancel with anything or be able to be compressed by any sums, so they should still be present in the final expression, but they are not.
Yes, I think the reviewers, the typesetter, or the author, Pelleg, missed these two points, and it left me confused as well.
You first point is correct. Since the solution is applicable to an arbitrary number of dimensions (Pelleg uses two in his examples), the norm would be a better notation, while still referring to the Euclidean norm.
After going through the log-likelihood derivation myself, I think a few steps were skipped, leaving the formulas mangled. The correct derivation might look more like this:
The probability that a sample $i$ in cluster $n$ will be at position $x_i$ is given by
\begin{align} p(x_i│μ_n,σ_n^2 )= \frac{1}{\sqrt{2\pi} σ_n^M} exp(-\frac{ ‖x_i-μ_n‖^2}{2σ_n^2 }) \tag{1} \end{align}
(Why Pelleg uses the factor $\frac{R_{(i)}}{R}$ with this is a mystery. It looks like he is showing the joint pdf, but this probability assumes $x_i$ is part of the set with the given mean and variance parameter values. This is the assumption for a particular model, which is the basis for the subsequent derivation of the log-likelihoods.)
The likelihood for a single cluster $n$ (where $1 \leq n \leq K$) is given by \begin{align} L_n(μ_n,σ_n^2|x_i) & = \prod_{i \in n}\frac{1}{\sqrt {2\pi} σ_n^M}exp(-\frac{ ‖x_i-μ_n‖^2}{2σ_n^2 }) \tag{2} \end{align}
where $i \in n$ are the indices only of the samples in the cluster $n$.
The log-likelihood for this is \begin{align} l_n(μ_n,σ_n^2|x_i) = -\frac{R_n}{2}log2\pi -\frac{MR_n}{2}log(σ_n^2) - \frac{1}{2σ_n^2}\sum_{i \in n}{‖x_i-μ_n‖^2} \tag{3} \end{align}
where, using Pelleg's notation, $R_n$ is the number of samples in cluster $n$.
This is the easiest place to substitute our unbiased best estimate for $\sigma_n^2$. (Pelleg's paper makes an error referring to this as the MLE of $\sigma^2$.) Using $\hat{\sigma_n}^2=\frac{1}{R_n-1}\sum_{i \in n}{‖x_i-μ_n‖^2}$, then
\begin{align} \hat{l}_n(μ_n,σ_n^2|x_i) &= -\frac{R_n}{2}log2\pi -\frac{MR_n}{2}log(\hat{σ_n}^2) - \frac{1}{2\hat{σ_n}^2}(R_n-1)\hat{σ_n}^2 \\ &= -\frac{R_n}{2}log2\pi -\frac{MR_n}{2}log(\hat{σ_n}^2) - \frac{1}{2}(R_n-1) \\ \tag{4} \end{align}
The log-likelihood for K clusters becomes
\begin{align} \hat{l}(μ_1,...,μ_K,σ_1^2,...,σ_K^2|x_i) &= \sum_{1 \leq n \leq K}\hat{l}_n \\ &= -\frac{log2\pi}{2}\sum_{1 \leq n \leq K}R_n -\sum_{1 \leq n \leq K}\frac{MR_n}{2}log(\hat{σ}_n^2) - \frac{1}{2}\sum_{1 \leq n \leq K}(R_n+1) \\ &= \frac{R}{2}log2\pi -\frac{M}{2}\sum_{1 \leq n \leq K}R_nlog(\hat{σ}_n^2)-\frac{1}{2}(R-K) \tag{5} \end{align}
where $R=\sum R_n$, or the total number of samples.
This somewhat resembles Pellleg's formula for $\hat{l}_n$. I believe an oversight explains how the $K$ mysteriously shows up in this partial calculation in Pelleg's paper for just one cluster:
I explained previously at equation (1) why I think the last two terms should not be present.