Is there a way to identify points coming from specific normal distribution from a pool of points belonging to different normal distributions?

55 Views Asked by At

Suppose I have a pool of $n$ data points coming from different normal distributions. I know that there are $m_1$ many points generated from $N(\mu_1,\sigma_1^2)$ distribution, $m_2$ many points from $N(\mu_2,\sigma_2^2)$ and so on ($\sum m_i=n$). However, I don't know which points they are as the dataset is mixed. Is there any statistical technique I can explore that may help me to identify all points coming from $N(\mu_1,\sigma_1^2)$?

1

There are 1 best solutions below

0
On BEST ANSWER

Let $x_i$ be the $i^{th}$ data point and let $z_i=1$ if $x_i$ came from $N(\mu_1,\sigma_1^2)$ and $z_i=0$ if it came from $N(\mu_0, \sigma_0^2)$, where the $z_i$ are unknown.

If we use a uniform prior over all possible valid $z=(z_1,\dots,z_n)$, i.e. $p(z)\propto \mathbb{1}\{\sum_{i=1}^n z_i = m_1\}$, then we can find the MAP estimate via maximizing the posterior distribution over $z$:

\begin{align*} \log P(z\vert x) &=\log P(x\vert z)\\ &= -\sum_{i=1}^n \frac{1}{2\sigma_{z_i}^2}\left(x_i-\mu_{z_i}\right)^2+ \frac{m_0}{2}\log 2\pi \sigma_{0}^2+ \frac{m_1}{2}\log 2\pi \sigma_{1}^2\\ &=\sum_{i=1}^n z_i\left(\frac{1}{2\sigma_{1}^2}\left(x_i-\mu_{1}\right)^2-\frac{1}{2\sigma_{0}^2}\left(x_i-\mu_{0}\right)^2\right)+C \end{align*}

subject to the constraint that $\sum_{i=1}^n z_i=m_1$ (for some constant $C$).

To solve, we just need to sort the $x_i$ according to $\Delta_i = \frac{1}{2\sigma_{1}^2}\left(x_i-\mu_{1}\right)^2-\frac{1}{2\sigma_{0}^2}\left(x_i-\mu_{0}\right)^2$ and assign the top $m_1$ to $z_i=1$, and the rest to $0$.