Measure Theoretic rigorous formulation/definition of the Expectation Maximization algorithm (EM Algorithm)

163 Views Asked by At

$\newcommand{\cY}{\mathcal{Y}}\newcommand{\cF}{\mathcal{F}}\newcommand{\cX}{\mathcal{X}}\newcommand{\Rbb}{\mathbb{R}}\newcommand{\Pbb}{\mathbb{P}}\newcommand{\cB}{\mathcal{B}}\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}\newcommand{\vY}{\vect{Y}}\newcommand{\vy}{\vect{y}}\newcommand{\vX}{\vect{X}}\newcommand{\vx}{\vect{x}}\newcommand{\vtheta}{\vect{\theta}}\newcommand{\vz}{\vect{z}}\newcommand{\vZ}{\vect{Z}}$

Aim: Rigorous Formulation of EM Algorithm

I am trying to write a rigorous treatment of the Expectation-Maximization algorithm (EM algorithm), however, I couldn't find it anywhere. I also tried reading the paper by Dempster that introduced it but it doesn't really use any measure theory.

My Current Formulation for Missing Data Setting

Observed or Incomplete data

Let $(\cY, \cF_{\cY}, \Pbb_{\cY})$ be a probability space and let $(\Rbb,\cB(\Rbb))$ be a measurable space. Let $Y_1, \ldots, Y_n$ be $n$ random variables defined as follows: $$ Y_i:\cY\to\Rbb $$ We define a random vector $\vY:=(Y_1, \ldots, Y_n)^\top$ and we call incomplete or observed data any vector $$ \vy:=(y_1, \ldots, y_n)^\top $$ containing realizations of the $n$ random variables $Y_1, \ldots, Y_n$.

Missing Data

Similarly, we call missing data any vector $$ \vx:=(x_1, \ldots, x_m)^\top $$ of realizations of the $m$ random variables $\vX:=(X_1, \ldots, X_m)^\top$ defined on the same probability space $(\cX, \cF_{\cX}, \Pbb_{\cX})$ and taking values in $\Rbb$.

Problem: How to define a density function on concatenation of random variables from different probability spaces?

Now, what most people do next is to concatenate the observed and the missing data into a vector $\vz:=(\vx^\top, \vy^\top)^\top$ which is assumed to be the realization of a complete data random vector $\vZ$ and then define $$ p_{\vY}(\vY=\vy; \vtheta) := \int_{\cB(\Rbb)} p_{\vX}(\vy, \vx; \vtheta) d\vx $$ How does this make any sense? Surely since $\vY$ and $\vX$ are defined on different probability spaces, we cannot simply concatenate the realizations $\vy$ and $\vx$ to obtain the realization of some new random vector $\vZ$. What is $\vZ$ actually made of? I would assume $$ \vZ:=(X_1, \ldots, X_m, Y_1, \ldots, Y_n)^\top $$ but then $m$ of the elements of $\vZ$ are defined on $(\cX, \cF_{\cX}, \Pbb_{\cX})$ and the remaining $n$ elements are defined on $(\cY, \cF_{\cY}, \Pbb_{\cY})$. But in the wikipedia page of a random vector we see that all the components of $\vZ$ should be defined on the same probability space. (see here)

How do we actually formulate the EM algorithm in a formal way?

1

There are 1 best solutions below

0
On

$\newcommand{\cY}{\mathcal{Y}}\newcommand{\cF}{\mathcal{F}}\newcommand{\cX}{\mathcal{X}}\newcommand{\Rbb}{\mathbb{R}}\newcommand{\Pbb}{\mathbb{P}}\newcommand{\cB}{\mathcal{B}}\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}\newcommand{\vY}{\vect{Y}}\newcommand{\vy}{\vect{y}}\newcommand{\vX}{\vect{X}}\newcommand{\vx}{\vect{x}}\newcommand{\vtheta}{\vect{\theta}}\newcommand{\vz}{\vect{z}}\newcommand{\vZ}{\vect{Z}}$ One possible way to write a complete and rigorous (possibly, measure-theoretic) formulation of the EM algorithm might be the following.

EM Algorithm Formulation

Complete Data

Let $(\Omega, \cF, \Pbb)$ be a probability space and let $(\Rbb, \cB(\Rbb))$ be a measurable space. Suppose we have $Z_1, \ldots, Z_n$ random variables defined as $$ Z_i: \Omega \to \Rbb \qquad \forall\,\,i\in\{1, \ldots, n\} $$ We construct a random vector $\vZ:=(Z_1, \ldots, Z_n)^\top$ and we call any vector $\vz:=(z_1, \ldots, z_n)^\top$ of realizations of the random variables $Z_i$, the complete data.

Incomplete or Observed Data

Now let $m<n$ be the number of observed data points. We simply rename the first $m$ random variables in $\vZ$ as follows $$ Y_i := Z_i \qquad \text{for } i=1, \ldots, m $$ and construct the random vector $$ \vY:=(Y_1, \ldots, Y_m)^\top:=(Z_1, \ldots, Z_m)^\top $$ then we denote the realizations of such random variables as $$ \vy:=(y_1, \ldots, y_m)^\top:=(z_1, \ldots, z_m)^\top $$ and we call them observed data.

Missing Data

Similarly, suppose that we have $p:=n-m$ missing data points. Then we rename the remaining $n-m$ random variables in $\vZ$ as follows $$ X_{i-m}:= Z_i \qquad \text{for } i=m+1, \ldots, n $$ and we construct the random vector $$ \vX :=(X_1, \ldots, X_p)^\top := (Z_{m+1}, \ldots, Z_n)^\top $$ then we rename the realizations of such random variables $$ \vx := (x_1, \ldots, x_p)^\top := (z_{m+1}, \ldots, z_n)^\top $$ and we call them missing data.

Definition of complete-data cumulative distribution function

Following wikipedia's definition of a probability distribution as a pushforward measure the distribution for $Z_i$, denoted $\mu_{Z_i}$. Basically the distribution of $Z_i$ is the pushforward probability measure $$ \mu_{Z_i}: \cB(\Rbb)\to [0, 1] $$ that satisfies $$ \mu_{Z_i}(B)= \Pbb(Z_i^{-1}(B)) = \Pbb(Z_i\in B) = \Pbb(\left\{\omega\in\Omega\,:\, Z_i(\omega) \in B\right\}) \qquad \forall \, B\in \cB(\Rbb) $$ That is, to any set $B$ of complete data points it gives the same measure / magnitude as the probability measure $\Pbb$ gives to the pre-image of $B$ under $Z_i$. In other words, it measures the probability that a data point $z_i$ takes a value in $B$. Next, we can define the cumulative distribution function as the function $$ F_{\mu_{Z_i}}: \Rbb\to [0,1] $$ that behaves as follows $$ F_{\mu_{Z_i}}(z_i) := \mu_{Z_i}\left((-\infty, z_i]\right) =\Pbb\left(Z_i\in(-\infty, z_i]\right)=\Pbb\left(Z_i\leq z_i\right) \qquad \text{for } (-\infty, z_i]\in\cB(\Rbb) $$ Then we can find the joint probability distribution of the random vector $\vZ$ as the pushforward probability measure $$ \mu_{\vZ}:\cB(\Rbb^n)\to [0, 1] $$ satisfying the following for every $B:=B_1\times\ldots\times B_n\in\cB(\Rbb^n)$ $$ \begin{align} \mu_{\vZ}(B) :&= \Pbb\left(\left\{\omega\in\Omega\,:\, Z_1(\omega)\in B_1\right\}\cup\ldots\cup\left\{\omega\in\Omega\,:\, Z_n(\omega)\in B_n\right\}\right) \\ &= \Pbb\left(\bigcup_{i=1}^n \left\{\omega\in\Omega\,:\, Z_i(\omega)\in B_i\right\}\right) \end{align} $$ We can therefore define a joint cumulative distribution $$ F_{\mu_{\vZ}}(\vz) : \Rbb^n \to [0, 1] $$ as follows $$ \begin{align} F_{\mu_{\vZ}}(\vz) :&= \mu_{\vZ}\left(\left((-\infty, z_1], \ldots, (-\infty, z_n]\right)\right) \\ &= \Pbb\left(\bigcup_{i=1}^n \left\{\omega\in\Omega\,:\,Z_i(\omega)\in(-\infty, z_i]\right\}\right) \\ &= \Pbb\left(\bigcup_{i=1}^n \left\{\omega\in\Omega\,:\, Z_i(\omega)\leq z_i\right\}\right) \\ &= \Pbb\left(\left\{\omega\in\Omega\,:\, Z_1(\omega)\leq z_1 \vee Z_2(\omega)\leq z_2 \vee \ldots \vee Z_n(\omega)\leq z_n\right\}\right) \\ &= \Pbb(Z_1\leq z_1, \ldots, Z_n\leq z_n) \end{align} $$

Definition of Complete-Data Density / Likelihood function

Given another measure $$ \mu: \cB(\Rbb) \to [0,1] $$ The probability density of the random variable $Z_i$ is any measurable function $f$ $$ f: \Rbb\to [0, \infty) $$ such that $$ \mu_{Z_i}(B) = \int_B f d\mu \qquad \forall \, B\in\cB(\Rbb) $$ That is, any measurable function whose integral with respect to $\mu$ over a set $B\in\cB(\Rbb)$ of possible data point realizations is given exactly by the value of the distribution $\mu_{Z_i}$ on $B$.

Similarly, for $B:=B_1\times\ldots\times B_n\in \cB(\Rbb^n)$ we say that $f$ $$ f: \Rbb^n \to [0, \infty) $$ is the joint density of $\vZ$ if it is measurable and $$ \mu_{\vZ}(B) = \int_{B} f d\mu $$ Now suppose that the function $f$ can be parametrized by a vector of parameters $\vtheta\in\Theta\subseteq \Rbb^d$. We write $$ f: \Rbb^n\to[0, \infty)\qquad f(\vz; \vtheta) := p(\vZ=\vz; \vtheta) $$

Writing the EM Algorithm in Full