I want to derive the formulas for the soft EM algorithm for the following model $P[y_i | x_i, \pi_{1,\dots,m}, a_{1,\dots,m}] = \sum_{j=1}^m \pi_j \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(a_j^T x_i - y_i)^2}{2 \sigma^2})$ over the data $(x_1,y_1),\dots,(x_n,y_n) \in \mathbb{R}^d \times \mathbb{R}$, hidden labels $z_1,\dots, z_n$ and parameters $\pi_{1,\dots,m} \in \mathbb{R}$, $a_{1,\dots,m} \in \mathbb{R}^d$, which is a mixture of linear regression models. The E step makes sense to me, but I'm not able to do the ML estimation for the parameter $a$. It is not clear to me how I can bring the log likelihood $$\log L((x_1, y_1),\dots,(x_n, y_n) | a) = \log \Pi_{i=1}^n P[y_i | x_i, a] = \sum_{i=1}^n \log(\sum_{j=1}^m \gamma_j(x_i, y_i) \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(a_j^T x_i - y_i)^2}{2 \sigma^2}))$$ where $$\gamma_j(x_i, y_i) = \frac{P[y_i|x_i,z_i=j]P[z_i=j]}{\sum_{k=1}^j P[y_i|x_i,z_i=k]P[z_i=k]}$$ into a form where one can optimize it as a weighted linear regression problem as is described in this paper.
2026-03-27 16:19:10.1774628350
Soft-EM: E-step for fitting mixed linear regression model
80 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in STATISTICS
- Given is $2$ dimensional random variable $(X,Y)$ with table. Determine the correlation between $X$ and $Y$
- Statistics based on empirical distribution
- Given $U,V \sim R(0,1)$. Determine covariance between $X = UV$ and $V$
- Fisher information of sufficient statistic
- Solving Equation with Euler's Number
- derive the expectation of exponential function $e^{-\left\Vert \mathbf{x} - V\mathbf{x}+\mathbf{a}\right\Vert^2}$ or its upper bound
- Determine the marginal distributions of $(T_1, T_2)$
- KL divergence between two multivariate Bernoulli distribution
- Given random variables $(T_1,T_2)$. Show that $T_1$ and $T_2$ are independent and exponentially distributed if..
- Probability of tossing marbles,covariance
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Related Questions in LINEAR-REGRESSION
- Least Absolute Deviation (LAD) Line Fitting / Regression
- How does the probabilistic interpretation of least squares for linear regression works?
- A question regarding standardized regression coefficient in a regression model with more than one independent variable
- Product of elements of a linear regression
- Covariance of least squares parameter?
- Contradiction in simple linear regression formula
- Prove that a random error and the fitted value of y are independent
- Is this a Generalized Linear Model?
- The expected value of mean sum of square for the simple linear regression
- How to get bias-variance expression on linear regression with p parameters?
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?
So the variables are:
To make the notation clearer, let's say that $i$ runs from $1$ to $I$, and $j$ runs from $1$ to $J$. So $I$ is the number of datapoints, and $J$ is the number of regression models.
The full likelihood function (with indices suppressed on the left-hand side) is $$ P\left(y, z\mid \mathbf x, \pi, \mathbf a \right) = \prod_{i=1}^I \left( \sum_{j=1}^J \mathbf 1_{z_{ij} = 1}\times \pi_{j} \times \mathcal N \left(y_i \mid \mathbf a_j^T \mathbf x_i, \sigma^2 \right) \right),$$ where $$ \mathcal N\left( y \mid \mu, \sigma^2 \right) := \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( - \frac{(y-\mu)^2}{2\sigma^2} \right)$$ and $$ \mathbf 1_{z= 1} := \begin{cases} 1 & {\rm if \ } z= 1 \\ 0 & {\rm if \ } z= 0\end{cases}.$$
Our goal is to find the values of $\pi$ and $\mathbf a$ that maximise the marginal likelihood, $$ P\left(y, \mid \mathbf x, \pi, \mathbf a \right) = \sum_{z} P\left(y, z \mid \mathbf x, \pi, \mathbf a \right) . $$
To do this, we use the expectation-maximisation algorithm, which is an iterative algorithm. Each step in the iteration consists of an E-step and an M-step.
E-step. Given $(\pi^{\rm old}, \mathbf a^{\rm old})$ from the previous iteration, calculate the conditional distribution $$P\left(z \mid y, \mathbf x, \pi^{\rm old}, \mathbf a^{\rm old} \right).$$
This is the bit you know how to do. By Bayes' theorem, the answer is $$ \gamma_{ij} := P \left( z_{ij} = 1 \mid y, \mathbf x, \pi^{\rm old}, \mathbf a^{\rm old} \right) = \frac{\pi_j^{\rm old} \times \mathcal N \left( y_i \mid (\mathbf a_j^{\rm old})^T \mathbf x_i, \sigma^2 \right) }{ \sum_{j' = 1}^J \pi_{j'}^{\rm old} \times \mathcal N \left( y_i \mid (\mathbf a_{j'}^{\rm old})^T \mathbf x_i, \sigma^2 \right) }.$$
M-step. Find $(\pi^{\rm new}, \mathbf a^{\rm new})$, defined as the solutions to the maximisation problem, $$ \pi^{\rm new}, \mathbf a^{\rm new} = {\rm argmax}_{\pi, \mathbf a} \left( \mathbb E_{z \sim P\left(z \mid y, \mathbf x, \pi^{\rm old}, \mathbf a^{\rm old} \right)}\left[ \log \left( P\left(y, z \mid \mathbf x, \pi, \mathbf a \right) \right) \right] \right).$$ (These $(\pi^{\rm new}, \mathbf a^{\rm new})$ will be carried forward to the E-step in the next iteration, and will become the $(\pi^{\rm old}, \mathbf a^{\rm old})$ for the E-step in the next iteration.)
We can express the expectation in terms of the $\gamma_{ij}$'s computed in the preceding E-step.
\begin{multline} \mathbb E_{z \sim P\left(z \mid y, \mathbf x, \pi^{\rm old}, \mathbf a^{\rm old} \right)}\left[ \log \left( P\left(y, z \mid \mathbf x, \pi, \mathbf a \right) \right) \right] \\ = \sum_{j=1}^J \left( \sum_{i=1}^I \gamma_{ij} \log \pi_j \right) + \sum_{j = 1}^J \left(\sum_{i=1}^I \gamma_{ij} \log \mathcal N \left(y_i \mid \mathbf a_j^T \mathbf x_i, \sigma^2 \right) \right).\end{multline}
(Notice how the $\gamma_{ij}$'s sit outside the logarithms in my expression!)
For each $j$, the optimal $\pi_j$ is given by $$ \pi_j^{\rm new} = \frac{\sum_{i=1}^I \gamma_{ij}}{\sum_{j' = 1}\sum_{i=1}^I \gamma_{ij'}}.$$
For each $j$, the optimal $\mathbf a_j$ is given by \begin{align} \mathbf a_j^{\rm new} & = {\rm argmax}_{\mathbf a_j} \left( \sum_{i = 1}^I \gamma_{ij} \log \mathcal N \left( y_i \mid \mathbf a_j^T \mathbf x_i , \sigma^2 \right) \right) \\ &= {\rm argmin}_{\mathbf a_j} \left( \sum_{i = 1}^I \gamma_{ij} \left( y_i - \mathbf a_j^T \mathbf x_i \right)^2 \right).\end{align}
This is a standard weighted regression problem, where the $i$th datapoint is assigned a weighting of $\gamma_{ij}$.
It can be proved that $$ P\left( y \mid \mathbf x, \pi^{\rm new}, \mathbf a^{\rm new} \right) \geq P\left( y \mid \mathbf x, \pi^{\rm old}, \mathbf a^{\rm old} \right).$$ (The proof is by no means obvious, but it's a standard proof that applies to all examples of expectation-maximisation.)
Hence if you follow this iterative procedure for many steps, you generate a sequence, $$ (\pi^{old}, \mathbf a^{old}), \ \ (\pi^{\rm new}, \mathbf a^{\rm new}), \ \ (\pi^{\rm newer}, \mathbf a^{\rm newer}), \ \ (\pi^{\rm even \ newer}, \mathbf a^{\rm even \ newer}), \ \ \dots$$ where the marginal likelihood $P\left( y \mid \mathbf x, \pi, \mathbf a \right)$ increases on each step. This is why the iterative procedure converges to the $(\pi, \mathbf a)$ that maximise the marginal likelihood $P\left( y \mid \mathbf x, \pi, \mathbf a \right)$.