My problem is simplified as follows: I'm estimating a binary choice probit model where the error term follows a mixture of two normal distributions. This is:
$y_i = \mathbf{1} \left( \beta_0 + \beta_1 x_i + \epsilon_i > 0 \right)$
Where $\epsilon_i \sim 0.3 \mathcal{N}(-1,1) + 0.7 \mathcal{N}(1.2,0.2)$. I followed naively what I would do for other more simpler models. This is, let the likelihood function be:
$$ \text{NLL} = -\sum_{i=1}^{n} \left[ y_i \log(P(y_i=1|x_i; \Theta)) + (1-y_i) \log(1-P(y_i=1|x_i; \Theta)) \right] $$
Where $ P(y_i=1|x_i; \Theta) = w \Phi\left(\frac{\beta_0 + \beta_1 x_i - \mu_1}{1}\right) + (1-w) \Phi\left(\frac{\beta_0 + \beta_1 x_i - \mu_2}{\sigma_2}\right)$. Then, the parameters to estimate are: $\Theta = \{\beta_0, \beta_1, \mu_1, \mu_2, w, ,\sigma_1^2,\sigma_2^2\}$.
For the implementation, I simulated the data generating process. Then, to estimate the model I placed the following restrictions (in hope of identification): 1) $\sigma_1=1$, 2)Weights must sum to 1 $w+(1-w)=1$. With this I show below an implementation in matlab.
Results widely differ from the true parameters. I'm sort of novice in these topics. I have the feeling that this might be an identification problem. I tried changing initial values (even placing the true ones) but estimates are out of place. Am I wrong in my scale and location normalizations? Is my approach not correct?
Thanks!
matlab
Data Generating Process
x=randn(50000,1);
n=length(x);
mu1 = -1; sigma1 = 1; weight1 = 0.3;
mu2 = 1.2; sigma2 = 0.2; weight2 = 0.7;
beta_0=1;
beta_1=-1;
epsilon_mixture = zeros(n, 1);
% Generate the mixture distribution
for i = 1:n
if rand() < weight1
epsilon_mixture(i) = mu1 + sigma1 * randn();
else
epsilon_mixture(i) = mu2 + sigma2 * randn();
end
end
y = (beta_0+beta_1*x+epsilon_mixture) >0;
function nll = probit_mixture_log_likelihood(params, x, y)
% Unpack parameters
beta_0 = params(1);
beta_1 = params(2);
mu1 = params(3); % Mean of the first normal distribution
mu2 = params(4);
p = params(5); % Mixture aux for the weight
sigma2_aux = params(6);
% Constraints for identifiability
w = exp(p) / (1 + exp(p)); % Ensure 0 < w < 1
% Calculate the linear predictor for all observations
eta = beta_0 + beta_1 * x;
sigma2 = exp(sigma2_aux); % Ensure sigma2 is positive
% Calculate the probabilities for each component of the mixture for all observations
prob1 = normcdf(eta, mu1, 1); % Fixed variance of 1 for the first normal
prob2 = normcdf(eta, mu2, sigma2); % Variance for the second normal
% Calculate the overall probabilities using the mixture weights
prob = w * prob1 + (1 - w) * prob2;
% Compute the negative log likelihood for all observations in a vectorized way
nll = -sum(y .* log(prob + 1e-15) + (1 - y) .* log(1 - prob + 1e-15));
end
And then the maximization is done by:
initialParams = [0, 0, 0, 0, log(0.3/0.7), 0]; % Adjusted initial guesses
% Options for the optimization algorithm
options = optimoptions('fminunc', 'Algorithm', 'quasi-newton', 'Display', 'iter');
% Minimize the negative log-likelihood function
[paramEstimates, fval, exitflag, output] = fminunc(@(params) probit_mixture_log_likelihood(params, x, y), initialParams, options);
% Unpack the estimated parameters
beta_0_est = paramEstimates(1);
beta_1_est = paramEstimates(2);
mu1_est = paramEstimates(3);
mu2_est = paramEstimates(4);
p_est = paramEstimates(5);
sigma2_aux_est = paramEstimates(6);
% Recover mu2 and sigma2 from the estimated parameters
w_est = exp(p_est) / (1 + exp(p_est)); % Mixture weight for the first component, transformed back from log-odds
sigma2_est = exp(sigma2_aux_est); % Recover sigma2, ensuring it's positive```
Your derivation and MLE codes look good to me (I tried it out and the code works apart from that you need to also specify $\beta_0$ and $\beta_1$). It is indeed an identification issue.
Under this model specification with $\beta_0=0$ and $\beta_1=1$, you can try to plot the histogram of the latent variable $Y_i = \beta_0+\beta_1 x_i+\epsilon_i$. You will notice that, although $\epsilon_i$ takes a bimodal shape due to the mixture design, the latent variable $Y_i$ has a unimodal shape, as $Y_i$ is a convolution of $\mathcal{N}(0,1)$ and $\epsilon_i$, which smoothes out the bimodal features. But your binary choice variable $y_i$ is generated by the mixture $Y_i$ from which the bimodal feature (that identifies $\mu_1$ and $\mu_2$) is not evident. To estimate $\mu_1$ and $\mu_2$, $x_i$ can be viewed as a Gaussian noise, and in this case, the signal-to-noise ratio is simply way too small for the model to have good identification.
You can fix this issue by picking $x_i\sim\mathcal{N}(0,0.01)$, then with some good initial guesses, your code seems to work. You can also verify this by increasing the sample size of the simulation.