Maximum likelihood using in binned data/model, missing measurement in some bins?

28 Views Asked by At

My apologies for this title that is likely vague, but I didn't find a better way to summarize my issue. So I'll start here with the global idea of my problem.

I have a simulation software that, given a vector of 3 parameters, creates in return a certain number of objects with different shape and sizes. The main idea is to see if, given an observation of so many objects with some statistics of their shape, I can recover the corresponding vector of parameters using Monte Carlo techniques. Because I have many objects, I average the shape statistics in smaller bins, of given object size. Here is a concrete example with a list of objects observed, where each object has a size and X_shape is the statistics derived from their shape.

+---------+---+---+----+-----+-----+-----+-----+------+
|  Size   | 1 | 2 | 10 | 500 |  13 |  56 |  80 |  123 |
+---------+---+---+----+-----+-----+-----+-----+------+
| X_shape | 5 | 3 |  5 |  69 |   1 |   2 |  11 |   4  |
+---------+---+---+----+-----+-----+-----+-----+------+

The binned statistics then looks like:

+-------------+----------------+------------------+--------------+
|             | 1 < Size <= 10 | 10 < Size <= 100 |  100 < Size  |
+-------------+----------------+------------------+--------------+
| X_shape_bin |              3 |                4 |          31  |
+-------------+----------------+------------------+--------------+

This is what I put in my likelihood, that assumes a Gaussian distribution:

$Log(L) =-\frac{1}{2} \sum_{i}^{\rm Nbins\ =\ 3} \frac{({\rm X\_shape\_bin_{data}} (i) - {\rm X\_shape\_bin_{model}} (i))^2}{\sigma(i)^2}$

where $ \rm X\_shape\_bin_{data}$ is the statistics I observed, and the $ \rm X\_shape\_bin_{model}$ the statistics from the models based on the parameter space sampled in the Monte Carlo sampler.

This works well when both the data have a measurement in each size bin, but it happens sometimes that the statistics extracted from a given model looks like this:

+-------------+----------------+------------------+--------------+
|             | 1 < Size <= 10 | 10 < Size <= 100 |  100 < Size  |
+-------------+----------------+------------------+--------------+
| X_shape_bin |              - |                - |          3   |
+-------------+----------------+------------------+--------------+

In that case, the Likelihood will only be computed based on the last bin, and therefore be artificially higher, while this model should actually be penalized because only a single bin has been included.

There is no way I can predict the missing data unfortunately, so I am looking for a way to penalize these cases.

I hope this is somehow clear, and hope you might know how to handle these cases.

Thanks in advance