My apologies for this title that is likely vague, but I didn't find a better way to summarize my issue. So I'll start here with the global idea of my problem.
I have a simulation software that, given a vector of 3 parameters, creates in return a certain number of objects with different shape and sizes. The main idea is to see if, given an observation of so many objects with some statistics of their shape, I can recover the corresponding vector of parameters using Monte Carlo techniques. Because I have many objects, I average the shape statistics in smaller bins, of given object size. Here is a concrete example with a list of objects observed, where each object has a size and X_shape is the statistics derived from their shape.
+---------+---+---+----+-----+-----+-----+-----+------+
| Size | 1 | 2 | 10 | 500 | 13 | 56 | 80 | 123 |
+---------+---+---+----+-----+-----+-----+-----+------+
| X_shape | 5 | 3 | 5 | 69 | 1 | 2 | 11 | 4 |
+---------+---+---+----+-----+-----+-----+-----+------+
The binned statistics then looks like:
+-------------+----------------+------------------+--------------+
| | 1 < Size <= 10 | 10 < Size <= 100 | 100 < Size |
+-------------+----------------+------------------+--------------+
| X_shape_bin | 3 | 4 | 31 |
+-------------+----------------+------------------+--------------+
This is what I put in my likelihood, that assumes a Gaussian distribution:
$Log(L) =-\frac{1}{2} \sum_{i}^{\rm Nbins\ =\ 3} \frac{({\rm X\_shape\_bin_{data}} (i) - {\rm X\_shape\_bin_{model}} (i))^2}{\sigma(i)^2}$
where $ \rm X\_shape\_bin_{data}$ is the statistics I observed, and the $ \rm X\_shape\_bin_{model}$ the statistics from the models based on the parameter space sampled in the Monte Carlo sampler.
This works well when both the data have a measurement in each size bin, but it happens sometimes that the statistics extracted from a given model looks like this:
+-------------+----------------+------------------+--------------+
| | 1 < Size <= 10 | 10 < Size <= 100 | 100 < Size |
+-------------+----------------+------------------+--------------+
| X_shape_bin | - | - | 3 |
+-------------+----------------+------------------+--------------+
In that case, the Likelihood will only be computed based on the last bin, and therefore be artificially higher, while this model should actually be penalized because only a single bin has been included.
There is no way I can predict the missing data unfortunately, so I am looking for a way to penalize these cases.
I hope this is somehow clear, and hope you might know how to handle these cases.
Thanks in advance