This question came up while I was doing a manual black-level calibration for an image sensor. To do this, you cover the sensor so that no light hits it and take lot's of raw images. The average of the images gives the black level for each pixel. Out of curiosity I plotted the histogram of one pixel and it looked like this: histogram (the orange line is the mean).
I wondered if there are better ways to calculate the expectation if the PDF is known. So I came up with the following example.
Let's say we have a random variable. The PDF of the random variable has two equal peaks spaced apart by the distance $d$. The expectation of the RV is obviously in the middle between the two peaks. image of the pdf (please excuse my poor drawing skills).
A number of samples are taken from the RV. The expectation can of course be estimated by calculating the mean of the samples taken. But since the cdf is known we can do much better.
First of all we look at the distance between the samples and put them into two groups. We add $\frac{d}{2}$ to the lower group and substract $\frac{d}{2}$ from the upper group. After that we calculate the mean. The resulting estimate for the expectation is much better than just calculating the mean.
Of course there is the possibility that only one group arises. In this case we calculate the mean over the group and add $\frac{d}{2}$. There is a $50%$ chance that we hit the lower group and get it right.
I have simulated this in MATLAB from 1 to 200 samples taken from the RV. For each number of samples and method the results were averaged over 1000 tests. The results show clearly that this method converges way faster than the mean and that much less samples are needed to achieve the same quality of estimate (diagram).
The question is now:
- Is there a general way to do this for a RV with an arbitrary cdf?
- Is there an even better way to estimate the expectation?
Maximum likelihood estimation
I don't know why I didn't think about this in the first place, but you can use Maximum likelihood estimation to estimate a parameter of a pdf. For the example stated here we have the random variable $X$ and, assuming the peaks are normally distributed, the pdf
$ P(X=x) = \frac{1}{2\sqrt{2\pi}\sigma} e^{-\frac{1}{2}(\frac{x-\mu-0.5}{\sigma})^2} + \frac{1}{2\sqrt{2\pi}\sigma} e^{-\frac{1}{2}(\frac{x-\mu+0.5}{\sigma})^2}$.
The likelihood formula looks like this:
$L(\vartheta)=\prod_{i=1}^{n} f_{\vartheta}(x_i)$
with $\vartheta$ being the unknown parameter, $n$ being the number of samples and $x_i$ being the sample teken from the RV. In this case the expectation $\mu$ is the unknown parameter. The pdf is then inserted into the likelihood formula like this:
$L(\mu)=\frac{1}{2\sqrt{2\pi}\sigma}[e^{-\frac{1}{2}(\frac{\prod_{i=1}^{n}x_i-\mu-0.5}{\sigma})^2}+e^{-\frac{1}{2}(\frac{\prod_{i=1}^{n}x_i-\mu+0.5}{\sigma})^2}]$
The resulting function is the likelihood for a given $\mu$ and given samples ${x_1,x_2,...,x_n}$. To get the 'best' $\mu$ we need to maximize the function. For that the derivative is taken and solved for zero. This is not that easy in this case since the function has multiple solutions depending on the number of samples $n$. Using numerical methods I compared it to my handcrafted method in the question and it performs even better.
TLDR: If samples and the pdf of a random variable are known, use maximum likelihood estimation to get a much better estimation of the expectation than the mean.