Estimate mean and variance for a truncated sample set

143 Views Asked by At

Assume there is a normally distributed random variable $X \tilde{} N(\mu, \sigma)$ I want to estimate $\mu$ and $\sigma$. So far the standard setting.

Assume I am given a sample $(X_i)_{i=1}^N$ of $N$ randomly drawn values for this distribution. The challenge is, that I can't observe $X_i$, I can only observe $(Y_i)_{i=1}^N$ where $$Y_i = \begin{cases} X_i, & \text{if } X_i > 50, \\ 0, & \text{otherwise}. \\ \end{cases} $$

Can I still form estimators for $\mu$ and $\sigma$? And how?

2

There are 2 best solutions below

1
On BEST ANSWER

\begin{align} \Pr(Y=0) = \Phi\left( \frac{50-\mu} \sigma \right) = {} & \int_{-\infty}^{50} \varphi\left(\frac{z-\mu}\sigma\right) \, \frac{dz} \sigma = \int_{-\infty}^{(50-\mu)/\sigma} \varphi(z)\,dz \\[12pt] & \text{ where } \varphi(z) = \frac 1 {\sqrt{2\pi}} e^{-z^2/2}. \end{align}

For $y>50$, $$ \Pr(50 \le Y \le y) = \int_{50}^y \varphi\left( \frac{z-\mu}\sigma \right)\,\frac{dz} \sigma = \int_{-\infty}^{(y-\mu)/\sigma} \varphi(z)\,dz. $$

Let $\mu$ be a measure on Borel subsets of the line, given by $$ \mu(A) = \begin{cases} 1 & \text{if }A=\{0\}, \\ m(A) & \text{if }0\not\in A, \end{cases} $$ where $m$ is Lebesgue measure. Then $$ \Pr(Y_i \in A) = \int_A \left. \begin{cases} \frac 1 \sigma \varphi\left( \frac{z-\mu} \sigma \right) & \text{if }z\ne 0, \\[10pt] \Phi\left( \frac{50-\mu} \sigma \right) & \text{if }z=0. \end{cases} \right\} \, d\mu(z) \tag 1 $$ In other words, the piecewise function under the integral sign in $(1)$ is the density with respect to the measure $\mu$. The likelihood function given that $Y_i=y_i$ for $i=1,\ldots,n$ is therefore $$ L(\mu,\sigma) = \prod_{i=1}^n \left. \begin{cases} \frac 1 \sigma \varphi\left( \frac{y_i -\mu} \sigma \right) & \text{if }y_i\ne 0, \\[10pt] \Phi\left( \frac{50-\mu} \sigma \right) & \text{if }y_i=0. \end{cases} \right\} $$ The problem then is to find the values of $\mu$ and $\sigma$ that maximize $L(\mu,\sigma)$. I doubt there's a closed form.

Let $w$ be the number of $0$s observed, so $n-w$ is the number of observations $>50$. Then $$ L(\mu,\sigma) \propto \left( \Phi\left( \frac{50-\mu} \sigma \right) \right)^w \cdot \frac 1 {\sigma^{n-w}} \exp\left( \sum_{i\in\text{a certain set}} \frac{-1} 2 \left(\frac{y_i-\mu} \sigma\right)^2 \right). $$ The "certain set" is of course $\{ i : y_i > 50 \}$.

Let $$ \bar y = \frac{\left(\sum\limits_{i\in\text{the aforementioned set}} y_i\right)}{\left(\sum\limits_{i\in\text{the aforementioned set}} 1\right)} = (\text{the mean of all observations}>50). $$ Then \begin{align} \sum_{i\in\text{the aforementioned set }\mathcal I} (y_i-\mu)^2 & = \sum_{i\in\mathcal I} \Big( (y_i-\bar y) + (\bar y - \mu)\Big)^2 \\[10pt] & = (n-w)(\bar y - \mu)^2 + \sum_{i\in\mathcal I} (y_i-\bar y)^2 + \underbrace{2(\text{a sum of middle terms})}_{\ldots\,\text{but these add up to 0.}} \\[10pt] & = (n-w)(\bar y - \mu)^2 + (\text{something not depending on } \mu). \end{align} That last lack of dependence on $\mu$ means we can write $$ L(\mu,\sigma) \propto \left( \Phi\left( \frac{50-\mu} \sigma \right) \right)^w \cdot \frac 1 {\sigma^{n-w}} \exp\left( \frac{-(n-w)} 2 \left(\frac{\bar y -\mu} \sigma\right)^2 \right). $$

0
On

Sufficient statistics should be $M = \text{card}\{i: Y_i > 0\}$, $S = \sum_i Y_i$ and $T = \sum_i Y_i^2$, with likelihood function $$ \eqalign{L(Y) &= {n \choose M} \Phi\left(\dfrac{50-\mu}\sigma\right)^{n-M} \prod_{i: Y_i > 0} \dfrac{\exp(-(Y_i-\mu)/(2\sigma^2)}{\sqrt{2\pi \sigma^2}}\cr &= {n \choose M} \Phi\left(\dfrac{50-\mu}\sigma\right)^{n-M} e^{-(T-2\mu S + \mu^2 M)/(2\sigma^2)} (2\pi)^{-M/2} \sigma^{-M}}$$ where $\Phi$ is the standard normal CDF. Thus for a maximum likelihood estimator we want to maximize $$ G(\mu, \sigma) = (n-M) \log\left(\Phi\left(\dfrac{50-\mu}{\sigma}\right)\right) - \dfrac{T - 2 \mu S + \mu^2 M}{2\sigma^2} - M \log(\sigma)$$

Now by taking derivatives with respect to $\mu$ and $\sigma$ and eliminating the $\Phi$ term, I get $$ \mu = \dfrac{M \sigma^2 + 50 S - T}{50 M - S}$$

But the remaining equation for $\sigma$ is rather a mess, and I don't expect a closed form.

I did try a simulation, using a sample of size $1000$ from the normal distribution with $\mu = 60$ and $\sigma = 30$. I obtained $M = 624$, $S = 47814.5251547458$, $T = 3.90701109400397 \times 10^6$. Solving for $\partial G/\partial \mu = \partial G/\partial \sigma = 0$ using Maple's fsolve, the results were $\widehat{\mu} = 58.92574364, \widehat{\sigma} = 29.34273559$. So, at least sometimes, this does give reasonably good results.