Maximum likelihood estimate of gaussian given rounded values

212 Views Asked by At

Suppose there is a hidden gaussian with mean $\mu$ and variance $\sigma^2$, and that $X_i \sim \mathcal{N}(\mu,\sigma^2)$ where the $X_i$ are i.i.d. If I can only oberve the rounded value of $X_i$, i.e. $Y_i = \lfloor X_i + 1/2 \rfloor$, is there an effective means of computing the maximum likelihood estimates of $\mu$ and $\sigma$?

1

There are 1 best solutions below

5
On

Often rounding doesn't make much difference in the sample mean and variance used to estimate the population mean $\mu$ and variance $\sigma^2.$ Rounding conventions (such as rounding a 5 in the third place to give an even number in the second when rounding to two places) attempt to decrease bias for the mean.

Here are $n = 150$ observations sampled from $\mathsf{Norm}(\mu = 100, \sigma=15),$ along with their sample mean, variance and SD.

x = rnorm(150, 100, 15)
mean(x);  var(x);  sd(x)
## 101.3501
## 209.5259
##14.47501

Upon rounding to the nearest integer, we find that the sample mean, variance and SD have not changed much. In 71 of 150 observations, rounding resulted in a decrease and in the remaining observations an increase.

rx = round(x)
mean(rx);  var(rx);  sd(rx)
## 101.3467
## 209.6643
## 14.47979
sum(rx < x)
## 71

Trouble can arise if rounding is unreasonably severe. Usually, practitioners are smart enough not to round so severely that important information is lost.

Here is a case in which rounding is arguably too severe. There is a noticeable change in the sample mean, variance, and SD, but perhaps not so severe as to ruin analysis. (Rounding to one decimal place would probably have been OK.)

y = rnorm(30, 10, 2)
mean(y);  var(y);  sd(y)
## 10.47787
## 2.773481
## 1.665377
ry = round(y)
mean(ry);  var(ry);  sd(ry)
## 10.53333
## 2.74023
## 1.655364

Here are the first six observations, before and after (severe) rounding:

head(y)
## 12.501007 10.766729 11.313505  9.660635  9.443169 10.734076
head(ry)
## 13 11 11 10  9 11

Here are 95% t CIs for $\mu$. Original data:

95 percent confidence interval:
  9.856012 11.099736 

Rounded data:

95 percent confidence interval:
  9.91521 11.15146 

However, rounding paired data can lead to disaster. If the differences are relatively small compared with the size of the observations, then rounding before taking differences can "round away" the true difference.

Perhaps most serious is that carelessly-rounded data often result in ties that spoil the distribution theory of rank-based nonparametric tests (Wilcoxon, Kruskal-Wallis, Friedman). Even then, one can sometimes get accurate P-values by doing a simulated permutation test that imitates the purpose of the rank-based test. However, this is not usually an issue with data known to be normal.

Finally, is is worthwhile noting that in practice all continuous data must be rounded to some number of decimal places.


Addendum: (Consequences of absurdly severe rounding, per Comments).

Suppose samples of size $n = 5$ are taken from $\mathsf{Norm}(\mu = 1, \sigma=.5).$ Below we plot sample SDs $S$ against sample means $A =\bar X$ for $m = 50,000$ such samples. At left, the plot illustrates that sample means and SDs for unrounded data are independent; at right, sample means and SDs for observations rounded to integers (thus nonnormal) are no longer independent (even if nearly uncorrelated). [At right, there only 12 uniquely different values of sample means and 13 for sample SDs. So there are many ties, massively overplotted in the figure.]

enter image description here

R code, in case it's of interest:

m = 50000; n = 5
x = rnorm(m*n, 1, .5);  MAT = matrix(x, nrow=m)
a = rowMeans(MAT);  s = apply(MAT, 1, sd)
ar = rowMeans(round(MAT));  sr = apply(round(MAT), 1, sd)
par(mfrow=c(1,2))
  plot(a,s, pch=".", main="Original Data")
  plot(ar, sr, pch=20, main="Severely Rounded")
par(mfrow=c(1,1))