Methods to estimate a probability distribution from truncated data?

201 Views Asked by At

I have a large set of values $t = \{t_i\}_{i=1}^N$. In actuality, these values (in some set of units) can range between $0$ and an unknown cutoff of the order of $10^7$, but they come from a numerical simulation which, due to memory issues, I have to downsample, so in the course of the simulation I have dropped all $t_i<5.0$.

I would like to calculate the cumulative probability that $t > T$. When I count the number of $t_i$ greater than $T$, and I plot it versus $T$, I get a nice looking truncated power law type distribution for the counts $N(t>T)$ across the variable $T$.

However, I cannot simply write $ P(t>T) = N(t>T)/N$, because I neglected very many values at $T<5.0$, and I should really be normalizing by the total number of my values, including those I neglected, rather than the size of my downsampled data.

That is, the largest value of $P(t>T)$ should happen at $T=0$, and not at $T=5.0$, which is where it would occur if I did it this way.

How can I handle a truncated dataset of this form? I need to calculate a histogram using the frequency of occurrence of values, but I have no means to normalize the counts, because I don't know how many values should actually exist if I hadn't truncated the data.

Any help is appreciated! Thanks

3

There are 3 best solutions below

2
On

Why did you choose $5.0$? Clearly you are losing critical information, and the data set you end up with is not a good sample. If you must downsize the sample, don't downsize it by choosing data truncated arbitrarily, but just choose a random sample which is small enough, if possible. This way you can hope to get a representing sample, which you evidently do not have now.

3
On

This question would be better asked at the Cross Validated Stack Exchange site. However...

If you do have samples from a truncated power law distribution (as compared to a censored distribution where you knew how many observations were below 5), then you can certainly estimate the parameter for the non-truncated distribution if you really know that the whole distribution follows the particular power law.

Suppose the truncated distribution has probability density

$$f(x)=\frac{(k-1) x^{-k}}{5^{1-k}}$$

and you have samples $x_1, x_2, \ldots, x_n$. The maximum likelihood estimator of $k$ is

$$\hat{k}=(\overline{\log x}-\log 5 +1)/(\overline{\log x}-\log5)$$

where $\overline{\log x}=\sum_{i=1}^n \log x_i/n$ (i.e., mean of the logs).

Therefore the un-truncated distribution will have density function

$$g(x)=(k-1)x^{-k}$$

for $x\ge 1$ assuming that the lower bound is 1. You mention a lower bound of $0$ but that particular power law density doesn't converge on the interval $(0,\infty)$. So that's why I asked in my comment above if you had a particular (and specific) power law in mind.

An estimate of the standard error of $\hat{k}$ is

$$\sqrt{\frac{(\hat {k}-1)^2}{n}}$$

0
On

As far as I understand you need to restrict the number of stored transitions. Instead of dropping observations, you may store $k$'th observation with probability $0<p<1$. That is, let $X_k\sim\text{Bern}(p)$ independent of $t_k$. Then you store $t_k$ if $X_k=1$ and estimate $q_T:=\mathsf{P}(t>T)$ using $$ \hat{q}_T=n^{-1}\sum_{k=1}^n 1\{t_k>T\}\times 1\{n>0\}, $$ where $n$ is the number of stored samples (note that $n\sim \text{Bin}(p,N)$ where $N$ is the (unknown) total number of observations). Assuming that each $t_k$ is an independent copy of $t$, $$ \mathsf{E}\hat{q}_T=\sum_{l=1}^N l^{-1}\sum_{k=1}^l \mathsf{P}(t_k>T)\times \mathsf{P}(n=l)=\mathsf{P}(t>T)\times \mathsf{P}(n>0), $$ which is very close to $\mathsf{P}(t>T)$ when $N$ is large.

On average you will need to store $pN$ observations with standard deviation $\sqrt{Np(1-p)}$.