Mean time between failures for exponential distribution.

586 Views Asked by At

Let's say I have n independent machines that fail according to independent exponential distributions with mean of 1000 days. The machines can be shut down by the operator, so some of the samples we observe will be censored. For example, if a machine was started today morning and shut down in the afternoon, with no failures in between, we can only say the mean time between failures was more than 12 hours, not exactly how much more. So, we have some sampled observations, $t_i$ where we observe actual inter-arrival times between failures and some censored observations, $x_j$ where we only observe that the failure took more than this time. How do we get the best estimate for the rate, $\lambda$? For this, we can use maximum likelihood.

$$L(\lambda)=\Pi_1^n f_T(t_i) \Pi_1^mF_T(x_j)$$

Where $f_T(t)$ is the PDF of the exponential distribution and $F_T(t)$ is the survival function. The lok-likelihood then becomes:

$$ll(\lambda) = \sum_1^n (log(\lambda)-\lambda t_i) +\sum_1^m-\lambda x_j$$ Differentiating with respect to $\lambda$ and setting to zero we get:

$$\frac{1}{\lambda} = \frac{\sum t_i +\sum x_j}{n}$$

Where $\frac{1}{\lambda}$ is the MTBF. Here, $n$ is the number of sampled observations. In other words, instances where we actually witnessed the time between two failures.

This is where things seem to fall apart. Let's say there is a single machine that fails every 1000 days. I run it continuously from 3000 days and hence see 3 failures. From the final formula above, the $n$ should be 2 since I only saw it fail consecutively twice (day 1000 to day 2000 and day 2000 to day 3000). The first failure is a censored observation. However, this would give an MTBF of 3000/2 = 1500 which is wrong.

Now consider another scenario. Suppose I have 2000 such machines, but can observe them only within the span of a day. If I take any random day, I'll need to sample 2000 machines on average before I observe two failures (and then $n$ = 1). So, the MTBF for this combined system is 1 day/ 1 sample between failures = 1. Since this is for the system of 2000 machines, the MTBF for a single machine must be 1*2000. This is a factor of two off.

Note in both of these examples, if I take $n$ to simply be the number of failures, everything works out. But that doesn't seem to be it's definition. What am I missing here?

1

There are 1 best solutions below

0
On

It actually has nothing to do with having censored data or not.

This is another example of the inspection paradox. Roughly speaking, it is how you treat the zero point $(t=0)$.

There are numerous materials out there, and there have been plenty of posts mentioning this exact term that can be easily found via search.

In particular, I find this post or an earlier one as a good starting point, with both the questions and the answers well formulated. The one in 2014 is quite comprehensive and might be unnecessarily detailed, while the 2011 answer post is terse and insightful.