Difference between sufficient and non sufficient statistics

946 Views Asked by At

I have a few questions here.

  1. Could someone please give me (in plain English, avoiding as much statistical jargon as possible and resorting to examples one might understand with a only a basic knowledge of statistics and/or mathematics whenever possible) an example of a non sufficient statistic, and go on to explain why it isn't sufficient?

  2. When we say "sufficient," what does that mean? Sufficient for... what exactly? Why is it so important?

  3. What makes a sufficient statistic sufficient?

2

There are 2 best solutions below

12
On

I will provide an intuitive explanation. As Hans Engler comments, if you want to use sufficient statistics, you will eventually have to get get comfortable with the formal definition. But let me do my best to explain sufficiency, as you ask, with minimal jargon.

The core goal of statistical estimation is to gain information about a (usually large) population from a much smaller sample. Specifically, we may want to estimate certain parameters of the population, like the mean (average) or variance. In statistical theory, a sample is given by a bunch of random variables $X_1,\ldots,X_n$, each random variable describing one randomly chosen individual from the population.

A sufficient statistic $U$ is, informally, a lossless compression of your sample data. (Or at least, it loses no information for purposes of estimating the parameter of interest.) Say I have two statisticians A and B. To A I give the entire sample $X_1,\ldots,X_n$ and to B I give just our sufficient statistic $U$ and tell him how $U$ was computed. You would imagine that $A$ would be able to make much better predictions of the population parameters since he has the entire dataset, but in fact since $U$ is sufficient both A and B can make equally accurate predictions. In some sense, $U$ represents the entire sample $X_1,\ldots,X_n$ boiled down to one number. (It is important to note that sometimes you need more than one sufficient statistic, but often one will suffice.)

So to directly answer your first question directly, what makes a sufficient statistic sufficient, intuitively, is that having the sufficient statistic is just as good as having the whole sample for purposes of estimation of the parameters you care about.

It is important to note that sufficiency is always for a particular parameter. If I'm trying to estimate the population mean $\mu$ and you're trying to estimate the population variance $\sigma^2$, we'll need different sufficient statistics.

As a concrete example, consider a population of individuals whose income can be modelled by an exponential distribution with mean $\mu$. (If this means nothing to you, just know the average income is $\mu$.) We then take a sample $X_1,\ldots,X_n$. What would not be a sufficient statistic is $X_1$, just the first element of our sample. This is because if you only know $X_1$, you lose all of the valuable information contained in the rest of the sample. However, the sample mean

$$ \bar{X} = \frac{\sum_{i=1}^n X_i}{n} $$

is sufficient. If you give statistician A all the sample data and B $\bar{X}$, they can make equally accurate predictions (point estimates) of $\mu$.

0
On

Well going through your question and comment all I can say about sufficient statistics in simpler terms is that Sufficient statistics is that Statistics which contains the full information about the parameter say theta.

Suppose in Normal distribution the sample mean is the sufficient statistics for the parameter meuw which is the population mean. So when your doing an practical experiment and you need the information about meuw, the population mean, you just need to know the sample mean which will give you all the information you need and you don't have to go through the whole data.

Also the if you think intuitively about the definition you've told in the comment section you'll realise that since T has all the information about theta, the conditional distribution can't give you any more data about theta.

And so you may have now got that Insufficient or Non Sufficient statistics are those which doesn't contain all the information about the parameter.

I hope this helps you in clearing your concept. I've tried to put it in as simple terms as it could get.