Help understanding Casella & Berger's explanation of a sufficient statistic

903 Views Asked by At

This is from Casella and Berger's Statistical Inference:

Definition: A statistic $T(\mathbf{X})$ is a sufficient statistic for $\theta$ if the conditional distribution of the sample $\mathbf{X}$ given the value of $T(\mathbf{X})$ does not depend on $\theta$.

In the discrete case,

Let $t$ be a possible value of $T(\mathbf{X})$ , that is, a value such that $P_\theta(T(\mathbf{X}) = t) > 0$. We wish to consider the conditional probability $P_\theta(\mathbf{X} = \mathbf{x}|T(\mathbf{X}) = t)$. If $\mathbf{x}$ is a sample point such that $T(\mathbf{x}) \neq t$, then clearly, $P_\theta(\mathbf{X} = \mathbf{x}|T(\mathbf{X}) = t) = 0$. Thus, we are interested in $P(\mathbf{X} = \mathbf{x}|T(\mathbf{X}) = T(\mathbf{x}))$. By the definition, if $T(\mathbf{X})$ is a sufficient statistic, this conditional probability is the same for all values of $\theta$ so we have omitted the subscript.

This is the part I'm having trouble with:

A sufficient statistic captures all the information about $\theta$ in this sense. Consider Experimenter 1, who observes $\mathbf{X} = \mathbf{x}$ and, of course, can compute $T(\mathbf{X} = T(\mathbf{x})$. To make an inference about $\theta$, he can use the information that $\mathbf{X} = \mathbf{x}$ and $T(\mathbf{X}) = T(\mathbf{x})$. Now consider Experimenter 2, who is not told the value of $\mathbf{X}$ but only that $T(\mathbf{X}) = T(\mathbf{x})$. Experimenter 2 knows $P(\mathbf{X} = \mathbf{y}|T(\mathbf{X}) = T(\mathbf{x}))$, a probability distribution on $A_{T(\mathbf{x})} = \{\mathbf{y}: T(\mathbf{y}) = T(\mathbf{x})\}$, because this can be computed from the model with knowledge of the true value of $\theta$.

So far, so good. But below, what exactly is this random variable $\mathbf{Y}$? I'm having trouble unraveling why exactly this conclusion means that Experimenter 2 has the same information that Experimenter 1 has regarding the parameter $\theta$. I apologize for not framing my question better -- I'm just quite confused by the point the author is trying to make in the paragraph below. I will update with an edit if I can clarify my question further.

Thus, Experimenter 2 can use this distribution and a randomization device, such as a random number table, to generate an observation $\mathbf{Y}$ satisfying $P(\mathbf{Y} = \mathbf{y}|T(\mathbf{X}) = T(\mathbf{x})) = P(\mathbf{X} = \mathbf{y}|T(\mathbf{X}) = T(\mathbf{x}))$. It turns out that, for each value of $\theta$, $\mathbf{X}$ and $\mathbf{Y}$ have the same unconditional probability distribution, as we shall see below. So Experimenter 1, who knows $\mathbf{X}$, and Experimenter 2, who knows $\mathbf{Y}$ have equivalent information about $\theta$, but surely the use of the random number table to generate $\mathbf{Y}$ has not added to Experimenter 2's knowledge of $\theta$. All his knowledge about $\theta$ is contained in the knowledge that $T(\mathbf{X}) = T(\mathbf{x})$. So Experimenter 2, who knows only $T(\mathbf{X}) = T(\mathbf{x})$, has as much information about $\theta$ as does Experimenter 1, who knows the entire sample $\mathbf{X} = \mathbf{x}$.

To complete the above argument, we need to show that $\mathbf{X}$ and $\mathbf{Y}$ have the same unconditional distribution, that is, $P_\theta(\mathbf{X} = \mathbf{x}) = P_\theta(\mathbf{Y} = \mathbf{x})$ for all $\mathbf{x}$ and $\theta$. Note that the events $\{\mathbf{X} = \mathbf{x}\}$ and $\{\mathbf{Y} = \mathbf{x}\}$ are both subsets of the event $\{T(\mathbf{X}) = T(\mathbf{x})\}$

Also recall that $$ P(\mathbf{X} = \mathbf{x}|T(\mathbf{X}) = T(\mathbf{x})) = (\mathbf{Y} = \mathbf{x}|T(\mathbf{X}) = T(\mathbf{x})) $$ and these conditional probabilities do not depend on $\theta$. Thus, we have $$ P_\theta(\mathbf{X} = \mathbf{x}) = P_\theta(\mathbf{X} = \mathbf{x} \text{ and } T(\mathbf{X}) = T(\mathbf{x})) \\ = P(\mathbf{X} = \mathbf{x}|T(\mathbf{X}) = T(\mathbf{x}))P_\theta(T(\mathbf{X}) = T(\mathbf{x})) \\ = P(\mathbf{Y} = \mathbf{x}|T(\mathbf{X}) = T(\mathbf{x}))P_\theta(T(\mathbf{X}) = T(\mathbf{x})) \\ = P_\theta(\mathbf{Y} = \mathbf{x} \text{ and } T(\mathbf{X}) = T(\mathbf{x}))\\ = P_\theta(\mathbf{Y} = \mathbf{x})$$

3

There are 3 best solutions below

0
On

Let me try to rewrite the paragraph as I understand it:

An experimenter 1 observes a random variable $X$ on a measurable space $(\Omega, \mathcal A)$ with values in any given space (possibly high or even infinite dimensional). The statistical experiment or model is given by the family of possible probability distributions $(P_\theta)_{\theta\in\Theta}$, with a suitable parameter space $\Theta$.

Experimenter 1 gains knowledge about the unknown parameter from a certain event $\{X=x\}$. Their inference is completely based on this event and is described by the probabilities $P_\theta(X=x)$.

Now there is also experimenter 2 who does not know that the event $\{X=x\}$ has appeared. They only know that $\{T(X)=T(x)\}$. Intuitively, experimenter 2 now has less information about $\theta$ since $\{X=x\}\subseteq\{T(X)=T(x)\}$.

But since $T$ is sufficient for $\theta$ we can show that experimenter 2 in fact has the same information: The conditional probabilities $P(X=x\mid T(X)=T(x))$ are independent of $\theta$ and are therefore accessible to both experimenters. These probabilities define a distribution that experimenter 2 might use to sample from. This random sample is defined to be the variable $Y$.

Therefore, $$ P(X=x\mid T(X)=T(x))= P(Y=x\mid T(X)=T(x)) $$

It is now shown that $X$ and $Y$ have the same (unconditional!) distribution: $P_\theta(X=x)=P_\theta(Y=x)$. This is the same as to say that experimenter 2 can draw the same conclusions about $\theta$ as experimenter 1 can, since they can sample from a variable with a distribution that is the same as that of $X$, even with the sole knowledge of $\{T(X)=T(x)\}$.

0
On

I have also failed to comprehend this explanation in the textbook.

I wanted to suggest a much simpler explanation of the sufficiency principle from the book

SUFFICIENCY PRINCIPLE: If $T({\bf X})$ is a sufficient statistic for $\theta$, then any inference about $\theta$ should depend on the sample ${\bf X}$ only through the value $T({\bf X})$...

using Bayes' rule. The most straightforward inference on $\theta$ we can make is compute its distribution conditional either on the information about ${\bf X}$ or only about $T({\bf X})$. But if $T({\bf X})$ is a sufficient statistic, we have

$$P(\theta = \theta_0 \ | \ {\bf X}={\bf x}) = P(\theta = \theta_0 \ | \ {\bf X}={\bf x}, \ T({\bf X})=T({\bf x}))$$ $$ =\frac{P({\bf X}={\bf x}\ | \ \theta=\theta_0, \ T({\bf X})=T({\bf x})) \ P(\theta=\theta_0 \ | \ T({\bf X})=T({\bf x}))}{P({\bf X}={\bf x} \ | \ T({\bf X})=T({\bf x}))}\ (\text{Bayes' rule} \ | \ T({\bf X})=T({\bf x}))$$

$$ = \frac{P({\bf X}={\bf x} \ | \ T({\bf X})=T({\bf x})) \ P(\theta=\theta_0 \ | \ T({\bf X})=T({\bf x}))}{P({\bf X}={\bf x} \ | \ T({\bf X})=T({\bf x}))} \quad (\text{sufficiency})$$ $$=P(\theta=\theta_0 \ | \ T({\bf X})=T({\bf x})),$$so we gain no more information knowing ${\bf X}={\bf x}$ compared to knowing just $T({\bf X})=T({\bf x})={\bf t}$.

0
On

Some further intuition on how sufficiency preserves the information contained in the sample about a parameter of interest: given a random sample $X_1,\dots,X_n$ from a distribution family having densities $f_X(x; \theta)$. Let $\mathcal{X}$ denote the sample space of the random vector $(X_1,\dots,X_n)$, then we consider a statistic $$ T= t(X_1,\dots,X_n) $$ can be viewed as a partitioning of $\mathcal{X}$. To see this, let $\mathcal{T}$ be the sample space of $T$, and consider the sets $$ \mathcal{X}_t = \{ (x_1,\dots,x_n)\in\mathcal{X} : t(x_1,\dots,x_n) = t\} $$ for each $t \in \mathcal{T}$, then it is clear that the collection fo sets $\{\mathcal{X}_t\}_{t \in \mathcal{T}}$ forms a partition of $\mathcal{X}$.

Now, a statistic $T$ is useful if it has good data reduction properties. This can be judged by how good the corresponding partition is at reducing the number of possible values to be considered, as well as the degree to which all relevant information regarding the parameter $\theta$ is retained. So if decisions are to be made based on the value of the statistic instead of the observed data, then this decision will be the same for any dataset within the same partition $\mathcal{X}_t$. Therefore, for a statistic to be sufficient, the information which distinguishes the individual elements of each $\mathcal{X}_t$ should have no bearing on the value of $\theta$ (i.e. if an observed sample is known to be in a given $\mathcal{X}_t$, the probability of a sample taking any of the values within this member of the sample space partition should be independent of $\theta$.

Consider the example in which we have a sample of size $3$, $X_1,X_2,X_3$ from a Bernoulli distribution with parameter $p = P(X_i=1)$. The sample space for $(X_1,X_2,X_3)$ is $$ \mathcal{X} = \{(0,0,0), (1,0,0), \dots, (1,1,1)\} $$ with $|\mathcal{X}| = 2^3=8$. Consider two statistics $T_1 = X_1X_2+X_3$ and $T_2 = X_1+X_2+X_3$. We can see that $\mathcal{T}_1 = \{0,1,2\}$ and $\mathcal{T}_2 = \{0,1,2,3\}$. $T_1$ induces the sample space partitions: \begin{align*} \mathcal{X}^1_0 &= \{(0,0,0), (0,1,0), (1,0,0) \} \\ \mathcal{X}^1_1 &= \{(0,0,1), (0,1,1), (1,0,1),(1,1,0) \}\\ \mathcal{X}^1_2 &= \{(1,1,1) \}, \end{align*} and similarly, $T_2$ induces the partitions \begin{align*} \mathcal{X}^2_0 &= \{(0,0,0) \} \\ \mathcal{X}^2_1 &= \{(0,0,1), (0,1,0), (1,0,0) \}\\ \mathcal{X}^2_2 &= \{(0,1,1), (1,0,1), (1,1,0) \}\\ \mathcal{X}^2_3 &= \{(1,1,1) \} \end{align*}

Now, we can consider the distribution of the sample space values within each element of the two partitions. Suppose you are given that $T_1=0$, so the possible values for our original samples are those elements of $\mathcal{X}^1_0$. Then, consider the probability that the actual dataset was all zeroes: \begin{align*} P(X_1=0,X_2=0,X_3=0 | T_1=0) &= \frac{P(X_1=0,X_2=0,X_3=0,T_1=0)}{P(T_1=0)}\\ &= \frac{P(X_1=0,X_2=0,X_3=0)}{P(X=(0,0,0) \text{ or } X=(0,1,0) \text{ or } X=(1,0,0) }\\ &=\frac{(1-p)^3}{(1-p)^3 + 2p(1-p)^2}\\ &=\frac{1-p}{1+p} \end{align*} and so the probability is not independent of $p$, in other words $T_1$ is not sufficient, as it does not induce an appropriate partition. In other words, if you based your decision on the value of $T_1=0$, it would have to be the same regardless of whether the actual sample had been $(0,0,0)$ or $(0,1,0)$, but these samples contain different information about the parameter of interest, $p$.

Now, suppose you are told that $T_2=1$, so that the possible values of the original sample are the elements of $\mathcal{X}^2_1$. You can easily show that $P(X_1=0,X_2=1,X_3=0|T_2=1) = \frac{1}{3}$, and you can actually show that for any value $T_2=t$, the chance that the actual dataset was one of the possible elements of $\mathcal{X}^2_t$ does not depend on $p$. So, $T_2$ is sufficient and basing estimates on its value retains all the relevant information in the sample $(X_1,X_2,X_3)$ regarding $p$.

For a formal definition: Let $X_1,\dots,X_n$ be a random sample from a distribution family with density function $f_X(x;\theta)$ where $\theta \in \mathbb{R}^k$ is a parameter vector. A vector valued statistic $T= t(X_1,\dots,X_n)$ is sufficient for $\theta$ if and only if the conditional distribution of $X_1,\dots,X_n$ given $T$ does not depend on $\theta$.