The definition of sufficient statistics says that the conditional distribution of a sufficient statistic, say $S$, must be independent of the unknown parameter,say $\theta$.
Consider the $Ber(\theta)$ distribution. It can be shown that $S=\sum_{i=1}^n X_i$ is sufficient for $\theta$. I cannot fully understand the intuition. A textbook says that if statistician A known the entire random sample of $X_1,X_2,...,X_n$, and statistician B knows the value of $S$, both of them will do an equally good job of estimating the unknown parameter since statistician B (as well as A) have all possible information on $\theta$.
What I can’t understand is - say,I’m statistician B. I have the value of $S$, say $s$. I can generate multiple random samples - say, $r\ $ $0’s$ and $(n-r)\ 1’s$ such that they add up to $s$. But the conditional distribution of the random sample, given the value of S, in this case is $\frac {(n-s)!s!}{n!}$. How are these two related? How am I generating a random sample by myself, from this conditional distribution?
I’m quoting the book here - “Since the conditional distribution of $X_1,\ldots,X_n$ given $\theta$ and $S$ does not depend on $\theta$, statistician B knows this conditional distribution. So he can use his computer to generate a random sample $x_1,\ldots,x_n$ which has this conditional distribution.”
As usual, a concrete example is illustrative. Consider iid $$X_i \sim \operatorname{Bernoulli}(\theta)$$ where $\Pr[X_i = 1] = \theta$ for $i = 1, 2, \ldots, n$. This is our parametric model for the sample. I'm statistician $A$ and you are statistician $B$.
Now, I am telling you that $n = 7$ and $S = X_1 + \cdots + X_7 = 3$. You know that the sample must contain $3$ ones and $4$ zeroes; i.e., $(X_1, \ldots, X_7)$ is some permutation of the $7$-tuple $(1,1,1,0,0,0,0)$. Intuitively, ask yourself:
Why or why not? Take your time to think about this. Have you lost any information about $\theta$ by ignoring the order in which the observations are recorded? Certainly, you have lost some information--in particular, which observations were $1$ and which were not--but is that information relevant to the value of $\theta$ used to generate the true sample that I know but you do not?