I came across an interesting (to me) exercise in Casella & Berger, exercise 7.23.
Suppose we have $S^2$ from normal population and $\frac{(n-1)S^2}{\sigma^2}$ follows $\chi^2_{n-1}$. Our prior for $\sigma^2$ is inverse gamma with parameters $\alpha, \beta$
$$\pi(\sigma) = \frac{1}{\Gamma(\alpha)\beta^{\alpha}} \frac{1}{(\sigma^2)^{\alpha + 1}}e^{-\frac{1}{\beta\sigma^2}}$$
My question is the following: does it matter if we use $S^2$ or the sample $X_1, .., X_n$ for the update, and how do we know which one to use?
The posterior for $\sigma^2$ in this case is Inverse Gamma too. But how can I be sure, that if I used the original sample, it wouldn't be something else?
For example, if I simply took the original sample $X_1, .., X_{n-1}$ without the last data point, and did and update using the product-of-normals, then I would get a different posterior from the full sample. Why cannot this happen if I take some statistic related to the original sample, but does not quite use all the information for the Bayesian update?
Like what if I computed $S^2$ for Weibull distribution (randomly named this one), took maybe some other conjugate prior, and made an update. Will it give me the same information in the Bayesian update as the original distribution.
My intuition tells me that the derived statistic has to be a sufficient statistic. But cannot find any reference on this.
You will get different answers because the sufficient statistic does not actually capture ALL of the information from the data. It only captures all the information for the maximum likelihood estimate. Bayes' theorem tells you about how likely ANY value for $\sigma^2$ is.
Basically, you can't recapitulate $X_1...X_n$ from $S^2$. Therefore, simply having $S^2$ will give you less to update your posterior with than $X_1...X_n$.
Let's actually walk through the math of Bayes's theorem to explain what I mean. I'm going to show this with one $S^2$ and then one set of normal i.i.d. ${X_1...X_n}$. For the sake of simplicity, I'm going to assume that the mean of any $X_i$ is $0$.
$$f(\sigma^2|S^2) = \frac{f(\sigma^2)f(S^2|\sigma^2)}{\int_{0}^{\infty}f(\sigma^2)f(S^2|\sigma^2)d\sigma^2 }$$
The denominator is a just a normalizing constant we'll call $Z$. It's way more cumbersome to calculate than actually matters. Your prior is the inverse Gamma prior, so
$$f(\sigma^2|S^2) = \frac{\frac{1}{\Gamma(\alpha)\beta^{\alpha}} \frac{1}{(\sigma^2)^{\alpha + 1}}e^{-\frac{1}{\beta\sigma^2}}f(S^2|\sigma^2)}{Z}$$
Furthermore, $\frac{(n-1)S^2}{\sigma^2} = A(S^2) \sim \chi^2_{n-1}$. $f(S^2)\frac{dA}{dS} = f(A(S^2)) \implies \frac{\sigma^2 f(A(S^2))}{{n-1}} = f(S^2|\sigma^2)$
$$f(\sigma^2|S^2) = \frac{\frac{1}{\Gamma(\alpha)\beta^{\alpha}} \frac{1}{(\sigma^2)^{\alpha + 1}}e^{-\frac{1}{\beta\sigma^2}}\frac{\sigma^2 f(A(S^2))}{{n-1}}}{Z}$$
$$f(\sigma^2|S^2) = \frac{\frac{1}{\Gamma(\alpha)\beta^{\alpha}} \frac{1}{(\sigma^2)^{\alpha}(n-1)}e^{-\frac{1}{\beta\sigma^2}}\frac{A^{(n-1)/2}e^{-A/2}}{2^{(n-1)/2}\Gamma(\frac{n-1}{2})}}{Z}$$
Now, lets look at $f(\sigma^2|X_1...X_n)$. The probability density of any $X_i$ given $\sigma^2$ is $\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{X_i^2}{2\sigma^2}}$. Given that $X_i$ are assumed to be independent, their products multiply.
$$f(\sigma^2|X_1...X_n) = \frac{\frac{1}{\Gamma(\alpha)\beta^{\alpha}} \frac{1}{(\sigma^2)^{\alpha+1}}e^{-\frac{1}{\beta\sigma^2}}\frac{e^{-\frac{\sum{X_i^2}}{2\sigma^2}}}{2^{n/2}\pi^{n/2}(\sigma^2)^{n/2}}}{Z}$$
And these aren't the same thing. They look somewhat similar, but there's no way to reconcile them as is. Feel free to try with the definition of $S^2$ and $A$.
And there's no way to pick a prior for $S^2$ alone to do so, as far as I know: a sufficient statistic doesn't actually give you all the information that the data can. It can't. No matter what single sufficient statistic you use, it can't determine what you have for data if you have more than one data point. I suspect strongly that you'd need $n$ statistics to perfectly recapitulate the information of $n$ data points, and that's impractical if you just have the data.
Now, that doesn't mean that the form for $f(\sigma^2|S^2)$ is useless or wrong by any means. Say you're looking at published data that only includes $S^2$ and not the data. Then the form for $f(\sigma^2|S^2)$ IS correct. Bayes' theorem lets you incorporate ALL the information you know about a system. If all you know is $S^2$, then the form for $f(\sigma^2|S^2)$ is what you want. If you know all the data $X_1...X_n$, then the form for $f(\sigma^2|X_1...X_n)$ is what you want.