I came to know the definition of conjugate distribution: For particular distribution $f(x|\theta)$, its conjugate distribution $g(\theta)$ is defined as(or satisfies) that the prior $g(\theta)$ and the posterior $g(\theta|x)$ belong to the same function family (or loosely speaking have the same form)
Then I wonder how the statisticians deduced the form of the conjugate distributions?
My first thought was they, indeed, solved a function equation
$$g(\theta|x)\propto f(x|\theta)g(\theta)$$
given $f(x|\theta)$, but I find it not so obvious.
A second thought is that they compute the parameters of a distribution from its samples, and they want to characterize the uncertainty of such estimation by another distribution, e.g. they study the sample covariance of a (multi-)normal distribution and find out the inverse Wishart distribution for characterization. They conclude that such characterization of the sample estimation of a parameter has an interesting relationship with the original distribution. Hence, they define conjugacy. If so, why does such a relationship exist for so many distributions? Is it a coincidence(for me, it is now the case), or is there some underlying logic?
I want to know the logic of how the theory of conjugacy develops rather than memorize a big table.
P.S. For convenience, we may restrict our discussion to exponential families.
@Mittens: It's too long to reply in the comment so I add it here.
Your answer seems a very formal and rigorous one (and both articles you referred seem also). Before I dive into it, I want to ask is the theory developped all of a sudden? Actually I'm more interesting about, if it exists, such a stage that people just discover that the distribution that governs the estimated parameter and the original distribution have a nice property under the Bayesian law. Only then we define the conjugacy (instead of defining the distribution of parameter prior as a conjugate form).
Still takes the example of multi-normal distribution. Without knowing the theory of conjugate distribution, we could still find out the distribution of the sample covariance matrix right? (At least what I understand here.)
The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate normal distribution
So I wonder if it is a coincidence that the distribution happens to be a conjugate prior, that is to say when apply the Bayes law, the posterior and prior are the same.
What I want to ask exactly is how you explain such coincidence. If it's not a coincidence, how you could prove(or just show intuitively without bothering too much on rigorousness) that such property holds over all the exponential family. I have no doubt on the rigorousness of the theory, I just wonder how it develops.
Conjugate priors are defined mostly for exponential family of distributions. This is due to the fact that for such family, the parameters that define such families of distributions form a convex set, and that the log maximal function takes the form of the Fenchel-Legendre transform. The theory relies heavily on convex analysis. The monograph "Brown, L.D., Fundamentals of Statistical Exponential Families, IMS Lecture Notes, Monograph series 9, 1986" has a detailed account of this. Working knowledge of the basics of convex analysis is needed to really appreciate the technicalities.
For exponential families finding prior conjugates works as follows.
Suppose $\mu$ is a Borel measure on $(\mathbb{R}^d,\mathscr{B}(\mathbb{R}^d))$. Consider a family of distributions $\{P_\theta:\theta\in\Theta\}$ of the form \begin{align} P_\theta(dx)=e^{\theta\cdot x-\Lambda(\theta)}\,\mu(dx)\tag{1}\label{one} \end{align} where $e^{-\Lambda(\theta)}$ is a normalizing factor, and $$\Theta=\{\theta\in\mathbb{R}^d:\int_{\mathbb{R}^d}e^{\theta\cdot x}\mu(dx)<\infty\}$$ Using Hölder's inequality is is each to check that $\Theta$ is a convex subset of $\mathbb{R}^d$.
The conjugate family of prior distributions is defined as the family of probability measures $\{\Pi_{a,\tau}:(a,\tau)\in E\}$ on $(\Theta,\mathscr{B}(\Theta))$ (here $\mathcal{B}(\Theta)$ is the Borel $\sigma$-algebra of subsets of $\Theta$), defined as \begin{align} \Pi_{a, \tau}(d\theta)=D(a, \tau)e^{\tau\cdot \theta- a\Lambda(\theta)}\lambda(d\theta)\tag{2}\label{two} \end{align} where $\lambda$ is the Lebesgue measure restricted to $\mathscr{B}(\Theta)$, $D(a,\tau)$ is a normalizing factor, and $$E=\{(a,\tau)\in\mathbb{R}\times\mathbb{R}^d: \int_{\Theta}e^{\tau\cdot \theta- a\Lambda(\theta)}\,\lambda(d\theta)<\infty\}$$ Again, using Hölder's inequality, one checks that $E$ is a convex subset of $\mathbb{R}\times\mathbb{R}^d$.
Under the Bayesian paradigm, if the law of $X$ given $\boldsymbol{\theta}$ is given by \eqref{one}, then under the prior of the form \eqref{two} the posterior distribution of $\boldsymbol{\theta}$ given $X$ has destiny (w.r.t. Lebesgue measure $\lambda$) given by \begin{align} \frac{d\mathbb{P}[\boldsymbol{\theta}\in d\theta|X]}{d\lambda}\propto e^{\theta\cdot(x+\tau) -(1+a)\Lambda(\theta)} \end{align} That is, the posteriori distribution $\boldsymbol{\theta}$ given $X=x$ is $\Pi_{1+a,x+\tau}$ which is of the form given \eqref{two}. This justifies use of the term conjugate prior: the posterior distribution belongs to the same family of distributions as the prior distributions.
There are some technical considerations that I mention now as a theorem
The Wishart distribution can be obtained along the lines discussed above; however, since they involved distributions on the space of positive semidefinite matrices, harmonic analysis provides a derivation of this distribution. A detail analysis of this appears in Eaton, M.L., Multivariate Statistics: A Vector Space Approach., John Wiley and Sons, 1983, chapter 8.
Denoting by $W(\Sigma,p,n)$ the Wishart distribution of $\mathcal{S}_p$ (the linear space of $p\times p$ real symmetric matrices), we have that
Notice that $p(S|\Sigma)$ defines an exponential family.