Do 'Sample from' and 'insert parameter' commute?

40 Views Asked by At

The Setting is as follows:

We are given random variables $X$ and $\Theta$ but we are not so much interested into $X$ itself as its Distribution needs a Parameter $\theta$ which is produced by $\Theta$. Concrete example:

$X$ is a binary variable mapping to $\{0,1\}$ and is Bernoulli distributed with Parameter $\theta$. $\Theta$ is Beta-distributed with fixed Parameters $\alpha_0, \beta_0$. This means that we actually model the conditional density

$$f_{X|\Theta}(x, \theta) = \theta^x (1-\theta)^{1-x}$$ and $$f_\Theta(\theta) = \frac{\Gamma(\alpha_0)\Gamma(\beta_0)}{\Gamma(\alpha_0 + \beta_0)} \theta^{\alpha_0} (1-\theta)^{1-\beta_0}$$

Now let us assume that I want to sample from $X$. People do the following:

(A):

1) Sample a value $\theta$ from the Beta distribution

2) Insert this value into the Bernoulli distribution and sample from it with this concrete Parameter

We could equivalently do the following:

(B):

Compute the Distribution of $X$ and then sample from it directly.

QUESTION: DO THESE METHODS GIVE THE SAME RESULT?

I guess that one should ask the question more precisely: Do I really get the Distribution of $X$ if I do (A) often enough? I dont see why an ''equation'' of the form

$$\text{Sample}(X|(\Theta|(\alpha_0, \beta_0))) = \text{Sample}(X|\text{Sample}(\Theta|(\alpha_0, \beta_0)))$$ should be true...

Note: (dunno whether it helps or not) In the example given it is easy to compute the Distribution of $X$ as

$$f_X(x) = \int_\theta f_{X|\Theta}(x, \theta) f_\Theta(\theta) d\theta$$

and it turns out that

$$P[X=x] = \frac{B(\alpha_0+x, \beta_0+(1-x))}{B(\alpha_0, \beta_0)}$$

where $B$ is the Beta function.

1

There are 1 best solutions below

3
On

Schemes A and B don't return the exact same samples, but each item returned has the same distribution under A as under B. The reason is precisely the equation $$ f_X(x) = \int_\theta f_{X|\Theta}(x, \theta) f_\Theta(\theta) d\theta.\tag1 $$ Under scheme A, the joint density of $(X,\Theta)$ is $$ f_{X,\Theta}(x,\theta)=f_{X|\Theta}(x,\theta)f_\Theta(\theta)\tag2 $$ so the marginal density of $X$ is obtained by integrating out $\Theta$, as in equation (1). What (1) is saying is that if you generate $\Theta$ from $f_\Theta$, and then generate $X$ conditional on $\Theta$ from $f_{X|\Theta}$, and then throw away $\Theta$, the resulting $X$ has the same density as if you generated $X$ from $f_X$ directly.

The reason why scheme A is often used is that there may not be a closed form for the marginal density $f_X$ in (1), so scheme B is not possible.

EDIT: An empirical simulation might help. Here's some code in R to apply scheme A to a situation where $\Theta$ has a normal distribution with mean $2$ and SD $1$, and having observed the value $\Theta=\theta$, we generate $X$ as normal with mean $\theta$ and SD $1$.

> set.seed(17)
> m <- 2
> theta <- rnorm(5000, m, 1)
> x <- rnorm(5000, theta, 1)
> mean(x)
[1] 2.014371
> sd(x)
[1] 1.41051
> mean(theta)
[1] 1.985225
> sd(theta)
[1] 0.9878088

When we simulate 5000 times, we find that $X$ has approximately mean $2$ and SD $\sqrt 2$. This agrees with (1), which gives the marginal density of $X$ as normal with mean $2$ and variance $2$.

You can compute the joint distribution of $(X,\Theta)$ using (2): It is bivariate normal with mean vector $\begin{pmatrix}1\\1\\\end{pmatrix}$ and covariance matrix $\begin{pmatrix}2&1\\1&1\\\end{pmatrix}$. The simulation also agrees with this:

enter image description here