The Setting is as follows:
We are given random variables $X$ and $\Theta$ but we are not so much interested into $X$ itself as its Distribution needs a Parameter $\theta$ which is produced by $\Theta$. Concrete example:
$X$ is a binary variable mapping to $\{0,1\}$ and is Bernoulli distributed with Parameter $\theta$. $\Theta$ is Beta-distributed with fixed Parameters $\alpha_0, \beta_0$. This means that we actually model the conditional density
$$f_{X|\Theta}(x, \theta) = \theta^x (1-\theta)^{1-x}$$ and $$f_\Theta(\theta) = \frac{\Gamma(\alpha_0)\Gamma(\beta_0)}{\Gamma(\alpha_0 + \beta_0)} \theta^{\alpha_0} (1-\theta)^{1-\beta_0}$$
Now let us assume that I want to sample from $X$. People do the following:
(A):
1) Sample a value $\theta$ from the Beta distribution
2) Insert this value into the Bernoulli distribution and sample from it with this concrete Parameter
We could equivalently do the following:
(B):
Compute the Distribution of $X$ and then sample from it directly.
QUESTION: DO THESE METHODS GIVE THE SAME RESULT?
I guess that one should ask the question more precisely: Do I really get the Distribution of $X$ if I do (A) often enough? I dont see why an ''equation'' of the form
$$\text{Sample}(X|(\Theta|(\alpha_0, \beta_0))) = \text{Sample}(X|\text{Sample}(\Theta|(\alpha_0, \beta_0)))$$ should be true...
Note: (dunno whether it helps or not) In the example given it is easy to compute the Distribution of $X$ as
$$f_X(x) = \int_\theta f_{X|\Theta}(x, \theta) f_\Theta(\theta) d\theta$$
and it turns out that
$$P[X=x] = \frac{B(\alpha_0+x, \beta_0+(1-x))}{B(\alpha_0, \beta_0)}$$
where $B$ is the Beta function.
Schemes A and B don't return the exact same samples, but each item returned has the same distribution under A as under B. The reason is precisely the equation $$ f_X(x) = \int_\theta f_{X|\Theta}(x, \theta) f_\Theta(\theta) d\theta.\tag1 $$ Under scheme A, the joint density of $(X,\Theta)$ is $$ f_{X,\Theta}(x,\theta)=f_{X|\Theta}(x,\theta)f_\Theta(\theta)\tag2 $$ so the marginal density of $X$ is obtained by integrating out $\Theta$, as in equation (1). What (1) is saying is that if you generate $\Theta$ from $f_\Theta$, and then generate $X$ conditional on $\Theta$ from $f_{X|\Theta}$, and then throw away $\Theta$, the resulting $X$ has the same density as if you generated $X$ from $f_X$ directly.
The reason why scheme A is often used is that there may not be a closed form for the marginal density $f_X$ in (1), so scheme B is not possible.
EDIT: An empirical simulation might help. Here's some code in R to apply scheme A to a situation where $\Theta$ has a normal distribution with mean $2$ and SD $1$, and having observed the value $\Theta=\theta$, we generate $X$ as normal with mean $\theta$ and SD $1$.
When we simulate 5000 times, we find that $X$ has approximately mean $2$ and SD $\sqrt 2$. This agrees with (1), which gives the marginal density of $X$ as normal with mean $2$ and variance $2$.
You can compute the joint distribution of $(X,\Theta)$ using (2): It is bivariate normal with mean vector $\begin{pmatrix}1\\1\\\end{pmatrix}$ and covariance matrix $\begin{pmatrix}2&1\\1&1\\\end{pmatrix}$. The simulation also agrees with this: