In the Metropolis-Hastings algorithm, we choose to accept our sample with probability: $$\rho =min\left\{1, \frac{p(x')g(x|x')}{p(x)g(x'|x)}\right\} $$
Where $x$ refers to the current state of the Markov chain, $x'$ refers to the proposed state, $p(.)$ is the target distribution, and $g(.)$ is the proposal distribution.
I have seen the proofs which show that using this $\rho$ as the acceptance probability balances the detailed-balance equation: $$p(x)p(x'|x) = p(y)p(x|x')$$
I get that part, nothing fancy. But I don't understand something much more fundamental: Why is it that when we select a bunch of samples based on these criteria (proposal + acceptance), the samples actually distribute as $p(x)$? Why is it that we can say that samples that are selected for their agreement with $p(x)$ detailed balance equation, reach p(x) as their steady state? And on a related note, why is there a burn in period, if the entire way through, we are selecting values that satisfy the balance equation, and our criteria never change?
Does this have to do with the law of large numbers? Is this just one of those things, like the central limit theorem, that you just have to take at face value as an empirical truth?
For accept-reject sampling, I saw a pretty elegant proof that used Bayes's rule to show definitively that each individual sample we considered was theoretically a sample of the target distribution. I am struggling to find a similarly satisfying answer about why the Metropolis-Hastings method does the same.
Cheers!