On the page: https://en.wikipedia.org/wiki/Bayesian_inference#Formal_description_of_Bayesian_inference there is the result:
$$p(\theta \mid \mathbf{X},\alpha) = \frac{p(\mathbf{X} \mid \theta) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} $$
which I am having trouble deriving. Here's my attempt. We have:
$$p(\theta, \mathbf{X},\alpha) = p(\theta \mid \mathbf{X},\alpha)p(\mathbf{X}, \alpha) = p(\theta \mid \mathbf{X},\alpha)p(\mathbf{X} \mid \alpha)p(\alpha)$$ also:
$$p(\theta, \mathbf{X},\alpha) = p( \mathbf{X}\mid \theta, \alpha)p(\theta, \alpha) = p( \mathbf{X}\mid \theta, \alpha)p(\theta\mid \alpha)p(\alpha)$$
Equating these, we have: $$p(\theta \mid \mathbf{X},\alpha) = \frac{p( \mathbf{X}\mid \theta, \alpha)p(\theta\mid \alpha)p(\alpha)}{p(\mathbf{X} \mid \alpha)p(\alpha)} =\frac{p( \mathbf{X}\mid \theta, \alpha)p(\theta\mid \alpha)}{p(\mathbf{X} \mid \alpha)} $$
which is not quite the same as the expression given, because we have a $p( \mathbf{X}\mid \theta, \alpha) $ term rather than a $p(\mathbf{X}\mid \theta) $ term. Where am I going wrong?
Bayes' Theorem states that: $$\begin{align} \mathsf P(A\mid B, C) & = \dfrac{P(B\mid A, C)\;P(A\mid C)}{P(B\mid C)} \\[2ex] \therefore p(\theta \mid \mathbf X,\alpha) & = \dfrac{p(\mathbf X\mid \theta, \alpha)\; p(\theta\mid \alpha)}{p(\mathbf X\mid \alpha)} \end{align}$$
Now $\mathbf X$ is a vector of data points $x_i$ each from a distribution determined by parameter $\theta$, which in turn has a distribution determined by the (hyper)parameter $\alpha$. This means $p(\mathbf X\mid \theta, \alpha) = p(\mathbf X\mid \theta)$. (Because if you know what the parameter is, then the hyperparemeter adds no additional information towards determining the probability measure of the vector of data points.)
Thus: $$\begin{align} p(\theta \mid \mathbf X,\alpha) & = \dfrac{p(\mathbf X\mid \theta)\; p(\theta\mid \alpha)}{p(\mathbf X\mid \alpha)} \end{align}$$