I have read other materials on Bayesian statistics but am using the Wikipedia page on Bayesian inference to frame this question.
I have a working understanding of the $$P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}$$ version of Bayesian inference, and can do examples/exercises with that method. But as soon as people shift to $$P(\theta \mid X, \alpha) = \frac{P(X \mid \theta) P(\theta \mid \alpha)}{P(X \mid \alpha)}$$ I get confused. I know that they are describing the same theorem, and I have a basic understanding of probability distributions and their parameters, but this still is not clicking for me. Could someone either help me conceptually or give a worked example that might help me understand the formal definition? I'm trying to reach an intuitive understanding of this. I have done a lot of reading and am still extremely confused.
Thank you!
Just work with the discrete setting first. All that is said is just the following. You suspect that some random variable $X$ (say, with values $0,1$) may have several possible distribution laws (say, $P(0)=\frac 13, P(1)=\frac 23$ (1), $P(0)=\frac 12, P(1)=\frac 12$ (2), and $P(0)=\frac 23, P(1)=\frac 13$ (3). So your $\theta\in\{1,2,3\}$ is just the choice of this law. You want to figure out which law is the case. All you can do is to make an initial guess (say, you suspect that (1) is more likely than the other two, which are equally likely. So, you make an educated guess and bet that there is about $1/2$ chance that you are dealing with law (1) and assign $\frac 14$ chance to each of the other $2$.
Now you observe $X$ and it comes out as $1$, say. So you look at how this changes your opinion about the likelihood of each law. Remember that your model of the generation of $X$ is a two-step one now: first you choose randomly one of 3 probability laws and then choose $X$ according to that law.
The full probability ($P(X|\alpha)$) of getting $X=1$ is then $\frac 12\frac 23+\frac 14\frac 12+\frac 14\frac 13=\frac{13}{24}$. Now you just look at the part each law contributed to this whole probability, which is given by the corresponding individual product of fractions $P(X|\theta)P(\theta|\alpha)$ (I wrote the factors in the reverse order above). Those parts are $\frac 13=\frac{8}{24}, \frac 18=\frac 2{24}, \frac 1{12}=\frac 2{24}$. Now just compute what portion each part makes of the whole $\frac{13}{24}$. You'll get $\frac 8{13},\frac 3{13}, \frac 2{13}$ correspondingly. Thus your belief in law (1) got reinforced while your beliefs in two other laws got diminished (which is not really surprising). The next sample will create another update, and so on.
Just keep in mind that what you are trying to evaluate here is not the likelihood of one event given another event (like the probability of rain given that the sky is cloudy) but the likelihood of the distribution given an event (like what is the chance that, given that the rain fell, it is the first weather station that predicts the weather correctly if it was saying that it would rain with probability 20% and the second one put that chance at 40%, assuming that one of them is always correct and the other one is taking its data from the ceiling but you don't know which one is which). BTW, you can try this example yourself now. All you need is to assign some prior probabilities to the 2 weather stations and see how your opinions of them get updated after you get wet when walking outside.