I've been reading a lot about Bayesian Analysis and there is something that is not quite clicking for me with respect to the motivation for using MCMC.
Assume we are doing analysis of the form: P(Hypothesis | Data) = ( (P(Data | Hypothesis) * P(Hypothesis) )/ P (Data)
I understand that it may be an intractable problem to calculate P (Data) using analytical methods due to the complexity of the integral.
With MCMC, my understanding is that we:
- Choose a random proposed starting value for the hypothesis.
- Calculate (P(Data | Hypothesis) * P(Hypothesis) (just the numerator of the right hand side of the equation).
- Choose another value for the hypothesis that will be a delta from the first value based on a number selected from a proposal distribution and again calculate (P(Data | Hypothesis) * P(Hypothesis)
- If the calculated value is higher, accept the proposed value of the hypothesis. If the value is lower, accept or reject it on a probabilistic basis.
- Repeat many times. Put all of the accepted values of the hypothesis into a histogram and then normalize it to turn it into a PDF which becomes your estimate of the posterior.
Basically, we are saying:
- We are going to run an algorithm which will try many proposed values of the hypothesis. The algorithm will search the entire space of possible values of the hypothesis but will visit values that correspond to areas with a higher probability density more frequently. Therefore, by binning and counting the results, we can infer the shape the of the probability density function.
I understand all of that but I don't understand why we are doing it instead of a different approach. Wouldn't it be a lot simpler to do this instead?
- Set lower and upper bounds for the value of the hypothesis.
- Choose 10,000 values that are equally distributed between those two bounds.
- Calculate (P(Data | Hypothesis) * P(Hypothesis) for each value.
- Plot the values on a chart.
- Run some quick diagnostics to be sure you have chosen reasonable bounds (e.g. there is an area of higher values somewhere in the middle and values that are much much smaller at the upper and lower bounds.)
- Normalize the chart so that area underneath it is equal to 1. At that point, wouldn't the chart be equal to the probability density function calculated by MCMC but without all the complexity?
Is my problem that I am thinking of this for a simple example (e.g. where there are one or two parameters and it is reasonably easy to guess reasonable bounds) but it quickly becomes computationally inefficient for real-world problems? Or is my problem more fundamental and I am completely misunderstanding the situation?