I'm revisiting conditional probability* after many years and fear I never really understood it as well as I thought I did. I have no problem taking the definition:
$$ \mathbb{P}(A\mid B) = \dfrac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} $$
and applying it to simple problems. What I'm having trouble understanding is where this equation came from in the first place. I checked wikipedia: https://en.wikipedia.org/wiki/Conditional_probability which suggests it is a definition or even can be an axiom of probability. If a definition, it can be shown that it honors the Kolmogorov axioms of probability and I sort of follow all this. And clearly this definition, when used in problems, seems to agree with my intuitive ideas of how conditional probability should work. Perhaps that's enough, but I still have this gnawing feeling that it should be "obvious" to me that this is the correct equation. But I'm failing to see how conditional probability should be related to the intersection AND why dividing by P(B) is the correct way to go. I tried thinking about the problem in terms of areas but that didn't help. I do see that if we asked "what is the probability of B given B we would have:
$$ \mathbb{P}(B\mid B) = \dfrac{\mathbb{P}(B \cap B)}{\mathbb{P}(B)} $$
and this would equal 1, so perhaps that is some justification for the division by $\mathbb{P}(B)$.
Sorry if I'm rambling, but if this was completely clear in my mind, I wouldn't have a question!
So the questions are: 1. Where does this equation come from? 2. Can someone guide me how to see that this is what would come up with if one was trying to come up with a definition for the idea of "how likely is $A$ given $B$".
And as always, if there is a better group to ask this in, please let me know. -Dave
*I've started a course on Bayesian Probability: https://www.coursera.org/learn/bayesian-statistics
A simple Google Search gave me the following PDF, which has a good explanation for the intuition behind conditional probability.
An excerpt to make the answer complete:
Now, it is clear how we define $\operatorname{Pr}[A \mid B]$ namely, we just sum up these normalized probabilities over all sample points that lie in both $A$ and $B$: