I'm reading Jeffrey's "Subjective Probability" and was intrigued by the following passage:
"The quotient rule is often called the definition of conditional probability. It is not. If it were, we could never be in the position we are often in, of making a conditional judgment--say, about how a coin that may or may not be tossed will land--without attributing some particular positive value to the condition that $pr$(head|tossed) = 1/2 even though
$\frac{pr(\text{head} \ \land \ \text{tossed})}{pr(\text{tossed})} = \frac{\textit{undefined}}{\textit{undefined}}$.
[...] The quotient rule merely restates the product rule; and the product rule is no definition but an essential principle relating two distinct sorts of probability." (Jeffrey 2004, p.14)
The product rule is this:
$pr(H \ \land D) = pr(H|D) pr(D)$
The quotient rule is this:
$pr(H|D) = \frac{pr(H \ \land \ D)}{pr(D)}$, provided $pr(D) > 0$.
$pr(H|D)$ of course means the conditional probability of $H$ given $D$.
Further useful information is that Jeffrey uses a Dutch book argument to motivate the product rule (as he also does in motivating the special disjunction axiom). So, what is he saying: Is the product rule a further axiom of the probability calculus? If not, how is it a theorem? I got confused multiple times in people defining conditional probability out of the blue, not seeing how its definition is implicit in the axioms.
In short, what does Jeffrey mean when he says that the product rule is an "essential principle"? How exactly does it relate to the axioms?
Too long for a comment. Probability theory is based on measure theory. Likewise linear algebra and the definition of vector space, the definition of probability space is laid out with some rules, often times labelled "axioms." This rules define what can be called an "abstract probability space." Such a space is a working technical tool; the probabilities are prespecified and given, as well as all the events. This means intuitively that if we know a probability space we have knowledge of all probabilities and event that can be modelled with such a space. The statistician often has a real-life situation; for example, a factory producing a component and want to estimate the probability of a defective component, or something more complicated such as weather, winning the lotto, a bet on a game, estimating whether a treatment will cure a disease, etc. An assumption which is quite solid is that if we knew absolutely every single physical variable, we can predict with certainty the outcomes; equivalently, true randomness does not exist. (Warning: this is a contention claim as it seems that in infinitesimal scales, some fundamental particles exhibit true randomness, emphasis on "it seems that.") However, such an ability seems impossible and rather the statistician has to pretend that some mathematical model describes the phenomenon fairly well. This assumption is tacit (i.e., implicit, not stated) and often forgotten. For example, in the case of the factory and the components, the statistician assumes that identical conditions occur every time a new component is created; in the weather, they assume a likewise impossible assumption (i.e., that the weather behaves the same on same conditions). Anyway, the assumption that a probability space can model a phenomenon is a strong assumption.
Now, mathematically, you can ask what is the intuition behind defining $P(A \mid E) = P(A \cap E) / P(E).$ First, we can tell from measure theory that $A \mapsto P(A \cap E)$ defines a measure, albeit not a probability measure, on the same probability space as $A \mapsto P(A)$ is defined. This measure is concentrated on $E,$ this means technically that this measure assigns value zero to any event outside of $E$ (i.e., inside $E^\complement$). It is natural to call this measure the measure induced on $E$ by the measure $P$ or something similar. As already mentioned, this is not a probability measure and therefore is it convenient to re-scale it to make it so. Thus, we consider $A \mapsto \lambda P(A \cap E)$ for some constant $\lambda$ that makes this a probability measure. Trivially from the axioms of probability, $\lambda = \dfrac{1}{P(E)}$ (such measure has to have value equal to $1$ when $A = \Omega,$ but when $A = \Omega,$ we have $\lambda P(E) = 1$). Thus, the new measure $A \mapsto P(A \cap E) / P(E)$ is called the probability measure induced on $E$ by $P.$ Why then this is called "condition on $E$" or "given $E$"? Well, essentially because $E$ is an almost sure event for this new measure. Indeed, evaluation on $A = E$ gives $P(E \cap E) / P(E) = 1.$ Therefore, the induced measure so defined, namely $A \mapsto P(A \cap E) / P(E),$ has $E$ as an almost sure event. Intuitively, we are condition on $E$ happening. The notation $P(A \mid E)$ is just notation, it can also be changed for any other convenient symbol such as $P_E(A),$ or $P^E(A).$ (Kay Lai Chung uses $P_E(A).$) It follows from pure mathematics that $P(A \cap E) = P(A \mid E) P(E)$ (note we assume that $P(E) > 0,$ if $P(E) = 0,$ that means that $E$ is an event that shouldn't happen under $P,$ equivalently, if $E$ happened, then the probability measure $P$ is not a good fit for $E$).