I have an algorithm which tries to calculate some $\operatorname{Pr(X | Y_1 Y_2 \dots )}$ (where juxtaposition means event intersection, "given $Y_1$ and $Y_2$ and ... have happened".) We have some control over the events $Y_1, Y_2, \dots$ that we measure, and we choose them to make the $Y_i$ largely independent. The algorithm works by recursion on the $Y_i.$ The problem is, I made an assumption about the recursively-applied formula and it is not obviously equivalent to independence of the $Y_i$, so I would like to know how bad I've shot myself in the foot.
So for the recursion we define $N = Y_k$ and $O = Y_1 \dots Y_{k-1}$ (mnemonic: "new fact" and "old evidence" respectively) -- then some repeated application of Bayes' theorem proves that you can essentially tack on an extra $|O$ to every term in the normal Bayes Theorem $\operatorname{Pr}(X|N) = \operatorname{Pr}(N|X) / \operatorname{Pr}(N)$ to get $$\operatorname{Pr}(X|NO) = \frac{\operatorname{Pr}(N|XO)}{\operatorname{Pr}(N|O)}\cdot \operatorname{Pr}(X|O).$$So the rightmost term will recurse down the $Y_i$; the denominator term is easily simplified on the assumption that the $Y_i$ are independent so that $\operatorname{Pr}(N | O) \approx \operatorname{Pr}(Y_k).$
Here's the problem: in my derivation I assumed that you could naively do the same to the numerator simultaneously, essentially adjoining an $|X$ to the expression $\operatorname{Pr}(N|O) \approx \operatorname{Pr}(N)$ to get $\operatorname{Pr}(N|XO) \approx \operatorname{Pr}(N|X).$ This sounds intuitively plausible, because we want to say "all of the old events don't tell us anything about the new event because they're all independent" -- but when this $X$ gets in the way the algebra seems to not simplify out the way that the earlier algebra did.
Even worse, one of the results I derived trying to examine this looks unexpectedly symmetric:$$\operatorname{Pr}(X~N~O) = \frac{\operatorname{Pr}(X~O)~\operatorname{Pr}(X~N)~\operatorname{Pr}(N~O)}{\operatorname{Pr}(X)~\operatorname{Pr}(N)~\operatorname{Pr}(O)},$$ where here (as above) juxtaposition means event-intersection (i.e. "the probability of all of these together" / "given all of these together"). This scares me somewhat: symmetries in such random variables would seem to indicate that we are in some sort of deeply unusual situation, and the resulting formulas stemming from this approximation are then likely total bunk in the real world!
So I need insight on any of the following mostly-equivalent questions:
- How safe is $\operatorname{Pr}(N|XO) \approx \operatorname{Pr}(N|X)$ when you are already trying to control $N$ to ensure that $\operatorname{Pr}(N | O) \approx \operatorname{Pr}(N)$?
- Is there a nice interpretation of the above symmetric formula?
- Does this somehow force any really bad results like $\operatorname{Pr}(X|NO) = \operatorname{Pr}(X)$ when you follow the logic out?
There would be other results that might help too, like if $\operatorname{Pr}(N|XO)$ could be phrased as some monotonic function $f\big(\operatorname{Pr}(N|X), \operatorname{Pr(N)}, \operatorname{Pr(O)}\big),$ that might be sufficient to say "these are in a one-to-one relationship with the true probability, I just don't know what that relationship is" and save the system.
The mathematics you're looking for is called conditional independence; as you note your second approximation is equivalent to $$\operatorname{Pr}(X~N~O) = \frac{\operatorname{Pr}(X~O)~\operatorname{Pr}(X~N)}{\operatorname{Pr} X},$$and dividing both sides by $\operatorname{Pr}X$ yields:$$\operatorname{Pr}(N~O | X) = \operatorname{Pr}(N | X)~\operatorname{Pr}(O | X),$$a straightforward statement that $N$ and $O$ are conditionally independent given $X$.
This is a standard assumption for any naive Bayes classifier.