I am trying to work through the paper "Repairing Neural Networks by Leaving the Right Past Behind" (arxiv). And really struggle working through the mathematics. The paper states that the key idea is that they can express:
$$ p(\mathcal{D}\setminus\mathcal{C} | \theta) = p(\mathcal{D}|\theta) / p(\mathcal{C}|\theta), \quad\forall\mathcal{C} \subset \mathcal{D} $$
This is possible due to the "i.i.d. modelling assumption".
I tried to understand this formulation with my (limited) intuition, reformulations and even by placing sets and calculating the conditional probabilities by hand and neither matches the above formulation.
Under what condition is the above formulation correct and what is the intuition behind it?
The symbols, $\cal D, C$, represent sets of data taken from independent and identically distributed sampling (from some distribution with parameter $\theta$).
Therefore, the data in $\cal D\smallsetminus C$ is conditionally independent from the data in $\cal D\cap C$ for given $\theta$, since these parts of $\cal D$ are disjoint.
Further, the text is specifying that $\cal \forall C\subset D$ , which means that: $\cal C = D\cap C$.
And so we have this:
$$\begin{align}p(\mathcal D\mid\theta) &=p(\mathcal{(D\smallsetminus C)\cup(D\cap C)}\mid\theta)&&\text{by definition of the union}\\ &=p(\mathcal{D\smallsetminus C}\mid\theta)\cdot p(\mathcal{D\cap C}\mid\theta)&&\text{by independence} \textit{ of the data} \text{ given } \theta\\ &=p(\mathcal{D\smallsetminus C}\mid\theta)\cdot p(\mathcal C\mid\theta)&&\text{when }\mathcal{C\subset D}\\[2ex]\therefore\quad p(\mathcal{D\smallsetminus C}\mid\theta) &= p(\mathcal D\mid\theta) / p(\mathcal C\mid\theta)&&\forall \mathcal{C\subset D}\end{align}$$
That is all.
$p(\mathcal E\mid\theta)$ is the probability for obtaining data points in the $\lvert\mathcal E\rvert$ trials given identical parameter $\theta$.
To clarify: the sets are not events, the data points are, and the sets are a conjunction of events from separate trials; so a union of sets of data is the conjunction of those events.
Thus for independent sets of data, $\mathcal E$ and $\mathcal F$, we will have $p(\mathcal{E\cup F}\mid\theta)=p(\mathcal E\mid\theta)\cdot p(\mathcal F\mid\theta)$ .