Сame across this equation for training in classification task $$P(\theta | X_{tr},Y_{tr}) = \frac{P(Y_{tr}|X_{tr},\theta)P(\theta)}{P(Y_{tr}|X_{tr})}$$ Can't understand how we came to it from this(Bayes theorem): $$P(\theta | X_{tr},Y_{tr}) = \frac{P(X_{tr},Y_{tr}|\theta)P(\theta)}{P(X_{tr},Y_{tr})}$$
(Do I have to use chain or sum rule?)
And more questions about prediction equation:
$$P(Y_{ts}|X_{ts},X_{tr},Y_{tr})= \int{P(Y_{ts}|X_{ts},\theta)P(\theta | X_{tr} ,Y_{tr})d\theta}$$
what's the theory behind it? and what the whole training process looks like?
thanks!
You need chain rule to arrive at first equation from second since $P(X_{tr}, Y_{tr})=P(Y_{tr}|X_{tr})P(X_{tr})$ and $P(X_{tr}, Y_{tr}|\theta)=P(Y_{tr}|X_{tr},\theta)P(X_{tr}|\theta)$, and additionally note that obviously in machine learning any particular event $X_{tr}$ as training feature input doesn't depend on your classifier model parameter $\theta$ you want to estimate, thus you can claim the first equation. To further elaborate my second equality above using the reference added, we have $$P(X_{tr}, Y_{tr}, \theta)=P(\theta)P(X_{tr}|\theta)P(Y_{tr}|X_{tr},\theta)$$ On the other hand, $P(X_{tr}, Y_{tr}, \theta)=P(X_{tr}, Y_{tr}|\theta)P(\theta)$ from the usual conditional probability product rule which is a special case of chain rule and you can simply treat the conjunction of two events $X_{tr}, Y_{tr}$ as a composite event. Then you can arrive at your first equation and clearly see the application of chain rule here.
As for your further question about the prediction equation during testing, it's mainly calculating the marginal probability unconditioned on the nuisance parameter $\theta$ not unlike the usual denominator evidence in standard Bayes theorem.