In one of the lectures, prof. Nando de Freitas explains the use of Bayesian rule to logistic regression. Here's the video and the slides.
In particular, on slide 10 (around 34:50 on the video) NdF writes the posterior as following:
$$p(\theta \mid X,y)=\frac{p(y \mid X,\theta)p(\theta)}{p(y \mid X)}$$
where $(X, y)$ are the observed data $D$ and $\theta$ is the parameter of the model.
(1) Strict application of Bayesian rule gives a slightly different equation:
$$p(\theta \mid X,y)=\frac{p(y \mid X,\theta)p(\theta\mid X)}{p(y \mid X)}$$
$X$ is just ignored from the condition to form a prior. For me it's not obvious that $X$ and $\theta$ are independent. Why is it true then?
(2) NdF repeats a similar reasoning on the next slide:
$$p(y_{n+1}\mid x_{n+1}, D) =$$ $$\int{p(y_{n+1}, \theta \mid x_{n+1}, D)}d\theta=$$ $$\int{p(y_{n+1}\mid \theta, x_{n+1}, D)} p(\theta \mid x_{n+1}, D)d\theta=$$ $$\int{p(y_{n+1}\mid \theta, x_{n+1})} p(\theta\mid D)d\theta$$
In the last equation two conditions disappear, $D$ and $x_{n+1}$. The argument is as follows (around 40:20 on the video): $\theta$ already contains the information about $D$, hence $D$ is redundant. Plus $x_{n+1}$ doesn't give any information to the posterior, hence $x_{n+1}$ is redundant.
I don't quite understand this reasoning and the nature of $\theta$ as a random variable. The dependence of $\theta$ and $x$ is not straightforward, but it looks like to compute $p(\theta \mid x)$ we need to marginalize over all $y$. Would appreciate if someone explains the intuition behind it.
It's been long time, but I think I figured it out. Strictly speaking, the presented derivation is wrong, because $X$ and $y$ are not independent. To see this, imagine a non-ML algorithm that tries to predict $y$ given just $X$. If it's just a little bit better than random guess, this means that $X$ contains information about $y$.
Consequently, $\theta$ that naturally depends on $(X, y)$ is not independent of $X$, in general, and it's not difficult to come up with an example when $y$ can be simply computed from $X$, thus $\theta$ can also be determined from $X$ alone (just fix the random seed and you have a deterministic algorithm).
I think the assumption that $X$ and $\theta$ are independent should've been stated explicitly, and that would resolve the issue. Despite not being true in all cases, in most real-world problems it seems reasonable and not far fetched. After all, the process to compute $\theta$ is usually stochastic and involves so many operations that $\theta$ value seems totally uncorrelated, which makes our target equation valid with very good precision.