I'm studying bayesian estimation of model parameters and i noticed in several ML books (Deep Learning Goodfellow, ML a probabilistic perspective K. Murphy) that they used the bayesian rule in the following way: $$ p(\theta \mid X, y) \propto p(y \mid X, \theta) \cdot p(\theta) $$ Which one then used to deriving MAP estimate: $$ \theta _{MAP} = \operatorname {argmax} \limits_{\theta} \log p(y \mid X , \theta) \cdot p(\theta) $$
On the other hand, the Bayes rule formally gives the next equation for this case: $$ p(\theta \mid X, y) = \dfrac{p(y \mid X, \theta) \cdot p(\theta \mid X)}{p(y \mid X)} $$
So i don't understand why they ignore the conditioning on $X$, it seems to me that they suppose that $\theta$ and $X$ independent? But why?
P.S. I found similar question was asked 2 years ago but this one remained unanswered :(
[MAP = maximum a posteriori, in case anyone was wondering. —Brian Tung]
The independence of $\theta$ given $X$ is indeed an assumption that's very common in Bayesian inference.
It comes from the fact that $p(\theta)$ expresses the prior information about the parameters of your model. These are essentially your model assumptions, which are independent of whatever data you choose to show your model.
You can think of it like this: you use MAP instead of MLE to derive the optimal parameters of your model because you believe that the prior information you have about these parameters based on the structure of your problem will help give more sensible answers (otherwise you could just impose a uniform prior where any parameter value is equally likely which would be equivalent to doing MLE).
So even though Bayes rule would tell you to condition everything on $X$, because of the way the problem is set up, the choice of priors comes before anything else.
The priors you place on your parameters have a crucial effect on the predictive power of your model. Indeed if your prior excludes the possibility of some parameter values by giving them zero weight, you won't be able to obtain them through this kind of inference, regardless of how well they explain the observed data through the likelihood. Therefore, it is really important to select sensible priors.
The bottom line is, your prior is independent of your inputs and that is why you are able to drop the conditioning on $X$. Hopefully the above gives you some intuition as to why this is the case.