MAP estimation for conditional probability

117 Views Asked by At

I'm studying bayesian estimation of model parameters and i noticed in several ML books (Deep Learning Goodfellow, ML a probabilistic perspective K. Murphy) that they used the bayesian rule in the following way: $$ p(\theta \mid X, y) \propto p(y \mid X, \theta) \cdot p(\theta) $$ Which one then used to deriving MAP estimate: $$ \theta _{MAP} = \operatorname {argmax} \limits_{\theta} \log p(y \mid X , \theta) \cdot p(\theta) $$

On the other hand, the Bayes rule formally gives the next equation for this case: $$ p(\theta \mid X, y) = \dfrac{p(y \mid X, \theta) \cdot p(\theta \mid X)}{p(y \mid X)} $$

So i don't understand why they ignore the conditioning on $X$, it seems to me that they suppose that $\theta$ and $X$ independent? But why?

P.S. I found similar question was asked 2 years ago but this one remained unanswered :(

[MAP = maximum a posteriori, in case anyone was wondering. —Brian Tung]

2

There are 2 best solutions below

5
On BEST ANSWER

The independence of $\theta$ given $X$ is indeed an assumption that's very common in Bayesian inference.

It comes from the fact that $p(\theta)$ expresses the prior information about the parameters of your model. These are essentially your model assumptions, which are independent of whatever data you choose to show your model.

You can think of it like this: you use MAP instead of MLE to derive the optimal parameters of your model because you believe that the prior information you have about these parameters based on the structure of your problem will help give more sensible answers (otherwise you could just impose a uniform prior where any parameter value is equally likely which would be equivalent to doing MLE).

So even though Bayes rule would tell you to condition everything on $X$, because of the way the problem is set up, the choice of priors comes before anything else.

The priors you place on your parameters have a crucial effect on the predictive power of your model. Indeed if your prior excludes the possibility of some parameter values by giving them zero weight, you won't be able to obtain them through this kind of inference, regardless of how well they explain the observed data through the likelihood. Therefore, it is really important to select sensible priors.

The bottom line is, your prior is independent of your inputs and that is why you are able to drop the conditioning on $X$. Hopefully the above gives you some intuition as to why this is the case.

0
On

You can add to the list of books that make this assumption Rasmussen's Gaussian Processes for Machine Learning.

This seems to be a very recurrent question on this site ... In addition to the similar question you raised in your post, we can also mention (1) https://stats.stackexchange.com/questions/584279/is-there-an-implicit-independence-assumption-in-bayesian-inference-between-x-and ; (2) Independence of the data and the parameter in Machine Learning or even (3) Bayesian Linear Regression

Some answers seem to treat the prior distribution as a random variable, then assert the independence between this random variable and X. Treating distributions of variables as random variables is, of course, a common practice, but that's not what we're doing here. In the derivation reported here, it is the independence between X and $\theta$ itself that is asserted, not the independance between X and the distribution of your a priori beliefs about $\theta$.

I guess that an answer that somehow encompasses Dmarks' intuition and (1)'s comments on the fixity of X is that the source of randomness that makes $\theta$ random is not the same as the source of randomness that determines which inputs are observed.

X can be considered random in the sense that samples are drawn at random from a dataset (some even consider it non-random, but if you come from ML you probably want to think of it as stochastic). On the other hand, $\theta$ is random for Bayesian reasons, whatever your interpretation of Bayesianity (model your subjective belief, or the uncertainty about the world from which the observed experience comes from among a collection of competing worlds ...), and can additionally model a source of uncertainty intrinsic to the interactions between X and y (additive noise ...) that we try to imitate with our parameters, but which will be entirely taken care of by y and not by X in the way we look at the problem. I suppose you could think of it as a product measurable space between Sampling and (Uncertainty & Inner randomness), and X varies only according to the first.