Independence of the data and the parameter in Machine Learning

12.4k Views Asked by At

In one of the lectures, prof. Nando de Freitas explains the use of Bayesian rule to logistic regression. Here's the video and the slides.

In particular, on slide 10 (around 34:50 on the video) NdF writes the posterior as following:

$$p(\theta \mid X,y)=\frac{p(y \mid X,\theta)p(\theta)}{p(y \mid X)}$$

where $(X, y)$ are the observed data $D$ and $\theta$ is the parameter of the model.

(1) Strict application of Bayesian rule gives a slightly different equation:

$$p(\theta \mid X,y)=\frac{p(y \mid X,\theta)p(\theta\mid X)}{p(y \mid X)}$$

$X$ is just ignored from the condition to form a prior. For me it's not obvious that $X$ and $\theta$ are independent. Why is it true then?

(2) NdF repeats a similar reasoning on the next slide:

$$p(y_{n+1}\mid x_{n+1}, D) =$$ $$\int{p(y_{n+1}, \theta \mid x_{n+1}, D)}d\theta=$$ $$\int{p(y_{n+1}\mid \theta, x_{n+1}, D)} p(\theta \mid x_{n+1}, D)d\theta=$$ $$\int{p(y_{n+1}\mid \theta, x_{n+1})} p(\theta\mid D)d\theta$$

In the last equation two conditions disappear, $D$ and $x_{n+1}$. The argument is as follows (around 40:20 on the video): $\theta$ already contains the information about $D$, hence $D$ is redundant. Plus $x_{n+1}$ doesn't give any information to the posterior, hence $x_{n+1}$ is redundant.

I don't quite understand this reasoning and the nature of $\theta$ as a random variable. The dependence of $\theta$ and $x$ is not straightforward, but it looks like to compute $p(\theta \mid x)$ we need to marginalize over all $y$. Would appreciate if someone explains the intuition behind it.

2

There are 2 best solutions below

0
On BEST ANSWER

It's been long time, but I think I figured it out. Strictly speaking, the presented derivation is wrong, because $X$ and $y$ are not independent. To see this, imagine a non-ML algorithm that tries to predict $y$ given just $X$. If it's just a little bit better than random guess, this means that $X$ contains information about $y$.

Consequently, $\theta$ that naturally depends on $(X, y)$ is not independent of $X$, in general, and it's not difficult to come up with an example when $y$ can be simply computed from $X$, thus $\theta$ can also be determined from $X$ alone (just fix the random seed and you have a deterministic algorithm).

I think the assumption that $X$ and $\theta$ are independent should've been stated explicitly, and that would resolve the issue. Despite not being true in all cases, in most real-world problems it seems reasonable and not far fetched. After all, the process to compute $\theta$ is usually stochastic and involves so many operations that $\theta$ value seems totally uncorrelated, which makes our target equation valid with very good precision.

0
On

There's still some confusion in your auto-answer, when you say that independence between $\theta$ and $X$ would be an assumption that could be true or false, and that it's reasonable in most real-world situations. In fact, this hypothesis isn't a hypothesis about the real world that could turn out to be true, false, or not too false in certain situations, it's rather a founding hypothesis of your Bayesian model. $\theta$ is independent of $X$, by design. It's your own model, so you're free to define it as you wish.

It is important to understand that $y$ is not random (only) in the same sense as $X$. Their source of randomness is not the same. You could even consider that $X$ is a vector that is not random at all (this is a radical way of removing any potential confusion) ; in this case, you can remove all conditionings by $X$ from your equations (or consider them as decorative reminders). The dependency between $X$ and $y$ will therefore no longer be handled by the probabilistic framework of realizations, but simply by a mental matching you make between the lines of $X$ and $y$ that have the same index. Vector $y$, however, remains random since it includes additive noise realizations. In fact, there are contexts where you might want to consider $X$ random, especially in machine learning (this is necessary, for example, for tackling bias-variance dilemna theme), but it will be random in the sense of sampling and data selection, which is not the same source of randomness.

Informally, the introduction of Bayesian parameters is used here to address the randomness caused by particular noise realizations, rather than the randomness caused by the selection of examples from a dataset ... but anyway you don't have to justify to anyone how you build your model, you have the right to introduce a random variable if you wish, and say that it does not depend on the dataset selection if you wish.