in Andrew Ng's Machine learning course http://cs229.stanford.edu/notes/cs229-notes1.pdf
Page 12,it said: $x$ and $y$ has linear relationship $$y(i) = \theta^Tx^{(i)}+\epsilon^{(i)}$$ where $$ \epsilon \sim N(0,\sigma^2)$$ is assumed to be IID random Gaussian variable.Therefore $$ p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\epsilon^{(i)})^2}{2\sigma^2})$$ and if we write $$\epsilon^{(i)} = y(i) - \theta^Tx^{(i)}$$ we can obtain a conditional probability: $$ p(y^{(i)} | x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y(i) - \theta^Tx^{(i)})^2}{2\sigma^2})$$
My question is : why it's an conditional probability? rather than a 2D-Joint probability? That is, why can't I write it as this: $$ p(y^{(i)} , x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y(i) - \theta^Tx^{(i)})^2}{2\sigma^2})$$
one more question, in the notes, $y^{(i)}$ and $x^{(i)}$ are samples, not random variants, was it OK to write $ p(y^{(i)} | x^{(i)} )$ instead of $ p(Y | X)$
Thank you Brilliant guys!
$p(\epsilon^{(i)})$ is the probability density function of the error term $\epsilon^{(i)}$. It expresses how the unmoddled effects influence the value of $y^{(i)}$; that is the density function of the influence on $y^{(i)}$ that is not produced by random variable $x^{(i)}$ (or the parameter $\theta$).
In other words, for a fixed value of $x^{(i)}$, then $p(\epsilon^{(i)})$ is the conditional density function of $y^{(i)}$ given that contraint.
$$p(y^{(i)}\mid x^{(i)};\theta)~=~p(\epsilon^{(i)})$$
Yes..ish. It is actually shorthand for $p_{\lower{0.5ex}{X\mid Y}}(y^{(i)}\mid x^{(i)})$, just as the $p(\epsilon^{(i)})$ is actually $p_{\lower{0.5ex}\mathcal E}(\epsilon^{(i)})$ . It is just that the subscripts are sometimes omitted when it is considered unambiguous as to what variable the value refers.