In the notes of Andrew Ng on linear regression p.11-12 it is written the following:
Let us assume that the target variables and the inputs are related via the equation $$ y^{(i)} = \theta^\top x^{(i)} + \epsilon^{(i)} $$ where $\epsilon^{(i)}$ is an error term that captures either unmodeled effects or random noise. We assume that
- $\epsilon^{(i)}$ ie ndenpendently and identically distributed
- $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$
i.e. the density of $\epsilon^{(i)}$ is given by $$ p(e^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\Big)\tag{1} $$ This implies that $$ p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2}\Big)\tag{2} $$
My question
If $$ p(e^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\Big) \quad \text {implies that} \quad p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2}\Big) $$ That means that we assume that $$ y^{(i)} = \theta^\top x^{(i)} + \epsilon^{(i)} \approx \theta^\top x^{(i)}\tag{3} $$ Or in other words: $$ p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(y^{(i)}-(\theta^\top x^{(i)}+ \epsilon^{(i)}))^2}{2\sigma^2}\Big) \approx \frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2}\Big)\tag{4} $$
- is ok to infer (3) and (4) from (1) and (2)?
- if yes, How can we mathematically justify (4) using this notation?
By change of variables. Suppose $X\sim f_X(x)$, where $f_X(.)$ reperesents the probability density function and $F_x(.)$ represents cumulative distribution function, $X$ is the random variable and $x$ is a known, deterministic value (given value). Now you want to compute the density of $Y=X+a$ where $a$ is deterministic, thus $F_Y(b)=$Pr$\{Y\le b\}=$Pr$\{X+a\le b\}=$Pr$\{X\le b-a\}=F_X(b-a)$ differentiating both sides yields $f_Y(b)=f_x(b-a)$.
Also notice that if $X_1$ and $X_2$ are independent, then $f_{X_1,X_2}(x_1,x_2)=f_{X_1}(x_1)f_{X_2}(x_2)$ and if we define $Y_1=X_1+a$ and $Y_2=X_2+a$ then $F_{Y_1,Y_2}(y_1,y_2)=$Pr$\{Y_1 \le y_1,Y_2 \le y_2\}=$Pr$\{X_1 \le y_1-a,X_2 \le y_2-a\}=$Pr$\{X_1 \le y_1-a\}$Pr$\{X_2 \le y_2-a\}$ which means $Y_1$ and $Y_2$ will be independent also and thus again by differentiation we have $f_{Y_1,Y_2}(y_1,y_2)=f_{X_1}(y_1-a)f_{X_2}(y_2-a)$. This can be generalized to arbitrary $n$ variable.
the noise model here is like a memory-less additive white Gaussian channel.