I want to fully understand the probabilistic interpretation. As in, I know once we have a probabilistic model, we differentiate for maximum likelihood and find the weights/regressors but what i really find difficult to grasp is how exactly are we developing a probabilistic model for linear regression.
I have see that initially we we write: $y_i=\epsilon_i +w^Tx_i$.---------(1)
Here i want to know what is $y_i$? Is it the observed value? Then how come we model it as random? Where is the randomness coming from? What is $\epsilon_i$? Is it error or noise?
please correct if i am wrong:
What i understand is our measured data is noisy. i.e for the same $x_i, y_i$ can vary on a different draw of samples, which is due to some inherent randomness in $y_i$. And this randomness is what we are quantifying using $\epsilon_i \sim N(0,\sigma^2)$. Hence $y_i$ is a normal random variable given $x_i$ and it has mean $w^Tx_i$ so we want to maximize the likelihood, meaning maximize the probability that $y_i$ takes the value which we have in our current experimental data, given xi and this probability happens to be parameterized by w due to (1).
Your two questions cancel each other out: the randomness is coming from $\varepsilon_i$. In the probabilistic interpretation of linear regression we assume that the data is being generated by an unknown linear model $w^T x_i$ plus IID Gaussian noise $\varepsilon_i$, which we assume has mean zero and a fixed variance $\sigma^2$ which will turn out to be irrelevant. This means that the log likelihood of observing some data points $y_1, \dots y_k$ under the assumption that the weight vector is $w$ is proportional to $- \sum (y_i - w^T x_i)^2$, so the maximum likelihood estimate for $w$ is given by maximizing this expression, which means minimizing the sum of squares.