connecting regression to the formal definition of a statistical model

58 Views Asked by At

I would like to know how regression model is a statistical model. The formal definition of a statistical model according to wikipedia is as follows:

a statistical model is usually thought of as a pair ( S, P) where where S is the set of possible observations, i.e. the sample space, and P is a set of probability distributions on S.

How does Regression model normally written as below map to the above formal definition of a statistical model? $y_{i}=\beta _{1}x_{i1}+\beta _{2}x_{i2}+\cdots +\beta _{p}x_{ip}+\varepsilon _{i},\, $

2

There are 2 best solutions below

0
On BEST ANSWER
  • Observations: data points of the form $(y_i, x_{i1}, \ldots, x_{ip})$ for $i = 1,\ldots,n$.
  • Probability model: conditional distribution of $y_i$ given $x_{i1},\ldots, x_{ip}$ is a normal distribution with mean $\beta_1 x_{i1} + \cdots + \beta_p x_{ip}$ and some variance, for some real coefficients $\beta_1, \ldots, \beta_p$.

Note that one can perform linear regression without any probabilistic assumptions. But the above formulation is the usual set of assumptions when discussing things like bias/variance of the model, inference of coefficients, etc.

Note also that the above conditions might vary in different contexts. For instance, the above only models the conditional distribution (effectively keeping the $x_{ij}$ fixed/known); some might model the $x_{ij}$ as random variables as well. Conditions on the variance may vary. Is it known or unknown? Is it constant for all $i$ (homoskedastic) or does it vary with $i$ (heteroskedastic)? What if you don't require the conditional distribution to be normal?

0
On

After a long process, I finally can provide my answer.I accepted @angryavian's response but I also am adding the extra points I learned and are relevant to a fuller answer. In linear regression for each observation yi, we assume =11+22+⋯++. Assuming Y to be a vector of all y_i and X (design matrix $n \times p$) to be deterministic, and $\varepsilon ~ \mathscr N(0, \sigma^2 I_n)$ then $Y= XB+ \varepsilon$ therefore, given that $\varepsilon$ is a gaussian and $XB$ is deterministic for any given $B$ therefore $XB+ \varepsilon$ is $ \mathscr N_n(XB, \sigma^2I_n)\}$ and our model is probability distribution family $ \{ \mathscr N_n(XB, \sigma^2I_n)\}_B \in \mathbb R^p$