Dimensionality of datasets in multiple regression

81 Views Asked by At

As an example, let's say that a linear regression is performed of the form

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n +\varepsilon$$

where $Y$ is a vector of $10,000$ measurements of peak acceleration of different car models, and the regressors correspond to different technical features of the cars.

From a linear algebra standpoint $Y$ lives in $\mathbb R^{10000}$, and the coefficients are found by minimizing the sum of the square distances of this vector on a hyperplane.

Now, from the point of view of dimension being the number of linearly independent vectors that span a space, this vector $Y$ is just $1$ dimension.

If it is truly $1$-dimension of a $\mathbb R^{10000}$ ambient space, the Euclidean projection on the hyperplane that underpins the process of finding the coefficients does not have any dimensionality issues (collinearity between the regressors being a separate topic). Otherwise, $L^2$ norms in high dimensions do pose problems.

So is $Y$ (the vector of $10,000$ observations) $1$-mimensional or high dimensional?

2

There are 2 best solutions below

1
On BEST ANSWER

Consider the function $$f(X_1,X_2) = \beta_0 + \beta_1 X_1 +\beta_2 X_2. $$ This is a plane in 3 dimensions no matter how many times you evaluate the function. Thus, your problem "lives" in a $2$-dimensional space.

As for using the $L^2$ norm, you are correct. This link might be useful in that regard Link

0
On

The issue of dimensionality is the context of regression analysis is the ratio between $n$, number of observations, and $p$, number of estimated parameters. As closer $n$ to $p$, the less reliable your estimated model is. Assume that your model is $$ y = \beta_0 + \sum_{j=1}^px_j\beta_j + \epsilon, $$ hence in order to find the OLS estimators of $\beta = (\beta_0,...,\beta_p)$ you project the vector $y$ on the affine space spanned by $(1,x_1,...,x_p)$, hence it is a $p$ dimensional space. The number of observations, $n$, is not count as dimension. If you have a continuous stochastic process, then you can sample from it infinitely many times, i.e., $n\to \infty$, that is usually a good feature because you can safely use asymptotic results. Notably, in such a case, there is another problem of artificially low p.values, but this is unrelated to the dimension of the model or the embedded space.