Optimal Selection of a Subset of Regression Variables

34 Views Asked by At

I have encountered a problem at work, which looks interesting but I am sure that there must be publications on this, although I cannot find any.

Formulation

In a large regression, with hundreds of variables, I need to choose a subset of 3-5 variables (usually, 3) which "best" capture the RHS (where the sense of "best" is perhaps open to interpretation but so far I was interpreting it to mean to produce the closest approximation to the RHS by minimizing the 2-norm of the difference).


Special Case - Orthonormal Vectors

It's clear that in the case of having orthonormal vectors on which we project the RHS, it is enough to run one regression and choose the sub-selection with the largest coefficients. (In other words, if we are projecting $(3,5,7)^T$ on the standard basis in $\mathbb{R}^3$, the top 2 vectors would be $(0,0,1)^T$ and $(0,1,0)^T$ (in that order).

However, this intuitive greedy algorithm does not produce the optimal solution when the vectors onto which we are projecting are not orthogonal (which is practice happens almost always, unfortunately). For example, if our basis is $(0,1)^T$ and $\left(1/\sqrt{2},1/\sqrt{2}\right)^T$ in $\mathbb{R}^2$ (which is normal but not orthogonal), any vector $(x,x+a)^T$ in the first quadrant (with $x,a \in \mathbb{R}^+$), can be uniquely represented as a linear combination of the elements of our basis as $$ (x,x+a)^T = \frac{x}{\sqrt{2}} (1,1)^T + a (0,1)^T, $$ which would be our regression output. So if we went with the intuition from the orthonormal case, we would choose $(0,1)^T$ iff $a\sqrt{2} \ge x$.

However, note that the projection of $(x,x+a)^T$ onto $(0,1)^T$ is $(0, x+a)^T$, which is $x$ away, and the projection of the same vector on $\left(1/\sqrt{2},1/\sqrt{2}\right)^T$ is $(x+a/2,x+a/2)$, which is $\sqrt{(a/2)^2 + (a/2)^2} = a/\sqrt{2}$ units away, so one would choose $(0,1)^T$ if $x < a/\sqrt{2}$, and the inequalities are not equivalent, so the intuitive greedy strategy does not provide an optimal solution in all cases...


Related Questions on This Site

The only related questions I found on this site are


My Thoughts on How to Approach This Problem

I thought one possible approach is to formulate this as an integer program, letting some $z_k \in \{0,1\}$ stand for whether the corresponding column-vector would be included in the solution or not. However, it is not entire clear how to keep this linear, since

  • we will likely have to change the optimization function from the quadratic 2-norm to the linear 1-norm (which is undesirable); and
  • formulating the objective is a problem, since without the $z_k$'s we have to minimize $\left\|\sum_{k=1}^n \vec{M}_k x_k z_k - \vec{b}\right\|_2^2$, but the term $x_k z_k$ is not linear, and we need the constraint $\sum_{k=1}^n z_k = 3$ (or $=5$, depending on the particular problem in question).

More importantly, I am really hoping there will be a way to solve this which is not NP-hard. Thank you so much for your help.

1

There are 1 best solutions below

0
On BEST ANSWER

You can solve the problem exactly via a branch-and-bound algorithm known as "leaps and bounds":

Furnival, George M., and Robert W. Wilson. "Regressions by Leaps and Bounds." Technometrics 16.4 (1974): 499-511.