Predicting future values or modeling data

38 Views Asked by At

Suppose I know that this relationship exists y=(xb-b)+c+d if I had a table of y values for different values of x,b,c and d and I didnt know this relationship how would I go about finding this relationship. Would regression analysis produce the equation or would I have to plot the values and take some of the values as zero and model manually.

1

There are 1 best solutions below

1
On

One "data driven" approach to find the relationship between the variables is basically to do a linear regression to some set of prespecified functions. This requires that you have some idea about what types of functions relate the variables (for instance, that the relationship is a polynomial of degree $\leq2$).

First, make vectors: $$ X = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, ~~ B = \begin{bmatrix}b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix}, ~~ C = \begin{bmatrix}c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix}, ~~ D = \begin{bmatrix}d_1 \\ d_2 \\ \vdots \\ d_n \end{bmatrix},~~ $$

Now, construct a "library" of possible relationships between your variables. The columns of this matrix shoukd be functions of the variables $x,b,c,d$ applied to each data point. For example, if you expect a polynomial relationship of degree$\leq 2$ you could make the library: $$ A = \begin{bmatrix} | & | & | & | & | & | & | & | & & | \\ 1 & X & B & C & D & X^2 & XB & XC & ... &D^2\\ | & | & | & | & | & | & | & | & & | \end{bmatrix} $$ where, for example, $$ XB = \begin{bmatrix}x_1b_1 \\ x_2b_2 \\ \vdots \\ x_nb_n \end{bmatrix}, ~~ $$

More generally, any column could be $f(X,B,C,D)$ where the $-$-the entry of this column is simply $f(x_i,b_i,c_i,d_i)$.

Now note that that the product $Ac$ gives a linear combination (weighted sum) of these entries. So you can solve $Ac = Y$ to find the relationship between the columns of your library. Of course, in practice you will have to solve the least squares problem $\min_c \Vert Y-Ac \Vert$.

If your data exactly satisfies $y=(xb-b)+c+d$ then you will $c$ will have a coefficient of 1 on the $XB$ column, $-1$ on the $B$ column, $1$ on the C and D columns, and 0 everywhere else.

Example

Suppose we have data:

Y,   X,  B,  C,  D
16,  6,  2,  2,  4
22,  2,  7,  6,  9
5,   1,  4,  1,  4
33,  5,  7,  0,  5
13,  1,  4,  9,  4

If we know that $Y$ is a linear function of X, B, C, D, and XB. We can form our library A = [X,B,C,D,XB]

[[ 6,  2,  2,  4, 12],
 [ 2,  7,  6,  9, 14],
 [ 1,  4,  1,  4,  4],
 [ 5,  7,  0,  5, 35],
 [ 1,  4,  9,  4,  4]]

Now, solving the least squares problem gives:

x = [0,-1,1,1,1]

This tells us that $y = 0\cdot x + -1\cdot b + 1\cdot c+1\cdot d+1\cdot xb$ which is exactly what we expected.

Now, if you didn't know that the only product term would be $xb$, you could have added more functions to the library and ideally the least squares would give coefficients of 0 for these functions (in our example we get a coefficient of 0 for $x$ since there is no $x$ term in the relationship). The results will vary depending on the amount of data you have, and how noisy it is. If you think the relationship is simple, you could promote sparsity through L1 regularization by instead solving: $$ \min_c \Vert Y-Ac \Vert_2 + \lambda \Vert c \Vert_1 $$ where $\lambda$ is a tune-able parameter.