Linear regression with two possible slopes

147 Views Asked by At

Let's say, I have a dataset with $X$ and $Y$ values. $X$ represents the monthly average temperature and $Y$ represents the money spent on utilities. My underlying hypothesis is that the heating energy (and utility bill) will be proportional to the average monthly temperature, but depending upon whether the house has gas or electric heating, the slope, $ ^\circ C$, will be different.

How can I use linear regression to extract out these two slopes? If I just do a simple linear regression with $X$ and $Y$, I will only get a single slope that will represent the average $^\circ C$ between gas and electric heating. If I do a scatter plot, it's quite easy to see distinct linear relations (as shown in figure below), but I am lost in terms of how to extract the two slopes.

enter image description here

3

There are 3 best solutions below

0
On BEST ANSWER

If you cannot distinct the two groups then just classify the observations manually, e.g, whenever $y_i/x_i$ larger then some threshold, then it belongs to group $A$, then just estimate $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_Ax_i + \epsilon_i, $$ where $D_A$ is the indicator of the $A$th group. The slope of the $A$th group will be $(\beta_1 + \beta_2)$, while for the other group just $\beta_1$.

1
On

As far as I can tell, assuming you cannot manually label, you have three options

  • Full-fledged optimization-based joint classification and regression
  • Two stages with initial unsupervised classification followed by standard regression
  • Some other heuristics based on adding/removing points to two regression problems

For such well-separated data as you have, I would go for the second alternative as it should be very easy to separate these clusters (unless it is just a test and you have a solver framework available)

I had some old MATLAB code illustrating the first, the following code sets up a case similar to yours and encodes it as a mixed-integer QP using the toolbox YALMIP. The MIQP solver Gurobi which I used for testing starts struggling already for 100 data-points. You essentially assign a binary variable to each data point and slope and let this variable describe which residual to be added to the objective.

%% Data
n = 25;
x1 = sort(rand(n,1));
x2 = sort(rand(n,1));
y1 = 2+3*x1+.3*randn(n,1);
y2 = 1+6*x2+.3*randn(n,1);
x = [x1;x2];
y = [y1;y2];

%% Optimization
line1 = binvar(2*n,1);
line2 = binvar(2*n,1);
sdpvar a1 b1 a2 b2
e = sdpvar(2*n,1);
Model = [implies(line1,e == y-(a1*x+b1))
         implies(line2,e == y-(a2*x+b2))
         line1+line2 == 1]
optimize(Model,e'*e)     

%% Evaluate
clf
hold on
t=(0:0.1:1);
l = plot(t,value(a1)'*t+value(b1),'k-');
l = plot(t,value(a2)'*t+value(b2),'k-')
i = find(value(line1));
j = find(value(line2));
plot(x(i),y(i),'b*',x(j),y(j),'r*')
plot(x1,y1,'ro',x2,y2,'bo')

enter image description here

I just had to try the linear least-squares method in the answer by JJacquelin. Seems to work well on data looking like yours (I was too lazy to extract the asymptotes so just symbolically plotted the quadratic, well the whole code is lazy)

sdpvar a b c d f g
e = x.^2*a + y.^2*b + 2*x.*y*c + f*x + g*y + 1;
optimize([],e'*e)
sdpvar x y
p = [a b c d f g];
s = sdisplay(replace(x^2*a + y^2*b + 2*x*y*c +f*x + g*y + 1,p,value(p)));
l = ezplot([s{1} '= 0'])

enter image description here

1
On

If you had posted an example of data (numerical, not graphical) I would have tested the method of regression given page 19 in https://fr.scribd.com/doc/14819165/Regressions-coniques-quadriques-circulaire-spherique

Numerical examples are shown in the paper.

I am reluctant to propose this method without testing it with a representative example of your data because, as pointed out in the paper, the reliability depends a lot of the scatter of data wrt the order of magnitude of the data.

You can try it and see. But a-priori I would not guarentee the succes.

NOTE : The first step of the method consits in fitting an hyperbola as shown in the paper. If the fitting is succesful, the asymptotes can roughly be taken as the two straight lines. Then for better fitting the points could be separated into two sets wrt the axis of the hyperbola.