Is it possible to fit any regression line to a set of data points?

145 Views Asked by At

If you have a set of data points (x1,y1), (x2,y2),...,(xn,yn)

And you know it fits a trend y=f(x) where f(x) is a known function,

Say for example: f(x)=A*x*sin(B*x)

Providing f(x) is known, is it possible to fit a regression line to this data and get values for A and B (and any other coefficients involved) for any f(x)?

If it is possible, could you give me some insight into how I would find the regression coefficients of the curve example given above?

2

There are 2 best solutions below

1
On BEST ANSWER

Here's some MATLAB code that implements the idea in Claude Leibovici's answer.

First we generate some data

a = 10;
b = 2.5;
x = linspace(0, 2*pi, 200)';            % 100 points between 0 and 2pi
y = a * sin(b * x) + 10*randn(size(x)); % y = a sin(bx) + noise

which look like this

enter image description here

Now you define your objective function, which will be a function of b only (since we will choose a by least squares)

function [mse, a] = obFun(b, x, y)
  sinbx = sin(b * x);
  a = sinbx \ y;          % choose a by OLS
  err = y - a * sinbx;    % compute error terms
  mse = mean(err.^2);
end

Say you suspect that the true value of b is in the interval [0, 5]. Then you can create some values in that interval, and compute the mean-square error on each value.

bValues = 0.01: 0.01: 5;
for i = 1:length(bValues)
  mse(i) = obFun(bValues(i), x, y);
end

The mean-square error as a function of b looks like this

enter image description here

So you can see that there is an obvious minimum near to 2.5. You can find out exactly where it is, and from that find out the optimal values of b and a.

[mseMin, j] = min(mse);
bOptimal = bValues(j);
[~, aOptimal] = obFun(bOptimal, x, y);

fprintf('a = %.3f, b = %.3f\n', aOptimal, bOptimal);

which results in

a = 10.548, b = 2.540

which are reasonably close to the true values of a = 10.0 and b = 2.5.

0
On

Nonlinear least squares is probably the easiest solution provided that you have reasonable starting values for $A$ and $B$.

Suppose that you do not know any starting point. What you can do is to fix $B$ at a given value and just compute $A$ according to $$A=\frac{\sum _{i=1}^n x_i y_i \sin (B x_i)}{\sum _{i=1}^n (x_i \sin (B x_i))^2 }$$ since the model is linear with respect to $A$ if $B$ is fixed. For this value of $A$, compute the sum of squares of errors and repeat for other values of $B$. Plot the sum of squares of errors as a function of $B$ until you have something close to a minimum. Now, you have your starting values.

If you want, post a dozen of data points and I shall show you how this works.

If the model had been $$y=A x \sin (B x)+C x^2+D x^3$$ the procedure I suggested could apply since, for a given value of $B$, parameters $A,C,D$ can be immediately obtained using multilinear regression. Then, a look at the plot of the sum of squares of errors as a function of $B$ tells you how to initialize the nonlinear square fit.