How to make small values matter in a non-linear regression analysis

477 Views Asked by At

I have a dataset with some of the following values:

independent variable n: dependent variable t
196: 8.32E-05
676: 0.000360012
..: ..
2739025: 17.19871902
4422609: 34.82757854

I am trying to match this empirical data to the closest function. For this, I use SPSS -> Analyze -> Regression -> [curve estimation / Nonlinear]. When I analyses the outcome, it appears that the large values are fitted in the same order of magnitude (42 vs 35), whereas the small values are off by multiple orders of magnitude (0.05 vs 0.00008).

I assume this is because of a square sum error measure.

My question: how can I perform an analysis that creates a more balanced fit, preferably using SPSS?

P.S. I apologize if math.stackexchange.com is the wrong forum. If so, please let me know which one to use instead.

2

There are 2 best solutions below

0
On

In usual least squares analysis you square the error at each point and add the squares to get the total error, which you minimize. If you have values like $35$ which are fitted with $42$ that contributes $(42-35)^2=49$ to the error. If a value like $0.05$ is fitted with $0$ it contributes $0.05^2=0.0025$ to the error, so it doesn't contribute much unitil the fit gets to be something like $7$.

One approach is to take the log of all your data and fit that. You will then be asking the fitter to match each data point within a multiplicative error instead of an additive error. Another approach is to weight some data points more than others in the sum of squares. A third approach is to use a functional form that is guaranteed to go through $(0,0)$

0
On

First of all, Welcome to the site !

As Ross Millikan answered, taking the logarithm of the $y$'s is a solution. In fact, it almost corresponds to the minimization of the sum of squares of relative errors.

If you do it, the residual is $$r_i=\log(y_i^{calc})-\log(y_i^{exp})=\log\left(\frac{y_i^{calc}} {y_i^{exp} } \right)=\log\left(1+\frac{y_i^{calc}-y_i^{exp}} {y_i^{exp} } \right)$$ and, if the error is small, by Taylor, $$r_i \approx \frac{y_i^{calc}-y_i^{exp}} {y_i^{exp} }$$

This is a typical problem in some areas; for example, the vapor pressure of any molecule is always represented as $\log(P)=f(T)$ and the fit is done that way.

Edit

For illustration purposes, let us consider the vapor pressure of water (values taken here). Units being Kelvin and Pascal, the data are $$\left( \begin{array}{cc} T & P \\ 212.45 & 1 \\ 230.95 & 10 \\ 252.85 & 100 \\ 280.15 & 1000 \\ 318.95 & 10000 \\ 372.75 & 100000 \end{array} \right)$$ and Antoine's very simplistic model is $$P=\exp\left(a+ \frac b {T+c} \right)$$ The fit does not present any difficulties but let us look at the predicted values $$\left( \begin{array}{ccc} T & P_{exp} & P_{calc} \\ 212.45 & 1 & 1.40872 \\ 230.95 & 10 & 13.5177 \\ 252.85 & 100 & 117.420 \\ 280.15 & 1000 & 993.933 \\ 318.95 & 10000 & 10000.6 \\ 372.75 & 100000 & 100000. \end{array} \right)$$ showing the same problem as in your case (very large relative errors at low temperature.

Let us repeat the problem with the logarithmic transform $$\log(P)=a+ \frac b {T+c}$$ and repeat the calculations $$\left( \begin{array}{ccc} T & P_{exp} & P_{calc} \\ 212.45 & 1 & 0.96378 \\ 230.95 & 10 & 10.6333 \\ 252.85 & 100 & 102.584 \\ 280.15 & 1000 & 937.705 \\ 318.95 & 10000 & 9954.24 \\ 372.75 & 100000 & 101906. \end{array} \right)$$ Finally, using the nonlinear model but minimizing the sum of the squares of relative errors $$\left( \begin{array}{ccc} T & P_{exp} & P_{calc} \\ 212.45 & 1 & 0.96119 \\ 230.95 & 10 & 10.5950 \\ 252.85 & 100 & 102.199 \\ 280.15 & 1000 & 934.653 \\ 318.95 & 10000 & 9934.34 \\ 372.75 & 100000 & 101902. \end{array} \right)$$

As you can see, the second and third steps lead to very similar results.