How to curve fit an unknown function?

2.3k Views Asked by At

I have data which can be described by $y=f(x,z)$ where $z$ varies from 170 ~ 154. Now values given by $ks$ are known sample values that equals value given in the table header, $uks$ are unknown samples. $$ \begin{array} & z & x_{ks_{1}=0} & x_{ks_{2}=50} & x_{ks_{3} = 60} & \cdots & x_{uks_{1}}& \cdots & x_{uks_{n}}\newline 170 & 3.5 & 5 & 6 &\cdots & 5.3 & \cdots & 12 \newline 168 & 3.7 & 5.1 & 6.6 &\cdots & 5.6 & \cdots & 13.2 \newline \vdots & \vdots & \vdots & \vdots &\ddots & \vdots & \ddots & \vdots \newline 154 & 6.8 & 9.52 & 10 &\cdots & 11.5 & \cdots & 26.5 \newline \end{array} $$

Plotting values of $y$ vs. $x$ for various $z$, I get this plot:

enter image description here

and on a semi-log plot:

enter image description here

Note the values on $y-axis$ are the values given by the subscripts of $x$ in the table headers.

**My objective is to find the value of $y$ for the corresponding $x_{uks_{1}}, x_{uks_{2}}, \cdots, x_{uks_{n}}$ for different $z$.**There might be alternate methods, such methods are most welcome. And my approach is as follow:

  1. I try to fit a linear function of the form:

$$y = \frac{A_{i}}{ln[x]^{3}} + B_{i},\quad \text{i for each z}$$

this gives me plots as follows:

enter image description here

  1. Since $A_{i}$ and $B_{i}$ is different for $z$, I can fit a function. Therefore $A_{i} = f(z)$ and $B_{i} = f(z)$. For step 1, I used the following function:

$$A = a_1 z^3 + a_2 z^2 + a_3 z + a_4$$ $$B = b_1 z^2 + b_2 z + b_3$$

  1. Using the correlation:

$$y = \frac{a_1 z^3 + a_2 z^2 + a_3 z + a_4}{ln[x]^{3}} + b_1 z^2 + b_2 z + b_3$$

And substituting different values of $x_{uks}$ I get corresponding $y$, but when I plot them vs. $z$, I get plots of the form:

enter image description here

Now, my questions are as follows:

  1. How am I getting $y$ to be a linear function of $z$?
  2. Is this approach a valid one?
1

There are 1 best solutions below

6
On BEST ANSWER

Assuming that $$y = \frac{A_{i}}{ln[x]^{3}} + B_{i}$$ is the right model to use, the general approach you used is very correct for me (at least in its principle). For each value of $z$, you adjusted the $A_z$ and $B_z$ and plotting the values of these coefficients as a function of $z$ you observed regular trends and you decided to fit them using a cubic polynomial for $A_z$ and a quadratic polynomial for $B_z$.

My first question is : did you check that all coefficients $a_i$ and $b_i$ are statistically significant ?

In any manner, after doing this, I strongly suggest that you regress simultaneously all the data with all descriptors and check if all coefficients are (or are not) significant. If some are not, discard the corresponding terms and restart.

Otherwise, after the observations you made, you could use a stepwise regression which will enter the most significant terms one after eachother in the regression.

From you last plot, $y$ is not linear as a function of $z$; there is some obvious curvature.

Edit

By the way, once you got the $A_i$'s and $B_i$'s it could be good, starting from these values and $m=3$, to perform a non linear regression of $$y = \frac{A_{i}}{ln[x]^{m}} + B_{i}$$ and see if $m$ depends or not on $z$.

I think that a detailed analysis of the significance of every single parameter is crucial. Otherwise, you take the risk of serious overfitting.

Edit after receiving the data points

I basically repeated what you did and obtained the different $A_i$'s and $B_i$'s corresponding to each value of $z$. Plotting them as a function of $z$, both of them seem to be along cubic polynomials; however, some of the coefficients are not significant from a statistical point of view.

Now, trying to answer the question : "How am I getting $y$ to be a linear function of $z$ ?". In fact, if you fit using just linear functions, you already have something which is not bad at all : to $$A =7365.18 -42.5313 z$$ $$B=378.156 -1.61209 z$$ correspond adjusted $R^2$ respectively equal to $0.917$ and $0.933$ which is quite good. For sure, these $R^2$ become $0.995$ and $0.971$ when using cubic polynomials but the main part is already available using straight lines for the fit.

If you perform a simultaneous fit of all data accodring to model $$y = \frac{a_1 z^3 + a_2 z^2 + a_3 z + a_4}{ln[x]^{3}} + b_1 z^3 + b_2 z^2 + b_3 z+b_4$$ the corresponding adjusted $R^2$ is $0.998585$ while excluding $a_1,a_2,b_1,b_2$ would lead to an adjusted $R^2$ equal to $0.993603$ which is already very good. Again, I suppose some overfit of the data.