We have the following problem: we perform a nonlinear regression after applying a non-linear transformation to our data set. Then, we perform an extrapolation using the fitted parameters in the transformed domain. Afterwards, we perform the inverse transformation on this extrapolated data to get an extrapolation in the 'real' domain. The nonlinear transformation makes use of, amongst other operations, a logarithm of the data. Upon rereading the question, we realized this explanation is complicated. However, we think that the example at the bottom can provide the necessary clarity.
The non linear fit works as expected, but when we try to generate confidence bands on our extrapolation using resampling of the residuals, we run into problems. Adding the resampled residuals to the (backtransformed) fitted model can lead to negative data in the 'resampled' data set. Of course, we cannot take a logarithm of the negative data, hence we cannot transform this resampled data set. In addition to being untransformable, negative data is also non-physical for our problem. We have thought about three possible solutions:
We discard resampled data sets that contain negative values. Cons of this approach: we introduce a bias towards certain data sets and thus resampling is not random anymore.
Instead of taking the absolute residuals, we could resample the relative residuals. For example, if we have a data value of 9 and a fitted value of 10, the residual would be equal to 10% of the fitted value. Hence, it would be impossible to get negative values after resampling. Cons: this would lead to a residual 'blow up': for example if we have a fitted data [0.01;100] and measured data [0.1;100], and the 1000% error of the first data point gets resampled to the second point, the second resampled data point blows up to 1000.
We could resample the residuals in the transformed domain. Cons of this approach: we are absolutely not sure whether this still produces the correct statistics. Since you are not allowed to transform a standard deviation through a non-linear transformation, our gut feeling says approach 3. is also not allowed.
So our question is: what is the correct way of performing a resample of the residuals in our case?
Kind regards,
Olivier
A simplified example:
According to the physics of our problem at hand, we can fit the following model $Y = P_1 + P_2*(1-exp(-P_3*t))$ , with parameters to fit being P1, P2 and P3 . But the following model can only be fit after applying the following transformation: $Y_T = ln(\frac{100*Y}{100-Y})$. As a side note: in our case the model equation is more difficult and it is impossible to linearize the model. So please do not provide alternative approaches that only work for linear regression, but alternatives that also work for nonlinear regression are of course welcome!
At the bottom of the example, you can find two graph illustrating the fit in both the measurement and transformed domain.
Suppose we had the following measurements:
time = $t$ = [0, 5, 10, 15, 20, 25, 30, 35, 40]
measurement = $Y_m$ = [1.2, 4.2, 13.1, 19.8, 21.4, 32.7, 35.4, 28.9, 35.3]
So, as a first step in order to perform the fit, we must apply the transformation described above (being $Y_{mT} = ln(\frac{100*Y_m}{100-Y_m})$ ):
The measurements in the transformed domain ($Y_{mT}$) are then:
$Y_{mT}$ = [0.20, 1.48, 2.71, 3.21, 3.30, 3.88, 4.00, 3.71, 4.00]
Now we fit these transformed data to our model equation, being $Y = P_1 + P_2*(1-exp(-P_3*t))$ . The parameters resulting from the fit are: $P_1 = 0.01$ , $P_2 = 4$, $P_3 = 0.1$. With this fitted model, we can calculate the values that our model predicts at the time data in the transformed domain, which we will call $Y_{fT}$:
$Y_{fT}$ = [0.01, 1.58, 2.54, 3.12, 3.47, 3.68, 3.81, 3.89, 3.94]
We can find the expected values according to our model in the domain of the measurement by applying the reverse transformation (the reverse transformation is $\frac{100*exp(Y_{fT})}{100+exp(Y_{fT})}$). Let us call the fitted model values in the measurement domain $Y_{f}$:
$Y_{f}$ = [1.0, 4.6, 11.2, 18.4, 24.3, 28.4, 31.1, 32.8, 33.9]
Now we can calculate the residauls ($r$) with the following formula: $r = Y_m - Y_f$
$r$ = [0.21, -0.45, 1.81, 1.40, -2.89, 4.23, 4.29, -3.91, 1.43 ]
Now we resample the residuals, meaning that we take random samples out of $r$ to create $r_1$, $r_1$ being the first set of resampled residuals. So now we can generate a new data set (called $Y_{r1}$) by the following formula: $Y_{r1} = r_1 + Y_f$. If we by chance resample the residual of -3.91 to the first point of the model, we generate new measurement data that contains negative data points, see for example the following set:
$Y_{r1}$ = [-2.9, 4.9, 10.8, 20.2, 25.7, 25.5, 35.4, 37.1, 35.3]
This resampled data set cannot be refit, hence breaking our application of the resampling of the residuals as a parametric bootstrap approach...