Fitting to top of point cloud rather than the middle - non-linear regression with negative residuals

67 Views Asked by At

Problem: There are some types of estimatino problems where the noise function is not normal, and where the mean of the residuals can not/should not be 0. A common example is trying to resolve the boundary of a point cloud rather than than standard OLS with gaussian, zero-mean error.

I have a data set consisting of sampled (X,Y) pairs where the precision of the measurement in Y is subject to a noise function. However, this noise is exclusively subtractive noise, by which I mean that all data points underestimate the true function. In the image I attach, these points are shown in blue, while the true model is shown in red. An OLS fit is shown in green.

The problem is easy to model. The function is: $$Y = Ae^{(kX)} + C + \epsilon$$

$\epsilon$ is not normally distributed, but instead is described by a uniform distribution where all samples are < 0: U(-300, 0). This means that all data points fall below the true function.

Performing an OLS fit to this data to recover A, k, and B would put a line through the middle of the data due to the assumption that noise is zero mean Gaussian. But this obviously incorrectly estimates these parameters given that the noise is not gaussian.

My question: In real life, I have data sets like these and need to estimate the parameters that describe the data given this uniform noise situation. Ols with mean squared loss function does not make the right assumptions to do this. What method (for example MLE) and loss function would be best to do this?

Apologies if there is a similar question already asked on here - I searced but could not find something similar enough to this problem to help me out.

Plot of data and problem

2

There are 2 best solutions below

0
On

If you have the uniform distribution for the error on $[-E,0]$, then all curves that are feasible at all give you the same likelihood for your data, so the question reduces to "just" finding a curve $y=f(p,x)$ in your family such that $y_i\le f(p,x_i)\le y_i+E$ (if there are several such curves, there is no canonical way to give one of them preference over the other one, so you just need to determine the admissible region or, at least, find a single point in it). However, it does not look easy with more than one parameter in general. What you can try is to define the "conflict function" $$ F(p)=\max_{i}\max(y_i-f(p,x_i),f(p,x_i)-y_i-E) $$ (or something like that) where $f(p,x)$ is your parametric family and use some global minimization method (after I learned about the differential evolution, I try it on everything and it seems to work pretty well here, but you may have your own preferences).

1
On

Interesting problem. I've been toying with asymmetric l1 loss and got this result here. For the accompanying notebook see here. Hope this helps!

Edit: As you described, OLS does not work in your case because it is based on the Euclidean norm $d=\left(f(x)-y\right)^2$, where $f(x)$ is your model and $y$ your noisy target values. This clearly is something symmetric. Hence it punishes not-perfect model predictions equally when they over- or underestimate $y$. Which is a good first guess if you don't know about your noise distribution.

But since you are in the lucky position to know about your noise distribution one can adjust $d$ by making it sign dependent like $$d=\begin{cases} a\cdot\left(f(x)-y\right)^2 & \quad \text{if } f(x)-y < 0\\ b\cdot\left(f(x)-y\right)^2 & \quad \text{otherwise } \end{cases},$$ e.g. with $a=10$ and $b=1$ in your case. The exact values of $a$ and $b$ are not that important, more relevant is their ratio. In your case using $a/b>1$ will move your model prediction progressively upwards for larger ratios. This is because the loss, summing over all differences, will be more severely affected by predictions that underestimate $y$ than those overestimating $y$.

This sign dependent approach is not limited to squared differences and also works with $d=|f(x)-y|$ as indicated earlier.

Using numerical optimizers, you should not have a problem finding good solutions for your parameters. Non-symmetric distributions should also work if you need something to plug into your equation other than a Gaussian.

Note: I also updated the gist notebook linked above to enable dynamic exploration of the choice of the $a/b$ ratio and the loss function.