Regression with noise

347 Views Asked by At

I have data acquired from simulation that is tainted by noise, which I know to a Wiener process (integration of white noise, with mean $\mu = 0$ and variance $\sigma^2$ known). I have good reasons to assume that the curve is a mixture of power laws, i.e., $y(x) \sim a_0 x^{-a}$ for $x< x_0$ and $y(x) \sim b_0 x^{-b}$ for $x>x_0$. The usual trick to this kind of function is to plot $\log(y(x))$ against $\log(x)$ and fit two linear curves.

However, if the noise on $y(x)$ becomes significant ($y(x) \approx \sigma$), the noise on the log-log plot becomes asymmetrically distributed. Currently, I use a Tuley's biweight functions instead of the usual least-squares but I was wondering if there was an established estimator for this kind of problem.

Here's an example of the problem : enter image description here

2

There are 2 best solutions below

0
On BEST ANSWER

Your picture demonstrates the bias nicely! And this is to be expected, since the Jensen inequality shows that your mean estimates will be biased if you take logs of additive Gaussian errors.

Since you should not take logs, you face a non-linear regression problem. In your case this should be no problem, since standard numerical software will provide a function to do this (in Matlab this would be "lsqnonlin"). But in addition to the non-linearity your error term does not fulfill the requirements for least squares.

Error term

Let $\epsilon_x$ be the white noise then your errors are $\eta(x_k) = \int_{x_1}^{x_k}{\epsilon_s \,ds}$ which means they are heteroskedastic $\text{Var}[\eta_k]=x_k \sigma^2$ and serially correlated $\text{corr}(\eta_k,\eta_{k-i})=\sqrt{\frac{x_{k-i}}{x_{k}}}$. Both features are clearly visible in the figure you posted.

To get rid of these issues I suggest not to fit to your data $y_k$ directly but to its increments $y_k-y_{k-1}$. This is because (assuming your observations are on a grid, i.e. $x_k=x_{k-1}+d$ with $d=\text{const}.$) then the error term is $\delta_k = \int_{x_{k-1}}^{x_k}{\epsilon_s \,ds}$ with $\text{Var}[\delta_k]=d \sigma^2$ and $\text{corr}(\delta_k,\delta_{k-i})=0$.

Fit with $x_0$ known

Assume now for a moment, that the location of the break $x_0$ would be known. Then your regression function for the increments is $$ f(x) = \begin{cases} a_0 \left( x^{-a} - (x -d)^{-a} \right) & \text{for $x<x_0$} \\ b_0 \left( x^{-b} - (x -d)^{-b} \right) & \text{for $x\geq x_0$}\end{cases}$$ which is a perfectly defined function with parameters, such that the $f(x_k)$ can be fit to the increments $y_k - y_{k-1}$ by least squares.

Finding $x_0$

Standard linear regression/time series analysis knows ways to find changes in trend. But this is always a linear trend. I am not sure something like this would work in the non-linear case. So I suggest a pragmatic solution: Just create a fit for every possible break $x_0$ and then choose the one you like best, for example using a criterion such as residual error.

0
On

Perhaps it is not the same terminology or knowledge domain that you use, but from my point of view this is a problem of heteroscedasticity, i.e., the noise term does not seem to follow $N(0, \sigma^2)$, but rather $N(0,h(x_{4i})\sigma^2)$ where $h(x)$ is monotone increasing function of $x_4$. Thus, my suggestion would be to use the so-called FGLS. Namely, assume some pretty general structure on $h(x)$ and then estimate it, and using this estimator to compute a WLS estimators for $y=f(x_4)+\epsilon_i$, where $f$ can be some polynomial of $x_4$.

In details: Lets assume that your model is given $y_i = \beta_0 + \beta_1x_{4i} +\epsilon_i$, where $\epsilon_i \sim N(0,h(x_{4i})\sigma^2)$. You can assume that $h(x_4)=\exp\{\delta_0 + \delta_1x_4\}$, and estimate it using $$ \log(\epsilon^2)=\log(\sigma^2)+\delta_0+\delta_1x_4+\eta, $$
where $\log(\sigma^2)+\delta_0$ is the intercept term and noise term is following some $N(0, b^2)$ distribution. Finally, you can take $\hat{h}_i=\exp\{\hat{\delta}_0+\hat{\delta}_1x_4\}$ to be the corresponding weight $w_i$ for the $i$th data point, namely $$ y_i^*=y_i/\sqrt{\hat{h}_i} = \beta_0/\sqrt{\hat{h}_i} + \beta_1x_4/\sqrt{\hat{h}_i} + \epsilon_i/\sqrt{\hat{h}_i}. $$ Surely the desired properties of of the FGLS estimator are mainly hold asymptotically, i.e., $Var(y_i^*|x_4)=\sigma^2h(x_4)Var(1/\hat{h}^{1/2}(x_4))\to\sigma^2\frac{h(x_4)}{h(x_4)}=\sigma^2$. But for large enough $n$ it should suffice as well.