Can integral equations be paired with linear regression to fit a double Gaussian regression?

1.3k Views Asked by At

In the paper https://www.scribd.com/doc/14674814/Regressions-et-equations-integrales (page 6) a method is presented to fit a single nonlinear Gaussian curve to noisy data. Later on in the paper, the same method is employed to fit a double exponential regression (and even more). I'm curious if it would be possible to employ the same technique to fit a double Gaussian regression with scaling constants? To be specific, I want to perform a regression of the following equation to data.

$$ f(x)=\frac{c_1}{\sigma_1} \text{exp}\left(-\frac{1}{2}\left(\frac{x-\mu_1}{\sigma_1}\right)^2\right)+\frac{c_2}{\sigma_2} \text{exp}\left(-\frac{1}{2}\left(\frac{x-\mu_2}{\sigma_2}\right)^2\right) $$

Where you have data $x_i, y_i$ with normally distributed noise $\epsilon_i$.

I know nonlinear least squares works pretty well in this case, but having a non-iterative algorithm to do this (and potentially with more Gaussian kernels) would be useful in many situations.

Take for example this data.

0,0.0953435022119166
0.408163265306122,0.165876041641162
0.816326530612245,0.217055023196137
1.22448979591837,0.336625390515636
1.63265306122449,0.502382096769287
2.04081632653061,0.666515393163172
2.44897959183673,0.883208684189565
2.85714285714286,1.12458773291947
3.26530612244898,1.37726169379448
3.6734693877551,1.61866109023969
4.08163265306122,1.82013750956065
4.48979591836735,1.9779288108796
4.89795918367347,2.06828873076259
5.30612244897959,2.08056217287661
5.71428571428571,2.07533074340141
6.12244897959184,2.0104842044372
6.53061224489796,1.94357288893297
6.93877551020408,1.8682413902483
7.3469387755102,1.82822815415585
7.75510204081633,1.84578299714906
8.16326530612245,1.86852992222851
8.57142857142857,1.95624254355303
8.97959183673469,2.01135490423995
9.38775510204082,2.0796994498718
9.79591836734694,2.09619003334345
10.2040816326531,2.03603444176666
10.6122448979592,1.91638500266306
11.0204081632653,1.77786079492498
11.4285714285714,1.55497697393296
11.8367346938776,1.28953163571889
12.2448979591837,1.06603391254518
12.6530612244898,0.812745486024327
13.0612244897959,0.622710458610182
13.469387755102,0.433501473226128
13.8775510204082,0.333280824649249
14.2857142857143,0.190078385939407
14.6938775510204,0.133877864179819
15.1020408163265,0.0812258047441917
15.5102040816327,0.0427020032495271
15.9183673469388,0.0280925506752726
16.3265306122449,0.0283736350602543
16.734693877551,0.00634269170340754
17.1428571428571,0.00825683417491195
17.5510204081633,-0.0074678105115024
17.9591836734694,-0.000778850177607315
18.3673469387755,-0.00739023269152126
18.7755102040816,-0.00232605185006384
19.1836734693878,0.0194459618205399
19.5918367346939,0.000445784296190274
20,-0.000752024542365978

Which, when plotted looks like this:

double Gaussian curve

1

There are 1 best solutions below

1
On BEST ANSWER

Mainly my answer is already in comments. Nevertheless, more details are given below (with different notations in order to made easier the typing) :

$$y(x)=a\,e^{-(bx+c)^2}+A\,e^{-(Bx+C)^2} \tag 1$$ $$\int ydx=\frac{a\sqrt{\pi}}{2b}\text{erf}(bx+c)+\frac{A\sqrt{\pi}}{2B}\text{erf}(Bx+C) \tag 2$$ $$\int xydx=-\frac{ac\sqrt{\pi}}{2b^2}\text{erf}(bx+c)-\frac{AC\sqrt{\pi}}{2B^2}\text{erf}(Bx+C)-\frac{a}{2b^2}\,e^{-(bx+c)^2}-\frac{A}{2B^2}\,e^{-(Bx+C)^2}\tag 3$$ $$\int x^2ydx=\frac{a(2c^2+1)\sqrt{\pi}}{4b^3}\text{erf}(bx+c)+\frac{A(2C^2+1)\sqrt{\pi}}{4B^3}\text{erf}(Bx+C)+\frac{a(c-bx)}{2b^3}\,e^{-(bx+c)^2}+\frac{A(C-Bx)}{2B^3}\,e^{-(Bx+C)^2} \tag 4$$ $$\int x^3ydx=-\frac{ac(2c^2+3)\sqrt{\pi}}{4b^4}\text{erf}(bx+c)-\frac{AC(2C^2+3)\sqrt{\pi}}{4B^4}\text{erf}(Bx+C)-\frac{a(c^2+1-bcx+b^2x^2)}{2b^4}\,e^{-(bx+c)^2}-\frac{A(C^2+1-BCx+B^2x^2)}{2B^4}\,e^{-(Bx+C)^2} \tag 5$$ Considering $\text{erf}(bx+c)$ , $\text{erf}(Bx+C)$ , $e^{-(bx+c)^2}$ and $e^{-(Bx+C)^2}$ as unknowns, Eqs.$(2,3,4,5)$ is a linear system which can be solved. Then the explicit expressions of $e^{-(bx+c)^2}$ and $e^{-(Bx+C)^2}$ obtained can be put into Eq.$(1)$. The result is an integral equation which could be theoretically used for linear regression.

But, in practice, the calculus to get to the integral equation is much too complicated. Moreover, there is another counter-argument (number of parameters) as it will be pointed out below. So, the method of integral equation appears not convenient for practical use in the case of double Gaussian regression.

In comments, wdkrnls raised the pertinent remark that instead of an integral equation, a differential equation would be simpler, regardless the question of instability of numerical differentiation. This is true and was considered when the above referenced paper was written.

Let : $\quad y(x)=af(x)+AF(x)\quad $ where $\quad f(x)=e^{-(bx+c)^2}\quad$ and $\quad F(x)=e^{-(Bx+C)^2}$

$$y'(x)= -2ab(bx+c)f-2AB(Bx+C)F$$ $$y''(x)= 2ab^2(2b^2x^2+4bcx+2c^2-1)f+2AB^2(2B^2x^2+4BCx+2C^2-1)F$$ $$\left(\begin{matrix} f \\ F \end{matrix}\right) = \left(\begin{matrix} -2ab(bx+c) & -2AB(Bx+C) \\ 2ab^2(2b^2x^2+4bcx+2c^2-1) & 2AB^2(2B^2x^2+4BCx+2C^2-1) \end{matrix}\right)^{-1} \left(\begin{matrix} y' \\ y'' \end{matrix}\right) $$ We put $f$ and $F$ into $y(x) = \left(\begin{matrix} a & A \end{matrix}\right) \left(\begin{matrix} f \\ F \end{matrix}\right) \quad$ which leads to the differential equation on matrix form :

$$\left(\begin{matrix} a & A \end{matrix}\right) \left(\begin{matrix} -2ab(bx+c) & -2AB(Bx+C) \\ 2ab^2(2b^2x^2+4bcx+2c^2-1) & 2AB^2(2B^2x^2+4BCx+2C^2-1) \end{matrix}\right)^{-1} \left(\begin{matrix} y' \\ y'' \end{matrix}\right) - y =0 $$ After calculus and simplification :

$$(b^2-B^2)xy''+4(b^3c-B^3C)y''+2(b^4-B^4)x^2y'+4(b^3c-B^3C)xy'+(2b^2c^2-b^2-2B^2C^2+B^2)y'+4bB(Bb^3-bB^3)x^3y+4bB(b^3C-B^3c+2Bb^2c-2bB^2C)x^2y+4bB(2b^2cC-2B^2Cc+Bbc^2-bBC^2)xy+2bB(2bc^2C-2BC^2c+Bc-bC)y=0$$

This is the differential equation which could be used for linear regression.

But another condition to obtain a robust process is that the number of independent terms in the differential equation be not higher than the number of parameters in the original equation.

This is far to be the case here.

As a consequence, the process will not be robust, even with not scattered data. This means that, sometimes and with luck, the computed values of the parameters will not be too bad. But not in the general case. And even worse with scattered data, of course. I am sorry for this discouraging result obtained when I was writing the referenced paper. That is why the case of double Gaussian is not treated in the paper.

Note: One can obtain an integral equation from the above differential equation with successive integrations. This is an easier way than the direct integration method as shown above. But the number of independent terms will be still larger than with the differential equation. So, the discouraging conclusion is the same.