Regress the mean/stddev of a data set resembling a cumulative gaussian distribution?

67 Views Asked by At

If I have a data set of x,y values that I believe resemble a cumulative gaussian distribution, is there an easy way for me to programmatically derive the mean/stddev of the CDF?

For example, imagine I have the following x/y value pairs:

1.380211242 0.06984097
1.62324929  0.09854223
1.84509804  0.15019463
2.133538908 0.227332755
2.423245874 0.352771719
2.73239376  0.481883449

I believe this roughly corresponds to a CDF gaussian function with mean=2.8 and stddev=0.9, or at least that roughly matches when I plot it:

data set and model plot

I'd like to programmatically derive mean/stddev from the x/y values (instead of trial and error in excel) - is that possible?

1

There are 1 best solutions below

2
On BEST ANSWER

What I would suggest is to look at square differences between your given $y_i$s and the value implied by a normal distribution at the corresponding $x_i$ value.

Lets call your pairs $(x_i,y_i)_{i=1}^N$, so for example $(x_1,y_1) = (1.380211242, 0.06984097)$. I'll also denote by $\Phi$ the cdf of a standard normal distribution, while $F_{\mu,\sigma}$ is the cdf of a normal distribution with parameters $\mu$ and $\sigma$. Then we have the relationship $F_{\mu,\sigma}(x) = \Phi(\frac{x - \mu}{\sigma})$. Now I suggest to minimize the sum of square differences, i.e. define \begin{align} &D(\mu,\sigma) = \sum_{i=1}^N (y_i - F_{\mu,\sigma}(x_i))^2, \mbox{ and}\\ &\min_{\mu,\sigma} D(\mu,\sigma) \end{align}

The minimization is carried out over the two parameters $\mu \in \mathbb{R}$ and $\sigma > 0$, and the optimal parameters (i.e the parameters that give the lowest value) will be your best (in the least square sense) estimates.

If you want to go there you can calculate the derivative of $D$ with respect to $\mu$ and $\sigma$, see https://stats.stackexchange.com/questions/154133/how-to-get-the-derivative-of-a-normal-distribution-w-r-t-its-parameters


Edit: running your values through the R-script below with your starting values of $2.8$ and $0.9$ gives me $\mu = 2.7750658, \sigma = 0.8981213$.


XX = c( 1.380211242, 1.62324929, 1.84509804, 2.133538908, 2.423245874, 2.73239376)

YY = c( 0.06984097, 0.09854223, 0.15019463, 0.227332755, 0.352771719, 0.481883449)

squareDiff <- function(param,dat){

x = dat\$x

y = dat\$y

m=param[1]

s=param[2]

D = sum((y - pnorm(x,mean=m,sd=s))^2)

return(D)

}

data = list(x=XX,y=YY)

result = optim(par=c(2.8,0.9), squareDiff, dat=data)

print(result)