I've recently developed an interest in statistics, so I decided to write a program which will generate a set of data, storing the Pearson correlation coefficient versus a changing random variance in the data. Here is the method I wrote to do this:
def generateData(nLines, nPoints, gradient, yIntercept, xScaleFactor, randomCoefficientScaleFactor):
Rs = []
xValues = []
for i in range(nLines):
x = []
y = []
for j in range(nPoints):
x.append(j*xScaleFactor)
y.append(fLinear(x[j], i*randomCoefficientScaleFactor, gradient, yIntercept))
dataSet = Data(x, y)
xValues.append(i*randomCoefficientScaleFactor)
Rs.append(linearCorrelationCoefficient(dataSet.x, dataSet.y))
dataSet = Data(xValues, Rs)
return dataSet
And here is the function fLinear():
def fLinear(x, randomCoefficient, gradient, yIntercept):
return gradient*x + yIntercept + randomCoefficient*random.uniform(0,1)
I ran this with 200 values for R, 1000 points for each line, and a scale factor on the random value of 75.
Outputting this data to a Desmos graph gives this result.
My first instinct as to which type of function would yield this shape would be a normal distribution, but as you can see here, this is not the case. My next thought would be a sigmoid, so I tried three different functions:
- $\frac{1}{e^{kx}+1}$, which yielded this result.
- $erf(kx)$, which yielded this result.
- $tan^{-1}(kx)+1$, which yielded this result.
None of these fit the the data particularly well, so I hoped that someone here knew of any function that better fits this data set. Thanks in advance.