I'm working on an algorithm (written in Python/Cython, but it reads like pseudo-code) that estimates the gradient of each point in noisy data, using a variable window size. It's working very well, but it seems that the algorithm is limited by the regression part. Here is what I use:
cdef double regression(np.ndarray[DTYPE_t, ndim=1] data, np.ndarray[DTYPE_t, ndim=1] time, unsigned int leftlim2, unsigned int rightlim2):
# declaring some variables
cdef unsigned int length, j
cdef double x, y, sumx, sumy, xy, xx, result, a, b, invlen
# resetting values
length = 0
sumx = 0
sumy = 0
xy = 0
xx = 0
# doing a loop from the left limit to the right limit
for j from leftlim2 <= j < rightlim2:
x = time[j]
y = data[j]
sumx += x # the sum of x
sumy += y # the sum of y
xy += x*y # the sum of x*y
xx += x*x # the sum of x^2
# estimating the best-fit slope
length = rightlim2 - leftlim2
invlen = 1.0/length
a = xy-(sumx*sumy)*invlen
b = xx-(sumx*sumx)*invlen
result = a/b
return result
Inputs:
- vectors/arrays of the data and time that was measured during an experiment. The data array contains noisy data of, for example, applied force, the time array contains equally spaced time recordings (0.1s, 0.2s, 0.3s, etc.)
- the left and right limits of how much data has to be included for the regression, provided as indices (i.e. the data used for regression is given by data[leftlim2:rightlim2])
Output: the slope of a straight line (the a in y = a*x + b) approximating the dataset.
I'm only interested in the slope, not in the intercept, hence the use of a loop rather than regression using matrix-vector multiplications, which is numerically inconvenient (i.e. slow). I was wondering if anyone knows a way to increase the efficiency of the regression, without sacrificing accuracy. Perhaps there's a way to exploit the equal spacing of the time array? For example, the sum of x can be written as 0.5*dx*(l^2 - l) + x[0]*l where dx is the time step and l = rightlim2 - leftlim2. Is there a way to represent the sum of x*y or x*x in a similar fashion? (I have tried and failed :( )
You want to fit a model Y = a + b X using linear regression and you are only concerned by the value of the slope "b".
From the normal equations, you have b = (S1 * S2 - N * S3) / (S2 * S2 - N * S4) where N is the number of data points, S1 the sum of the Y(i), S2 the sum of the X(i), S3 the sum of the X(i) * Y(i) and S4 the sum of the X(i) * X(i).
Now, you want to take advantage that the X(i) are equally spaced; this means that
X(i) = X(1) + (i - 1) dX
where X(1) is the first value and dX the step size. This allows you to directly compute the value of sums S2 and S4 using the fact that the X(i) are in arithmetic progression. Using this fact leads to
S2 = N * X(1) + N * (N + 1) * dX / 2
S4 = N * X(1) * X(1) + N * (N + 1) * X(1) * dX + N * (N + 1) * (2 * N + 1) * dX * dX / 6
If you do not know the number of data points but only the first value X_first, the last value X_last and the step size dX, you have
N = 1 + (X_last - X_first) / dX
and X(1) = X_first.
Similarly, it could be possible to expand S3 and write
S3 = X(1) * S1 + dX * Sum[(i - 1) Y(i),{i,1,N}]
but I do not think this would induce any savings in calculation time.
By the way, you can also get the intercept for free since a = (S1 - b * S2) / N