Calculating average value of points over a period of time

24.7k Views Asked by At

I have several values plotted over a period of time on a line graph. The length of time between values is not uniform (i.e. one value could be 3 days after the one before, which itself could be 2 weeks after that one).

Visualised as a line graph, it would look like this:

Values plotted over time

How would you calculate the average value over the entire period of time? - obviously taking into account the increases and decreases between points and the length of time between points. Is this possible? (I may be being blind to an obvious solution...)

2

There are 2 best solutions below

0
On BEST ANSWER

It depends on what the values represent.

$\underline{\text{Example 1}}$

If the value is "the number of people in my neighborhood", then the best approach is to integrate the above plot and divide by the total time.

We know that people arrive and leave somewhat unpredictably. Also that sometimes it's one person, and sometimes it's a family of several. If this neighborhood isn't a college town, then there generally won't be a seasonal pattern.

For example, say we want to estimate how many people there were on July 25. We know there were 75 on June 20, and 55 on August 29, which is a decrease of 20 people in 70 days. The best estimate we can make is to assume that one person left every 3.5 days, so on July 25, there would have been 65 people. This is purely a guess, but it is the best estimate we have available. Knowing how many people were present in April or October won't improve this estimate.

Thus, the linear plot represents our best guess for the number of people present each day. So the average is the area under the curve divided by the time.

$\underline{\text{Example 2}}$

If the value is "the number of people who died from X in NY" (where X is an instantaneous, non-contagious, non-seasonal effect like "stroke"), then the numbers are completely independent. Knowing how many died on October 17 and on October 19 tells us absolutely nothing about how many died on October 18. In this case, the best estimate we can make for the average is to sum the values for the days we have data on, then divide by the number of data points.

$\underline{\text{Example 3}}$

Other effects like temperature and amount of rainfall can be seasonal, so you would expect perhaps a sinusoidal variation about the average. In that case, fitting to a curve would seem the best approach.

$\underline{\text{Caveat}}$

These estimates suffer from possible sample bias. In the first example, extra significance is given to values that are far apart in time. Nearly half the plot is connected to the single June datapoint. Moving that one point up by 4 would raise the average by 1, which makes that point count as 3 times more significant than the average for all points.

In the third example, a single hurricane in the second half of October could significantly affect a full $1/3$ of the data sample ($4$ out of $12$ data points). Thus, a single weather phenomenon could skew the results.

So, to reiterate: the best approach to calculating an average depends highly on what the values represent.

0
On

you could project the series onto a basis that is a function of time, and then using the resulting coefficients compute the mean of the continuous time function that results.

Thus approximate you data, $d_j$, with $f(t) = \sum_{k=0}^N c_k\phi_k(t)$ where $\phi_k(t)$ are the basis functions that you choose to approximate with.

  • this is particularly easy if your basis is $$ \Phi = 2\left[\frac{1}{2},\cos(2\pi t_j), \cos(2\pi 2 t_j), \cdots,\cos(2\pi N t_j), \sin(2\pi t_j), \sin(2\pi 2 t_j), \cdots,\sin(2\pi N t_j)\right]$$

    where $t_j = (T_j-T_{init})/(T_{final} - T_{init})$.

Your number of samples must be greater than or equal to $2N+1$ and $T_j$ is the time stamp of your $j^{th}$ sample.

Then solve the least squares problem $\Phi c=d$ - here $d$ is the vector containing your sample data.

In this case the coefficient, $c_0$, corresponding to the constant term, will be your average over the entire time interval - regardless of the non-uniform sampling.

you can also use the coefficients to re-sample and estimate the data at any other points where you did not have a sample.

In the periodic basis function case - if $d_{init} \ne d_{final}$ then you may want to choose a $T_{final}$ a little beyond the last sample (to help minimize the amount of Gibbs Phenomenon that will occur)

This same thing could also be accomplished with a polynomial basis in time. You must just then compute the integrals of your basis functions (easy polynomials) over the entire analytical interval.