Extremely poor polyfit, what am I doing wrong?

5.2k Views Asked by At

I have a dataset with me: http://pastebin.com/YZArky1j, which I am trying to polyfit. This is what I used to perform the polyfit:

x = [i[0] for i in data] # Get all x co-ordinates
y = [i[1] for i in data] # Get all y co-ordinates

coeff = polyfit(x, y, 5)

I am using the polyfit function from numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html

The coefficients I get are: $$ -2.84904034523 \times 10^{15} \\ 2.46136213423 \times 10^{11} \\ 8.66716249115 \times 10^{6} \\ -5.3534845897 \times 10^{2} \\ -0.577304577155 \times 10^{-1} \\ 2.54464720615 \times 10^{-6} \\ $$

The plot of the polynomial with the plot of data looks like:

Polyfit plots

Here, red is the polynomial function and the blue is a plot of the data. So far, it is looking really good. This data is actually the prize pool of a gaming tournament v/s the day from when it started. So, when I try to project as to how it would look when the tournament actually starts (and there will be no more increase to the prize pool), I get a ridiculous number: $498,527,857.5$. This absolutely cannot be the case since the pool right now is around 6 million, there is no way that in two months it will reach 400 million+ in the remaining 2 months, especially as a look at the graph tells us that the rate of rate of growth is decreasing.

So, maybe I should try to fit it in the form $y = B + log(x)*A$. So, I tried to find the values of A and B:

log_x = [math.log(i) for i in x]
coeff_log = polyfit(log_x, y, 1)

And the coefficients I get are:

$$ 5.28101367 \times 10^{9} \\ -5.11847659 \times 10^{10} $$

And now, the projected prize pool looks something like: $24,987,469$. This also is highly suspect and I don't think is the correct projection.

What am I missing or what is it that I am doing wrong?

1

There are 1 best solutions below

2
On BEST ANSWER

Since I need to post a picture, I will expand some of my comments and turn them into an answer.

About the part what's wrong in your approach, as mentioned in the comments, they are:

  • Once one move away from the range of data, the quality of a polyfit based extrapolation can deteriorate pretty quickly. Don't use it.

  • If you make a plot of $\log x$ vs $x$ over $[16200,16212]$, you will notice the graph is pretty straight there. Your result is no different from doing a linear fit.

It is understandable why you want to use $\log x$ for fitting the data. In fact, it is a good choice to capture the concavity of the provided data. However, to use a logarithm fit, one need to offset $x$ to the point where the data supposed to kick start. i.e one should try a fit of the form:

$$y = y(x) = A \log(x - \alpha) + B\quad\text{ where }\quad \alpha \sim 16200.$$

The next issues is how to fix/determine $\alpha$, $A$ and $B$. Instead of a full scale non-linear least square fit, one can

  1. treat $\alpha$ as a parameter, perform a linear least square fit to obtain $A$ and $B$.
  2. compute the mean standard error for this $\alpha$ once $A$ and $B$ are known.
  3. determine $\alpha$ by minimizing the mean squared error.

I'm not going to derive anything. I'll just write down some formula so that you can try the way I perform the fit.

  • Let $N$ be the number of samples
  • Let $x_i$ be the date number for the $i^{th}$ sample.
    (in unit date/time format, date 0 = 1970/01/01 mid-night)
  • Let $y_i$ be the prize pool in unit of millions.

For any two vectors $X_i$, $Y_i$ of size $N$, define $$\begin{cases} \langle X \rangle &= \frac{1}{N}\sum_{i=1}^{N} X_i\\ \text{Corr}( X, Y ) &= \langle XY \rangle - \langle X\rangle\langle Y\rangle\\ \text{Var}(X) &= \langle X^2 \rangle - \langle X \rangle^2 \end{cases} $$ The $A$ and $B$ which minimize the MSE (mean square error):

$$\text{MSE}(X,Y) = \langle ( Y - A X - B )^2 \rangle = \frac{1}{N}\sum_{i=1}^N (Y_i - A X_i - B)^2$$

is given by $\displaystyle\;\begin{cases} A_{opt} &= \frac{\text{Corr}(X,Y)}{\text{Var}(X)}\\ B_{opt} &= \langle Y \rangle - A \langle X \rangle \end{cases}$. The corresponding optimal MSE also has a simple formula

$$\text{MSE}_{opt}(X,Y) = \text{Var}(Y) - \frac{\text{Corr}(X,Y)^2}{\text{Var}(X)}$$

To fix $\alpha$ for this problem, one just need to find the $\alpha$ which minimize $\;\text{MSE}_{opt}(\log(x-a),y)$.

It turns out I tell you the wrong $\alpha$ in comment. The optimal $\alpha$ should be around $16198.\color{red}{5}9255$.
Following is a plot of the prize pool vs. date.

Two optimal fit

  • $y$-axis - the prize pool in unit of millions.
  • $x$-axis - date number in unix universe (i.e time start at 1970/1/1 mid-night).
  • the $\color{blue}{\text{blue}}$ curve is dataset from user.
  • The $\color{red}{\text{red}}$ curve is the optimal fit using $\alpha = 16198.59255$: $$y = y_{opt}(x) = 2.067125632834611 \log(x-16198.59255) + 0.90491534594457$$
  • The $\color{green}{\text{green}}$ curve is the fit using $\alpha = 16200$: $$y = y_{est}(x) = 1.417797964812813 \log(x-16200) + 2.558265172338731$$

On 2014/07/19 (unix date 16268), the two fit above gives following estimates of the prize pool (in unit of millions): $$y_{opt}(16268) \sim 9.66951606144537 \quad\text{ and }\quad y_{est}(16268) = 8.540674609249399 $$