What does ln() accomplish on a regression input?

931 Views Asked by At

I have gotten interested in forecasting using linear/nonlinear regression, particularly using Facebook's Prophet library for R/Python. It makes forecasting on a time-series input pretty straightforward.

However, one thing I don't fully understand in the "Quick Start" tutorial is why a natural logarithm is applied to the inputted values before giving it to the model, like so:

#log() uses base e
df['y'] = np.log(df['y']) 

#input into model
m = Prophet()
m.fit(df)

I somewhat remember logarithms from my high school/college math days, but none of my teachers ever made it clear why Euler's number was useful much less when used as a base for a logarithm.

This led me down a rabbit hole to explore natural logarithms because I am starting to see them everywhere. It's interesting that Prophet will accept an input regardless if ln() was applied, and it will produce somewhat similar curves. I'm guessing the forecasted output has to be exponentialized via $e^x$ to be meaningful as well.

What I want to know is what do natural logarithms accomplish in the context of linear regression inputs? For instance, here are two simple Excel charts plotting a series with y and ln(y). The charts look kind of similar, but what effect did ln(y) have?

And what logical reason does Prophet choose to input/output things that look like they need to be exponentialized via $e^x$ to be meaningful?

x   y       ln(y)
1   29      3.36729583
2   3       1.098612289
3   100     4.605170186
4   3       1.098612289
5   10      2.302585093
6   11      2.397895273
7   9       2.197224577
8   49      3.891820298
9   97      4.574710979
10  33      3.496507561

f(x)

enter image description here

ln(f(x))

enter image description here

2

There are 2 best solutions below

3
On BEST ANSWER

"Linear regression" is a technique for finding the straight line that best fits a given set of data points $(x,y)$. It's the right technique to use if the data points actually lie near some line, which is likely to be the case if there is some underlying reason to expect linearity. Regression finds the line that fits best. For a time series, $x$ will be time and $y$ the value you measure and then want to predict using the line.

But suppose your data are about the population of some biological system as time goes on. Then you would not expect those points to lie on or near a straight line, you'd expect some kind of exponential growth, expressed as $p = c e^{rt}$ where $p$ is population, $t$ is time and $c$ and $r$ are constants. So your data points will lie near a curve like that. You'd like to know the best values for the constants. Fortunately, linear regression comes to the rescue if you are clever. If you take the (natural) logarithms of the values of $p$ the resulting data points will lie near a line which you can find with linear regression. With the slope and intercept of that line you can get the constants you want for the exponential best fit.

The authors of the Python linear regression code anticipated the fact that some folks would want to use regression this way, so tell you how to input the logarithms of the measured values.

For the spiky data in your question neither linear regression nor finding the exponential best fit will be much use for prediction.

0
On
  1. $\ln( \cdot)$ is a concave transformation, hence if the distribution of your $y$s is right-skewed, it will "flatten" it and thus make it more suitable for linear regression with classic assumptions.

  2. Models of a kind $\log (y_i) = \beta_0 + \beta_1x_i$ are sometimes used when we are interested in a relative (percentage) change in $y$ as a a result of small (infinitesimal) changes in $x$, namely $$ \frac{\partial}{\partial x} (\ln y) = \frac{1}{y}\frac{\partial y}{\partial x} = \beta_1 $$ or $$ \frac{\partial y}{y} = \beta_1\partial x, $$ i.e., $\beta_1$ is the percentage change (estimated by $\hat{\beta_{1}}\times100\% $) of $y$ when $x$ changes by one unit ($\partial x \approx \Delta x = 1$).