Say I have a large set of data. Each data point corresponds to a particular day in the year, so for 1 year I will have 365 points. Say I have collected this sort of data for 5 years.
Now, I want to use this data to predict future values for any particular day. Because I want to have a good model equation, I perform a regression on 1 years worth of data, 2 years worth of data, ..., 5 years worth of data. So essentially I have 5 different equations that I am considering using for forecasting other data points.
Would the assumption that, the equation that takes 5 years worth of data is the most accurate in forecasting data be valid or invalid? My intuition tells me that if you are forecasting data, it doesn't necessarily mean that more data points implies better fit.
I would like some insight into this, and know this is the best place to ask such a question. Thanks!
While more data usually is preferable, it all depends on the model and how well it is suited for your data.
Let's say we had exact weather data since before the last ice age. How would modern weather prediction software do? Very poorly, since the climate was so very different back then.
On the other hand, mathematical models that track climate change would need data sets over these huge periods of time such as this to be of use. (Not neccessarily the same type of data, but you hopefully get my point.)