Testing using training data

98 Views Asked by At

I've been trying to prove that estimates of a classifier's performance using training data is a bad thing. Does "bad" mean it is biased? This is part of a larger proof.

If somebody knows of previous work that proves this or a quick proof, any pointers would be much appreciated!

Thanks in advance, Yakka

2

There are 2 best solutions below

4
On BEST ANSWER

This issue goes beyond any particular model, like a classifier, to statistical models in general. When you fit a model to training data, you are optimizing its fit or performance relative to that training set. Now, if you took that model and applied to different data, then unless the new data looks exactly like your training set, the performance will uaually be worse.

You can see this even with just the trainig data by running your classifier on a boostrap sample of your training set. You'll see that your traning set performance was biased high.

2
On

Estimates of a classifiers performance using training data is "bad" because the same data was used to fit the model to optimize the performance metric in the first place.

This means that this performance metric may be very different from the metric when evaluated using the true distribution $\mu_{X}$ of the data source.

Use of testing data or cross validation is a technique to approximate this true metric under $\mu_X$.

Research pointer:

  • Statistical learning theory answers the following question. When is the true metric for the fitted model close to the true metric of the best model possible (Bayes optimal or best in class)? Obviously this skips the question of "badness" of using an in-sample performance metric.
  • In sample vs out of sample performance is discussed in Page 9 of this. Also check cross validation.
  • This paper by Hansen essentially shows that for a variety of modeling problems, in sample performance and out of sample performance are negatively correlated (loosely speaking, check the abstract for definitions).