In the estimation of a parameter say the average of a population the definition of "bias" is very clear. It is the difference between the average estimator value (averaged over random samples) and the true value of the parameter.
In machine learning models the same term "Bias" (as in bias variance tradeoff) is used but I have seen many different definitions of it. Is there a standard definition?
I have seen it defined as:
- The bias at a specific point $x$ is the average difference (over samples) between $\hat f(x)$ and $f(x)$
- As before the average difference but averaged over all data points $x$ and not at a specific data point
- The difference between the best $\hat f$ in the hypothesis class and $f$
- The difference between $\hat f$ and $f$ as the number of records in the sample goes to infinity.
I would also like to know if we should use a different definition for regression where it makes sense to talk about a systematic error (because we can underestimate the value and overestimate it) and classification where every error is systematic because there is only one type of error that can be made.
A final question is how all this would apply to classification with KNN. It seems conventional wisdom that with a higher $k$ we get a higher bias. But applying the above definitions I am not convinced this should always be true.
As you showed yourself, there isn't. You'll have to figure out what the author means exactly by bias. To add to your list, the most common thing, at least in my experience, people mean when they talk about model bias is actually inductive bias, which is
Which includes basically any possible architectural design choice you make when constructing a learning algorithm.
Now, to address your other two points:
I do not agree the assertion that systematic error is the only kind of error that can be made. Mislabelling, a type of random error, is present in almost any larger scale classification dataset.
Regarding KNNs, at least for KNN-Regression we have a very simple closed form for the bias-variance trade off:
$$ \operatorname {E} [(y-{\hat {f}}(x))^{2}\mid X=x]=\underbrace{\left(f(x)-{\frac {1}{k}}\sum _{i=1}^{k}f(N_{i}(x))\right)^{2}}_{\texttt{BIAS}^2}+\underbrace{\frac {\sigma ^{2}}{k}}_{\texttt{VAR}}+\underbrace{\sigma^{2}}_{\texttt{Bayes err.}} $$
here, bias is indeed monotonically increasing with $k$ while the variance is monotonically decreasing with $k$.