we saw in my machine learning course the error decomposition, that decompose the error to an approximation error and estimation error. In addition, we talked a lot about the bias-variance trade-off. As far as I understand, those are two different things. Can you please clarify the difference between them, and what is the relation between them?
Also is there is a formal definition (some formula I can calculate, Like those we have for approximation error and estimation error ) for bias and variance?
Thanks
These are very different things. One simple way to describe this problem is as follows: we assume there is some data space $X$ and some output (or target or label space) $Y$. We assume that the real data distribution is defined by some probability distribution $P$ with density $p(x)$, for $x\in X$. There is some true mapping $g:X\rightarrow Y$, which we want to learn to approximate. Next, we choose some hypothesis class $\mathcal{H}$, which is a set (or space) of functions that we will search for our solution. Let $L:Y\times Y\rightarrow\mathbb{R}^+$ be a loss function, so that the expected generalization error is $$ \mathcal{E}(f) = \mathbb{E}_{x\sim P}[L(f(x),g(x))] = \int_X p(x)L(f(x),g(x)) \,dx $$ Note there are more general formulations where $g$ is not deterministic, which you may have seen.
The approximation error is commonly considered to be the inherent error due to using a model alone, not even considering the problems with fitting that model to data. In other words, usually, $g \notin \mathcal{H}$, so any function $f\in \mathcal{H}$ that we choose, is guaranteed to have some error. In other words, the error $$ E_A[\mathcal{H}] = \inf_{f\in \mathcal{H}} \mathcal{E}(f) $$ is inevitable. Even if the training/fitting works perfectly, this error will still be there. Notice that $E_A$ depends only on the function space of hypotheses we choose.
Now, given $\mathcal{H}$, we choose some fitting procedure (like stochastic gradient descent), which searches $\mathcal{H}$ for the optimal $f$. Usually, we cannot find it! So there is error due to the model training process. This error is called the estimation error, which we can define via $$ E_S[D,\mathcal{A},\mathcal{H}] = - E_A[\mathcal{H}] + \mathcal{E}(f_{D,\mathcal{A}}) $$ where $\mathcal{A}$ is a fitting algorithm, $D\sim P_T$ is a training dataset (where we have many pairs $(x,y)$, where $x\sim P$ and $y = \xi(g(x))$ with $\xi$ being some "noise" function that injects noise into the correct label), and $f_{D,A}\in\mathcal{H}$ is the model our procedure came up with. Note that the performance of $\mathcal{A}$ can depend on $\mathcal{H}$. In other words, the estimation error is the remaining error, caused due to training being suboptimal. It's the difference between the best we could do given $\mathcal{H}$ and what our training algorithms and datasets can actually do.
These two errors are very generic. The bias-variance decomposition is more specific, in that it is concerned with how the error changes with respect to just the dataset $D\sim P_T$. (The estimation error, on the other hand, depends on the dataset, hypothesis class, and training algorithm). The idea is that we are interested in how the error changes as our dataset varies. High variance means that the model acts wildly different when $D$ changes slightly, whereas high bias means that the model is consistently wrong (in the same way, despite $D$ changing); an ideal model has low bias and low variance.
To simplify things, let's consider the special case of (mean) squared error with additive noise, so $ L(a,b) = (a-b)^2 $, $\xi(y) = y + \eta$ with $\eta\sim\mathcal{N}(0,\sigma^2)$, and $Y = \mathbb{R}$. Then (or see here): $$ \mathbb{E}_{D\sim P_T}[(y - f(x))^2] = \mathfrak{B}[f(x)]^2 + \mathbb{V}[f(x)] + \sigma^2 $$ where the bias and variance, respectively, are given by $$ \mathfrak{B}[f(x)] = \mathbb{E}_{D\sim P_T}[f(x)] - g(x) \;\;\;\;\&\;\;\;\; \mathbb{V}[f(x)] = \mathbb{E}_{D\sim P_T}[f(x)^2] - \mathbb{E}_{D\sim P_T}[f(x)]^2 $$ where $(x,y)\notin D$ are some fixed data pair. In other words, bias measures "how close is the expected prediction to the real value as the dataset varies", where variance measures "how spread out are the model predictions from their own mean as $D$ varies". Note that the variance here has no dependence on the true value, yet it contributes to the expected squared error. Often, people discuss deep learners having high variance but low bias: this is because they tend to overfit to the noise in the data (meaning they are very sensitive to changes in $D$), but are powerful enough learners that on average they will learn the correct value (but for a particular model on a specific dataset and input, it may be far from the truth).
Summary: approximation and estimation error define the error due to choice of hypothesis class, training algorithm, and dataset, while the bias-variance decomposition is a specific way of rewriting the expected error of a model as the dataset varies. Note that in the former case $f$ is fixed, as we are considering the generalization error, while in the latter case, we care about how $f$ changes as $D$ changes.
We can write them together by considering a fixed training algorithm $\mathcal{A}$ and hypothesis class $\mathcal{H}$, and considering the expected generalization error of our model as the dataset changes: \begin{align} \mathbb{E}_{D\sim P_T}[ \mathcal{E}(f_{D,\mathcal{A}}) ] &= \mathbb{E}_{D\sim P_T}[ E_S[D,\mathcal{A},\mathcal{H}]] + E_A[\mathcal{H}] \\ &= \mathbb{E}_{D\sim P_T}\left[ \mathbb{E}_{x\sim P} [L(f_{D,\mathcal{A}}(x),g(x))] \right] \\ &= \mathbb{E}_{x\sim P} \left[ \mathbb{E}_{D\sim P_T}[ (f_{D,\mathcal{A}}(x) - g(x))^2] \right] \\ &= \mathbb{E}_{x\sim P} \left[\mathfrak{B}[f(x)]^2 + \mathbb{V}[f(x)]\right] \end{align} for the case of the mean squared error without data stochasticity.
Not sure what exactly you mean here. I think I've given a relatively formal definition here (compared to most ML courses), if that helps; I'll link a few other definitions below as well. If you're asking how to estimate these values, one way is similar to cross-validation or bootstrapping. Simply split your training set into many pieces, compute the loss, and directly estimate the mean and variance of the error. Assuming $\sigma$ is small (this one we cannot estimate), this should give an estimate of the bias and variance.
On approximation vs estimation, see this post.
On bias vs variance decomposition, see these: [1], [2], [3], [4], [5], [6], [7], [8].