I am studying decision trees and how they can be enhanced via bagging, boosting and random forests. In order to assess their performance I would like to adopt the notion of bias-variance decomposition for zero-one loss functions as done in A Unified Bias-Variance Decomposition.
The setup is as follows: The training set consists of $ \lbrace (x_{1}, t_{1}), \ldots, (x_{n}, t_{n}) \rbrace $ where $ t_{i} $ is the true label of instance $ i $ having covariate $ x_{i} $. The goal is to fit a learner $ f $ such that $ y = f(x) $ is a prediction of some future instance having covariate $ x $, with smallest possible loss. The loss function is denoted by $ L(t,y) $.
The optimal prediction $ y_{\ast} $ for an instance $ x $ is the prediction that minimizes $ E_{t}\left[ L(t,y_{\ast}) \right] $, where the subscript $ t $ denotes that the expectation is taken with respect to all possible values of $ t $, weighted by their probabilities given $ x $.
Question 1: I am not happy about the notation $ E_{t}\left[ L(t,y_{\ast}) \right] $; I believe that I could substitute it with $ E\left[ L(t,y_{\ast}) \mid X = x \right] $ ? Alternatively $ E_{t|x}\left[ L(t,y_{\ast}) \right] $ may be a better notation?
Now, since the same learner will in general produce different models for different training sets, $ L(t,y) $ will be a function of the training data. This dependency can be removed by averaging over training sets. Let $ D $ be a set of training sets. Then the quantity of interest is the expected loss $ E_{D,t}\left[ L(t,y) \right] $ where the expectation is taken with respect to $ t $ and the training sets in $ D $ (i.e., with respect to $ t $ and the predictions $ y = f(x) $ produced by $ x $ by applying the learner to each training set in $ D $).
Question 2: Again, I find the subscript notation in $ E_{D,t}\left[ L(t,y) \right] $ a bit strange. If a subscript notation is to be used I would maybe substitute this with $ E_{x}\left[ E_{t|x}\left[ L(t,y) \right]\right] $. However, I am not entirely sure of what is meant here. Can someone shed a light on this?
The question is basically answered in this, fine, paper on bias variance anakysis. The answer to question 1 is yes, the interpretation of $E_{t}[L(t,y_{*})]$ is $E[L(t,y_{*})\mid X = x]$ which is calculated with respect to the probability measure $P(t \mid X = x)$.
For question 2, the quantity $E_{D,t}[L(t,y)]$ is defined as $E_{D}[E_{t}[L(t,y)]]$ where the inner expectation is interpreted as in question 1. It makes sense to talk about a stochastic set $D$ all though its density can not be written down explicitly. In practise, the expectation $E_{D}[\cdot]$ can be approximated by cross validation or boot strapping.