I am currently working on an assignment on RF. Here is the problem:
Consider building a random forest by both subsampling the data and choosing a single feature randomly. For example, consider a dataset $D = {(x_1,y_1),...,(x_n,y_n)}$ where $x_i \in R^p$ and $y_i \in R$ for $i=1,...,n$. A tree would be constructed from data that first samples one feature index $j \leq p$ at random followed by drawing a sample of data of size $m \leq n$ with replacement. From this process the subsamples are $D_k={(x_{k_1},y_{k_1}),...,(x_{k_m},y_{k_m})}$; each $x_{k_l}$ only contains the $j$th feature. We then build a decision tree on $D_k$. We average many of those random trees to construct the very random forest.
In a formal way, if we have a sequence of iid random vectors $\theta_k $ controlling the mechanism of sub-sampling, i.e., $\theta_k$ generates $D_k$ from D, then an ensemble of trees is grown ${h(x^*;\theta_k,D)}$ taking $x^*$ as test input.
Question:
a) For what true data distributions does very random forests have zero bias?
b) On the data distributions in part (a), compare (in the sense of bias and variance) this very random forest with the traditional random forest.
My intuition for part (a) is that in order for the very random forests to have zero bias, the true data distribution has to equal to the empirical distribution of the training data. Here's my attempt to construct a formal proof:
We want to show for any given $x$: $E_\theta[h(x^*; \theta, D)] = E_p[y|x]$
Where p is the true distribution that we are trying to figure out.
My intuition is:
$E_\theta[h(x^*;\theta,D)]=E_D[y|x]$
In other words, given a particular x, the expectation of what my trees predict will be identical to the mean of y given x in the training data.
If that's true, then it implies
$E_D[y|x] = E_p[y|x]$
So the true distribution $p$ must equal to the empirical distribution $D$ in order for the bias to be 0.
What do you think?