How to generate and use random trees?

798 Views Asked by At

I have the assignment to implement a random tree classifier in MATLAB.

The lecture says:

Input: observations and lables
While stopping criterion not reached:
1. Node optimization: - several split candidates are randomly generated
   - the best splitting function is chosen according to some quality measure
2. Data splitting: observations are pushed to the left or right branch.
3. Move to next node

Stopping criteria: 
      Quality measure - Number of data points in the current node/leaf

My problem now is I do not understand how to get the randomly generated split candidates? Get them from the input values? But then I would get a decision tree (pick a random element and say >x right node, < x left node.) Also I do not understand what the difference between the random tree and the decision tree is in the end.

Also the lecture says:

Choosing the best candidates: according to a quality measures
Out-of-bag error (OOB)  -  Minimize error rate after splitting using a test set
Information gain  -  Maximize information gain after splitting

But what test set should I use? The test set already in the tree used for training?

Wikipedia and Google did not help me either. The code of the MATLAB stub can be found here: http://pastebin.com/iuzqF8gG

I appreciate your help.

2

There are 2 best solutions below

0
On

I do not know about Matlab, but in general, you can use stochastic context free grammers (SCFG); see e.g. Weinberg, Nebel (2010).

Basic idea: represent trees with a CFG, train/define probabilities appropriately and generate trees top-down, following the grammar.

1
On

My problem now is I do not understand how to get the randomly generated split candidates? ... Also I do not understand what the difference between the random tree and the decision tree is in the end.

Randomly select a split point and choose to accept based on a quality measure like BIC, etc. It differs from a decision tree because the split point is generated randomly to generate several decision trees and you only keep the most useful (w.r.t some metric).

But what test set should I use? The test set already in the tree used for training?

Use cross-validation (e.g. leave-one-out-cross-validation).

Here, you take your data set and randomly split into two datasets : a test and training dataset (e.g. you have 100 data points and you randomly choose 50 for training and the rest for testing). Once you have trained your tree on the training data set, you can test it on the testing data set which will give you an insight on how well your trees perform with unseen data (testing error).