Tree/Random Forest (Bagging)

Tree/Random Forest (Bagging)#

CART#

CART is binary: each node has either 2 children or 0 children
- Other algorithm such as ID3 can produce decision trees with nodes have more than two children
- Other types of splitting rules
  - oblique decision trees or binary space partition tree (BSP trees): linear split at each node
  - sphere trees: space is partitioned by a sphere of a certain radius around a fixed point
The CART algorithm searches for the pair of \((k, t_k)\), a feature and its cutoff point, that produces the purest subsets (weighted by their size)
- CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether or not the split will lead to the lowest possible impurity several levels down. it is not guaranteed to be the optimal solution.
- finding the optimal tree is known to be an NP-Complete problem: it requires \(O(\exp(m))\) time, making the problem intractable even for fairly small training sets.
Prediction complexity: Since each node only requires checking the value of one feature, the over all prediction complexity is just \(O(\log2(m))\), independent of the number of features

Regression tree#

If the loss function is \(l(\hat{y}, y) = (\hat{y} - y)^2\), then the best choice of score on each leaf is \(\hat{c}_m = avg(y_i|x_i \in R_m)\), where \(\{R_1, \cdots, R_M\}\) is the partition of a given tree and the final prediction of a sample is defined as

\[f(x) = \sum_{m=1}^{M} c_m I(x\in R_m)\]
- Assume there are \(d\) feasures \(x = (x_1, \cdots, x_d)\) and the tree is split on the \(j\)th feasure (\(j \in \{1, \cdots, d\}\)), split point \(s\in R\), then the partition based on \(j\) and \(s\):
  
  \[R_1(j, s) = \{x|x_j\le s\}\]
  
  \[R_2(j, s) = \{x|x_j > s\}\]
- For each splitting variable \(x_j\) and split point \(s\),
  
  \[\hat{c}_1(j, s) = avg(y_i|x_i\in R_1(j, s))\]
  
  \[\hat{c}_2(j, s) = avg(y_i|x_i\in R_2(j, s))\]
- Find \(j, s\) by minimizing loss
  
  \[L(j,s)= \sum_{i:x_i\in R_1(j, s)}(y_i - \hat{c}_1(j, s))^2 + \sum_{i:x_i\in R_2(j, s)}(y_i - \hat{c}_2(j, s))^2\]
Overcome the potential risk of over-fitting (see regularization parameters below)
- Trees can easily get over-fit if we do enough splitting
- CART uses number of terminal nodes
- Another option is tree depth
Complexity of a tree
- Let \(|T | = M\) denote the number of terminal nodes in \(T\) , which is used to measure the complexity of a tree
- Finding the optimal binary tree of a given complexity is computationally intractable
  - Use greedy algorithm: build the tree one node at a time, without any planning ahead
- Control the complexity of a tree (i.e. avoid over-fitting)
  - limit max depth of tree
  - require all leaf nodes contain a minimum number of points
  - require a node have at least a certain number of data points to split
  - do backward pruning (the approach of CART (Breiman et al 1984)
    - Build a really big tree (e.g. until all regions have \(\le\) 5 points).
    - “Prune” the tree back greedily all the way to the root, assessing performance on validation

Classification tree#

Similar to regression, the predicted classification for node \(m\) is just the class gives the highest probability, thus the most frequent class, in that node \(m\)

\[k(m) = \text{argmax}_k\hat{p}_{mk}\]

where

\[\hat{p}_{mk} = \frac{1}{N_m}\sum_{i: x_i\in R_m}I(y_i = k)\]
The missclassification rate on node \(m\) is \(1 - \hat{p}_{mk(m)}\)
Loss function
- 0/1 loss: tractable for finding the best split but is not used usually, especially there are repeatedly splitting insteading of splitting once
- Eeventually we want \(pure\) leaf node and it’s thus natrual to find splitting node and point in that node to minimizing the node impurity after splitting
  - Miss classification error: \(1 - \hat{p}_{mk(m)}\)
  - Gini index: \(\sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk}) = 1 - \sum_{k=1}^K \hat{p}^2_{mk}\)
  - Entropy or deviance (= using information gain): \(-\sum_{k=1}^K\hat{p}_{mk}\log \hat{p}_{mk}\)
    - Gini and Entropy seem to be more effective, they push for more pure nodes, not just misclassification rate!
    - Gini impurity is slightly faster to compute, so it’s a good default. When they differe, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.
How to split
- To find the split that minimize the weighted average of node impurities
  
  \[N_L*Q(R_L) + N_R*Q(R_R)\]
  
  where \(R_L\) and \(R_R\) are potential splitting, \(N_L\) and \(N_R\) are number of observations in left and right branch after splitting and \(Q(\cdot)\) is the impurity measure

Trees in General#

Tree Pruning#

Build a “big tree”
- Keep splitting until every node either has zero error OR has \(C\) (e.g. 5 or 1) or fewer examples
Pruning
- Consider an internal node \(n\), to prune the subtree rooted at \(n\) so that eliminate all descendents of \(n\) and \(n\) becomes a terminal node
Cost Complexity Greedy Pruning
- Cost of complexity criterion with \(\alpha\):
  
  \[C_{\alpha} = \hat{R}(T) + \alpha|T|\]
  
  where \(\hat{R}(T)\) quantifies the empirical risk and \(|T|\) is the complexity of a tree
- Thus, \(C_{\alpha}\) trades off betweem empirical risk and complexity of a tree
- Cost complexity pruning:
  - For each \(\alpha\), find the subtree \(T ⊂ T_0\) minimizing \(C_{\alpha} (T)\) (on training data).
  - Use cross validation to find the right choice of \(\alpha\)
  - We only need to try \(N_{Int}\) trees to find the final pruned tree \(T\), where \(N_{Int}\) is the number of internal nodes of \(T_0\), that is
    
    \[\mathcal{T}= \{T_0 ⊃T_1 ⊃T_2 ⊃···⊃T_{|N_{Int}|}\}\]
  - Those \(N_{|Int|}\) trees can be identified by repeatedly pruning the tree (remove leaf nodes of an internal node) and choosing the one that minimes \(\hat{R}(T_i) - \hat{R}(T_{i+1})\) in each step until all internal nodes are removed and just a single (leaf) node is left
  - Breiman et al. (1984) proved that
    
    \[\left\{\arg\min_{T⊂T_0} C_{\alpha}(T) | \alpha \ge 0\right\} ⊂ \mathcal{T}\]

Bagging#

Parallel ensemble method \(\rightarrow\) bagging \(\rightarrow\) random forest[1]
Averaging independent prediction functions is preferred over single predictor as it reduces the variance of the prediction; however, in reality we don’t have multiple indepenent samples, instead we can use bootstrap to construct multiple samples for different predictors (bagging = bootstrap aggregation[2])
Bagging (for regression)
- Draw \(B\) bootstrap samples \(D^1,...,D^B\) from original data \(D\).
- Let \(\hat{f}_1, \hat{f}_2, \cdots, \hat{f}_B: \mathcal{X} \rightarrow \bf{R}\) be the prediction functions for each set
- Bagged prediction function is given as
  
  \[\hat{f}_{\text{bag}} = \frac{1}{B}\sum_{b=1}^B\hat{f}_b(x)\]
- \(\hat{f}_{\text{bag}}(x)\) usually has smaller variance than individual predictor \(\hat{f}_{b}(x)\)
Out-of-Bag error estimation
- Each bagged predictor is trained on about 63% (e.g.) of the data. Remaining 37% are called out-of-bag (OOB) observations.
  
  \[\hat{f}_{OOB}(x_i) = \frac{1}{S_i}\sum_{b\in S_i} \hat{f}_b(x_1)\]
  
  where \(S_i = \{b|D^b \text{does not contain ith point}\}\)
  - The OOB error is a good estimate of the test error.
  - OOB error is similar to cross validation error – both are computed on training set.
General sentiment is that bagging helps most when
- Relatively unbiased base prediction functions
- High variance / low stability
Bootstrap samples are
- independent samples from the training set, but are not indepedendent samples from \(P_{\mathcal{X×Y}}\).
- This dependence limits the amount of variance reduction we can get.

Random Forest#

Main idea of random forest
- Use bagged decision trees, but modify the tree-growing procedure to reduce the correlation between trees.
- Key step in random forests: When constructing each tree node, restrict choice of splitting variable to a randomly chosen subset of features of size \(m\).
- Typically choose \(m \approx \sqrt{p}\), where \(p\) is the number of features.
- Can choose m using cross validation.

Defining variable importance in RF#

Accuracy based importance
- Permutation importance: mean decrease in classification accuracy after permuting \(X_j\) over all trees
Gini importance
- Mean Gini gain produced by splitting \(X_j\) over all trees

Importance is calculated in out-of-sample trees

Tree/Random Forest (Bagging)

Contents

Tree/Random Forest (Bagging)#

CART#

Regression tree#

Classification tree#

Trees in General#

Tree Pruning#

Bagging#

Random Forest#

Defining variable importance in RF#