Decision Trees and Random Forests¶

Prof. Forrest Davis

Apply a decision tree to new data¶

Consider the following decision tree (from the book).

Question: Classify a new data point with the following values:

petal length: 2.50cm
petal width: 1.75cm

Iris decision tree from the book

Answer: The point would be classified as veriscolor

Step One: A classification problem (our old friend, the bad plots)¶

Consider the following figure.

A basic classification example to motivate decision trees

Question: Calculate the Gini impurity.

Hint: You can get the Gini impurity if you can answer the following question:
- What is the probability of classifying a point incorrectly (e.g., if I select a triangle, and I guessed based on the labels I know exist, what is the likelihood I misclassify a triangle)?

Answer

Starting with the triangle class that I mispredict as circle

\begin{align} G_\textrm{triangle} &= P(\textrm{select triangle}) * (1 - P(\textrm{classify as triangle}))\\ &= \frac{5}{15} * (1-\frac{5}{15}) \\ &= \frac{2}{9} \end{align}

Next up with the circle class that I mispredict as a triangle

\begin{align} G_\textrm{circle} &= P(\textrm{circle}) * (1 - P(\textrm{classify as circle}))\\ &= \frac{10}{15} * (1-\frac{10}{15}) \\ &= \frac{2}{9} \end{align}

Total probability of misclassification is $\frac{4}{9} \approx 0.44$
Notice it is less than $0.5$ why is that?

A note on the equation¶

Above we use $G_\textrm{i} = \sum_{k=1}^n P_{i,k} (1 - P_{i,k})$ where $n$ is the number of output classes and $P_{i,k}$ is the probability of correctly guessing output $k$ (from the $n$ classes) at a node $i$ in the tree. Recall a node has some split of the data.
The book instead uses $G_\textrm{i} = 1 - \sum_{k=1}^n P_{i,k}^2$
These are equivalent:

\begin{align} G_\textrm{i} &= \sum_{k=1}^n P_{i,k} (1 - P_{i,k})\\ &= \sum_{k=1}^n P_{i,k} - P_{i,k}^2 \\ &= \sum_{k=1}^n P_{i,k} - \sum_{k=1}^n P_{i,k}^2 \\ &= 1 - \sum_{k=1}^n P_{i,k}^2 \\ \end{align}

Note $\sum_{k=1}^n P_{i,k}$ is 1 because, by fundmental assumption about probability, the total probability at a node $i$ for all of the classes must sum to 1.

Step Two: Apply Gini impurity¶

Consider the following two potential decision boundaries:

Two possible decision boundaries

Question: Calculate the Gini impurity for the left and right regions for each decision boundary

Answer:

Equation for Gini impurity

$$ G = 1 - \sum_{k=1}^K p_{k} ^2 $$

Where we have $K$ output classes
Let's start with A
The left side has a gini index of:

$$ G_\textrm{left A} = 1 - ((\frac{10}{11})^2 + (\frac{1}{11})^2) = 0.165 $$

The right side has a gini index of:

$$ G_\textrm{right A} = 1 - ((\frac{0}{4})^2 + (\frac{4}{4})^2) = 0 $$

Next with B
The left side has a gini index of:

$$ G_\textrm{left B} = 1 - ((\frac{10}{10})^2 + (\frac{0}{10})^2) = 0 $$

The right side has a gini index of:

$$ G_\textrm{right B} = 1 - ((\frac{0}{5})^2 + (\frac{5}{5})^2) = 0 $$

Question: Calculate the CART cost function value for both decision boundaries

Answer

The cost function is as follows:

$$\textrm{Cost}_\textrm{threshold} = \frac{m_\textrm{left}}{m}G_\textrm{left} + \frac{m_\textrm{right}}{m}G_\textrm{right}$$

Where m is my total samples considered for this split and $G$ is gini impurity
For A

$$ \textrm{Cost}_\textrm{A} = \frac{11}{15} * 0.165 + \frac{4}{15}*0 = 0.121$$

For B

$$ \textrm{Cost}_\textrm{A} = \frac{10}{15} * 0 + \frac{5}{15}*0 = 0$$

Question: How much information did I gain with each decision bounday?

Answer

For A:
- $0.44 - 0.121 = 0.319$
For B:
- $0.44 - 0 = 0.44$
I gained the most information with B, so it wins!

Consider a binary classification task. Three decision boundaries with the following properties are generated. Determine their Gini values.
- All points belong to class 1
- Half of the points belong to class 1
- None of the points belong to class 1

Step Three: CART Algorithm Sketch¶

Guiding Questions:

Sketch out the CART Algorithm
Why is this algorithm greedy?
What are possible stopping criteria?

Answers

Calculate the Gini impurity for the entire dataset (e.g., impurity of the root node).
For each input feature, calculate the Ginit impurity for all possible thresholds. The feature, threshold pair which has the minimum Gini impurity wins.
Split the data based on the chosen feature, threshold pair. Create new nodes.
Repeate 2 and 3, until some stopping criterai is met. Could be maximum tree depth, minimum data points in a leaf node, or a minimum reduction in impurity

Learning a decision tree for continuous data¶

Given the following sample data, use the CART algorithm to learn a decision tree.

Sample	F$_1$	Y
1	2.2	1
2	3.2	0
3	4.2	0
4	4.6	1
5	5.6	1

Learning a decision tree for non-binary categorical data¶

Question

Given the following sample data, use the CART algorithm to learn a decision tree.

Sample	F$_1$	F$_2$	Y
1	A	cat	1
2	B	cat	1
3	A	cookie	0
4	C	cookie	0
5	C	dog	0
6	B	dog	1
7	A	cat	1

Limitations of Decision Trees¶

Question What are some limitations of decision trees?

Answer

Our learning algorithm can proceed until all nodes are pure. Meaning, we can greatly overfit our data.

Random Forests, Ensemble Learning, and Voting Classifiers¶

Random forests: train a bunch of decision trees on your data (adding in some randomness to ensure differences between each decision tree).
Ensemble learning: train a bunch of models and use them all to make a prediction
Voting Classifiers is one approach to ensemble learning.

Hard Voting Classifier Predictions