Categorical Support in Decision Trees
In decision tree algorithms, handling categorical features requires special consideration. Unlike numerical features where natural ordering exists, categorical variables need different strategies for finding optimal splits. Here’s how we handle categorical splits for different types of problems:
Binary Classification and Regression:
We group categories based on their relationship with the target variable
Each category’s contribution to the target is measured through mean target values
Categories are sorted and systematically grouped to find the most effective split point
Multiclass Classification:
We examine how each category relates to multiple classes
Since classes don’t have natural ordering
We explore different ways to group categories by considering all possible class orderings
Binary Classification and Regression
For each categorical value \(k\), we calculate its mean target value to understand how that category relates to what we’re trying to predict:
Where:
\(C_k\): Set of samples belonging to category \(k\)
\(y_i\): Target value for sample \(i\)
\(w_i\): Sample weight for instance \(i\)
Detailed Steps:
Sort categories by their mean target values (\(\mu_k\)) in descending order:
\[ \mu_1 \geq \mu_2 \geq \dots \geq \mu_m \]Evaluate all possible splits: For each potential split point \(t\) between 1 and \(m-1\) categories:
Left node: Top \(t\) categories (higher mean values)
Right node: Remaining \(m-t\) categories (lower mean values)
Select the split that gives the best prediction performance by minimizing:
For regression: Mean Squared Error
For binary classification: Gini Impurity
Multiclass Classification
When dealing with multiple classes (3+ categories), we need a more sophisticated approach since there’s no natural ordering between classes.
1. Class Distribution Calculation
Calculate how each category relates to each possible class:
Where:
\(p_{k,j}\): Proportion of class \(j\) in category \(k\)
\(I(y_i=j)\): Equals 1 if sample \(i\) belongs to class \(j\), 0 otherwise
2. Permutation-based Ordering
Since we don’t know the best way to order classes, we try all possible arrangements:
Consider all ways to order the classes
For each ordering \(\pi\), we compare categories \(k_1\) and \(k_2\) based on:
Their class distributions matching up to some position
The first position where they differ determines their order
3. Split Evaluation
For each class ordering \(\pi\) and split point \(t\):
Calculate node purity using Gini impurity:
\[ Gini(S) = 1 - \sum_{j=1}^C p_j^2 \]Where \(p_j\) is the proportion of class \(j\) in a node
Calculate weighted average impurity for both child nodes:
\[ Gini_{split} = \frac{n_L}{n}Gini(S_L) + \frac{n_R}{n}Gini(S_R) \]Where:
\(n_L, n_R\): Number of samples (weighted) in left and right nodes
\(S_L, S_R\): Samples in left and right nodes
4. Best Split Selection
Find the optimal split by minimizing impurity across all possible:
Class orderings (\(\pi\))
Split points (\(t\))
This comprehensive approach ensures we find the most effective way to group categorical values, regardless of the problem type or number of classes.