Oblique Split Algorithm
1. General Framework and Notation
1.1 Data and Parameter Definitions
\(X \in \mathbb{R}^{N \times d}\): Feature matrix
\(\mathbf{y} \in \{0,1,\ldots,K-1\}^N\) or \(\mathbb{R}^N\): Target variable (categorical for classification, continuous for regression)
\(\mathbf{w} \in \mathbb{R}^d\): Optimization parameters defining the split hyperplane
\(\boldsymbol{\omega} \in \mathbb{R}^N\): Sample weights
\(\gamma\): Scaling parameter for the sigmoid function (controls the sharpness of the split)
\(\epsilon\): Small constant for numerical stability
1.2 Basic Computations
For each sample \(i\):
Computes the projection of each sample onto the direction vector \(\mathbf{w}\)
Transforms the projection into a probability of belonging to the left child node
Smoothed version of a hard split, allowing gradient-based optimization
1.3 Optimal Threshold Selection
After obtaining optimal weights \(\mathbf{w}\), we select the threshold \(t\) that minimizes the impurity measure:
where Impurity is either Gini impurity for classification or MSE for regression. In implementation, this threshold is chosen by an exact scan over the sorted projected values \(\{z_i\}_{i=1}^N\) after the weight optimization step.
2. Binary Classification
2.1 Soft Decision Tree Approach
Left Node |
Right Node |
|---|---|
\(S_L = \sum_{i=1}^N \omega_i p_i + \epsilon\) |
\(S_R = \sum_{i=1}^N \omega_i(1-p_i) + \epsilon\) |
\(M_L = \sum_{i=1}^N \omega_i p_i y_i\) |
\(M_R = \sum_{i=1}^N \omega_i(1-p_i)y_i\) |
\(P_{L1} = \frac{M_L}{S_L}\) |
\(P_{R1} = \frac{M_R}{S_R}\) |
These statistics track the weighted distribution of samples and their labels in each child node.
The optimization objective is to minimize:
The optimization requires computing several gradients:
Basic probability gradient:
Node sum gradients:
Class probability gradient:
Final gradient:
2.2 Linear Classification Approach
This approach treats the split as a linear classification problem.
3. Multiclass Classification
3.1 Soft Decision Tree Approach
Left Node |
Right Node |
|---|---|
\(C_{L,k} = \sum_{i:y_i=k} \omega_i p_i\) |
\(C_{R,k} = \sum_{i:y_i=k} \omega_i(1-p_i)\) |
\(S_L = \sum_{k=0}^{K-1} C_{L,k} + \epsilon\) |
\(S_R = \sum_{k=0}^{K-1} C_{R,k} + \epsilon\) |
\(P_{L,k} = \frac{C_{L,k}}{S_L}\) |
\(P_{R,k} = \frac{C_{R,k}}{S_R}\) |
These statistics track the weighted distribution of samples for each class in the child nodes.
The optimization objective is to minimize:
The optimization requires computing several gradients:
Basic probability gradient:
Total soft mass gradient:
Final gradient.
Define
Then
3.2 One-vs-Rest Approach
For \(K\) classes, we train \(K\) binary classifiers where for each classifier \(k\):
Class \(k\) samples are labeled as 1
All other class samples are labeled as 0
The final decision is made by selecting the class whose model gives the minimum loss:
4. Regression
4.1 Soft Decision Tree Approach
Left Node |
Right Node |
|---|---|
\(S_L = \sum_{i=1}^N \omega_i p_i + \epsilon\) |
\(S_R = \sum_{i=1}^N \omega_i(1-p_i) + \epsilon\) |
\(M_L = \sum_{i=1}^N \omega_i p_i y_i\) |
\(M_R = \sum_{i=1}^N \omega_i(1-p_i)y_i\) |
\(m_L = \frac{M_L}{S_L}\) |
\(m_R = \frac{M_R}{S_R}\) |
These statistics track the weighted sum and mean of target values in each child node.
The optimization objective is to minimize:
The optimization requires computing several gradients:
Node mean gradients:
Final gradient:
4.2 Linear Regression Approach
5. L-BFGS Optimization
The optimization process uses a custom unconstrained L-BFGS implementation to solve:
Key elements of the optimization process:
Objective Function:
Minimizes the loss function (Gini impurity or MSE)
Stopping Criteria:
Stops when maximum iterations (
maxiter) is reachedStops when the infinity norm of the gradient falls below a small tolerance
Stops when relative improvement falls below threshold:
\[ \frac{\mathcal{L}(\mathbf{w}_{t-1}) - \mathcal{L}(\mathbf{w}_t)}{\mathcal{L}(\mathbf{w}_{t-1})} \leq \text{relative\_change} \]
Solution Normalization:
Final weights are normalized:
\[ \mathbf{w} \leftarrow \frac{\mathbf{w}}{\max(|\mathbf{w}|)} \]