Getting Started
obliquetree combines advanced capabilities with efficient performance. It supports oblique splits, leveraging a custom L-BFGS optimization routine to determine the best linear weights for splits, ensuring both speed and accuracy.
In traditional mode, without oblique splits, obliquetree outperforms scikit-learn in terms of speed and adds support for categorical variables, providing a significant advantage over many traditional decision tree implementations.
When the oblique feature is enabled, obliquetree dynamically selects the optimal split type between oblique and traditional splits. If no weights can be found to reduce impurity, it defaults to an axis-aligned split, ensuring robustness and adaptability in various scenarios.
In very large trees (e.g., depth 20 or more), the performance of obliquetree may converge closely with traditional trees. The true strength of obliquetree lies in their ability to perform exceptionally well at shallower depths, offering improved generalization with fewer splits. Moreover, thanks to linear projections, obliquetree significantly outperform traditional trees when working with datasets that exhibit linear relationships.
Note
Data Format:
obliquetreeexpects input data in Fortran order. If the data is not in Fortran order, the library will automatically create a copy in the correct format.
Splitting Criteria:
For regression tasks, the library currently uses Mean Squared Error (MSE).
For classification tasks, it uses Gini Impurity.
Future versions will include additional splitting criteria.
Data Standardization:
It is highly recommended to standardize your data before using
obliquetreefor better performance and stability.
Flexibility:
The library can be used as a traditional decision tree without oblique splits.
Handling Missing and Infinite Values:
For traditional axis-aligned splits,
obliquetreecan handleNaNandInfvalues.For oblique splits, any
NaNorInfvalues must be imputed before use.
Parameter Descriptions
General Parameters
use_oblique(bool, default=True):Specifies whether to use oblique splits.
When set to
True, the decision tree can use both linear combinations of features and axis-aligned to make splits.When
False, the tree uses traditional axis-aligned splits.
max_depth(int, default=-1):Maximum depth of the tree.
If set to
-1, the tree expands until all leaves are pure or contain fewer thanmin_samples_splitsamples.
min_samples_leaf(int, default=1):The minimum number of samples required to be at a leaf node.
min_samples_split(int, default=2):The minimum number of samples required to split an internal node.
min_impurity_decrease(float, default=0.0):A node is split if the impurity decrease is greater than or equal to this value.
ccp_alpha(float, default=0.0):Complexity parameter for Minimal Cost-Complexity Pruning.
Larger values result in more aggressive pruning.
Oblique-Specific Parameters
n_pair(int, default=2):The number of features to consider for oblique splits.
Candidate tuples are generated from a screened subset of usable numeric features.
If
top_k=None, the library uses an internal heuristic: \(k=\min\{p,\max(\lfloor\sqrt{p}\rfloor, 2\,n\_pair)\}\) where \(p\) is the number of usable numeric features.Example: If there are 20 usable numeric features and
n_pair=2, the default heuristic keeps \(k=\max(\lfloor\sqrt{20}\rfloor, 4)=4\) features and evaluates \(\binom{4}{2}=6\) candidate pairs.
top_k(int or None, default=None):Number of screened numeric features kept before generating oblique candidate tuples.
Set a larger value to search more aggressively.
Set
top_kequal to the number of usable numeric features to recover exhaustive candidate enumeration.
gamma(float, default=1.0):Controls the separation strength in oblique splits.
Higher values enforce stronger separation in the loss function.
max_iter(int, default=100):Maximum number of iterations for the L-BFGS optimization algorithm.
relative_change(float, default=0.001):Early stopping threshold for L-BFGS optimization.
Smaller values lead to longer optimization times but can improve split quality.
random_state(int, default=None):Seed for random number generation in oblique splits.
Ensures reproducibility of results when set.
Categorical Data Support
categories(List[int], default=None):A list of column indices representing categorical features.
Categorical features are not used directly in oblique splits but are fully supported in axis-aligned splits.
Important
The n_pair parameter is critical for oblique splits. It defines how many features are combined to evaluate split candidates. For example:
If screening keeps k usable numeric features, the algorithm evaluates \(\binom{k}{n\_pair}\) candidates.
With the default heuristic and 20 usable numeric features,
n_pair=3gives \(k=\max(\lfloor\sqrt{20}\rfloor, 6)=6\), so the algorithm evaluates \(\binom{6}{3}=20\) candidates.If
top_k=20, the same setting becomes exhaustive and evaluates \(\binom{20}{3}=1140\) candidates.
Recommended Values:
n_pair=2orn_pair=3for most use cases.Keep
top_k=Noneunless you explicitly want a broader candidate search.
Avoid large top_k together with large n_pair, as the computational cost still grows combinatorially in \(\binom{k}{n\_pair}\).
Important
Oblique split search depends on deterministic random initialization and screening seeds. Use a fixed random_state when you need reproducible results across runs.