Getting Started

obliquetree combines advanced capabilities with efficient performance. It supports oblique splits, leveraging a custom L-BFGS optimization routine to determine the best linear weights for splits, ensuring both speed and accuracy.

In traditional mode, without oblique splits, obliquetree outperforms scikit-learn in terms of speed and adds support for categorical variables, providing a significant advantage over many traditional decision tree implementations.

When the oblique feature is enabled, obliquetree dynamically selects the optimal split type between oblique and traditional splits. If no weights can be found to reduce impurity, it defaults to an axis-aligned split, ensuring robustness and adaptability in various scenarios.

In very large trees (e.g., depth 20 or more), the performance of obliquetree may converge closely with traditional trees. The true strength of obliquetree lies in their ability to perform exceptionally well at shallower depths, offering improved generalization with fewer splits. Moreover, thanks to linear projections, obliquetree significantly outperform traditional trees when working with datasets that exhibit linear relationships.

Note

  1. Data Format:

    • obliquetree expects input data in Fortran order. If the data is not in Fortran order, the library will automatically create a copy in the correct format.

  2. Splitting Criteria:

    • For regression tasks, the library currently uses Mean Squared Error (MSE).

    • For classification tasks, it uses Gini Impurity.

    • Future versions will include additional splitting criteria.

  3. Data Standardization:

    • It is highly recommended to standardize your data before using obliquetree for better performance and stability.

  4. Flexibility:

    • The library can be used as a traditional decision tree without oblique splits.

  5. Handling Missing and Infinite Values:

    • For traditional axis-aligned splits, obliquetree can handle NaN and Inf values.

    • For oblique splits, any NaN or Inf values must be imputed before use.


Parameter Descriptions

General Parameters

  • use_oblique (bool, default=True):

    • Specifies whether to use oblique splits.

    • When set to True, the decision tree can use both linear combinations of features and axis-aligned to make splits.

    • When False, the tree uses traditional axis-aligned splits.

  • max_depth (int, default=-1):

    • Maximum depth of the tree.

    • If set to -1, the tree expands until all leaves are pure or contain fewer than min_samples_split samples.

  • min_samples_leaf (int, default=1):

    • The minimum number of samples required to be at a leaf node.

  • min_samples_split (int, default=2):

    • The minimum number of samples required to split an internal node.

  • min_impurity_decrease (float, default=0.0):

    • A node is split if the impurity decrease is greater than or equal to this value.

  • ccp_alpha (float, default=0.0):

    • Complexity parameter for Minimal Cost-Complexity Pruning.

    • Larger values result in more aggressive pruning.

Oblique-Specific Parameters

  • n_pair (int, default=2):

    • The number of features to consider for oblique splits.

    • Candidate tuples are generated from a screened subset of usable numeric features.

    • If top_k=None, the library uses an internal heuristic: \(k=\min\{p,\max(\lfloor\sqrt{p}\rfloor, 2\,n\_pair)\}\) where \(p\) is the number of usable numeric features.

    • Example: If there are 20 usable numeric features and n_pair=2, the default heuristic keeps \(k=\max(\lfloor\sqrt{20}\rfloor, 4)=4\) features and evaluates \(\binom{4}{2}=6\) candidate pairs.

  • top_k (int or None, default=None):

    • Number of screened numeric features kept before generating oblique candidate tuples.

    • Set a larger value to search more aggressively.

    • Set top_k equal to the number of usable numeric features to recover exhaustive candidate enumeration.

  • gamma (float, default=1.0):

    • Controls the separation strength in oblique splits.

    • Higher values enforce stronger separation in the loss function.

  • max_iter (int, default=100):

    • Maximum number of iterations for the L-BFGS optimization algorithm.

  • relative_change (float, default=0.001):

    • Early stopping threshold for L-BFGS optimization.

    • Smaller values lead to longer optimization times but can improve split quality.

  • random_state (int, default=None):

    • Seed for random number generation in oblique splits.

    • Ensures reproducibility of results when set.

Categorical Data Support

  • categories (List[int], default=None):

    • A list of column indices representing categorical features.

    • Categorical features are not used directly in oblique splits but are fully supported in axis-aligned splits.


Important

The n_pair parameter is critical for oblique splits. It defines how many features are combined to evaluate split candidates. For example:

  • If screening keeps k usable numeric features, the algorithm evaluates \(\binom{k}{n\_pair}\) candidates.

  • With the default heuristic and 20 usable numeric features, n_pair=3 gives \(k=\max(\lfloor\sqrt{20}\rfloor, 6)=6\), so the algorithm evaluates \(\binom{6}{3}=20\) candidates.

  • If top_k=20, the same setting becomes exhaustive and evaluates \(\binom{20}{3}=1140\) candidates.

Recommended Values:

  • n_pair=2 or n_pair=3 for most use cases.

  • Keep top_k=None unless you explicitly want a broader candidate search.

Avoid large top_k together with large n_pair, as the computational cost still grows combinatorially in \(\binom{k}{n\_pair}\).

Important

Oblique split search depends on deterministic random initialization and screening seeds. Use a fixed random_state when you need reproducible results across runs.