# Getting Started


`obliquetree` combines advanced capabilities with efficient performance. It supports **oblique splits**, leveraging a custom **L-BFGS optimization** routine to determine the best linear weights for splits, ensuring both speed and accuracy.

In **traditional mode**, without oblique splits, `obliquetree` outperforms `scikit-learn` in terms of speed and adds support for **categorical variables**, providing a significant advantage over many traditional decision tree implementations.

When the **oblique feature** is enabled, `obliquetree` dynamically selects the optimal split type between oblique and traditional splits. If no weights can be found to reduce impurity, it defaults to an **axis-aligned split**, ensuring robustness and adaptability in various scenarios.

In very large trees (e.g., depth 20 or more), the performance of `obliquetree` may converge closely with **traditional trees**. The true strength of `obliquetree` lies in their ability to perform exceptionally well at **shallower depths**, offering improved generalization with fewer splits. Moreover, thanks to linear projections, `obliquetree` significantly outperform traditional trees when working with datasets that exhibit **linear relationships**.

```{note}
1. **Data Format**:
   - `obliquetree` expects input data in **Fortran order**. If the data is not in Fortran order, the library will automatically create a copy in the correct format.

2. **Splitting Criteria**:
   - For **regression tasks**, the library currently uses **Mean Squared Error (MSE)**.
   - For **classification tasks**, it uses **Gini Impurity**.
   - Future versions will include additional splitting criteria.

3. **Data Standardization**:
   - It is **highly recommended** to standardize your data before using `obliquetree` for better performance and stability.

4. **Flexibility**:
   - The library can be used as a **traditional decision tree** without oblique splits.

5. **Handling Missing and Infinite Values**:
   - For **traditional axis-aligned splits**, `obliquetree` can handle `NaN` and `Inf` values.
   - For **oblique splits**, any `NaN` or `Inf` values must be **imputed** before use.
```

---

## Parameter Descriptions

### General Parameters
- **`use_oblique` (bool, default=True):**
  - Specifies whether to use oblique splits.
  - When set to `True`, the decision tree can use both linear combinations of features and axis-aligned to make splits.
  - When `False`, the tree uses traditional axis-aligned splits.

- **`max_depth` (int, default=-1):**
  - Maximum depth of the tree.
  - If set to `-1`, the tree expands until all leaves are pure or contain fewer than `min_samples_split` samples.

- **`min_samples_leaf` (int, default=1):**
  - The minimum number of samples required to be at a leaf node.

- **`min_samples_split` (int, default=2):**
  - The minimum number of samples required to split an internal node.

- **`min_impurity_decrease` (float, default=0.0):**
  - A node is split if the impurity decrease is greater than or equal to this value.

- **`ccp_alpha` (float, default=0.0):**
  - Complexity parameter for Minimal Cost-Complexity Pruning.
  - Larger values result in more aggressive pruning.

### Oblique-Specific Parameters
- **`n_pair` (int, default=2):**
  - The number of features to consider for oblique splits.
  - Candidate tuples are generated from a screened subset of usable numeric features.
  - If `top_k=None`, the library uses an internal heuristic: $k=\min\{p,\max(\lfloor\sqrt{p}\rfloor, 2\,n\_pair)\}$ where $p$ is the number of usable numeric features.
  - **Example:** If there are 20 usable numeric features and `n_pair=2`, the default heuristic keeps $k=\max(\lfloor\sqrt{20}\rfloor, 4)=4$ features and evaluates $\binom{4}{2}=6$ candidate pairs.

- **`top_k` (int or None, default=None):**
  - Number of screened numeric features kept before generating oblique candidate tuples.
  - Set a larger value to search more aggressively.
  - Set `top_k` equal to the number of usable numeric features to recover exhaustive candidate enumeration.

- **`gamma` (float, default=1.0):**
  - Controls the separation strength in oblique splits.
  - Higher values enforce stronger separation in the loss function.

- **`max_iter` (int, default=100):**
  - Maximum number of iterations for the L-BFGS optimization algorithm.

- **`relative_change` (float, default=0.001):**
  - Early stopping threshold for L-BFGS optimization.
  - Smaller values lead to longer optimization times but can improve split quality.

- **`random_state` (int, default=None):**
  - Seed for random number generation in oblique splits.
  - Ensures reproducibility of results when set.

### Categorical Data Support
- **`categories` (List[int], default=None):**
  - A list of column indices representing categorical features.
  - Categorical features are not used directly in oblique splits but are fully supported in axis-aligned splits.

---


```{important}
The `n_pair` parameter is critical for oblique splits. It defines how many features are combined to evaluate split candidates. For example:
- If screening keeps **k** usable numeric features, the algorithm evaluates $\binom{k}{n\_pair}$ candidates.
- With the default heuristic and **20 usable numeric features**, `n_pair=3` gives $k=\max(\lfloor\sqrt{20}\rfloor, 6)=6$, so the algorithm evaluates $\binom{6}{3}=20$ candidates.
- If `top_k=20`, the same setting becomes exhaustive and evaluates $\binom{20}{3}=1140$ candidates.

**Recommended Values:**
- `n_pair=2` or `n_pair=3` for most use cases.
- Keep `top_k=None` unless you explicitly want a broader candidate search.

Avoid large `top_k` together with large `n_pair`, as the computational cost still grows combinatorially in $\binom{k}{n\_pair}$.
```

```{important}
Oblique split search depends on deterministic random initialization and screening seeds. Use a fixed `random_state` when you need reproducible results across runs.
```