Classification algorithm

Random forest classifier

A non-linear model for predicting a categorical label from feature columns. Trains a scikit-learn RandomForestClassifier — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus impurity-based feature importance.

It is the robust non-linear baseline alongside the linear logreg-classifier-v1 and the gradient-boosted lightgbm-classifier-v1. Pick it when the relationship between features and label is not linear, when features interact, and you want a model that works well with little tuning.

What it does

You point it at a DataSource and pick:

a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Like logistic regression — and unlike the LightGBM classifier, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the other classifiers — the same classifier_train / classifier_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► classifier_train      classifier_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, logistic regression, and LightGBM share the exact same scoring + metric code.

What a random forest is

A random forest fits many decision trees and averages their class-probability predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.

This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM classifier's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kind	Steps
Numeric / boolean	impute missing values with the median → standardize to zero mean, unit variance
String / categorical	impute missing values with the most frequent value → one-hot encode

Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn classifier shares one pipeline shape; it is a harmless no-op here.

The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to.
Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.

Metric set

Same as the other classifiers — the eval step is shared:

Metric	Meaning
Accuracy	Fraction of test rows classified correctly
Precision / Recall / F1	Weighted averages across classes (binary: of the positive class)
ROC AUC	Ranking quality — binary single score, multi-class macro one-vs-rest
Confusion matrix	Per-class true-vs-predicted counts

Feature importance

The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced classification impurity across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.

These are impurity-based importances — they are not numerically comparable to logistic regression's |coefficient| bars, nor to the LightGBM classifier's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

Key	Default	Meaning
`n_estimators`	`100`	Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows
`max_depth`	`None`	Maximum depth of each tree — `None` grows trees fully; set a cap to make a smaller, faster, more regularized model
`min_samples_leaf`	`1`	Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data
`max_features`	`"sqrt"`	Features considered at each split — `"sqrt"` is the classification default and decorrelates the trees
`class_weight`	`None`	Set to `"balanced"` to up-weight minority classes on imbalanced data
`n_jobs`	`-1`	CPU cores used to fit trees in parallel — `-1` uses all cores

The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.

Limitations

Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap max_depth or lower n_estimators if artifact size matters.
Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
Probabilities are not calibrated. A random forest's predict_proba values rank well but are not true probabilities — they tend to be pulled toward the middle. Trust the ranking (and ROC AUC) more than the raw numbers.
Usually edged out by boosting. On well-behaved tabular data the LightGBM classifier often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
Imbalanced classes. With a strong class imbalance, set class_weight="balanced" — otherwise the forest can collapse toward always predicting the majority class.
Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.