Classification algorithm
Random forest classifier
A non-linear model for predicting a categorical label from feature columns. Trains a scikit-learn RandomForestClassifier — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus impurity-based feature importance.
It is the robust non-linear baseline alongside the linear logreg-classifier-v1 and the gradient-boosted lightgbm-classifier-v1. Pick it when the relationship between features and label is not linear, when features interact, and you want a model that works well with little tuning.
What it does
You point it at a DataSource and pick:
- a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
- one or more feature columns the model gets to look at.
Feature columns may be numeric, boolean, or string/categorical. Like logistic regression — and unlike the LightGBM classifier, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)
The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.
How it works
The pipeline shape is identical to the other classifiers — the same classifier_train / classifier_eval nodes:
data_source → random_split → train_data + test_data
│ │
▼ ▼
model ─────────► classifier_train classifier_eval
│ ▲
▼ │
trained_model ──────────────────┘
│
▼
eval_result
classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, logistic regression, and LightGBM share the exact same scoring + metric code.
What a random forest is
A random forest fits many decision trees and averages their class-probability predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.
This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM classifier's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.
The preprocessing pipeline
The model is a scikit-learn Pipeline. Inside it:
| Feature kind | Steps |
|---|---|
| Numeric / boolean | impute missing values with the median → standardize to zero mean, unit variance |
| String / categorical | impute missing values with the most frequent value → one-hot encode |
Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn classifier shares one pipeline shape; it is a harmless no-op here.
The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.
Missing values & unseen categories
- Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates
NaNnatively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to. - Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.
Metric set
Same as the other classifiers — the eval step is shared:
| Metric | Meaning |
|---|---|
| Accuracy | Fraction of test rows classified correctly |
| Precision / Recall / F1 | Weighted averages across classes (binary: of the positive class) |
| ROC AUC | Ranking quality — binary single score, multi-class macro one-vs-rest |
| Confusion matrix | Per-class true-vs-predicted counts |
Feature importance
The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced classification impurity across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.
These are impurity-based importances — they are not numerically comparable to logistic regression's |coefficient| bars, nor to the LightGBM classifier's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.
Hyperparameters
Pass these on the model node's hyperparams (all optional):
| Key | Default | Meaning |
|---|---|---|
n_estimators | 100 | Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows |
max_depth | None | Maximum depth of each tree — None grows trees fully; set a cap to make a smaller, faster, more regularized model |
min_samples_leaf | 1 | Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data |
max_features | "sqrt" | Features considered at each split — "sqrt" is the classification default and decorrelates the trees |
class_weight | None | Set to "balanced" to up-weight minority classes on imbalanced data |
n_jobs | -1 | CPU cores used to fit trees in parallel — -1 uses all cores |
The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.
Limitations
- Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap
max_depthor lowern_estimatorsif artifact size matters. - Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
- Probabilities are not calibrated. A random forest's
predict_probavalues rank well but are not true probabilities — they tend to be pulled toward the middle. Trust the ranking (and ROC AUC) more than the raw numbers. - Usually edged out by boosting. On well-behaved tabular data the LightGBM classifier often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
- Imbalanced classes. With a strong class imbalance, set
class_weight="balanced"— otherwise the forest can collapse toward always predicting the majority class. - Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.
See also
lightgbm-classifier-v1.md— gradient-boosted sister; also tree-based and non-linear, with native categorical handling and usually slightly higher accuracy.logreg-classifier-v1.md— the interpretable linear baseline with the same pipeline shape.random-forest-regressor-v1.md— numeric-target sister with the same fit/eval shape.
Not sure which to pick?
Choosing a classification algorithmLightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.