Classification algorithm

Random forest classifier

A non-linear model for predicting a categorical label from feature columns. Trains a scikit-learn RandomForestClassifier — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus impurity-based feature importance.

It is the robust non-linear baseline alongside the linear logreg-classifier-v1 and the gradient-boosted lightgbm-classifier-v1. Pick it when the relationship between features and label is not linear, when features interact, and you want a model that works well with little tuning.

What it does

You point it at a DataSource and pick:

  • a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Like logistic regression — and unlike the LightGBM classifier, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the other classifiers — the same classifier_train / classifier_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► classifier_train      classifier_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, logistic regression, and LightGBM share the exact same scoring + metric code.

What a random forest is

A random forest fits many decision trees and averages their class-probability predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.

This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM classifier's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kindSteps
Numeric / booleanimpute missing values with the median → standardize to zero mean, unit variance
String / categoricalimpute missing values with the most frequent value → one-hot encode

Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn classifier shares one pipeline shape; it is a harmless no-op here.

The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

  • Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to.
  • Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.

Metric set

Same as the other classifiers — the eval step is shared:

MetricMeaning
AccuracyFraction of test rows classified correctly
Precision / Recall / F1Weighted averages across classes (binary: of the positive class)
ROC AUCRanking quality — binary single score, multi-class macro one-vs-rest
Confusion matrixPer-class true-vs-predicted counts

Feature importance

The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced classification impurity across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.

These are impurity-based importances — they are not numerically comparable to logistic regression's |coefficient| bars, nor to the LightGBM classifier's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

KeyDefaultMeaning
n_estimators100Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows
max_depthNoneMaximum depth of each tree — None grows trees fully; set a cap to make a smaller, faster, more regularized model
min_samples_leaf1Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data
max_features"sqrt"Features considered at each split — "sqrt" is the classification default and decorrelates the trees
class_weightNoneSet to "balanced" to up-weight minority classes on imbalanced data
n_jobs-1CPU cores used to fit trees in parallel — -1 uses all cores

The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.

Limitations

  • Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap max_depth or lower n_estimators if artifact size matters.
  • Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
  • Probabilities are not calibrated. A random forest's predict_proba values rank well but are not true probabilities — they tend to be pulled toward the middle. Trust the ranking (and ROC AUC) more than the raw numbers.
  • Usually edged out by boosting. On well-behaved tabular data the LightGBM classifier often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
  • Imbalanced classes. With a strong class imbalance, set class_weight="balanced" — otherwise the forest can collapse toward always predicting the majority class.
  • Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.

See also

  • lightgbm-classifier-v1.md — gradient-boosted sister; also tree-based and non-linear, with native categorical handling and usually slightly higher accuracy.
  • logreg-classifier-v1.md — the interpretable linear baseline with the same pipeline shape.
  • random-forest-regressor-v1.md — numeric-target sister with the same fit/eval shape.

Not sure which to pick?

Choosing a classification algorithm

LightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.