Regression algorithm

Random forest regression

A non-linear model for predicting a numeric value from feature columns. Trains a scikit-learn RandomForestRegressor — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard regression metric set plus impurity-based feature importance.

It is the robust non-linear baseline alongside the linear regressors (ridge-regressor-v1, ols-regressor-v1) and the gradient-boosted lightgbm-regressor-v1. Pick it when the relationship between features and target is not linear, when features interact, and you want a model that works well with little tuning.

What it does

You point it at a DataSource and pick:

  • a numeric target column you want to predict, and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Like the linear regressors — and unlike the LightGBM regressor, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the other regressors — the same regressor_train / regressor_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► regressor_train       regressor_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

regressor_train is fit-only; regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, ridge, OLS, and LightGBM share the exact same scoring + metric code.

What a random forest is

A random forest fits many decision trees and averages their predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.

This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM regressor's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kindSteps
Numeric / booleanimpute missing values with the median → standardize to zero mean, unit variance
String / categoricalimpute missing values with the most frequent value → one-hot encode

Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn regressor shares one pipeline shape; it is a harmless no-op here.

The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

  • Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to.
  • Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM regressor treats an unseen level as missing.

Metric set

Same as the other regressors — the eval step is shared:

MetricMeaning
MAEMean absolute error — average prediction error in target units
RMSERoot mean squared error — penalizes large errors more heavily
Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPEMean absolute percentage error — relative error, None if any test row has target == 0

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature importance

The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced prediction error across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.

These are impurity-based importances — they are not numerically comparable to the linear regressors' |coefficient| bars, nor to the LightGBM regressor's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

KeyDefaultMeaning
n_estimators100Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows
max_depthNoneMaximum depth of each tree — None grows trees fully; set a cap to make a smaller, faster, more regularized model
min_samples_leaf1Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data
max_features1.0Fraction of features considered at each split — lower values decorrelate the trees more
n_jobs-1CPU cores used to fit trees in parallel — -1 uses all cores

The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.

Limitations

  • Poor extrapolation. A forest predicts by averaging training targets, so it can never predict a value outside the range it saw in training. If your target trends beyond the training range, prefer a linear regressor or LightGBM.
  • Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap max_depth or lower n_estimators if artifact size matters.
  • Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
  • Usually edged out by boosting. On well-behaved tabular data the LightGBM regressor often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
  • No prediction intervals. This is point regression — the model outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them.
  • Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.

See also

  • lightgbm-regressor-v1.md — gradient-boosted sister; also tree-based and non-linear, with native categorical handling and usually slightly higher accuracy.
  • ridge-regressor-v1.md — the interpretable linear baseline with the same pipeline shape.
  • ols-regressor-v1.md — the unregularized linear baseline.

Not sure which to pick?

Choosing a regression algorithm

LightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.