Regression algorithm

Random forest regression

A non-linear model for predicting a numeric value from feature columns. Trains a scikit-learn RandomForestRegressor — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard regression metric set plus impurity-based feature importance.

It is the robust non-linear baseline alongside the linear regressors (ridge-regressor-v1, ols-regressor-v1) and the gradient-boosted lightgbm-regressor-v1. Pick it when the relationship between features and target is not linear, when features interact, and you want a model that works well with little tuning.

What it does

You point it at a DataSource and pick:

a numeric target column you want to predict, and
one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Like the linear regressors — and unlike the LightGBM regressor, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the other regressors — the same regressor_train / regressor_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► regressor_train       regressor_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

regressor_train is fit-only; regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, ridge, OLS, and LightGBM share the exact same scoring + metric code.

What a random forest is

A random forest fits many decision trees and averages their predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.

This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM regressor's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kind	Steps
Numeric / boolean	impute missing values with the median → standardize to zero mean, unit variance
String / categorical	impute missing values with the most frequent value → one-hot encode

Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn regressor shares one pipeline shape; it is a harmless no-op here.

The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to.
Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM regressor treats an unseen level as missing.

Metric set

Same as the other regressors — the eval step is shared:

Metric	Meaning
MAE	Mean absolute error — average prediction error in target units
RMSE	Root mean squared error — penalizes large errors more heavily
R²	Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPE	Mean absolute percentage error — relative error, `None` if any test row has `target == 0`

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature importance

The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced prediction error across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.

These are impurity-based importances — they are not numerically comparable to the linear regressors' |coefficient| bars, nor to the LightGBM regressor's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

Key	Default	Meaning
`n_estimators`	`100`	Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows
`max_depth`	`None`	Maximum depth of each tree — `None` grows trees fully; set a cap to make a smaller, faster, more regularized model
`min_samples_leaf`	`1`	Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data
`max_features`	`1.0`	Fraction of features considered at each split — lower values decorrelate the trees more
`n_jobs`	`-1`	CPU cores used to fit trees in parallel — `-1` uses all cores

The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.

Limitations

Poor extrapolation. A forest predicts by averaging training targets, so it can never predict a value outside the range it saw in training. If your target trends beyond the training range, prefer a linear regressor or LightGBM.
Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap max_depth or lower n_estimators if artifact size matters.
Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
Usually edged out by boosting. On well-behaved tabular data the LightGBM regressor often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
No prediction intervals. This is point regression — the model outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them.
Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.