Regression algorithm

LightGBM regression

Gradient-boosted regression for predicting a numeric value from feature columns. Trains one booster on a random hold-out split, scores on the held-out test set, and surfaces the standard regression metric set (MAE / RMSE / R² / MAPE) plus feature importance.

What it does

You point it at a DataSource and pick:

a numeric target column you want to predict, and
one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or categorical — LightGBM handles all three natively. No manual one-hot encoding or normalization is required at this stage; if you want preprocessing (imputation, scaling, encoding) you can wire it explicitly upstream of the trainer node.

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature importance chart.

How it works

The pipeline shape is:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► regressor_train       regressor_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

regressor_train is fit-only — it produces a fitted booster but no metrics. regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. This mirrors the classification slice; both differ from the forecast trainer (which is monolithic because of the CQR calibration step).

Loss function

V1 defaults to LightGBM's objective="regression" (squared-error / L2 loss). Robust users who want median regression (L1, less sensitive to outliers) can pass hyperparams={"objective": "regression_l1"} on the model node — both are first-class LightGBM objectives, no separate algorithm id needed.

Metric set

Metric	Meaning
MAE	Mean absolute error — average prediction error in target units
RMSE	Root mean squared error — penalizes large errors more heavily
R²	Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPE	Mean absolute percentage error — relative error, NaN if any test row has `target == 0`

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature handling

Numeric columns pass through as-is.
Boolean columns are cast to int.
Categorical columns are handed to LightGBM with the Categorical dtype so the booster uses Fisher categorical splits (no one-hot blow-up). Vocabulary is captured at fit time and replayed on inference.

If a column has too many unique values to encode meaningfully, the trainer falls back to ordinal encoding with a configurable cap; check the registered model's model_feature_schema for the resolved choice.

Limitations

No prediction intervals. This is point regression — the booster outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them; a quantile-regression variant of this trainer is on the roadmap.
MAPE is brittle. If any test row has target == 0, MAPE is undefined and reported as NaN. Use MAE or RMSE as your scoreboard metric in that case.
No automatic feature engineering. Unlike the time-series trainer (which adds lag and calendar features), the regressor uses the columns you give it verbatim. Wire preprocessing nodes upstream if you want imputation, scaling, or encoding.
Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.