Regression algorithm
LightGBM regression
Gradient-boosted regression for predicting a numeric value from feature columns. Trains one booster on a random hold-out split, scores on the held-out test set, and surfaces the standard regression metric set (MAE / RMSE / R² / MAPE) plus feature importance.
What it does
You point it at a DataSource and pick:
- a numeric target column you want to predict, and
- one or more feature columns the model gets to look at.
Feature columns may be numeric, boolean, or categorical — LightGBM handles all three natively. No manual one-hot encoding or normalization is required at this stage; if you want preprocessing (imputation, scaling, encoding) you can wire it explicitly upstream of the trainer node.
The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature importance chart.
How it works
The pipeline shape is:
data_source → random_split → train_data + test_data
│ │
▼ ▼
model ─────────► regressor_train regressor_eval
│ ▲
▼ │
trained_model ──────────────────┘
│
▼
eval_result
regressor_train is fit-only — it produces a fitted booster but no metrics. regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. This mirrors the classification slice; both differ from the forecast trainer (which is monolithic because of the CQR calibration step).
Loss function
V1 defaults to LightGBM's objective="regression" (squared-error / L2 loss). Robust users who want median regression (L1, less sensitive to outliers) can pass hyperparams={"objective": "regression_l1"} on the model node — both are first-class LightGBM objectives, no separate algorithm id needed.
Metric set
| Metric | Meaning |
|---|---|
| MAE | Mean absolute error — average prediction error in target units |
| RMSE | Root mean squared error — penalizes large errors more heavily |
| R² | Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean) |
| MAPE | Mean absolute percentage error — relative error, NaN if any test row has target == 0 |
The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.
Feature handling
- Numeric columns pass through as-is.
- Boolean columns are cast to int.
- Categorical columns are handed to LightGBM with the
Categoricaldtype so the booster uses Fisher categorical splits (no one-hot blow-up). Vocabulary is captured at fit time and replayed on inference.
If a column has too many unique values to encode meaningfully, the trainer falls back to ordinal encoding with a configurable cap; check the registered model's model_feature_schema for the resolved choice.
Limitations
- No prediction intervals. This is point regression — the booster outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them; a quantile-regression variant of this trainer is on the roadmap.
- MAPE is brittle. If any test row has
target == 0, MAPE is undefined and reported as NaN. Use MAE or RMSE as your scoreboard metric in that case. - No automatic feature engineering. Unlike the time-series trainer (which adds lag and calendar features), the regressor uses the columns you give it verbatim. Wire preprocessing nodes upstream if you want imputation, scaling, or encoding.
- Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.
See also
lightgbm-v1.md— time-series sister with quantile + CQR calibration.lightgbm-classifier-v1.md— categorical-target sister with the same fit/eval shape.
Not sure which to pick?
Choosing a regression algorithmLightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.