Regression algorithm

LightGBM regression

Gradient-boosted regression for predicting a numeric value from feature columns. Trains one booster on a random hold-out split, scores on the held-out test set, and surfaces the standard regression metric set (MAE / RMSE / R² / MAPE) plus feature importance.

What it does

You point it at a DataSource and pick:

  • a numeric target column you want to predict, and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or categorical — LightGBM handles all three natively. No manual one-hot encoding or normalization is required at this stage; if you want preprocessing (imputation, scaling, encoding) you can wire it explicitly upstream of the trainer node.

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature importance chart.

How it works

The pipeline shape is:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► regressor_train       regressor_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

regressor_train is fit-only — it produces a fitted booster but no metrics. regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. This mirrors the classification slice; both differ from the forecast trainer (which is monolithic because of the CQR calibration step).

Loss function

V1 defaults to LightGBM's objective="regression" (squared-error / L2 loss). Robust users who want median regression (L1, less sensitive to outliers) can pass hyperparams={"objective": "regression_l1"} on the model node — both are first-class LightGBM objectives, no separate algorithm id needed.

Metric set

MetricMeaning
MAEMean absolute error — average prediction error in target units
RMSERoot mean squared error — penalizes large errors more heavily
Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPEMean absolute percentage error — relative error, NaN if any test row has target == 0

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature handling

  • Numeric columns pass through as-is.
  • Boolean columns are cast to int.
  • Categorical columns are handed to LightGBM with the Categorical dtype so the booster uses Fisher categorical splits (no one-hot blow-up). Vocabulary is captured at fit time and replayed on inference.

If a column has too many unique values to encode meaningfully, the trainer falls back to ordinal encoding with a configurable cap; check the registered model's model_feature_schema for the resolved choice.

Limitations

  • No prediction intervals. This is point regression — the booster outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them; a quantile-regression variant of this trainer is on the roadmap.
  • MAPE is brittle. If any test row has target == 0, MAPE is undefined and reported as NaN. Use MAE or RMSE as your scoreboard metric in that case.
  • No automatic feature engineering. Unlike the time-series trainer (which adds lag and calendar features), the regressor uses the columns you give it verbatim. Wire preprocessing nodes upstream if you want imputation, scaling, or encoding.
  • Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.

See also

  • lightgbm-v1.md — time-series sister with quantile + CQR calibration.
  • lightgbm-classifier-v1.md — categorical-target sister with the same fit/eval shape.

Not sure which to pick?

Choosing a regression algorithm

LightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.