Regression algorithm

Linear regression (OLS)

The classic linear model for predicting a numeric value from feature columns. Trains a scikit-learn LinearRegression (ordinary least squares — no regularization) behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard regression metric set plus coefficient-based feature importance.

It is the simplest interpretable baseline in the regression family — pick it when you want plain, textbook least squares whose coefficients are the raw partial effect of each feature, with nothing shrinking them.

What it does

You point it at a DataSource and pick:

  • a numeric target column you want to predict, and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Like the Ridge regressor — and unlike LightGBM — a linear model needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the Ridge regressor — the same regressor_train / regressor_eval nodes, the same regression task. regressor_train is fit-only; regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — every regressor shares the exact same scoring + metric code.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kindSteps
Numeric / booleanimpute missing values with the median → standardize to zero mean, unit variance
String / categoricalimpute missing values with the most frequent value → one-hot encode

A subtlety worth knowing: standardizing the numeric features does not change OLS predictions — ordinary least squares is scale-equivariant, so rescaling a feature just rescales its coefficient inversely and the fitted values are identical. It is kept anyway because it puts the fitted coefficients on a comparable scale for the importance chart. (For Ridge the scaler also matters for the penalty; for OLS it is purely a presentation choice.)

The whole fitted pipeline — imputers, scaler, encoder, and coefficients — is serialized as one unit, so inference replays exactly what was fit.

OLS vs. Ridge

OLS minimizes squared error with no penalty on the coefficients. Ridge adds an L2 penalty (alpha) that shrinks them. The practical differences:

  • OLS coefficients are the raw partial effects — directly interpretable as "holding everything else fixed, one unit of this feature moves the target by this much." Ridge's are biased toward zero by the penalty.
  • OLS has no tuning knob. There is no alpha to set — the fit is fully determined by the data.
  • OLS is less stable. With many one-hot columns or correlated features, the unpenalized fit can produce large, erratic coefficients. Ridge's penalty tames exactly that.

Use OLS as the transparent reference; reach for Ridge when the feature set is wide or collinear.

Metric set

Same as every regressor — the eval step is shared:

MetricMeaning
MAEMean absolute error — average prediction error in target units
RMSERoot mean squared error — penalizes large errors more heavily
Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPEMean absolute percentage error — relative error, None if any test row has target == 0

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature importance

The chart shows standardized-coefficient magnitude|coefficient| for each (one-hot-expanded) feature. Because features are scaled to unit variance before the fit, these magnitudes are roughly comparable across columns.

These are linear coefficients, not split gains — not numerically comparable to the LightGBM regressor's importance bars — and, being unregularized, not shrunk the way Ridge's are.

Hyperparameters

OLS has no regularization knob — there is nothing equivalent to Ridge's alpha. The only model-node hyperparams scikit-learn exposes are niche:

KeyDefaultMeaning
fit_intercepttrueWhether to fit an intercept term. Leave on unless you have already centered the target
positivefalseConstrain all coefficients to be non-negative

Most runs leave hyperparams empty.

Limitations

  • Linear relationship only. OLS models a linear relationship between features and the target — no interactions, no non-linear effects. If accuracy lags the LightGBM regressor badly, that is usually why.
  • Unstable on wide or collinear feature sets. With no penalty, correlated features (including the full one-hot expansion of a categorical alongside the intercept — the "dummy-variable trap") leave the individual coefficients non-unique and sometimes wildly large. Predictions and R² are still well-defined, but the importance chart can mislead. Switch to Ridge if you see this.
  • One-hot blow-up on high-cardinality columns. A categorical feature with hundreds of distinct values becomes hundreds of indicator columns. Prefer the LightGBM regressor for high-cardinality features, or reduce cardinality upstream.
  • No prediction intervals. This is point regression — a single number per row. For uncertainty bands, use the time-series forecaster.
  • Random split assumes IID rows. If your data has temporal structure, use the forecast template instead.

See also

  • ridge-regressor-v1.md — the regularized linear sister; pick it for wide or collinear feature sets.
  • lightgbm-regressor-v1.md — gradient-boosted sister; non-linear, native categorical handling, usually higher accuracy.

Not sure which to pick?

Choosing a regression algorithm

LightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.