Decision guide

Choosing a regression algorithm

Clarex ships four ways to predict a numeric value — LightGBM regression, Ridge regression, Linear regression (OLS), and Random forest regression. They all take feature columns, train on a random hold-out split, and report the same metric set (MAE / RMSE / R² / MAPE). What differs is how each one models the relationship between features and target — and what you can read off the result afterwards.

The short version: start with a linear model, and reach for a tree-based model when it isn't enough.

At a glance

Algorithm	Relationship	Interpretability	Categorical features	Tuning
Linear regression (OLS)	Linear only	Highest — coefficients are raw effects	One-hot, automatic	None
Ridge regression	Linear only	High — coefficients, gently shrunk	One-hot, automatic	One regularization knob
Random forest regression	Non-linear	Low — importance ranking only	One-hot, automatic	Minimal
LightGBM regression	Non-linear	Low — importance ranking only	Native categorical splits	Moderate

Start with a linear baseline

A linear model is the right first model almost every time. It trains in a blink, it can't overfit badly, and its coefficients tell you the story — "each extra bedroom adds about $X to the predicted price." Even when you expect to need something fancier, a linear baseline is the yardstick that tells you whether the fancier model is actually earning its complexity.

Linear regression (OLS) is the plainest choice: textbook least squares, no regularization. Each coefficient is the raw partial effect of its feature. Pick it when you have a modest number of well-behaved features and want the most direct reading possible.
Ridge regression is least squares with L2 regularization — it gently shrinks coefficients toward zero. That makes it steadier than OLS when features are correlated or numerous, at the cost of coefficients biased slightly toward zero. Ridge is the better default of the two; use plain OLS when you specifically want unshrunk effects.

If a linear model's R² is already where you need it, stop here — you have a model that is both accurate and explainable.

When to go non-linear

Reach for a tree-based model when a linear fit leaves accuracy on the table — typically because the relationship curves, or because features interact (the effect of one depends on another). Trees capture both automatically, with no interaction terms to engineer.

Random forest regression averages many independent decision trees. It is the robust, low-effort non-linear option: the defaults are a solid model, there is no learning rate to tune, and it is hard to overfit badly. Pick it when you want non-linear accuracy with near-zero fuss.
LightGBM regression builds trees sequentially, each correcting the last (gradient boosting). On well-behaved tabular data it usually reaches the highest accuracy of the four — and it splits on categorical columns natively, so high-cardinality categories don't blow up into hundreds of one-hot columns. The trade-off is a learning rate and tree-complexity settings that interact, so it rewards a little tuning.

Between the two: random forest if you want to set it and forget it; LightGBM if you want peak accuracy or have messy high-cardinality categoricals.

Rules of thumb

Always run a linear baseline first — it's free, and it tells you what "good" looks like.
Need to explain the model? A linear model's coefficients explain why; tree importances only rank what mattered. Prefer Ridge or OLS.
Lots of high-cardinality categorical features? LightGBM handles them natively; the others one-hot every level.
Want non-linear accuracy with no tuning? Random forest.
Chasing the last few points of accuracy? LightGBM, with some tuning.
Rows have a time order (before / after a date)? None of these is the right tool — see Choosing a forecasting algorithm instead.

Choosing a regression algorithm

At a glance

Start with a linear baseline

When to go non-linear

Rules of thumb

See also