Decision guide
Choosing a regression algorithm
Clarex ships four ways to predict a numeric value — LightGBM regression, Ridge regression, Linear regression (OLS), and Random forest regression. They all take feature columns, train on a random hold-out split, and report the same metric set (MAE / RMSE / R² / MAPE). What differs is how each one models the relationship between features and target — and what you can read off the result afterwards.
The short version: start with a linear model, and reach for a tree-based model when it isn't enough.
At a glance
| Algorithm | Relationship | Interpretability | Categorical features | Tuning |
|---|---|---|---|---|
| Linear regression (OLS) | Linear only | Highest — coefficients are raw effects | One-hot, automatic | None |
| Ridge regression | Linear only | High — coefficients, gently shrunk | One-hot, automatic | One regularization knob |
| Random forest regression | Non-linear | Low — importance ranking only | One-hot, automatic | Minimal |
| LightGBM regression | Non-linear | Low — importance ranking only | Native categorical splits | Moderate |
Start with a linear baseline
A linear model is the right first model almost every time. It trains in a blink, it can't overfit badly, and its coefficients tell you the story — "each extra bedroom adds about $X to the predicted price." Even when you expect to need something fancier, a linear baseline is the yardstick that tells you whether the fancier model is actually earning its complexity.
- Linear regression (OLS) is the plainest choice: textbook least squares, no regularization. Each coefficient is the raw partial effect of its feature. Pick it when you have a modest number of well-behaved features and want the most direct reading possible.
- Ridge regression is least squares with L2 regularization — it gently shrinks coefficients toward zero. That makes it steadier than OLS when features are correlated or numerous, at the cost of coefficients biased slightly toward zero. Ridge is the better default of the two; use plain OLS when you specifically want unshrunk effects.
If a linear model's R² is already where you need it, stop here — you have a model that is both accurate and explainable.
When to go non-linear
Reach for a tree-based model when a linear fit leaves accuracy on the table — typically because the relationship curves, or because features interact (the effect of one depends on another). Trees capture both automatically, with no interaction terms to engineer.
- Random forest regression averages many independent decision trees. It is the robust, low-effort non-linear option: the defaults are a solid model, there is no learning rate to tune, and it is hard to overfit badly. Pick it when you want non-linear accuracy with near-zero fuss.
- LightGBM regression builds trees sequentially, each correcting the last (gradient boosting). On well-behaved tabular data it usually reaches the highest accuracy of the four — and it splits on categorical columns natively, so high-cardinality categories don't blow up into hundreds of one-hot columns. The trade-off is a learning rate and tree-complexity settings that interact, so it rewards a little tuning.
Between the two: random forest if you want to set it and forget it; LightGBM if you want peak accuracy or have messy high-cardinality categoricals.
Rules of thumb
- Always run a linear baseline first — it's free, and it tells you what "good" looks like.
- Need to explain the model? A linear model's coefficients explain why; tree importances only rank what mattered. Prefer Ridge or OLS.
- Lots of high-cardinality categorical features? LightGBM handles them natively; the others one-hot every level.
- Want non-linear accuracy with no tuning? Random forest.
- Chasing the last few points of accuracy? LightGBM, with some tuning.
- Rows have a time order (before / after a date)? None of these is the right tool — see Choosing a forecasting algorithm instead.
See also
- The full reference for each: LightGBM regression · Ridge regression · Linear regression (OLS) · Random forest regression
- Choosing a classification algorithm — the same decision, for predicting a category.
- Choosing a forecasting algorithm — for time-ordered data.