Decision guide
Choosing a classification algorithm
Clarex ships three ways to predict a category — LightGBM classification, Logistic regression, and Random forest classification. All three take feature columns, train on a random hold-out split, handle binary and multi-class targets, and report the same metric set (accuracy, precision / recall / F1, ROC AUC, confusion matrix). What differs is how each one draws the boundary between classes — and how much it can tell you afterwards.
As with regression, the short version: start linear, go to trees when linear isn't enough.
At a glance
| Algorithm | Decision boundary | Interpretability | Categorical features | Tuning |
|---|---|---|---|---|
| Logistic regression | Linear | High — readable coefficients | One-hot, automatic | Minimal |
| Random forest classification | Non-linear | Low — importance ranking only | One-hot, automatic | Minimal |
| LightGBM classification | Non-linear | Low — importance ranking only | Native categorical splits | Moderate |
Start with logistic regression
Logistic regression is the right first classifier almost every time. It fits a linear boundary between classes, trains instantly, and — because the features are standardized before the fit — its coefficients are roughly comparable, so you can read which features push a row toward which class. Even when you expect to need a tree model, logistic regression is the baseline that tells you whether the tree model is earning its complexity.
Its limit is in the name: the boundary is linear. If classes are separated by a curved or interaction-driven boundary, logistic regression can't bend to fit it — and a categorical feature with hundreds of values becomes hundreds of one-hot columns. When accuracy lags a tree model badly, that is usually why.
When to go non-linear
Reach for a tree-based classifier when the linear boundary leaves accuracy on the table.
- Random forest classification averages many independent decision trees (bagging). It is the robust, low-effort option — solid defaults, no learning rate, hard to overfit badly. Pick it when you want non-linear accuracy with almost no tuning. One caveat: its predicted probabilities rank well but are pulled toward the middle, so trust the ranking (and ROC AUC) over the raw numbers.
- LightGBM classification builds trees sequentially, each correcting the last (boosting). On well-behaved tabular data it usually reaches the highest accuracy of the three, and it splits on categorical columns natively — a real edge when you have high-cardinality categories. The trade-off is more tuning: a learning rate and tree-complexity settings that interact.
Between the two trees: random forest to set-and-forget; LightGBM for peak accuracy or messy high-cardinality categoricals.
A note on imbalanced classes
If one class is rare (fraud, churn, defects), accuracy alone is misleading — a model that always predicts the majority class can still score 95%. Watch precision, recall, and ROC AUC, and read the confusion matrix. Logistic regression and random forest both accept class_weight="balanced"; LightGBM accepts is_unbalance: true. And use a stratified split so the rare class is present in both the train and test halves.
Rules of thumb
- Always run logistic regression first — it's free and it sets the bar.
- Need to explain the prediction? Logistic regression's coefficients explain why; tree importances only rank what mattered.
- Lots of high-cardinality categorical features? LightGBM handles them natively; the others one-hot every level.
- Want non-linear accuracy with no tuning? Random forest.
- Chasing the last few points of accuracy? LightGBM, with some tuning.
- Rare, important class? Watch recall and ROC AUC, set the class weighting, and use a stratified split.
See also
- The full reference for each: LightGBM classification · Logistic regression · Random forest classification
- Choosing a regression algorithm — the same decision, for predicting a number.
- Choosing a forecasting algorithm — for time-ordered data.