Classification algorithm

Logistic regression

A linear classifier for predicting a categorical label from feature columns. Trains a scikit-learn LogisticRegression behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus coefficient-based feature importance.

It is the interpretable linear baseline alongside lightgbm-classifier-v1 — pick it when you want a fast, transparent model whose coefficients you can read, or as a yardstick for judging whether a more complex model is earning its keep.

What it does

You point it at a DataSource and pick:

  • a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Unlike the LightGBM classifier — which consumes categoricals natively — logistic regression needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the LightGBM classifier — the same classifier_train / classifier_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► classifier_train      classifier_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — logistic regression and LightGBM share the exact same scoring + metric code.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kindSteps
Numeric / booleanimpute missing values with the median → standardize to zero mean, unit variance
String / categoricalimpute missing values with the most frequent value → one-hot encode

The standardization matters twice over: it helps the optimizer converge, and it puts the fitted coefficients on a comparable scale for the importance chart.

The whole fitted pipeline — imputers, scaler, encoder, and coefficients — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

  • Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; logistic regression does not, so this step is required — the model handles it so you don't have to.
  • Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.

Metric set

Same as the LightGBM classifier — the eval step is shared:

MetricMeaning
AccuracyFraction of test rows classified correctly
Precision / Recall / F1Weighted averages across classes (binary: of the positive class)
ROC AUCRanking quality — binary single score, multi-class macro one-vs-rest
Confusion matrixPer-class true-vs-predicted counts

Feature importance

The chart shows standardized-coefficient magnitude|coefficient| for each (one-hot-expanded) feature. Because features are scaled to unit variance before the fit, these magnitudes are roughly comparable across columns. For multi-class models the per-class coefficient rows are collapsed by mean absolute value.

These are linear coefficients, not split gains — they are not numerically comparable to the LightGBM classifier's importance bars, and they describe linear effects only.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

KeyDefaultMeaning
C1.0Inverse regularization strength — smaller means stronger regularization
max_iter1000Solver iteration cap (raised above scikit-learn's default of 100 because the standardized + one-hot space often needs more)
class_weightNoneSet to "balanced" to up-weight minority classes on imbalanced data

Regularization is L2 by default (controlled by C); scikit-learn's penalty argument is deprecated and not exposed.

Limitations

  • Linear decision boundary. Logistic regression models a linear relationship between features and the log-odds. On its own it cannot capture feature interactions or non-linear effects — if accuracy lags the LightGBM classifier badly, that is usually why. Engineer interaction features upstream, or use the LightGBM classifier.
  • One-hot blow-up on high-cardinality columns. A categorical feature with hundreds of distinct values becomes hundreds of indicator columns. Prefer the LightGBM classifier (native categorical splits) for high-cardinality features, or reduce cardinality upstream.
  • Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.

See also

  • lightgbm-classifier-v1.md — gradient-boosted sister; non-linear, native categorical handling, usually higher accuracy.
  • lightgbm-regressor-v1.md — numeric-target sister with the same fit/eval shape.

Not sure which to pick?

Choosing a classification algorithm

LightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.