Classification algorithm

Logistic regression

A linear classifier for predicting a categorical label from feature columns. Trains a scikit-learn LogisticRegression behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus coefficient-based feature importance.

It is the interpretable linear baseline alongside lightgbm-classifier-v1 — pick it when you want a fast, transparent model whose coefficients you can read, or as a yardstick for judging whether a more complex model is earning its keep.

What it does

You point it at a DataSource and pick:

a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Unlike the LightGBM classifier — which consumes categoricals natively — logistic regression needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the LightGBM classifier — the same classifier_train / classifier_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► classifier_train      classifier_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — logistic regression and LightGBM share the exact same scoring + metric code.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kind	Steps
Numeric / boolean	impute missing values with the median → standardize to zero mean, unit variance
String / categorical	impute missing values with the most frequent value → one-hot encode

The standardization matters twice over: it helps the optimizer converge, and it puts the fitted coefficients on a comparable scale for the importance chart.

The whole fitted pipeline — imputers, scaler, encoder, and coefficients — is serialized as one unit, so inference replays exactly what was fit.

Missing values & unseen categories

Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; logistic regression does not, so this step is required — the model handles it so you don't have to.
Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.

Metric set

Same as the LightGBM classifier — the eval step is shared:

Metric	Meaning
Accuracy	Fraction of test rows classified correctly
Precision / Recall / F1	Weighted averages across classes (binary: of the positive class)
ROC AUC	Ranking quality — binary single score, multi-class macro one-vs-rest
Confusion matrix	Per-class true-vs-predicted counts

Feature importance

The chart shows standardized-coefficient magnitude — |coefficient| for each (one-hot-expanded) feature. Because features are scaled to unit variance before the fit, these magnitudes are roughly comparable across columns. For multi-class models the per-class coefficient rows are collapsed by mean absolute value.

These are linear coefficients, not split gains — they are not numerically comparable to the LightGBM classifier's importance bars, and they describe linear effects only.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

Key	Default	Meaning
`C`	`1.0`	Inverse regularization strength — smaller means stronger regularization
`max_iter`	`1000`	Solver iteration cap (raised above scikit-learn's default of 100 because the standardized + one-hot space often needs more)
`class_weight`	`None`	Set to `"balanced"` to up-weight minority classes on imbalanced data

Regularization is L2 by default (controlled by C); scikit-learn's penalty argument is deprecated and not exposed.

Limitations

Linear decision boundary. Logistic regression models a linear relationship between features and the log-odds. On its own it cannot capture feature interactions or non-linear effects — if accuracy lags the LightGBM classifier badly, that is usually why. Engineer interaction features upstream, or use the LightGBM classifier.
One-hot blow-up on high-cardinality columns. A categorical feature with hundreds of distinct values becomes hundreds of indicator columns. Prefer the LightGBM classifier (native categorical splits) for high-cardinality features, or reduce cardinality upstream.
Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.