How to Handle Imbalanced Data¶
When one class dominates your dataset (e.g., 99% Non-Fraud, 1% Fraud), standard ML algorithms bias entirely toward the majority class.
Why Imbalance is Dangerous¶
If a model blindly predicts "Non-Fraud" for every single transaction, it achieves 99% Accuracy. However, detecting fraud is the entire business objective. Standard algorithms minimise overall error, sacrificing the minority class entirely to achieve high global accuracy.
Method 1: Algorithmic Class Weights¶
The simplest solution is adjusting class_weight. Most scikit-learn classifiers accept "balanced", which automatically increases the penalty for misclassifying minority observations.
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
df = sns.load_dataset("diamonds").sample(1000, random_state=42)
X = df[["carat", "depth", "table"]]
y = (df["cut"] == "Premium").astype(int)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)
model = LogisticRegression(class_weight="balanced", max_iter=1000)
model.fit(X_tr, y_tr)
print(classification_report(y_te, model.predict(X_te)))
Method 2: Synthetic Minority Oversampling (SMOTE)¶
Instead of reweighting the algorithm, you synthetically generate new minority-class data points by interpolating between existing minority observations.
You need to install the imbalanced-learn library: pip install imbalanced-learn.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
print(f"Original training shape: {X_train.shape}")
print(f"Resampled training shape: {X_train_res.shape}")
Critical Rule
Never apply SMOTE before train_test_split. If you synthesise data before splitting, synthetic copies of minority observations will leak into the test set, giving you artificially inflated scores that do not reflect real-world performance.
Method 3: Threshold Tuning¶
By default, classifiers use a 0.5 probability threshold. For imbalanced problems, lowering the threshold (e.g., to 0.3) increases Recall at the cost of Precision.
import numpy as np
probs = model.predict_proba(X_te)[:, 1]
custom_preds = (probs >= 0.3).astype(int)
print(classification_report(y_te, custom_preds))
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.1 | Statistical models and methods | Understanding the statistical basis of regression and classification |
| K4.2 | ML and AI techniques | Implementing and comparing supervised learning algorithms |
| K4.4 | Resource constraints and trade-offs | Model complexity vs interpretability; computational cost |
| S1 | Scientific methods and hypothesis testing | Formulating hypotheses and testing with rigorous validation |
| S4 | Building models and validating | Cross-validation, train/test evaluation, performance metrics |
| B5 | Impartial, hypothesis-driven approach | Honest evaluation of model performance and limitations |