Train/Test Split¶

The most fundamental validation step: hold out a portion of your data that the model never sees during training, then evaluate on it.

Why Split?¶

If you evaluate a model on the same data it trained on, you measure how well it memorises, not how well it generalises. The test set simulates unseen, real-world data.

Implementation¶

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 80% train, 20% test
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y           # Preserve class proportions
)

model = RandomForestClassifier(random_state=42)
model.fit(X_tr, y_tr)
print(f"Test Accuracy: {model.score(X_te, y_te):.4f}")

Key Parameters¶

Parameter	Purpose
`test_size`	Fraction of data for testing (default 0.25)
`random_state`	Seed for reproducibility
`stratify=y`	Preserve class distribution — always use for classification
`shuffle=True`	Shuffle before splitting (default) — set False for time series

Train / Validation / Test Split¶

For hyperparameter tuning, you need three sets:

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Second split: train and validation from the remaining data
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

Set	Purpose
Train	Fit the model
Validation	Tune hyperparameters and select the best model
Test	Final, unbiased evaluation — touch it once

Common Pitfall

Never tune hyperparameters on the test set. If you do, the test score is no longer an unbiased estimate of real-world performance. Use cross-validation or a separate validation set for tuning.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.4	Resource constraints and trade-offs	Balancing model complexity, performance, and computational cost
S1	Scientific methods and hypothesis testing	Rigorous cross-validation and statistical model comparison
S4	Building models and validating	Systematic hyperparameter tuning and performance evaluation
B5	Impartial, hypothesis-driven approach	Preventing overfitting; honest reporting of generalisation metrics