Skip to content

Train/Test Split

The most fundamental validation step: hold out a portion of your data that the model never sees during training, then evaluate on it.

Why Split?

If you evaluate a model on the same data it trained on, you measure how well it memorises, not how well it generalises. The test set simulates unseen, real-world data.

Implementation

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 80% train, 20% test
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y           # Preserve class proportions
)

model = RandomForestClassifier(random_state=42)
model.fit(X_tr, y_tr)
print(f"Test Accuracy: {model.score(X_te, y_te):.4f}")

Key Parameters

Parameter Purpose
test_size Fraction of data for testing (default 0.25)
random_state Seed for reproducibility
stratify=y Preserve class distribution — always use for classification
shuffle=True Shuffle before splitting (default) — set False for time series

Train / Validation / Test Split

For hyperparameter tuning, you need three sets:

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Second split: train and validation from the remaining data
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
Set Purpose
Train Fit the model
Validation Tune hyperparameters and select the best model
Test Final, unbiased evaluation — touch it once

Common Pitfall

Never tune hyperparameters on the test set. If you do, the test score is no longer an unbiased estimate of real-world performance. Use cross-validation or a separate validation set for tuning.

KSB Mapping

KSB Description How This Addresses It
K4.4 Resource constraints and trade-offs Balancing model complexity, performance, and computational cost
S1 Scientific methods and hypothesis testing Rigorous cross-validation and statistical model comparison
S4 Building models and validating Systematic hyperparameter tuning and performance evaluation
B5 Impartial, hypothesis-driven approach Preventing overfitting; honest reporting of generalisation metrics