Skip to content

Scikit-Learn Model Selection Reference

Quick-reference for the sklearn.model_selection module — the toolkit for splitting, validating, and tuning models.

Splitting

Class / Function Purpose
train_test_split(X, y) Single random split into train and test sets
StratifiedShuffleSplit Repeated random splits preserving class proportions
TimeSeriesSplit Expanding-window splits for time-ordered data

Cross-Validation

Function Purpose
cross_val_score(model, X, y, cv=5) Returns array of scores for each fold
cross_validate(model, X, y, cv=5) Returns dict with fit time, score time, and test scores
cross_val_predict(model, X, y, cv=5) Returns out-of-fold predictions for every sample

CV Splitters

Splitter Use Case
KFold(n_splits=5) Standard k-Fold (regression)
StratifiedKFold(n_splits=5) Preserves class ratios per fold (classification)
RepeatedStratifiedKFold(n_splits=5, n_repeats=3) Repeated stratified for more stable estimates
LeaveOneOut() Every sample used as a test set once — expensive
Class Strategy
GridSearchCV Exhaustive search over a parameter grid
RandomizedSearchCV Random sampling from parameter distributions
HalvingGridSearchCV Successive halving — efficient for large grids

Quick Example

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 10, None]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=cv,
    scoring="accuracy",
    n_jobs=-1
)
grid.fit(X, y)

print(f"Best params: {grid.best_params_}")
print(f"Best CV Accuracy: {grid.best_score_:.4f}")

KSB Mapping

KSB Description How This Addresses It
K4.4 Resource constraints and trade-offs Balancing model complexity, performance, and computational cost
S1 Scientific methods and hypothesis testing Rigorous cross-validation and statistical model comparison
S4 Building models and validating Systematic hyperparameter tuning and performance evaluation
B5 Impartial, hypothesis-driven approach Preventing overfitting; honest reporting of generalisation metrics