Scikit-Learn Model Selection Reference
Quick-reference for the sklearn.model_selection module — the toolkit for splitting, validating, and tuning models.
Splitting
| Class / Function |
Purpose |
train_test_split(X, y) |
Single random split into train and test sets |
StratifiedShuffleSplit |
Repeated random splits preserving class proportions |
TimeSeriesSplit |
Expanding-window splits for time-ordered data |
Cross-Validation
| Function |
Purpose |
cross_val_score(model, X, y, cv=5) |
Returns array of scores for each fold |
cross_validate(model, X, y, cv=5) |
Returns dict with fit time, score time, and test scores |
cross_val_predict(model, X, y, cv=5) |
Returns out-of-fold predictions for every sample |
CV Splitters
| Splitter |
Use Case |
KFold(n_splits=5) |
Standard k-Fold (regression) |
StratifiedKFold(n_splits=5) |
Preserves class ratios per fold (classification) |
RepeatedStratifiedKFold(n_splits=5, n_repeats=3) |
Repeated stratified for more stable estimates |
LeaveOneOut() |
Every sample used as a test set once — expensive |
Hyperparameter Search
| Class |
Strategy |
GridSearchCV |
Exhaustive search over a parameter grid |
RandomizedSearchCV |
Random sampling from parameter distributions |
HalvingGridSearchCV |
Successive halving — efficient for large grids |
Quick Example
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, 10, None]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=cv,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best CV Accuracy: {grid.best_score_:.4f}")
KSB Mapping
| KSB |
Description |
How This Addresses It |
| K4.4 |
Resource constraints and trade-offs |
Balancing model complexity, performance, and computational cost |
| S1 |
Scientific methods and hypothesis testing |
Rigorous cross-validation and statistical model comparison |
| S4 |
Building models and validating |
Systematic hyperparameter tuning and performance evaluation |
| B5 |
Impartial, hypothesis-driven approach |
Preventing overfitting; honest reporting of generalisation metrics |