How to Save and Load Machine Learning Models¶

Training a Random Forest on a large dataset can take hours. You must save the trained model to disk so you do not have to retrain it every time you need predictions.

Using `joblib`¶

The standard pickle module works, but scikit-learn recommends joblib because it is optimised for large NumPy arrays (which underpin all model weights).

Saving a Trained Model¶

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Save to disk
joblib.dump(model, "random_forest_v1.pkl")
print("Model saved successfully.")

Loading for Inference¶

import joblib

# Load the saved model
loaded_model = joblib.load("random_forest_v1.pkl")

# Use it for predictions without retraining
preds = loaded_model.predict(X_test)
print(f"Accuracy: {loaded_model.score(X_test, y_test):.2f}")

Versioning Best Practices¶

Include a version number in the filename (e.g., model_v2.pkl) so you can roll back.
Save alongside metadata — record the training date, dataset hash, hyperparameters, and evaluation scores.
Never load untrusted .pkl files — pickle can execute arbitrary code during deserialisation.

Common Pitfall

The scikit-learn version used to save the model must match the version used to load it. Upgrading scikit-learn can break deserialisation. Pin your version in requirements.txt.

KSB Mapping¶

KSB	Description	How This Addresses It
K4.1	Statistical models and methods	Understanding the statistical basis of regression and classification
K4.2	ML and AI techniques	Implementing and comparing supervised learning algorithms
K4.4	Resource constraints and trade-offs	Model complexity vs interpretability; computational cost
S1	Scientific methods and hypothesis testing	Formulating hypotheses and testing with rigorous validation
S4	Building models and validating	Cross-validation, train/test evaluation, performance metrics
B5	Impartial, hypothesis-driven approach	Honest evaluation of model performance and limitations