Skip to content

Handling Missing Values

Missing data is inevitable. How you handle it dictates the integrity of your entire ML pipeline.

What You Will Learn

  • Drop missing values safely when data loss is acceptable
  • Use Scikit-Learn SimpleImputer to replace missing values systematically
  • Visualise the statistical impact of imputation on your target distributions

Prerequisites

  • Completed the Loading & Exploring Data tutorial
  • Basic understanding of mean, median, and mode

Step 1: Drop Missing Values

We will use the built-in titanic dataset which has famously messy and missing passenger records. Keep your code incredibly concise using pandas built-ins.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('titanic')

# Check baseline missing values
print(df.isnull().sum()[df.isnull().sum() > 0])
Expected Output
age            177
embarked         2
deck           688
embark_town      2
dtype: int64

If a column is overwhelmingly empty (like deck), drop the column. If only a tiny fraction of rows are missing (like embarked), drop those specific rows.

# Drop the 'deck' column completely
df_dropped = df.drop(columns=['deck'])

# Drop rows where 'embarked' is missing
df_dropped = df_dropped.dropna(subset=['embarked'])

print(f"Original shape: {df.shape} | New shape: {df_dropped.shape}")
Expected Output
Original shape: (891, 15) | New shape: (889, 14)

Workplace Tip

Never blindly use df.dropna(). This will drop any row with even a single missing value. In a dataset with 50 columns, dropna() might accidentally delete 80% of your valid data!

Step 2: Basic Statistical Imputation

For the age column, dropping 177 rows means losing 20% of our dataset. Instead, we can impute (fill in) the missing values using Scikit-Learn's SimpleImputer.

from sklearn.impute import SimpleImputer
import numpy as np

# Instantiate the imputer to fill with the 'mean' strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the 'age' column (requires 2D array, so we use [['age']])
df_imputed = df_dropped.copy()
df_imputed['age'] = imputer.fit_transform(df_imputed[['age']])

print(f"Missing ages before: {df_dropped['age'].isnull().sum()}")
print(f"Missing ages after: {df_imputed['age'].isnull().sum()}")
Expected Output
Missing ages before: 177
Missing ages after: 0

Step 3: Visualise the Impact of Imputation

Whenever you inject synthetic data via imputation, you must verify that you haven't fundamentally distorted the original distribution.

plt.figure(figsize=(10, 5))

# Plot original age distribution (ignoring NaNs)
sns.histplot(data=df_dropped, x='age', bins=30, kde=True, color='skyblue', label='Original')

# Plot the imputed age distribution
sns.histplot(data=df_imputed, x='age', bins=30, kde=True, color='red', alpha=0.3, label='Mean Imputed')

plt.title('Impact of Mean Imputation on Age Distribution')
plt.legend()
plt.tight_layout()
plt.show()
Expected Plot

Impact of Imputation

As seen in the plot, injecting the mean 177 times artificially creates a massive spike in the center of the distribution. This is the primary danger of mean imputation!

Assessment Connection

In your EPA, examiners will ask: "Why did you choose median imputation over mean?" You must be able to justify your choice. Mentioning that the mean is sensitive to outliers, while the median preserves the distribution structure better for skewed data, guarantees high marks.

Summary

  • Drop columns if >50% of the data is missing.
  • Drop rows only if the missing data represents <5% of the total dataset.
  • Use SimpleImputer to fill missing numeric data with the mean or median.
  • Always plot the distribution before and after imputation to verify you haven't destroyed the variance of your feature.

Next Steps

Data Types & Encoding — prepare categorical text data for machine learning algorithms.

Stretch & Challenge

For Advanced Learners

1. Multivariate Imputation

Instead of SimpleImputer, try using IterativeImputer (also known as MICE). This algorithm builds an internal machine learning model for each feature and uses the other columns to predict and fill the missing values.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed_advanced = imputer.fit_transform(df_dropped[['age', 'fare', 'pclass']])

This technique prevents the artificial "spike" caused by mean imputation, as each missing age is predicted uniquely based on that passenger's ticket fare and class.

KSB Mapping

KSB Description How This Addresses It
K5.3 Common patterns in real-world data Identifying missing values, duplicates, outliers, and class imbalance
S2 Data engineering and governance Systematic data cleaning, transformation, and quality assessment
S3 Programming for data manipulation pandas pipelines for data preparation
B3 Adaptability and pragmatism Handling imperfect real-world datasets