Detecting & Treating Outliers¶
One bad data point can completely derail a linear calculation. Learn to find and neutralise model-breaking anomalies.
What You Will Learn¶
- Visually identify extreme outliers using Seaborn Box Plots
- Programmatically calculate Interquartile Range (IQR) boundaries
- Cap extreme values systematically rather than deleting data
Prerequisites¶
- Completed the Scaling & Normalisation tutorial
- Basic understanding of percentiles (25th, 75th)
Step 1: Visual Detection¶
We will use the diamonds dataset again. To demonstrate outliers properly, we will artificially inject some severe anomalies. Box plots are the definitive visual tool for hunting outliers.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('diamonds').head(1000)
# Inject structural anomalies for demonstration
df.loc[df.index[0], 'price'] = 25000
df.loc[df.index[1], 'carat'] = 6.0
# Plotting box plots to instantly spot outliers
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(data=df, y='price', x='cut', ax=axes[0], palette='viridis')
axes[0].set_title('Box Plot for Outlier Detection (Price)')
sns.scatterplot(data=df, x='carat', y='price', alpha=0.5, ax=axes[1])
axes[1].set_title('Scatter Plot IQR Bound')
Any point sitting individually outside the thick "whiskers" of the box plot is a statistical outlier.
Step 2: The IQR Method¶
You cannot rely on humans to visually spot every outlier. The Interquartile Range (IQR) mathematically defines the boundary of "normal". Anything above Q3 + 1.5*IQR is flagged.
# Calculate the 25th (Q1) and 75th (Q3) percentiles
Q1 = df['carat'].quantile(0.25)
Q3 = df['carat'].quantile(0.75)
IQR = Q3 - Q1
# Define our statistical boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# How many rows violate this boundary?
outliers = df[(df['carat'] < lower_bound) | (df['carat'] > upper_bound)]
print(f"Total rows: {len(df)} | Outliers detected: {len(outliers)}")
Let's dynamically draw that bound onto our scatter plot from Step 1 to see strictly where the threshold falls:
axes[1].axvline(upper_bound, color='red', linestyle='--', label=f'Upper IQR ({upper_bound:.2f})')
axes[1].legend()
plt.tight_layout()
plt.show()
Expected Plot

Workplace Tip
Never delete outliers automatically. In fraud detection, those outliers ARE the entire target! The 6.0 carat diamond in our example may genuinely exist and be priced accurately. Always consult a domain expert to determine if an anomaly is a "data entry error" or a "genuine extreme event."
Step 3: Capping (Winsorisation)¶
If you determine the outliers are destructive but you cannot throw away 21 rows of data, you can "cap" them at a specific ceiling threshold. This is mathematically known as Winsorisation.
# Cap values strictly at the calculated upper/lower bounds
df['carat_capped'] = np.clip(df['carat'], a_min=lower_bound, a_max=upper_bound)
print(f"Max original carat: {df['carat'].max()}")
print(f"Max capped carat: {df['carat_capped'].max():.2f}")
Assessment Connection
In your final EPA portfolio mapping to KSB S4 (Data Cleansing), explicitly documenting exactly why you chose to 'Cap' rather than 'Drop' outliers provides examiners the precise analytical logic required to award top grades.
Summary¶
- Outliers drag mathematical averages and linear models radically off course.
- Use
sns.boxplot()prior to modelling to visually identify magnitude anomalies. - Calculate the IQR (
Q3 - Q1) bound dynamically to hunt anomalies across millions of rows without manual visualisation. - Use
np.clip()to pull destructive outliers back to the mathematical fence without deleting the row entirely.
Next Steps¶
→ Building Preprocessing Pipelines — package all cleaning, scaling, and encoding into a single robust transformer block.
Stretch & Challenge
For Advanced Learners¶
Isolation Forest Algorithm
The IQR method only works on one column at a time (Univariate). What if a diamond's carat = 1 (normal) and price = $1,000 (normal), but together 1 carat for $1,000 is an impossible combination? You need a Multivariate outlier algorithm.
IsolationForest builds randomised decision trees to deliberately isolate unique data points mathematically across multi-dimensional features.
from sklearn.ensemble import IsolationForest
# Train the anomaly detector across BOTH columns
clf = IsolationForest(contamination=0.02, random_state=42)
predictions = clf.fit_predict(df[['carat', 'price']])
# Predictions return -1 for an outlier and 1 for normal
df['multivariate_outlier'] = predictions
print(df[df['multivariate_outlier'] == -1].head())
Examine the rows marked -1 and attempt to understand why IsolationForest triggered them without explicitly violating single-column bounds.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K5.3 | Common patterns in real-world data | Identifying missing values, duplicates, outliers, and class imbalance |
| S2 | Data engineering and governance | Systematic data cleaning, transformation, and quality assessment |
| S3 | Programming for data manipulation | pandas pipelines for data preparation |
| B3 | Adaptability and pragmatism | Handling imperfect real-world datasets |