Skip to content

Scaling & Normalisation

Distance-based algorithms panic when comparing centimetres to kilometres. Scaling forces all features into immediate proportion.

What You Will Learn

  • Identify precisely when scaling is computationally required
  • Compress variant features using StandardScaler
  • Compress variant features using MinMaxScaler
  • Visually verify transformation integrity via plotting

Prerequisites

  • Completed the Data Types & Encoding tutorial

Step 1: Why Scaling Matters

Most ML algorithms (KNN, SVM, Neural Networks, PCA, K-Means) use raw Euclidean distance to calculate the difference between distinct data points.

If measuring houses, Square Footage might range from 800 to 5000, while Number of Bedrooms ranges from 1 to 5. Because the square footage numbers mathematically dominate the equation by three magnitudes, the algorithm practically ignores the bedroom count entirely. Scaling resets them both to the exact same comparable distribution box.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler

df = sns.load_dataset('diamonds').head(1000)

print(f"Carat variance: {df['carat'].var():.4f}")
print(f"Price variance: {df['price'].var():.1f}")
Expected Output
Carat variance: 0.2223
Price variance: 15891338.8

Step 2: Standardisation (Z-Score)

StandardScaler shifts the mean of the feature to exactly 0 and scales the variance tracking to exactly standard deviation 1. This is the absolute default choice for linear scaling, assuming your underlying data structurally mimics a bell curve (Normal distribution).

scaler_std = StandardScaler()

# Transform multiple columns simultaneously
features = ['carat', 'price']
df_std = pd.DataFrame(scaler_std.fit_transform(df[features]), columns=features)

print(df_std.describe().round(2))
Expected Output
carat price
count 1000.00 1000.00
mean -0.00 0.00
std 1.00 1.00
min -1.13 -0.91
25% -0.83 -0.73
50% -0.22 -0.32
75% 0.44 0.46
max 4.88 3.65

Notice that the mean is functionally \(0.00\) and the standard deviation is precisely \(1.00\).

Workplace Tip

Tree-based algorithms (RandomForest, XGBoost, DecisionTree) are entirely immune to magnitude scaling because they split iteratively on percentile thresholds rather than geometric distances. If you are ONLY running trees at work, you can intentionally skip scaling routines completely!

Step 3: Local Normalisation

MinMaxScaler compresses every single float strictly between the absolute lower boundary of 0.0 and 1.0.

scaler_minmax = MinMaxScaler()

df_mm = pd.DataFrame(scaler_minmax.fit_transform(df[features]), columns=features)

print(df_mm.describe().round(2))
Expected Output
carat price
count 1000.00 1000.00
mean 0.19 0.20
std 0.17 0.22
min 0.00 0.00
25% 0.05 0.04
50% 0.15 0.13
75% 0.26 0.30
max 1.00 1.00

Here the minimum is precisely \(0.00\) and the maximum is precisely \(1.00\). This is computationally highly desired as input arrays for deep learning Neural Networks.

Step 4: Visualising Transformation Integrity

Scaling algorithms radically shift the underlying array magnitude, but critically preserve relative correlation and proportion. A scatter plot will look physically identical across the X and Y axes despite the array inputs dropping from 15,000,000 to 0.5.

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Original
sns.scatterplot(data=df, x='carat', y='price', alpha=0.5, ax=axes[0])
axes[0].set_title('Original Scale (Prices ~ $2000)')

# Standard Scaler
sns.scatterplot(data=df_std, x='carat', y='price', alpha=0.5, ax=axes[1], color='green')
axes[1].set_title('StandardScaler (Mean=0, STD=1)')

# MinMax Scaler
sns.scatterplot(data=df_mm, x='carat', y='price', alpha=0.5, ax=axes[2], color='purple')
axes[2].set_title('MinMaxScaler (Range 0-1)')

plt.tight_layout()
plt.show()
Expected Plot

Scaling visualised

Assessment Connection

You are structurally expected to invoke .fit_transform() on your X_train partitions, but critically strictly only .transform() on your X_test partitions! Do NOT accidentally invoke .fit() on tests or validation environments, as information "leakage" will artificially inflate your assessment results resulting in immediate capability failure.

Summary

  • Raw linear/geometric algorithms mandate equivalent spatial variance mapping.
  • StandardScaler drives elements onto Z-scores centered around an absolute 0 framework.
  • MinMaxScaler boxes extreme ranges aggressively into pure 0.0 to 1.0 decimals.
  • Distance distributions physically remain visually parallel despite scalar compression.

Next Steps

Detecting & Treating Outliers — handle the extreme anomalies that corrupt prediction accuracy and warp scalers.

Stretch & Challenge

For Advanced Learners

RobustScaler for High-Impact Anomaly Clusters

If your dataset contains massive anomalous outliers, they will physically drag the baseline mean calculations of StandardScaler severely towards their extreme polarity.

RobustScaler solves this by computing distance matrices exclusively against the median and IQR boundaries rather than the mean scalar.

from sklearn.preprocessing import RobustScaler

scaler_robust = RobustScaler()
df_robust = scaler_robust.fit_transform(df[['price']])

Test RobustScaler on extremely skewed banking transactional tables to radically observe how much more cleanly normal behaviors remain mathematically bundled.

KSB Mapping

KSB Description How This Addresses It
K5.3 Common patterns in real-world data Identifying missing values, duplicates, outliers, and class imbalance
S2 Data engineering and governance Systematic data cleaning, transformation, and quality assessment
S3 Programming for data manipulation pandas pipelines for data preparation
B3 Adaptability and pragmatism Handling imperfect real-world datasets