Skip to content

Dimensionality Reduction with PCA

Feature Selection deletes columns cleanly. Principal Component Analysis (PCA) chemically melts columns together into fewer, denser geometric combinations.

What You Will Learn

  • Differentiate mathematically between Feature Selection and Feature Extraction
  • Compress highly correlated dimensions utilizing Python natively
  • Validate PCA compression metrics tracking .explained_variance_ratio_

Prerequisites

  • Completed the Filter / Embedded Methods tutorials
  • Core understanding of StandardScaler (Z-Scores)

Step 1: The Curse of Dimensionality

If you possess an image comprising 100x100 pixels, it functionally behaves computationally as a matrix possessing 10,000 completely separate columns.

If you attempt to feed 10,000 columns into K-Nearest Neighbors, the fundamental mathematics of "Distance" completely break down natively physically due to sparse hyper-dimensionality.

Instead of selecting the "Top 50" pixels and deleting the other 9,950 pixels (which would literally obliterate the picture), we explicitly use Principal Component Analysis (PCA). PCA discovers the invisible correlation axes and crushes 10,000 columns dynamically into 100 dense super-columns called "Components".

Step 2: Explicit Standardisation

PCA purely calculates raw Euclidean matrix variance. It is completely blind to specific underlying units. If carat ranges from 0-5, and price ranges from 0-15000, PCA will algorithmically declare that price dominates 99.9% of the structural geometric trajectory!

You must Standardise ALL data before passing it mechanically to PCA.

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df = sns.load_dataset('penguins').dropna()
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']

# THIS IS MANDATORY
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Extracting Principal Components

We will instruct the PCA transformer dynamically to convert our 4 penguin columns physically into just 2 super-components!

# Force compilation down explicitly exactly to 2 dimensions!
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled) 

print(f"Original Structural Shape: {X_scaled.shape}")
print(f"Reduced PCA Tensor Shape:  {X_pca.shape}")
Expected Output
Original Structural Shape: (333, 4)
Reduced PCA Tensor Shape:  (333, 2)

What is inside X_pca? We deleted the raw names (bill_length, body_mass) completely! The new columns are just geometric blends titled PC1 and PC2.

Step 4: Measuring Information Loss

We mathematically deleted 2 physical dimensions entirely. Did we lose 50% of our predictive information? Let's check the .explained_variance_ratio_.

variance_ratio = pca.explained_variance_ratio_

print(f"Data mathematically retained in PC1: {variance_ratio[0]*100:.2f}%")
print(f"Data mathematically retained in PC2: {variance_ratio[1]*100:.2f}%")
print(f"Total Cumulative Retained Variance: {sum(variance_ratio)*100:.2f}%")
Expected Output
Data mathematically retained in PC1: 68.63%
Data mathematically retained in PC2: 19.45%
Total Cumulative Retained Variance: 88.09%

Incredible! We violently collapsed the entire physical dataset physically by 50% identically, yet structurally retained mathematically exactly 88% of all variance signals.

Let's observe PCA physically decoupling our classes effectively in a mapped 2D space:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette='Set2')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2D PCA Projection of Penguins')
plt.tight_layout()
plt.show()
Expected Plot

PCA Compression Output

Assessment Connection

In your EPA explicitly mapping S12 (Feature Engineering), document mathematically that you set a threshold (e.g., "I initialized PCA to strictly retain identically exactly 90% Cumulative Variance"). Arbitrarily guessing exactly n_components=10 is deemed structurally unsafe by scoring examiners.

Summary

  • Selection mechanically deletes raw columns entirely.
  • Extraction (PCA) physically melts variables entirely along their steepest variance axes dynamically.
  • StandardScaler() is an absolutely non-negotiable prerequisite prior to PCA extraction arrays.
  • Summing .explained_variance_ratio_ dictates mathematically how much accuracy was lost dynamically to dimensionality reduction operations.

Next Steps

Domain Expertise in Feature Design — leaving mechanics behind to manually inject psychological reality bounds purely via How-To structural engineering.

Stretch & Challenge

For Advanced Learners

Inverse Transformation

If PCA compresses the data heavily from exactly 4 dimensions down to 2 dimensions structurally, can we "uncompress" it dynamically back to exactly 4 dimensions natively?

Yes, using pca.inverse_transform().

# 2D -> 4D
X_recovered = pca.inverse_transform(X_pca)

# We must explicitly reverse the scaler mathematically to return to raw millimetres!
X_original_approximation = scaler.inverse_transform(X_recovered)

The reversed data strictly will never match perfectly the origin sequence identically because 12% of the variance was violently discarded (Lossy Compression), but the structural mapping heavily mirrors reality efficiently!

KSB Mapping

KSB Description How This Addresses It
K4.2 Advanced analytics and ML techniques Feature selection algorithms and dimensionality reduction
K5.2 Data formats and structures Encoding categorical variables, handling mixed feature types
S2 Data engineering Creating and transforming features from raw data
S4 Feature selection and ML Applying feature selection methods and PCA
B1 Inquisitive approach Exploring creative feature engineering strategies