How to Reduce Multicollinearity¶
Multicollinearity structurally occurs natively when two predictive columns logically predict each other. This aggressively destabilises mathematical weights intrinsically inside Linear Regressions algebraically.
What You Will Learn¶
- Diagnose Multicollinearity manually utilizing Pearson's Correlation mathematically
- Drop inherently redundant dimensions computationally natively
- Identify Variance Inflation Factor (VIF) methodologies logically
Step 1: Detection via Heatmap¶
If a CSV contains Year_of_Birth structurally and also Age, they natively communicate biologically the exact same information identically. An algorithm will mathematically assign completely unstable arbitrary weights cleanly.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('diamonds').head(1000)
# Calculate natively the mathematical Pearson continuous correlation globally
correlation_matrix = df.select_dtypes('number').corr()
plt.figure(figsize=(8, 6))
# A correlation explicitly > 0.85 natively flags severe dangerous multicollinearity!
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix: Hunting High Redundancy", fontsize=14)
plt.show()
Expected Output
(The output cleanly reveals explicitly that diamond x, y, z coordinates structurally correlate at exactly 0.98 natively with carat dimension!)
Step 2: Manual Truncation¶
If carat explicitly logically contains all geometric variance physically possessed exclusively by x, y, and z length natively, we mathematically explicitly delete the lesser features structurally rather than attempting dimensionality reduction (PCA).
# The physical dimensions structurally destabilize mathematical regression cleanly without adding new logical insights
df_clean = df.drop(columns=['x', 'y', 'z'])
print(f"Original shape strictly: {df.shape}")
print(f"Cleaned multicollinear shape: {df_clean.shape}")
Workplace Tip
Tree-based models (Random Forest, XGBoost) natively possess structural geometric immunity explicitly to multicollinearity! They simply select the strongest variable cleanly and explicitly logically systematically ignore the redundant copy subsequently. Highly aggressive collinearity cleanup is strictly explicitly mathematically mandatory purely ONLY for Linear Regressions and Neural Networks.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced analytics and ML techniques | Feature selection algorithms and dimensionality reduction |
| K5.2 | Data formats and structures | Encoding categorical variables, handling mixed feature types |
| S2 | Data engineering | Creating and transforming features from raw data |
| S4 | Feature selection and ML | Applying feature selection methods and PCA |
| B1 | Inquisitive approach | Exploring creative feature engineering strategies |