How to Merge Multiple Data Sources¶

Rarely does all your predictive signal live in a single table. Merging correctly is essential for feature enrichment.

What You Will Learn¶

Execute Left, Right, Inner, and Outer joins
Merge on discrepant column names
Verify merge integrity securely

Step 1: The Inner Join¶

If you want to merge two datasets and strictly only retain the records that securely exist identically in both frames, use the inner merge (the Pandas default).

import pandas as pd

# Creating synthetic mock data to simulate relational SQL tables
customers = pd.DataFrame({'cust_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
transactions = pd.DataFrame({'cust_id': [2, 3, 4], 'amount': [150, 200, 50]})

# Only Bob (2) and Charlie (3) exist in both. Alice (1) and (4) are dropped.
inner_merged = pd.merge(customers, transactions, on='cust_id', how='inner')
print(inner_merged)

Expected Output

   cust_id     name  amount
0        2      Bob     150
1        3  Charlie     200

Step 2: The Left Join (Enrichment)¶

If predicting customer churn, you absolutely must retain 100% of your primary customers even if they have 0 transactions recorded. Use a left join securely to enrich the left-hand DataFrame.

# Retain everyone in 'customers', populate NaN for missing matches in 'transactions'
left_merged = pd.merge(customers, transactions, on='cust_id', how='left')

# Fill NaN transaction amounts with 0 computationally
left_merged['amount'] = left_merged['amount'].fillna(0)
print(left_merged)

Expected Output

   cust_id     name  amount
0        1    Alice     0.0
1        2      Bob   150.0
2        3  Charlie   200.0

Step 3: Merging on Discrepant Keys¶

When databases aren't homogenised, your keys might be named differently (e.g. cust_id vs client_num). Use left_on and right_on.

# Simulating a third table with a different ID column completely
address = pd.DataFrame({'client_num': [1, 2, 3], 'city': ['London', 'Manchester', 'Leeds']})

# Merge specifying explicit columns
enriched = pd.merge(left_merged, address, left_on='cust_id', right_on='client_num', how='left')

# Drop the redundant secondary key efficiently
enriched = enriched.drop(columns=['client_num'])
print(enriched)

Expected Output

   cust_id     name  amount        city
0        1    Alice     0.0      London
1        2      Bob   150.0  Manchester
2        3  Charlie   200.0       Leeds

Workplace Tip

Always run df.shape before and after executing a pd.merge(). Doing a 1:Many join accidentally on a duplicated ID column can massively multiply your row count silently, leading to fatally flawed analytics statistics.

KSB Mapping¶

KSB	Description	How This Addresses It
K5.3	Common patterns in real-world data	Identifying missing values, duplicates, outliers, and class imbalance
S2	Data engineering and governance	Systematic data cleaning, transformation, and quality assessment
S3	Programming for data manipulation	pandas pipelines for data preparation
B3	Adaptability and pragmatism	Handling imperfect real-world datasets