How to Load and Clean a Messy Dataset¶
Real-world data is rarely formatted cleanly. Learn to quickly bypass bad metadata, skip rows, and coerce stubborn strings into workable floats.
What You Will Learn¶
- Skip preamble metadata headers when reading datasets
- Handle inconsistent missing value indicators (
N/A,?,missing) - Convert currency strings (
$1,000) into numeric columns instantly - Rename columns programmatically into standard python snake_case
Step 1: Handling Inconsistent Missing Indicators¶
When using built-in datasets like sns.load_dataset('titanic'), data is pre-cleaned. In real assessments, your stakeholders will upload files where missing values are marked inconsistently. You can handle this natively inside the read_csv parser.
import pandas as pd
import numpy as np
# We simulate a messy CSV by creating a DataFrame with inconsistent missing values
messy_data = pd.DataFrame({
'Passenger Id': [1, 2, 3],
'Age': ['22', '?', 'N/A'],
'Fare': ['$7.25', '$71.28', 'Missing']
})
# In the real world, you'd use pd.read_csv('messy.csv', na_values=['?', 'N/A', 'Missing'])
# Since we simulate the data, we use replace:
messy_data.replace(['?', 'N/A', 'Missing'], np.nan, inplace=True)
print(messy_data)
Step 2: Fixing Column Names¶
Spaces and Capital Letters in column names break your ability to use dot-notation (e.g. df.Age). Convert them all cleanly using a lambda functional block to standardise them to snake_case.
# Strip whitespace, lowercase, and replace spaces with underscores
messy_data.columns = [col.strip().lower().replace(' ', '_') for col in messy_data.columns]
print(messy_data.columns)
Step 3: Coerce Strings to Numeric Types¶
Often currency or percentages load as object text (e.g. $7.25). We must strip the characters before converting them to floats.
# Convert Age mathematically to numeric
messy_data['age'] = pd.to_numeric(messy_data['age'])
# Strip the '$' from Fare and convert computationally
messy_data['fare'] = messy_data['fare'].str.replace('$', '').astype(float)
print(messy_data.dtypes)
Workplace Tip
Do not iterate over rows manually using for index, row in df.iterrows() to fix strings! Python string vectorised operations like .str.replace() run exponentially faster natively in C underneath pandas and cleanly format 10 million rows in 0.5s.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K5.3 | Common patterns in real-world data | Identifying missing values, duplicates, outliers, and class imbalance |
| S2 | Data engineering and governance | Systematic data cleaning, transformation, and quality assessment |
| S3 | Programming for data manipulation | pandas pipelines for data preparation |
| B3 | Adaptability and pragmatism | Handling imperfect real-world datasets |