Skip to content

How to Load and Clean a Messy Dataset

Real-world data is rarely formatted cleanly. Learn to quickly bypass bad metadata, skip rows, and coerce stubborn strings into workable floats.

What You Will Learn

  • Skip preamble metadata headers when reading datasets
  • Handle inconsistent missing value indicators (N/A, ?, missing)
  • Convert currency strings ($1,000) into numeric columns instantly
  • Rename columns programmatically into standard python snake_case

Step 1: Handling Inconsistent Missing Indicators

When using built-in datasets like sns.load_dataset('titanic'), data is pre-cleaned. In real assessments, your stakeholders will upload files where missing values are marked inconsistently. You can handle this natively inside the read_csv parser.

import pandas as pd
import numpy as np

# We simulate a messy CSV by creating a DataFrame with inconsistent missing values
messy_data = pd.DataFrame({
    'Passenger Id': [1, 2, 3],
    'Age': ['22', '?', 'N/A'],
    'Fare': ['$7.25', '$71.28', 'Missing']
})

# In the real world, you'd use pd.read_csv('messy.csv', na_values=['?', 'N/A', 'Missing'])
# Since we simulate the data, we use replace:
messy_data.replace(['?', 'N/A', 'Missing'], np.nan, inplace=True)

print(messy_data)
Expected Output
   Passenger Id   Age   Fare
0             1    22  $7.25
1             2   NaN  $71.28
2             3   NaN    NaN

Step 2: Fixing Column Names

Spaces and Capital Letters in column names break your ability to use dot-notation (e.g. df.Age). Convert them all cleanly using a lambda functional block to standardise them to snake_case.

# Strip whitespace, lowercase, and replace spaces with underscores
messy_data.columns = [col.strip().lower().replace(' ', '_') for col in messy_data.columns]
print(messy_data.columns)
Expected Output
Index(['passenger_id', 'age', 'fare'], dtype='object')

Step 3: Coerce Strings to Numeric Types

Often currency or percentages load as object text (e.g. $7.25). We must strip the characters before converting them to floats.

# Convert Age mathematically to numeric
messy_data['age'] = pd.to_numeric(messy_data['age'])

# Strip the '$' from Fare and convert computationally
messy_data['fare'] = messy_data['fare'].str.replace('$', '').astype(float)

print(messy_data.dtypes)
Expected Output
passenger_id      int64
age             float64
fare            float64
dtype: object

Workplace Tip

Do not iterate over rows manually using for index, row in df.iterrows() to fix strings! Python string vectorised operations like .str.replace() run exponentially faster natively in C underneath pandas and cleanly format 10 million rows in 0.5s.

KSB Mapping

KSB Description How This Addresses It
K5.3 Common patterns in real-world data Identifying missing values, duplicates, outliers, and class imbalance
S2 Data engineering and governance Systematic data cleaning, transformation, and quality assessment
S3 Programming for data manipulation pandas pipelines for data preparation
B3 Adaptability and pragmatism Handling imperfect real-world datasets