Tidy Data Principles¶

Tidy Data is the fundamental geometric architecture required for algorithmic ingestion.

The Three Rules of Tidy Data¶

Before any algorithm can ingest your DataFrame, it must structurally adhere rigidly to the "Tidy Data" principles popularized by Hadley Wickham.

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Example: Untidy Data (Wide Format)¶

Often, stakeholders will send you data designed for human readability (like a pivot table), which is fundamentally broken for computational learning.

Region	2021_Sales	2022_Sales	2023_Sales
London	1500	1700	2000
Manchester	800	850	900

Why is this broken? - 2021_Sales and 2022_Sales are not separate variables; they are the exact same variable (Sales) recorded at different chronological times. - The actual variable Year is hiding dynamically inside the column headers!

The Solution: Tidy Data (Long Format)¶

We must use the Pandas .melt() function to reshape the geometry of this table.

import pandas as pd

untidy_df = pd.DataFrame({
    'Region': ['London', 'Manchester'],
    '2021_Sales': [1500, 800],
    '2022_Sales': [1700, 850]
})

# Melt the DataFrame structurally
tidy_df = untidy_df.melt(
    id_vars=['Region'], 
    var_name='Year', 
    value_name='Sales'
)

# Clean up the Year string vectorised
tidy_df['Year'] = tidy_df['Year'].str.replace('_Sales', '').astype(int)
print(tidy_df)

Expected Output

       Region  Year  Sales
0      London  2021   1500
1  Manchester  2021    800
2      London  2022   1700
3  Manchester  2022    850

Now, your data cleanly adheres to the rule: 1 Observation = 1 Row. You can successfully feed this matrix natively into an algorithm to predict Sales using Region and Year as independent predictive coordinates.

Workplace Tip

Whenever a colleague sends you a formatted Excel sheet with merged cells, multiple header rows, and colour-coded highlights, you must immediately strip all visual formatting and .melt() the data into a strict flat Tidy format before attempting any Pythonic analysis.

KSB Mapping¶

KSB	Description	How This Addresses It
K5.3	Common patterns in real-world data	Identifying missing values, duplicates, outliers, and class imbalance
S2	Data engineering and governance	Systematic data cleaning, transformation, and quality assessment
S3	Programming for data manipulation	pandas pipelines for data preparation
B3	Adaptability and pragmatism	Handling imperfect real-world datasets