Tidy Data Principles¶
Tidy Data is the fundamental geometric architecture required for algorithmic ingestion.
The Three Rules of Tidy Data¶
Before any algorithm can ingest your DataFrame, it must structurally adhere rigidly to the "Tidy Data" principles popularized by Hadley Wickham.
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Example: Untidy Data (Wide Format)¶
Often, stakeholders will send you data designed for human readability (like a pivot table), which is fundamentally broken for computational learning.
| Region | 2021_Sales | 2022_Sales | 2023_Sales |
|---|---|---|---|
| London | 1500 | 1700 | 2000 |
| Manchester | 800 | 850 | 900 |
Why is this broken?
- 2021_Sales and 2022_Sales are not separate variables; they are the exact same variable (Sales) recorded at different chronological times.
- The actual variable Year is hiding dynamically inside the column headers!
The Solution: Tidy Data (Long Format)¶
We must use the Pandas .melt() function to reshape the geometry of this table.
import pandas as pd
untidy_df = pd.DataFrame({
'Region': ['London', 'Manchester'],
'2021_Sales': [1500, 800],
'2022_Sales': [1700, 850]
})
# Melt the DataFrame structurally
tidy_df = untidy_df.melt(
id_vars=['Region'],
var_name='Year',
value_name='Sales'
)
# Clean up the Year string vectorised
tidy_df['Year'] = tidy_df['Year'].str.replace('_Sales', '').astype(int)
print(tidy_df)
Expected Output
Now, your data cleanly adheres to the rule: 1 Observation = 1 Row. You can successfully feed this matrix natively into an algorithm to predict Sales using Region and Year as independent predictive coordinates.
Workplace Tip
Whenever a colleague sends you a formatted Excel sheet with merged cells, multiple header rows, and colour-coded highlights, you must immediately strip all visual formatting and .melt() the data into a strict flat Tidy format before attempting any Pythonic analysis.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K5.3 | Common patterns in real-world data | Identifying missing values, duplicates, outliers, and class imbalance |
| S2 | Data engineering and governance | Systematic data cleaning, transformation, and quality assessment |
| S3 | Programming for data manipulation | pandas pipelines for data preparation |
| B3 | Adaptability and pragmatism | Handling imperfect real-world datasets |