How to Change Data Granularity¶

Machine Learning algorithms require one row per 'event'. If your data is recorded per minute, but your predictive event is 'Total Daily Sales', you must aggregate granularities.

What You Will Learn¶

Downsample high-frequency time-series data using .resample()
Aggregate categorical transactions into grouped numeric rows using .groupby()

Step 1: Time-Series Downsampling¶

If sensors log temperature every 5 seconds, you will have \(17,280\) rows per day. If you only want to predict the "Daily High Temp", you must compress the chronological granularity computationally.

import pandas as pd
import numpy as np

# Simulate 1 week of minute-by-minute sensor data
dates = pd.date_range("2024-01-01", "2024-01-07", freq="min")
df = pd.DataFrame({
    'timestamp': dates,
    'temperature_c': np.random.normal(20, 2, len(dates))
})

# Set the timestamp as the index (mandatory for Pandas time-series)
df.set_index('timestamp', inplace=True)

# Resample to Daily ('D') frequency, capturing the Max and Mean
daily_stats = df.resample('D').agg({
    'temperature_c': ['max', 'mean']
})

print(daily_stats)

Expected Output

            temperature_c           
                      max       mean
timestamp                           
2024-01-01      25.105123  20.001235
2024-01-02      26.002341  19.981290
...

Step 2: Aggregating Categorical Groups¶

If a transactional file has 1 row per checkout receipt, but your objective is to predict "Customer Lifetime Value", you must aggregate the data so the granularity is exactly 1 row per customer_id.

# Simulating receipt-level transactions
transactions = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3],
    'purchase_amount': [15.50, 20.00, 5.00, 100.00, 50.00, 12.00],
    'item_category': ['Food', 'Food', 'Drink', 'Electronics', 'Clothing', 'Food']
})

# Group statically by Customer
customer_profile = transactions.groupby('customer_id').agg(
    total_spent=('purchase_amount', 'sum'),
    average_order_value=('purchase_amount', 'mean'),
    number_of_purchases=('purchase_amount', 'count')
).reset_index()

print(customer_profile)

Expected Output

   customer_id  total_spent  average_order_value  number_of_purchases
0            1         40.5                13.50                    3
1            2        150.0                75.00                    2
2            3         12.0                12.00                    1

Assessment Connection

In your presentation, examiners will probe whether your dataset granularity matched your theoretical problem statement. Demonstrating explicit .groupby() engineering proves you didn't simply throw raw transactional logs blindly into an algorithm!

KSB Mapping¶

KSB	Description	How This Addresses It
K5.3	Common patterns in real-world data	Identifying missing values, duplicates, outliers, and class imbalance
S2	Data engineering and governance	Systematic data cleaning, transformation, and quality assessment
S3	Programming for data manipulation	pandas pipelines for data preparation
B3	Adaptability and pragmatism	Handling imperfect real-world datasets