Loading & Exploring Data¶

The first step in any ML project is understanding what you're working with.

What You Will Learn¶

Load data securely using pandas and built-in datasets
Inspect dataset shape, types, and basic statistics efficiently
Identify missing values and duplicates with minimal code
Create initial visualisations to understand distributions and relationships

Prerequisites¶

Python environment set up
Basic familiarity with pandas and seaborn

Step 1: Load Your Data¶

Instead of manually loading CSVs, we will use the built-in penguins dataset from Seaborn. This dataset contains physical measurements of three penguin species and is excellent for demonstrating data preparation.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

# Load the built-in penguins dataset
df = sns.load_dataset('penguins')

print(f"Dataset shape: {df.shape}")
df.head(3)

Expected Output

Dataset shape: (344, 7)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female

Workplace Tip

In your workplace project, document exactly where your data comes from. The assessment rubric values transparency about data sources and any SQL transformations applied before your Python analysis string.

Step 2: First Look at the Data¶

Use pandas built-in methods to concisely summarise the numeric and categorical columns.

# Summary statistics for numeric columns
df.describe().round(2)

Expected Output

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
count	342.00	342.00	342.00	342.00
mean	43.92	17.15	200.92	4201.75
std	5.46	1.97	14.06	801.95
min	32.10	13.10	172.00	2700.00
25%	39.23	15.60	190.00	3550.00
50%	44.45	17.30	197.00	4050.00
75%	48.50	18.70	213.00	4750.00
max	59.60	21.50	231.00	6300.00

Assessment Connection

Section A of your presentation should demonstrate that you thoroughly understood your data before modelling. Examiners want to see evidence of systematic exploration, not just jumping straight to algorithms.

Step 3: Missing Values Audit¶

Rather than looping over columns, use pandas method chaining to generate a clean summary of missing data.

# Create a concise missing values summary
missing_summary = df.isnull().sum().sort_values(ascending=False)
print(missing_summary[missing_summary > 0])

Expected Output

sex                  11
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
dtype: int64

For a visual representation of missing data, the missingno library is the industry standard:

# Matrix view — shows patterns of missingness
msno.matrix(df, figsize=(10, 5))
plt.title('Penguins Dataset: Missing Value Patterns', fontsize=16)
plt.tight_layout()
plt.show()

Expected Plot

Missing Value Patterns

Step 4: Distribution Analysis¶

Use Seaborn to quickly visualise the distributions of your numeric variables. It handles NaN values natively and requires far less code than Matplotlib.

# Plot numeric distributions
numeric_cols = df.select_dtypes(include=[np.number]).columns
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    sns.histplot(data=df, x=col, kde=True, ax=axes[i], color='#6E368A')
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

Expected Plot

Feature Distributions

Step 5: Correlation Analysis¶

Identify highly correlated features immediately to diagnose multicollinearity before modelling.

# Correlation matrix
plt.figure(figsize=(8, 6))
corr_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

Expected Plot

Correlation Matrix

Summary¶

Use sns.load_dataset for instant access to practice data
Use df.describe() and df.info() for immediate statistical summaries
Use method chaining (df.isnull().sum().sort_values(ascending=False)) for concise reporting
Use missingno to visually identify patterns in your missing data
Use seaborn heatmaps and histplots for heavily reduced plotting code

Next Steps¶

→ Handling Missing Values — decide how to treat the missing data you've identified

Stretch & Challenge

For Advanced Learners¶

Try automated EDA profiling

Instead of writing the exploration steps manually, try using ydata_profiling to generate a comprehensive HTML report of your dataset in just three lines of code:

from ydata_profiling import ProfileReport

# Generate the report
profile = ProfileReport(df, title="Penguins Profiling Report")

# Save to a file to open in your browser
profile.to_file("penguins_report.html")

In your EPA presentation, doing this can save you 30 minutes of manual plotting while revealing deeper correlations and interactions you may have missed!

KSB Mapping¶

KSB	Description	How This Addresses It
K5.3	Common patterns in real-world data	Identifying missing values, duplicates, outliers, and class imbalance
S2	Data engineering and governance	Systematic data cleaning, transformation, and quality assessment
S3	Programming for data manipulation	pandas pipelines for data preparation
B3	Adaptability and pragmatism	Handling imperfect real-world datasets