Datasets¶

A curated list of built-in datasets used throughout this module. No downloads or API keys required.

Seaborn Datasets¶

Load any of these with sns.load_dataset('name'):

Dataset	Task	Target Variable
`titanic`	Binary Classification	`survived` (0/1)
`penguins`	Multiclass Classification	`species` (Adelie, Chinstrap, Gentoo)
`tips`	Regression	`tip` (continuous)
`taxis`	Regression	`fare` (continuous)
`diamonds`	Regression	`price` (continuous)
`iris`	Multiclass Classification	`species`

import seaborn as sns

df = sns.load_dataset("titanic")
print(df.shape)
print(df.head())

Load these with from sklearn.datasets import <function>:

Function	Task	Samples	Features
`load_breast_cancer()`	Binary Classification	569	30
`load_iris()`	Multiclass Classification	150	4
`load_diabetes()`	Regression	442	10
`load_wine()`	Multiclass Classification	178	13

Function	Purpose
`make_classification()`	Generate synthetic classification data with controllable complexity
`make_regression()`	Generate synthetic regression data
`make_blobs()`	Generate clustered data for unsupervised learning
`make_moons()`	Generate non-linear, crescent-shaped clusters

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

KSB	Description	How This Addresses It
K1	Context of Data Science	Understanding where ML sits within the broader discipline
S3	Programming languages and tools	Setting up the development environment and dependencies
B6	Commitment to keeping up to date	Engaging with current ML resources and research