Datasets
A curated list of built-in datasets used throughout this module. No downloads or API keys required.
Seaborn Datasets
Load any of these with sns.load_dataset('name'):
| Dataset |
Task |
Target Variable |
titanic |
Binary Classification |
survived (0/1) |
penguins |
Multiclass Classification |
species (Adelie, Chinstrap, Gentoo) |
tips |
Regression |
tip (continuous) |
taxis |
Regression |
fare (continuous) |
diamonds |
Regression |
price (continuous) |
iris |
Multiclass Classification |
species |
import seaborn as sns
df = sns.load_dataset("titanic")
print(df.shape)
print(df.head())
Scikit-Learn Datasets
Load these with from sklearn.datasets import <function>:
| Function |
Task |
Samples |
Features |
load_breast_cancer() |
Binary Classification |
569 |
30 |
load_iris() |
Multiclass Classification |
150 |
4 |
load_diabetes() |
Regression |
442 |
10 |
load_wine() |
Multiclass Classification |
178 |
13 |
Synthetic Generators
| Function |
Purpose |
make_classification() |
Generate synthetic classification data with controllable complexity |
make_regression() |
Generate synthetic regression data |
make_blobs() |
Generate clustered data for unsupervised learning |
make_moons() |
Generate non-linear, crescent-shaped clusters |
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
KSB Mapping
| KSB |
Description |
How This Addresses It |
| K1 |
Context of Data Science |
Understanding where ML sits within the broader discipline |
| S3 |
Programming languages and tools |
Setting up the development environment and dependencies |
| B6 |
Commitment to keeping up to date |
Engaging with current ML resources and research |