Pandas Data Prep Cheatsheet¶
Quick reference for the most common Pandas operations used during Data Preparation.
Reading Data¶
import pandas as pd
import numpy as np
import seaborn as sns
# Load a built-in seaborn dataset effortlessly
df = sns.load_dataset('titanic')
# Read standard formats
csv_df = pd.read_csv('file.csv', na_values=['?'])
excel_df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
Basic Inspection¶
df.head(5) # View first 5 rows
df.tail(3) # View last 3 rows
df.shape # Tuple of (rows, cols)
df.info() # Column types, Non-Null counts, Memory usage
df.describe() # Summary statistics (mean, std, min, max, quartiles)
df.columns # List of column names
df.dtypes # Data types of all columns
Selecting & Filtering Columns¶
# Select a single column (returns a Series)
ages = df['age']
# Select multiple columns (returns a DataFrame)
subset = df[['age', 'fare', 'survived']]
# Select columns algorithmically by mathematical datatype
numeric_df = df.select_dtypes(include=[np.number])
text_df = df.select_dtypes(include=['object', 'category'])
Filtering Rows (Boolean Indexing)¶
# Filter conditionally by value
adults = df[df['age'] >= 18]
# Filter using multiple exact conditions (use & for AND, | for OR)
female_survivors = df[(df['sex'] == 'female') & (df['survived'] == 1)]
# Filter using exactly matched lists
first_class = df[df['pclass'].isin([1, 2])]
# Filter natively using string matching
miss_titles = df[df['who'].str.contains('child')]
Missing Data Operations¶
df.isnull().sum() # Count NaNs structurally per column
df.dropna() # EXTREMELY DANGEROUS: Drops any row with a NaN
df.dropna(subset=['age']) # Safely drop rows ONLY if 'age' is missing
df.dropna(thresh=10, axis=1) # Drop specifically columns containing less than 10 non-NaN values
df['age'].fillna(30) # Permanently replace ALL NaNs with 30
df['age'].fillna(df['age'].median()) # Replace with the median algorithmically
Column Manipulation¶
# Rename explicitly using dictionaries
df = df.rename(columns={'survived': 'Labels'})
# Drop permanently single columns
df = df.drop(columns=['Labels', 'age'])
# Typecast mathematically
df['fare'] = df['fare'].astype('float32')
Grouping & Aggregating¶
# Group securely by categorical text and calculate aggregates
grouped = df.groupby('sex').agg({
'age': ['mean', 'max'],
'fare': 'sum'
}).reset_index()
Workplace Tip
Bookmark this page during your EPA coding challenges. Quickly retrieving exact Pandas syntax for .agg() or .select_dtypes() under pressure demonstrates fluency and saves critical computational time during assessment interviews!