How to Build Features using Domain Knowledge¶
The best algorithms in the world cannot discover business logic that isn't physically present in the training matrix.
What You Will Learn¶
- Transform institutional business rules explicitly into code
- Write vectorised operations to map continuous values conditionally
Step 1: Flagging High-Value Interactions¶
A Retail dataset might possess hundreds of columns, but the business structurally defines a "VIP Customer" purely via an arbitrary threshold: someone who spent \(> \$100\) and registered \(> 5\) years ago.
An algorithm will struggle blindly to discover this exact boundary linearly. You must explicitly inject this Domain Knowledge geometrically!
import pandas as pd
import numpy as np
import seaborn as sns
# We utilize Pandas DataFrame manipulation without external libraries
df = pd.DataFrame({
'total_spend': [50, 150, 200, 45],
'years_registered': [1, 6, 2, 8]
})
# Condition: Spend > 100 AND Years > 5
condition = (df['total_spend'] > 100) & (df['years_registered'] > 5)
# np.where executes this vectorised across 10 million rows instantly
df['is_vip'] = np.where(condition, 1, 0)
print(df)
Step 2: Complex Business Logic (np.select)¶
Instead of a binary 1/0 VIP status dynamically, your company might have complex tiered pricing logically.
Instead of writing dangerously slow .apply(lambda x: ...) functions natively, np.select() structurally chains multiple conditions simultaneously.
df_flights = sns.load_dataset('flights')
# We know historically logically that July/August are "Peak",
# Dec/Jan are "Holiday", and everything else is "Off-Peak"
conditions = [
df_flights['month'].isin(['Jul', 'Aug']),
df_flights['month'].isin(['Dec', 'Jan'])
]
choices = ['Peak', 'Holiday']
# The 'default' is what happens geographically if no conditions trigger
df_flights['seasonality_tier'] = np.select(conditions, choices, default='Off-Peak')
print(df_flights.sample(5, random_state=1))
Expected Output
Workplace Tip
Whenever a domain expert logically tells you "Sales explicitly drop whenever it rains consecutively after a bank holiday", mechanically encode that EXACT sentence recursively into a mathematical feature is_rain_after_holiday = [0,1]. Domain experts understand causal reality natively better than XGBoost does experimentally!
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.2 | Advanced analytics and ML techniques | Feature selection algorithms and dimensionality reduction |
| K5.2 | Data formats and structures | Encoding categorical variables, handling mixed feature types |
| S2 | Data engineering | Creating and transforming features from raw data |
| S4 | Feature selection and ML | Applying feature selection methods and PCA |
| B1 | Inquisitive approach | Exploring creative feature engineering strategies |