Why Data Preparation Matters¶
"Garbage In, Garbage Out." The performance of a Machine Learning algorithm is strictly bounded by the structural quality of the tensor arrays it receives.
The Algorithmic Reality¶
Algorithms are blind. A Random Forest does not know what a "Customer" or a "Sensor" is. It only sees a geometric landscape of Floats. If you feed it raw CSV data without preparation, it will either:
- Crash immediately: Algorithms throw
ValueErrorif they encounter NaNs (missing values) or Strings. - Learn the wrong patterns: If you feed it a UUID column like
customer_id, the algorithm will attempt to find a mathematical correlation between beingCustomer 1000andCustomer 1001, finding an entirely fictional numerical relationship where none physically exists.
The 80/20 Rule of Data Science¶
Industry surveys consistently show Data Scientists spend 80% of their time finding, cleaning, and organising data, leaving only 20% for actual algorithm training.
Why? Because algorithmic development is largely automated now via libraries like Scikit-Learn. You can train a state-of-the-art Gradient Boosting model in exactly three lines of code:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
The complexity of modern Data Science lies entirely in the engineering required to produce that flawless X_train matrix.
The Three Pillars of Preparation¶
Data Preparation is divided into three strict chronological phases:
- Data Cleansing (Quality): Finding and destroying
NaNs,Nulls, and structural anomalies. - Feature Engineering (Extraction): Generating a
Agenumeric float column safely from a messy rawDate of Birthstring column. - Data Transformation (Formatting): Encoding strings universally into One-Hot Arrays and mathematically standardizing numeric scales.
The Consequence of Failure¶
If you fail to standardise your data (e.g. comparing Kilometres to Millimetres): - Your optimization algorithms (like Gradient Descent) will mathematically fail to converge. - K-Means and K-Nearest Neighbors will prioritize features with larger absolute numbers linearly, entirely ignoring small decimals regardless of their genuine predictive signal.
Assessment Connection
In your EPA presentation, explicitly documenting why you spent time preparing the dataset before launching an algorithm demonstrates the architectural maturity level required for an immediate Distinction.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K5.3 | Common patterns in real-world data | Identifying missing values, duplicates, outliers, and class imbalance |
| S2 | Data engineering and governance | Systematic data cleaning, transformation, and quality assessment |
| S3 | Programming for data manipulation | pandas pipelines for data preparation |
| B3 | Adaptability and pragmatism | Handling imperfect real-world datasets |