Skip to content

Why Data Preparation Matters

"Garbage In, Garbage Out." The performance of a Machine Learning algorithm is strictly bounded by the structural quality of the tensor arrays it receives.

The Algorithmic Reality

Algorithms are blind. A Random Forest does not know what a "Customer" or a "Sensor" is. It only sees a geometric landscape of Floats. If you feed it raw CSV data without preparation, it will either:

  1. Crash immediately: Algorithms throw ValueError if they encounter NaNs (missing values) or Strings.
  2. Learn the wrong patterns: If you feed it a UUID column like customer_id, the algorithm will attempt to find a mathematical correlation between being Customer 1000 and Customer 1001, finding an entirely fictional numerical relationship where none physically exists.

The 80/20 Rule of Data Science

Industry surveys consistently show Data Scientists spend 80% of their time finding, cleaning, and organising data, leaving only 20% for actual algorithm training.

Why? Because algorithmic development is largely automated now via libraries like Scikit-Learn. You can train a state-of-the-art Gradient Boosting model in exactly three lines of code:

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()
model.fit(X_train, y_train)

The complexity of modern Data Science lies entirely in the engineering required to produce that flawless X_train matrix.

The Three Pillars of Preparation

Data Preparation is divided into three strict chronological phases:

  1. Data Cleansing (Quality): Finding and destroying NaNs, Nulls, and structural anomalies.
  2. Feature Engineering (Extraction): Generating a Age numeric float column safely from a messy raw Date of Birth string column.
  3. Data Transformation (Formatting): Encoding strings universally into One-Hot Arrays and mathematically standardizing numeric scales.

The Consequence of Failure

If you fail to standardise your data (e.g. comparing Kilometres to Millimetres): - Your optimization algorithms (like Gradient Descent) will mathematically fail to converge. - K-Means and K-Nearest Neighbors will prioritize features with larger absolute numbers linearly, entirely ignoring small decimals regardless of their genuine predictive signal.

Assessment Connection

In your EPA presentation, explicitly documenting why you spent time preparing the dataset before launching an algorithm demonstrates the architectural maturity level required for an immediate Distinction.

KSB Mapping

KSB Description How This Addresses It
K5.3 Common patterns in real-world data Identifying missing values, duplicates, outliers, and class imbalance
S2 Data engineering and governance Systematic data cleaning, transformation, and quality assessment
S3 Programming for data manipulation pandas pipelines for data preparation
B3 Adaptability and pragmatism Handling imperfect real-world datasets