Scikit-Learn Preprocessing Reference¶

The sklearn.preprocessing module is the engine driving mathematical transformations.

Encoders (Categorical to Numeric)¶

Mapping unstructured text structurally into arrays.

Class	Use Case	Example
`OrdinalEncoder()`	Hierarchical/Ranked Text: Low < Medium < High	`[[0], [1], [2]]`
`OneHotEncoder(drop='first')`	Nominal Text: Red vs Blue vs Green	`[[1,0], [0,1], [0,0]]`
`LabelEncoder()`	Target Variables strictly (`Y`): Binary or Multiclass strings	`0, 1`

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Ordinal mathematically enforces scale
rank_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])

# OneHot purely routes string values out to parallel binary columns
ohe = OneHotEncoder(sparse_output=False, drop='first')

Scalers (Numeric to Numeric)¶

Harmonising the distance between numerical ranges.

Class	Behaviour	Use Case
`StandardScaler()`	Z-score scaling. Mean is strictly 0, std is 1.	DEFAULT: Linear regressions, SVMs, PCA
`MinMaxScaler()`	Compresses purely between minimum 0.0 to maximum 1.0	Deep Learning Tensors (Neural Networks)
`RobustScaler()`	Scales purely around Median and strictly Interquartile Range	Data containing severe extreme outliers

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sets feature distribution to Mean = 0, Std = 1
std = StandardScaler()

# Sets boundary dynamically between Min = 0.0, Max = 1.0 
minmax = MinMaxScaler()

Imputers (Missing Data)¶

Located uniquely in sklearn.impute.

Class	Strategy	Use Case
`SimpleImputer(strategy='mean')`	Fills statically with statistical Average	Normally distributed continuous algorithms
`SimpleImputer(strategy='median')`	Fills statically with geometric middle	Highly skewed geometric boundaries
`SimpleImputer(strategy='most_frequent')`	Fills statically with textual Mode	Pure categorical (Text/Object) columns

from sklearn.impute import SimpleImputer

# Instantiation mapping structurally to Median arithmetic
imputer = SimpleImputer(strategy='median')

Assessment Connection

For your K3 (Data Management) portfolio, documenting why StandardScaler was chosen explicitly over MinMaxScaler based on your analysis of the statistical feature skew guarantees top marks from assessment moderators.