Algorithm Selection Flowchart¶
Choosing the right algorithm requires answering a short sequence of questions about your data and target variable.
The Decision Flow¶
Step 1: Do you have labels (a target variable y)?¶
- No → Unsupervised Learning (see Topic 5: Clustering — K-Means, DBSCAN).
- Yes → Supervised Learning. Move to Step 2.
Step 2: Is your target variable continuous (e.g., £ price, temperature)?¶
- Yes → Regression.
- Start with:
LinearRegression(interpretable baseline). - If non-linear patterns exist:
RandomForestRegressororGradientBoostingRegressor. - If you need regularisation:
Ridge,Lasso, orElasticNet.
- Start with:
- No → Classification. Move to Step 3.
Step 3: Classification — how complex is the decision boundary?¶
- Start with:
LogisticRegression(interpretable baseline). - If non-linear patterns exist:
RandomForestClassifierorGradientBoostingClassifier. - If you need probability outputs: Ensure the model supports
predict_proba()(most tree-based and logistic models do; default SVM does not). - If you have very high-dimensional sparse data (e.g., text):
MultinomialNBorSGDClassifier.
Step 4: Scale and performance considerations¶
| Consideration | Recommended Action |
|---|---|
| Dataset > 100k rows | Use HistGradientBoostingClassifier or LightGBM for speed |
| Interpretability required | Use LogisticRegression or DecisionTreeClassifier (shallow) |
| Maximum accuracy needed | Use GradientBoostingClassifier / XGBoost with hyperparameter tuning |
| Many irrelevant features | Apply Lasso (L1) for automatic feature selection |
Workplace Tip
When in doubt, train a RandomForestClassifier with default parameters as your first model. It handles mixed feature types, non-linear relationships, and missing values (in some implementations) with minimal preprocessing.
KSB Mapping¶
| KSB | Description | How This Addresses It |
|---|---|---|
| K4.1 | Statistical models and methods | Understanding the statistical basis of regression and classification |
| K4.2 | ML and AI techniques | Implementing and comparing supervised learning algorithms |
| K4.4 | Resource constraints and trade-offs | Model complexity vs interpretability; computational cost |
| S1 | Scientific methods and hypothesis testing | Formulating hypotheses and testing with rigorous validation |
| S4 | Building models and validating | Cross-validation, train/test evaluation, performance metrics |
| B5 | Impartial, hypothesis-driven approach | Honest evaluation of model performance and limitations |