Introduction: Understanding Core ML Algorithms
Machine learning algorithms enable computers to learn patterns from data without explicit programming. These foundational algorithms form the backbone of modern data science, analytics, and artificial intelligence applications. This cheatsheet covers five essential ML algorithms—Linear Regression, Logistic Regression, Support Vector Machines (SVM), Decision Trees, and K-Nearest Neighbors (KNN)—providing a practical reference for understanding their mechanics, use cases, strengths, limitations, and implementation considerations. Whether you’re a beginner learning the basics or a practitioner refreshing your knowledge, this guide offers concise information to support your ML journey.
Core Machine Learning Concepts & Terminology
Before diving into specific algorithms, let’s establish the fundamental concepts that apply across machine learning:
Learning Paradigms
- Supervised Learning: Training with labeled data (input-output pairs)
- Unsupervised Learning: Finding patterns in unlabeled data
- Semi-supervised Learning: Training with both labeled and unlabeled data
- Reinforcement Learning: Learning through action-reward feedback
Dataset Terminology
- Features (X): Input variables or attributes (independent variables)
- Target (y): Output variable to predict (dependent variable)
- Training Set: Data used to train the model
- Validation Set: Data used for tuning hyperparameters
- Test Set: Unseen data used to evaluate model performance
- Cross-validation: Technique to assess model performance by partitioning data
Model Evaluation Metrics
- Regression Metrics: MSE, RMSE, MAE, R²
- Classification Metrics: Accuracy, Precision, Recall, F1-score, AUC-ROC
- Confusion Matrix: Table showing TP, TN, FP, FN predictions
Key Terminology
- Hyperparameters: Model configuration settings set before training
- Parameters: Values learned during training
- Overfitting: Model performs well on training data but poorly on new data
- Underfitting: Model fails to capture the underlying pattern in the data
- Bias-Variance Tradeoff: Balance between model simplicity and complexity
Linear Regression
Conceptual Overview
Linear Regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the data.
Mathematical Formula
Simple Linear Regression: y = β₀ + β₁x + ε
Multiple Linear Regression: y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε
Where:
- y is the dependent variable (target)
- x, x₁, x₂, etc. are independent variables (features)
- β₀ is the y-intercept (bias)
- β₁, β₂, etc. are coefficients (weights)
- ε is the error term
Cost Function
Mean Squared Error (MSE): J(β) = (1/n) Σ(yᵢ – ŷᵢ)²
Optimization Method
- Ordinary Least Squares (OLS): Analytical solution for minimizing MSE
- Gradient Descent: Iterative optimization to find the minimum of the cost function
Key Characteristics
- Model Type: Parametric, supervised learning
- Output: Continuous values
- Assumptions:
- Linearity: Relationship between X and y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance in errors
- Normality: Errors are normally distributed
- No multicollinearity: Independent variables aren’t highly correlated
Advantages
- Simple and interpretable
- Computationally efficient
- Provides feature importance through coefficients
- Works well with large datasets
- Easily extendable (polynomial, regularized versions)
Limitations
- Only captures linear relationships
- Sensitive to outliers
- Assumes independence among features
- Performs poorly with non-linear data
- Requires feature scaling for optimal performance
Use Cases
- Housing price prediction
- Sales forecasting
- Risk assessment
- Resource allocation
- Trend analysis and time-series forecasting
Python Implementation
# Using scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Access model parameters
coefficients = model.coef_
intercept = model.intercept_
Logistic Regression
Conceptual Overview
Despite its name, Logistic Regression is a classification algorithm that estimates the probability of an instance belonging to a particular class. It applies a logistic function to a linear combination of features.
Mathematical Formula
Logistic Function (Sigmoid): P(y=1|X) = σ(z) = 1 / (1 + e^(-z))
Where:
- z = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
- P(y=1|X) is the probability that y=1 given X
- σ is the sigmoid function
Decision Boundary
- If P(y=1|X) ≥ 0.5, predict class 1
- If P(y=1|X) < 0.5, predict class 0
Cost Function
Binary Cross-Entropy: J(β) = -(1/n) Σ [y₍ᵢ₎log(p₍ᵢ₎) + (1-y₍ᵢ₎)log(1-p₍ᵢ₎)]
Optimization Method
- Gradient Descent (or variants like Newton’s method)
Key Characteristics
- Model Type: Parametric, supervised learning
- Output: Probability (0 to 1)
- Classification Type:
- Binary (standard)
- Multiclass (using One-vs-Rest or multinomial approaches)
Advantages
- Provides probability scores, not just classifications
- Resistant to overfitting (especially with regularization)
- No assumptions about feature distributions
- Efficient training with convex optimization
- Easily interpretable via odds ratios
Limitations
- Assumes linear decision boundary
- Struggles with highly correlated features
- May underperform with small datasets
- Sensitive to imbalanced data
- Limited expressiveness for complex relationships
Use Cases
- Spam detection
- Credit approval
- Disease diagnosis
- Customer churn prediction
- Sentiment analysis
Python Implementation
# Using scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model (with regularization)
model = LogisticRegression(C=1.0, penalty='l2', solver='liblinear')
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test) # Probabilities for each class
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
# Access model parameters
coefficients = model.coef_
intercept = model.intercept_
Support Vector Machines (SVM)
Conceptual Overview
SVM finds the optimal hyperplane that maximizes the margin between classes in feature space. It can handle linear and non-linear classification through kernel methods.
Mathematical Formula
Linear SVM Objective: Minimize: (1/2)||w||² + C Σ ξᵢ Subject to: yᵢ(w·xᵢ + b) ≥ 1 – ξᵢ and ξᵢ ≥ 0
Where:
- w is the normal vector to the hyperplane
- b is the bias term
- ξᵢ are slack variables for soft margin
- C is the regularization parameter
Key Concepts
- Support Vectors: Data points closest to the decision boundary
- Margin: Distance between support vectors and hyperplane
- Kernel Trick: Implicit mapping to higher-dimensional space
- Common Kernels:
- Linear: K(x, y) = x·y
- Polynomial: K(x, y) = (γx·y + r)^d
- RBF/Gaussian: K(x, y) = exp(-γ||x-y||²)
- Sigmoid: K(x, y) = tanh(γx·y + r)
Key Characteristics
- Model Type: Non-parametric, supervised learning
- Output: Class label (or distance to hyperplane)
- Variants:
- SVC: Classification
- SVR: Regression
- One-class SVM: Anomaly detection
Advantages
- Effective in high-dimensional spaces
- Robust to overfitting, especially in high-dimensional data
- Versatile with different kernel functions
- Memory efficient (only stores support vectors)
- Handles non-linear decision boundaries well
Limitations
- Computationally intensive for large datasets
- Sensitive to choice of kernel and hyperparameters
- Probability estimates require additional calibration
- Not naturally suited for multiclass problems
- Less effective when classes overlap significantly
Use Cases
- Text categorization
- Image classification
- Bioinformatics sequence analysis
- Handwriting recognition
- Face detection
Python Implementation
# Using scikit-learn
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Scale features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Create and train model
model = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test) # Only if probability=True
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
# Access support vectors
support_vectors = model.support_vectors_
Decision Trees
Conceptual Overview
Decision Trees create a model that predicts the target by learning simple decision rules from features. The tree structure consists of nodes (decision points), branches (decisions), and leaves (outcomes).
Key Concepts
- Node Splitting Criteria:
- Gini Impurity: 1 – Σ(pᵢ)²
- Entropy: -Σ pᵢlog₂(pᵢ)
- Information Gain: Entropy(parent) – Weighted Sum of Entropy(children)
- MSE (for regression): Σ(y – y_pred)²
- Pruning: Removing branches to reduce complexity and overfitting
- Tree Depth: Maximum number of levels in the tree
- Leaf Node: Terminal node that provides prediction
Algorithm Flow
- Start with entire dataset at root node
- Find the best feature and threshold to split on (maximizing information gain)
- Create child nodes based on the split
- Recursively repeat for each child node until stopping criteria met
- Assign class or value to leaf nodes
Key Characteristics
- Model Type: Non-parametric, supervised learning
- Output:
- Classification: Class label or probability
- Regression: Continuous value
- Interpretability: Highly interpretable, visualizable structure
Advantages
- Intuitive and easy to interpret
- Handles mixed data types (numerical and categorical)
- Requires minimal data preprocessing
- Automatically handles feature interactions
- Not affected by outliers or missing values (with proper handling)
Limitations
- Prone to overfitting (especially deep trees)
- Can create biased trees if classes are imbalanced
- Unstable (small variations in data can create different trees)
- May create overly complex trees
- Suboptimal for linear relationships
Use Cases
- Customer segmentation
- Credit risk assessment
- Medical diagnosis
- Fault diagnosis
- Decision support systems
Python Implementation
# Using scikit-learn
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_split=2)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
# Visualize tree
plt.figure(figsize=(20,10))
plot_tree(model, filled=True, feature_names=feature_names, class_names=class_names)
plt.show()
# Feature importance
importances = model.feature_importances_
K-Nearest Neighbors (KNN)
Conceptual Overview
KNN is a simple, instance-based learning algorithm that classifies new data points based on the majority class (for classification) or average value (for regression) of its k-nearest neighbors in the feature space.
Key Concepts
- Distance Metrics:
- Euclidean: √Σ(xᵢ – yᵢ)²
- Manhattan: Σ|xᵢ – yᵢ|
- Minkowski: (Σ|xᵢ – yᵢ|ᵖ)^(1/p)
- Hamming: Count of different attributes (for categorical)
- k Value: Number of neighbors to consider
- Weighting: Optional weighting of neighbors by distance
Algorithm Flow
- Choose k (number of neighbors)
- Calculate distance from new point to all training points
- Select k points with smallest distances
- For classification: Take majority vote of neighbors
- For regression: Take average (weighted or simple) of neighbors
Key Characteristics
- Model Type: Non-parametric, instance-based, supervised learning
- Output:
- Classification: Class label or probability
- Regression: Continuous value
- Training: Lazy learning (stores training data, computation at prediction time)
Advantages
- Simple and intuitive
- No training phase (just stores data)
- Naturally handles multi-class problems
- Can learn complex decision boundaries
- No assumptions about data distribution
Limitations
- Computationally expensive for large datasets
- Sensitive to irrelevant features
- Requires feature scaling
- Memory-intensive (stores entire dataset)
- Sensitive to imbalanced data
- Curse of dimensionality
Use Cases
- Recommendation systems
- Credit rating
- Pattern recognition
- Image classification
- Medical diagnosis
Python Implementation
# Using scikit-learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Scale features (important for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Create and train model
model = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='minkowski', p=2)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
# Find optimal k
accuracy_list = []
for k in range(1, 31):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
accuracy_list.append(accuracy_score(y_test, knn.predict(X_test)))
Algorithm Comparison Tables
Overall Comparison
Algorithm | Model Type | Interpretability | Training Speed | Prediction Speed | Memory Usage | Handles High Dimensions | Handles Outliers | Feature Scaling Needed |
---|---|---|---|---|---|---|---|---|
Linear Regression | Parametric | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★☆☆ | ★☆☆☆☆ | ★★★☆☆ |
Logistic Regression | Parametric | ★★★★☆ | ★★★★☆ | ★★★★★ | ★★★★★ | ★★★☆☆ | ★★★☆☆ | ★★★★☆ |
SVM | Non-parametric | ★★☆☆☆ | ★★☆☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★★★★ |
Decision Trees | Non-parametric | ★★★★★ | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★☆☆☆ | ★★★★☆ | ★★★★★ |
KNN | Non-parametric | ★★★★☆ | ★★★★★ | ★☆☆☆☆ | ★☆☆☆☆ | ★☆☆☆☆ | ★☆☆☆☆ | ★★★★★ |
Use Case Fit
Algorithm | Linear Relationships | Non-linear Relationships | Binary Classification | Multi-class Classification | Regression | Large Datasets | Imbalanced Data | Sparse Data |
---|---|---|---|---|---|---|---|---|
Linear Regression | ★★★★★ | ★☆☆☆☆ | N/A | N/A | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★★☆☆ |
Logistic Regression | ★★★★☆ | ★★☆☆☆ | ★★★★☆ | ★★★☆☆ | N/A | ★★★★☆ | ★★☆☆☆ | ★★★★☆ |
SVM | ★★★★☆ | ★★★★☆ | ★★★★★ | ★★★☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ | ★★★★★ |
Decision Trees | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★★☆ | ★★★☆☆ |
KNN | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★☆☆☆☆ | ★★☆☆☆ | ★☆☆☆☆ |
Hyperparameter Overview
Algorithm | Key Hyperparameters | Common Ranges | Impact on Model |
---|---|---|---|
Linear Regression | Regularization (L1, L2)<br>Learning rate (for GD) | alpha: 0.001 – 10<br>learning_rate: 0.001 – 0.1 | Controls overfitting<br>Affects convergence speed |
Logistic Regression | Regularization strength (C)<br>Penalty type<br>Solver | C: 0.001 – 100<br>penalty: l1, l2, elasticnet<br>solver: liblinear, saga, lbfgs | Inversely controls regularization<br>Determines regularization type<br>Optimization method |
SVM | Kernel type<br>C<br>Gamma<br>Degree | kernel: linear, poly, rbf, sigmoid<br>C: 0.1 – 100<br>gamma: 0.001 – 10<br>degree: 2 – 5 | Transforms feature space<br>Controls margin vs. errors<br>Kernel coefficient<br>For polynomial kernel |
Decision Trees | Max depth<br>Min samples split<br>Min samples leaf<br>Criterion | max_depth: 3 – 20<br>min_samples_split: 2 – 20<br>min_samples_leaf: 1 – 10<br>criterion: gini, entropy | Controls tree complexity<br>Min samples to split node<br>Min samples in leaf node<br>Node splitting metric |
KNN | Number of neighbors (k)<br>Weight function<br>Distance metric | k: 1 – 20<br>weights: uniform, distance<br>metric: euclidean, manhattan, minkowski | Controls model complexity<br>How neighbors are weighted<br>How distance is measured |
Common Challenges and Solutions
Challenge | Affected Algorithms | Solution |
---|---|---|
Overfitting | Decision Trees (most)<br>SVM<br>KNN | Pruning/max depth limits<br>Increase regularization<br>Increase k value |
Underfitting | Linear/Logistic Regression<br>Decision Trees<br>SVM | Add polynomial features<br>Decrease min_samples_split<br>Change kernel or decrease C |
Class imbalance | Logistic Regression<br>SVM<br>KNN | Class weighting<br>SMOTE/data resampling<br>Adjust decision threshold |
High dimensionality | KNN<br>Linear/Logistic Regression | Feature selection<br>PCA/dimensionality reduction<br>Regularization |
Feature scaling | SVM<br>KNN<br>L2-regularized models | StandardScaler<br>MinMaxScaler<br>RobustScaler for outliers |
Missing values | All | Imputation methods<br>Use algorithms that handle missing values (Trees)<br>Remove rows/features |
Categorical data | All | One-hot encoding<br>Label encoding<br>Target encoding |
Computational efficiency | KNN<br>SVM<br>Decision Trees | Approximate KNN<br>Linear kernel for SVM<br>Limit tree depth |
Best Practices for Algorithm Selection & Implementation
Algorithm Selection Guidelines
Consider your data size:
- Small data: KNN, SVM
- Medium data: Decision Trees, Linear/Logistic Regression
- Large data: Linear/Logistic Regression, optimized implementations of others
Consider interpretability needs:
- High: Linear/Logistic Regression, Decision Trees
- Medium: KNN
- Low: SVM with non-linear kernels
Consider data characteristics:
- Linear relationships: Linear/Logistic Regression
- Non-linear relationships: SVM, Decision Trees, KNN
- High-dimensional: SVM, Linear/Logistic with regularization
- Noisy data: Ensemble methods, SVM, Regularized models
Consider computational constraints:
- Training speed: Linear/Logistic Regression, KNN
- Prediction speed: Linear/Logistic Regression, Decision Trees
- Memory usage: Linear/Logistic Regression, SVM
Implementation Best Practices
Preprocessing:
- Scale features for distance-based algorithms (KNN, SVM)
- Encode categorical features appropriately
- Handle missing values before modeling
- Consider dimensionality reduction for high-dimensional data
Model Development:
- Split data into training/validation/test sets (70/15/15 or 80/10/10)
- Use cross-validation for smaller datasets
- Tune hyperparameters systematically (Grid/Random search)
- Try simple models before complex ones
- Consider ensemble methods for improved performance
Evaluation:
- Choose appropriate metrics for your problem
- Use baseline models as comparison points
- Evaluate on multiple metrics, not just accuracy
- Consider business impact, not just statistical measures
- Test model on fresh data when possible
Deployment:
- Save preprocessing steps alongside the model
- Monitor model performance over time
- Plan for model updating and retraining
- Consider explainability requirements
Practical Tips for Each Algorithm
Linear Regression
- Check for assumptions (linearity, independence, homoscedasticity)
- Address multicollinearity with VIF (Variance Inflation Factor)
- Try polynomial features for non-linear relationships
- Consider regularization (Ridge, Lasso, ElasticNet) for many features
- Plot residuals to identify patterns and outliers
Logistic Regression
- Balance classes for better performance
- Feature scaling improves convergence
- Use L1 regularization for feature selection
- Check the calibration of probability estimates
- Consider the appropriate threshold (not always 0.5)
SVM
- Kernel selection is crucial (start with linear for high dimensions)
- Scale your features for optimal performance
- Use GridSearchCV to find optimal C and gamma
- Consider SVC(probability=True) if you need probabilities
- Use LinearSVC for large datasets with linear kernels
Decision Trees
- Visualize the tree to understand decisions
- Control depth to prevent overfitting
- Consider feature importance for insights
- Pre-pruning (during building) and post-pruning
- Usually better as ensemble methods (Random Forest, Gradient Boosting)
KNN
- Choose k using cross-validation (often sqrt(n) is a starting point)
- Consider weighted voting for better decision boundaries
- Dimensionality reduction is critical for high-dimensional data
- Use ball tree or KD tree for large datasets
- Feature scaling is essential for performance
Resources for Further Learning
Books
- “Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani
- “Pattern Recognition and Machine Learning” by Christopher Bishop
- “Machine Learning: A Probabilistic Perspective” by Kevin Murphy
- “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron
- “Python Machine Learning” by Sebastian Raschka and Vahid Mirjalili
Online Courses
- Andrew Ng’s Machine Learning (Coursera)
- Machine Learning (Stanford Online)
- Applied Machine Learning in Python (Coursera)
- Machine Learning A-Z (Udemy)
- Machine Learning Crash Course (Google)
Interactive Learning
- Kaggle Learn
- DataCamp
- Machine Learning Mastery
- Towards Data Science tutorials
- Scikit-learn documentation tutorials
Research Papers
- “A Few Useful Things to Know about Machine Learning” by Pedro Domingos
- “Random Forests” by Leo Breiman
- “Support Vector Networks” by Cortes and Vapnik
- “Least Angle Regression” by Efron et al.
- “Greedy Function Approximation: A Gradient Boosting Machine” by Friedman
GitHub Repositories
- Scikit-learn: https://github.com/scikit-learn/scikit-learn
- Awesome Machine Learning: https://github.com/josephmisiti/awesome-machine-learning
- Homemade Machine Learning: https://github.com/trekhleb/homemade-machine-learning
- Data Science IPython Notebooks: https://github.com/donnemartin/data-science-ipython-notebooks
- Machine Learning From Scratch: https://github.com/eriklindernoren/ML-From-Scratch
Remember that algorithms are tools in your toolbox—understanding their strengths, weaknesses, and appropriate use cases is more important than memorizing their details. Start simple, validate thoroughly, and iterate based on performance and insights.