Introduction to AI Algorithms
Artificial Intelligence (AI) algorithms are computational procedures that enable machines to perform tasks that typically require human intelligence. These algorithms form the foundation of modern AI systems, from simple rule-based decision-making to complex pattern recognition and prediction. AI algorithms leverage data, statistical models, and optimization techniques to learn from experience, identify patterns, make decisions, and continuously improve performance. Understanding these algorithms is essential for developing effective AI solutions and navigating the rapidly evolving field of artificial intelligence.
Core Concepts and Terminology
Term | Definition | Example |
---|---|---|
Supervised Learning | Learning from labeled training data to predict outputs for unseen inputs | Email spam classification using labeled emails |
Unsupervised Learning | Finding patterns in unlabeled data | Customer segmentation based on purchasing behavior |
Reinforcement Learning | Learning optimal actions through trial and error with rewards/penalties | Game AI learning to win by maximizing score |
Feature | Input variable used for prediction | Age, income, or transaction history in a loan approval model |
Label | Target output value for supervised learning | “Spam” or “Not Spam” for email classification |
Training Set | Data used to train the model | 80% of available labeled data |
Validation Set | Data used to tune hyperparameters | 10% of available labeled data |
Test Set | Data used to evaluate final model performance | 10% of available labeled data |
Overfitting | Model learns training data too well, including noise, performing poorly on new data | A model that achieves 99% accuracy on training but only 70% on test data |
Underfitting | Model is too simple to capture underlying patterns | A linear model trying to fit highly non-linear data |
Hyperparameters | Model configuration settings set before training | Learning rate, number of hidden layers, regularization strength |
Supervised Learning Algorithms
Linear and Logistic Regression
Linear Regression
- Purpose: Predict continuous values
- Equation: $y = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n$
- Loss Function: Mean Squared Error (MSE)
- Optimization: Ordinary Least Squares or Gradient Descent
- Best For: Simple predictive relationships, baseline models, interpretable results
- Limitations: Cannot capture non-linear relationships, sensitive to outliers
- Complexity: Time: O(n³) for exact solution, O(n²d) for iterative; Space: O(nd)
Logistic Regression
- Purpose: Binary classification
- Equation: $P(y=1) = \frac{1}{1 + e^{-(w_0 + w_1x_1 + … + w_nx_n)}}$
- Loss Function: Binary Cross-Entropy (Log Loss)
- Optimization: Maximum Likelihood Estimation via Gradient Descent
- Best For: Probability estimation, interpretable models, binary outcomes
- Limitations: Limited to linear decision boundaries, struggles with imbalanced data
- Complexity: Time: O(nd) per iteration; Space: O(nd)
Decision Trees and Ensemble Methods
Decision Trees
- Purpose: Classification and regression
- Construction: Recursive binary splitting based on feature values
- Split Criteria: Gini impurity/entropy (classification), MSE (regression)
- Best For: Interpretable models, handling mixed feature types, non-linear patterns
- Limitations: Prone to overfitting, unstable (small changes in data can significantly change the tree)
- Complexity: Time: O(nd log d) for training; Space: O(d) where d is tree depth
Random Forests
- Purpose: Classification and regression through ensemble of trees
- Construction: Build multiple decision trees on random subsets of data and features
- Prediction: Average (regression) or majority vote (classification) from all trees
- Best For: High accuracy, handling large datasets, feature importance ranking
- Limitations: Less interpretable than single trees, more computationally expensive
- Complexity: Time: O(Knd log d) where K is number of trees; Space: O(Kd)
Gradient Boosting Machines (GBM)
- Purpose: Classification and regression through sequential tree building
- Construction: Build trees sequentially, each correcting errors of the previous
- Variants: XGBoost, LightGBM, CatBoost
- Best For: Winning competitions, handling mixed data types, high predictive accuracy
- Limitations: More prone to overfitting, requires careful tuning, slower training
- Complexity: Time: O(Knd log d); Space: O(Kd)
Support Vector Machines (SVM)
Linear SVM
- Purpose: Find optimal hyperplane for classification
- Objective: Maximize margin between classes
- Formulation: $\min \frac{1}{2}||w||^2$ subject to $y_i(w \cdot x_i + b) \geq 1$
- Best For: High-dimensional data, clear margin of separation
- Limitations: Scaling to large datasets, sensitivity to parameter tuning
- Complexity: Time: O(n²d) to O(n³d); Space: O(n²)
Kernel SVM
- Purpose: Handle non-linear classification boundaries
- Common Kernels: Polynomial, Radial Basis Function (RBF), Sigmoid
- RBF Kernel: $K(x_i, x_j) = \exp(-\gamma ||x_i – x_j||^2)$
- Best For: Complex non-linear boundaries, medium-sized datasets
- Limitations: Computational complexity, memory requirements, black-box predictions
- Complexity: Time: O(n²d) to O(n³d); Space: O(n²)
Neural Networks
Multilayer Perceptron (MLP)
- Purpose: General-purpose supervised learning
- Architecture: Input layer, 1+ hidden layers, output layer
- Activation Functions: ReLU, Sigmoid, Tanh
- Loss Functions: MSE (regression), Cross-entropy (classification)
- Optimization: Backpropagation with variants of Gradient Descent
- Best For: Complex patterns, feature learning, large datasets
- Limitations: Computationally intensive, requires significant data, black-box nature
- Complexity: Time: O(nihl) where n=samples, i=input size, h=hidden units, l=layers; Space: O(ihl)
Convolutional Neural Networks (CNN)
- Purpose: Image processing, visual recognition
- Key Layers: Convolutional, Pooling, Fully Connected
- Operations: Convolution, Max/Avg Pooling, Flattening
- Popular Architectures: LeNet, AlexNet, VGG, ResNet, Inception
- Best For: Image classification, object detection, computer vision tasks
- Limitations: Computational resources, large datasets required, limited interpretability
- Complexity: Varies by architecture, typically O(n²) for standard convolutions
Recurrent Neural Networks (RNN)
- Purpose: Sequential data processing
- Key Feature: Memory of previous inputs through recurrent connections
- Variants: LSTM, GRU, Bidirectional RNN
- Best For: Time series, natural language, speech recognition
- Limitations: Vanishing/exploding gradients, difficulty with long-term dependencies
- Complexity: Time: O(nlt²) where n=samples, l=sequence length, t=hidden state size; Space: O(t²)
Long Short-Term Memory (LSTM)
- Purpose: Improved processing of long sequences
- Components: Input/output/forget gates, cell state
- Advantage: Better at capturing long-term dependencies than basic RNNs
- Best For: Machine translation, speech recognition, complex sequence tasks
- Limitations: Computational complexity, still struggles with very long sequences
- Complexity: Time: O(nlt²); Space: O(t²)
Unsupervised Learning Algorithms
Clustering Algorithms
K-Means
- Purpose: Partition data into K distinct clusters
- Algorithm:
- Initialize K centroids randomly
- Assign points to nearest centroid
- Update centroids as mean of assigned points
- Repeat until convergence
- Best For: Spherical clusters, large datasets, initial exploration
- Limitations: Requires pre-specified K, sensitive to initialization, struggles with non-spherical clusters
- Complexity: Time: O(nKdi) where i=iterations; Space: O(n+K)
Hierarchical Clustering
- Types: Agglomerative (bottom-up), Divisive (top-down)
- Agglomerative Process:
- Start with each point as individual cluster
- Merge closest clusters iteratively
- Continue until single cluster remains
- Linkage Methods: Single, Complete, Average, Ward
- Best For: Hierarchical structure exploration, creating dendrograms, unknown cluster numbers
- Limitations: Computationally expensive, cannot scale to very large datasets
- Complexity: Time: O(n³) naive implementation, O(n²log n) optimized; Space: O(n²)
DBSCAN
- Purpose: Density-based clustering, identifying clusters of arbitrary shape
- Parameters: ε (neighborhood distance), MinPts (minimum points for core point)
- Point Types: Core, Border, Noise
- Best For: Arbitrary shaped clusters, automatic noise detection, unknown cluster numbers
- Limitations: Struggles with varying density clusters, sensitive to parameters
- Complexity: Time: O(n²) naive, O(n log n) with spatial indexing; Space: O(n)
Dimensionality Reduction
Principal Component Analysis (PCA)
- Purpose: Linear dimensionality reduction
- Process:
- Standardize data
- Calculate covariance matrix
- Compute eigenvectors and eigenvalues
- Select top K eigenvectors (principal components)
- Project data onto these components
- Best For: Visualization, preprocessing, noise reduction
- Limitations: Only captures linear relationships, sensitive to outliers
- Complexity: Time: O(nd² + d³); Space: O(nd)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Purpose: Non-linear dimensionality reduction for visualization
- Process: Maps high-dimensional similarities to low-dimensional distances
- Best For: Visualizing high-dimensional data, preserving local structure
- Limitations: Computationally expensive, non-deterministic, focus on local structure
- Complexity: Time: O(n² log n); Space: O(n²)
Uniform Manifold Approximation and Projection (UMAP)
- Purpose: General dimensionality reduction with preserved global structure
- Process: Constructs fuzzy topological representation and optimizes low-dimensional layout
- Best For: Faster alternative to t-SNE, preserving both local and global structure
- Limitations: Complex parameter tuning, stochastic results
- Complexity: Time: O(n log n); Space: O(n)
Generative Models
Autoencoders
- Purpose: Self-supervised learning for dimensionality reduction and generation
- Architecture: Encoder (compresses) + Decoder (reconstructs)
- Variants: Vanilla, Sparse, Denoising, Variational (VAE)
- Best For: Feature learning, anomaly detection, data generation
- Limitations: Training complexity, reconstructions often blurry
- Complexity: Similar to neural networks, depends on architecture
Generative Adversarial Networks (GANs)
- Purpose: Generate realistic new data samples
- Components: Generator (creates samples) + Discriminator (classifies real/fake)
- Training: Zero-sum game between generator and discriminator
- Variants: DCGAN, StyleGAN, CycleGAN, BigGAN
- Best For: Image generation, style transfer, data augmentation
- Limitations: Training instability, mode collapse, evaluation difficulty
- Complexity: High, varies by architecture
Reinforcement Learning Algorithms
Value-Based Methods
Q-Learning
- Purpose: Learn optimal action-value function Q(s,a)
- Update Rule: $Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) – Q(s,a)]$
- Exploration Strategy: ε-greedy (random action with probability ε)
- Best For: Discrete action spaces, tabular representation
- Limitations: Struggles with large state spaces, requires discretization
- Complexity: Time: O(|S|²|A|) for tabular form; Space: O(|S||A|)
Deep Q-Network (DQN)
- Purpose: Q-learning with neural network function approximation
- Key Innovations:
- Experience replay (store and randomly sample transitions)
- Target network (separate network for stable Q-targets)
- Best For: High-dimensional state spaces like images
- Limitations: Discrete action spaces, sample inefficiency
- Complexity: Varies based on network architecture
Policy-Based Methods
Policy Gradient Methods
- Purpose: Directly optimize policy parameters
- Key Algorithm: REINFORCE
- Update Direction: Gradient of expected return
- Formula: $\nabla_\theta J(\theta) = E[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]$
- Best For: Continuous action spaces, stochastic policies
- Limitations: High variance, sample inefficiency
- Complexity: Depends on policy representation
Actor-Critic Methods
- Purpose: Combine policy-based and value-based approaches
- Components: Actor (policy network) + Critic (value network)
- Examples: A2C, A3C, PPO, SAC
- Best For: Reducing variance in policy gradients, stable learning
- Limitations: Complexity, hyperparameter sensitivity
- Complexity: Higher than pure policy or value methods
Advanced RL Methods
Proximal Policy Optimization (PPO)
- Purpose: Stable policy optimization with constraint updates
- Key Innovation: Clipped surrogate objective to limit policy changes
- Formula: $L^{CLIP}(\theta) = E[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$
- Best For: Sample-efficient learning, avoiding destructive policy updates
- Limitations: Implementation complexity, hyperparameter tuning
- Complexity: Similar to actor-critic methods
Soft Actor-Critic (SAC)
- Purpose: Off-policy actor-critic with entropy maximization
- Key Innovation: Trading off reward maximization with policy entropy
- Best For: Continuous action spaces, exploration-exploitation balance
- Limitations: Complex implementation, sensitive to temperature parameter
- Complexity: High, multiple networks involved
Modern Transformer Architectures
Transformer Base Architecture
- Purpose: Process sequential data through self-attention
- Key Components: Multi-head attention, feed-forward networks, positional encoding
- Advantages: Parallelization, handling long-range dependencies
- Applications: NLP, computer vision, sequential data processing
- Complexity: Time: O(n²d) where n=sequence length, d=dimension; Space: O(n²)
BERT (Bidirectional Encoder Representations from Transformers)
- Purpose: Pretrained language understanding model
- Architecture: Transformer encoder stack (bidirectional context)
- Training Objectives: Masked Language Modeling, Next Sentence Prediction
- Size Variants: BERT-base (110M params), BERT-large (340M params)
- Best For: Text classification, question answering, named entity recognition
- Limitations: Limited context length, computationally intensive fine-tuning
- Complexity: High, billions of operations per forward pass
GPT (Generative Pre-trained Transformer)
- Purpose: Autoregressive language generation
- Architecture: Decoder-only transformer stack
- Evolution: GPT-1 → GPT-2 → GPT-3 → GPT-4
- Size Growth: GPT-3 (175B params), GPT-4 (estimated trillions)
- Best For: Text generation, completion, few-shot learning
- Limitations: Context window constraints, hallucinations, bias
- Complexity: Extremely high, requiring specialized hardware for inference
T5 (Text-to-Text Transfer Transformer)
- Purpose: Unified approach to NLP tasks as text-to-text
- Architecture: Encoder-decoder transformer
- Training: Masked span prediction, multi-task learning
- Best For: Translation, summarization, question answering
- Limitations: Training complexity, resource requirements
- Complexity: High, similar to other large transformer models
Model Evaluation Metrics
Classification Metrics
Accuracy
- Formula: $\frac{TP + TN}{TP + TN + FP + FN}$
- Range: 0 to 1 (higher is better)
- Best For: Balanced datasets
- Limitations: Misleading for imbalanced classes
Precision
- Formula: $\frac{TP}{TP + FP}$
- Interpretation: Of all predicted positives, how many are actually positive
- Best For: When false positives are costly
- Limitations: Doesn’t account for false negatives
Recall (Sensitivity)
- Formula: $\frac{TP}{TP + FN}$
- Interpretation: Of all actual positives, how many were identified
- Best For: When false negatives are costly
- Limitations: Doesn’t account for false positives
F1 Score
- Formula: $2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$
- Interpretation: Harmonic mean of precision and recall
- Best For: Imbalanced datasets, trading off precision and recall
- Limitations: Equal weight to precision and recall
ROC-AUC
- Definition: Area Under the Receiver Operating Characteristic curve
- Range: 0.5 (random) to 1 (perfect)
- Best For: Evaluating ranking quality, threshold-invariant assessment
- Limitations: Less informative for highly imbalanced datasets
Regression Metrics
Mean Squared Error (MSE)
- Formula: $\frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2$
- Interpretation: Average squared difference between predicted and actual
- Best For: General-purpose regression evaluation
- Limitations: Sensitive to outliers, not in original unit
Root Mean Squared Error (RMSE)
- Formula: $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2}$
- Interpretation: Square root of MSE, in original unit
- Best For: Error magnitude in original units
- Limitations: Still sensitive to outliers
Mean Absolute Error (MAE)
- Formula: $\frac{1}{n}\sum_{i=1}^{n}|y_i – \hat{y}_i|$
- Interpretation: Average absolute difference
- Best For: Robust to outliers, intuitive interpretation
- Limitations: Less sensitive to learning improvements than MSE
R² (Coefficient of Determination)
- Formula: $1 – \frac{\sum(y_i – \hat{y}_i)^2}{\sum(y_i – \bar{y})^2}$
- Range: (-∞, 1] (1 is perfect fit)
- Interpretation: Proportion of variance explained by model
- Best For: Model comparison, goodness of fit
- Limitations: Can be misleading for non-linear relationships
Clustering Metrics
Silhouette Coefficient
- Range: -1 to 1 (higher is better)
- Interpretation: Measure of how similar points are to their cluster vs. others
- Best For: Evaluating cluster separation and cohesion
- Limitations: Computationally expensive for large datasets
Davies-Bouldin Index
- Range: 0 to ∞ (lower is better)
- Interpretation: Average similarity of each cluster with its most similar cluster
- Best For: Dense, well-separated clusters
- Limitations: Sensitive to noisy clusters
Optimization Algorithms
First-Order Optimization
Gradient Descent
- Update Rule: $\theta_{t+1} = \theta_t – \eta \nabla_\theta J(\theta_t)$
- Variants: Batch, Mini-batch, Stochastic
- Best For: Convex optimization, baseline approach
- Limitations: Slow convergence, stuck in local minima/saddle points
- Learning Rate: Typically 0.01-0.1 for normalized data
Momentum
- Update Rule:
- $v_{t+1} = \gamma v_t + \eta \nabla_\theta J(\theta_t)$
- $\theta_{t+1} = \theta_t – v_{t+1}$
- Hyperparameters: Learning rate η, momentum factor γ (typically 0.9)
- Best For: Accelerating convergence, overcoming small local minima
- Limitations: Additional hyperparameter, potential overshooting
- Update Rule:
Nesterov Accelerated Gradient (NAG)
- Update Rule:
- $v_{t+1} = \gamma v_t + \eta \nabla_\theta J(\theta_t – \gamma v_t)$
- $\theta_{t+1} = \theta_t – v_{t+1}$
- Advantage: Looks ahead to where parameters will be
- Best For: Faster convergence than standard momentum
- Limitations: More complex implementation
- Update Rule:
Adaptive Learning Rate Methods
AdaGrad
- Update Rule:
- $G_{t+1} = G_t + (\nabla_\theta J(\theta_t))^2$
- $\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{G_{t+1} + \epsilon}} \nabla_\theta J(\theta_t)$
- Best For: Sparse data, different learning rates per parameter
- Limitations: Learning rate diminishes too quickly for deep learning
- Update Rule:
RMSProp
- Update Rule:
- $G_{t+1} = \gamma G_t + (1-\gamma)(\nabla_\theta J(\theta_t))^2$
- $\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{G_{t+1} + \epsilon}} \nabla_\theta J(\theta_t)$
- Best For: Non-stationary objectives, deep neural networks
- Limitations: Sensitive to initial learning rate
- Update Rule:
Adam (Adaptive Moment Estimation)
- Update Rule:
- $m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla_\theta J(\theta_t)$ (first moment)
- $v_{t+1} = \beta_2 v_t + (1-\beta_2)(\nabla_\theta J(\theta_t))^2$ (second moment)
- Bias correction applied to moments
- $\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}{t+1}} + \epsilon} \hat{m}{t+1}$
- Hyperparameters: η (typically 0.001), β₁ (0.9), β₂ (0.999), ε (10⁻⁸)
- Best For: Default choice for deep learning, works well in most cases
- Limitations: Potential convergence issues, generalization concerns
- Update Rule:
Regularization Techniques
Parameter Norm Penalties
L1 Regularization (Lasso)
- Modified Objective: $J(\theta) + \lambda \sum_i |\theta_i|$
- Effect: Encourages sparse solutions (many weights exactly zero)
- Best For: Feature selection, compression, sparse models
- Typical λ Range: 0.01 to 0.0001
L2 Regularization (Ridge)
- Modified Objective: $J(\theta) + \lambda \sum_i \theta_i^2$
- Effect: Smaller weights across all parameters
- Best For: Preventing overfitting, handling collinearity
- Typical λ Range: 0.01 to 0.0001
Elastic Net
- Modified Objective: $J(\theta) + \lambda_1 \sum_i |\theta_i| + \lambda_2 \sum_i \theta_i^2$
- Effect: Combines L1 and L2 benefits
- Best For: When both sparsity and small weights are desired
- Parameters: Ratio between λ₁ and λ₂
Neural Network Specific
Dropout
- Mechanism: Randomly set activations to zero during training
- Implementation: Apply mask with probability p to layer outputs
- Typical Rates: 0.2-0.5 (higher for larger models)
- Best For: Preventing co-adaptation, ensemble-like effects
- Usage: Applied during training only, scaled during inference
Batch Normalization
- Formula: $y = \frac{x – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \cdot \gamma + \beta$
- Effect: Normalizes layer inputs, stabilizes training
- Benefits: Faster convergence, higher learning rates, regularization effect
- Limitations: Batch size dependency, inference/training mismatch
Early Stopping
- Mechanism: Stop training when validation performance worsens
- Implementation: Monitor validation metric, save best model, stop after patience period
- Best For: Preventing overfitting without modifying model
- Parameters: Patience (typically 10-50 epochs), monitoring metric
Common Hyperparameters
General ML Algorithms
Decision Trees
- max_depth: Maximum depth of tree (typical: 3-10)
- min_samples_split: Minimum samples required to split (typical: 2-10)
- min_samples_leaf: Minimum samples required in leaf node (typical: 1-5)
- max_features: Number of features to consider for split (typical: “sqrt”, “log2”)
Random Forest/GBM
- n_estimators: Number of trees (typical: 100-1000)
- learning_rate: Step size shrinkage (GBM only, typical: 0.01-0.2)
- subsample: Fraction of samples for tree (typical: 0.5-1.0)
- colsample_bytree: Fraction of features per tree (typical: 0.5-1.0)
SVM
- C: Regularization parameter (typical: 0.1-100)
- kernel: Kernel type (typical: linear, rbf, poly)
- gamma: Kernel coefficient for rbf/poly (typical: 0.001-10)
- degree: Polynomial degree for poly kernel (typical: 2-5)
Neural Networks
Network Architecture
- hidden_layers: Number of hidden layers (typical: 1-5 for MLPs, more for deep networks)
- hidden_units: Neurons per layer (typical: powers of 2, 32-512)
- activation: Activation functions (typical: ReLU, LeakyReLU, Tanh)
Training Parameters
- batch_size: Samples per gradient update (typical: 32-512)
- learning_rate: Step size for optimizer (typical: 0.1-0.0001)
- epochs: Complete passes through dataset (depends on data/model)
Regularization Parameters
- dropout_rate: Probability of dropping neuron (typical: 0.2-0.5)
- weight_decay: L2 regularization strength (typical: 0.0001-0.01)
- batch_norm_momentum: Momentum for batch norm moving average (typical: 0.9-0.99)
Resources for Further Learning
Foundational Literature
- “Pattern Recognition and Machine Learning” by Christopher Bishop
- “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- “Reinforcement Learning: An Introduction” by Richard S. Sutton and Andrew G. Barto
- “The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Online Courses
- Stanford CS229: Machine Learning
- Stanford CS231n: Convolutional Neural Networks
- UC Berkeley CS285: Deep Reinforcement Learning
- Coursera: Deep Learning Specialization (Andrew Ng)
Frameworks and Libraries
- General ML: scikit-learn, XGBoost, LightGBM
- Deep Learning: PyTorch, TensorFlow/Keras, JAX
- Reinforcement Learning: OpenAI Gym, Stable Baselines, RLlib
- NLP: Hugging Face Transformers, spaCy, NLTK
Research Conferences
- NeurIPS (Neural Information Processing Systems)
- ICML (International Conference on Machine Learning)
- ICLR (International Conference on Learning Representations)
- ACL (Association for Computational Linguistics)
- CVPR (Computer Vision and Pattern Recognition)
This cheat sheet provides a comprehensive overview of AI algorithms, their technical details, and practical implementation considerations. From classical machine learning to cutting-edge deep learning and reinforcement learning, it serves as a quick reference guide for understanding the vast landscape of artificial intelligence techniques.