Complete Bayesian Optimization Cheat Sheet: Methods, Implementation & Applications

Introduction to Bayesian Optimization

Bayesian Optimization (BO) is a sequential, model-based approach for optimizing expensive black-box functions. It excels at finding global optima with minimal function evaluations, making it ideal for tasks where each evaluation is costly in terms of time, computation, or resources.

Why Bayesian Optimization Matters:

  • Efficiently optimizes expensive-to-evaluate functions
  • Requires fewer function evaluations than grid/random search
  • Handles noisy objective functions naturally
  • Works well with non-convex, multi-modal functions
  • Balances exploration and exploitation automatically
  • Provides uncertainty estimates for predictions

Common Applications:

  • Hyperparameter tuning for machine learning models
  • Experimental design in scientific research
  • Drug discovery and materials science
  • Robotics and control systems
  • A/B testing optimization
  • Sensor placement and network design

Core Concepts & Principles

The Bayesian Optimization Framework

Bayesian Optimization has two main components:

  1. Surrogate Model: Approximates the objective function and quantifies uncertainty
  2. Acquisition Function: Determines where to sample next based on the surrogate model

The process iteratively builds a probabilistic model of the objective function and uses it to make informed decisions about which points to evaluate next.

Key Components

ComponentDescriptionCommon Options
Surrogate ModelProbabilistic model of objective functionGaussian Process, Random Forest, Bayesian Neural Network
Acquisition FunctionStrategy for selecting next points to evaluateEI, UCB, PI, Thompson Sampling
Domain/Search SpaceSet of possible input configurationsContinuous, discrete, mixed, constrained
Objective FunctionBlack-box function being optimizedCan be deterministic or noisy

Surrogate Models

ModelProsConsBest For
Gaussian ProcessUncertainty quantification, theoretical guarantees, smooth interpolationScales poorly with dimensions and data size (O(n³))Low-dimensional problems (<20 dims), when uncertainty is critical
Random ForestHandles categorical variables, scales to high dimensionsLess precise uncertainty estimatesHigh-dimensional problems, mixed variable types
TPE (Tree Parzen Estimator)Works well with conditional parametersLess theoretical foundationHierarchical search spaces
Bayesian Neural NetworkScales to high dimensions, complex functionsHarder to train, calibrateVery complex, high-dimensional problems

Acquisition Functions

FunctionFormulaBehaviorWhen to Use
Expected Improvement (EI)EI(x) = E[max(f(x) – f(x⁺), 0)]Balanced exploration/exploitationGeneral-purpose default
Upper Confidence Bound (UCB)UCB(x) = μ(x) + κσ(x)Tunable exploration parameter (κ)When exploration needs adjustment
Probability of Improvement (PI)PI(x) = P(f(x) > f(x⁺) + ξ)More exploitativeWhen focusing on promising regions
Thompson SamplingSample from posterior and maximizeNaturally balances explorationInherently Bayesian approach
Entropy SearchReduces uncertainty about the optimumInformation-theoretic approachWhen optimum location matters more than value
Knowledge GradientExpected increase in best valueOne-step lookaheadWhen evaluations are extremely limited

Step-by-Step Bayesian Optimization Process

  1. Define Problem & Search Space

    • Identify parameters to optimize
    • Determine parameter ranges and types
    • Specify constraints if applicable
    • Define objective function
  2. Initialize with Initial Points

    • Generate initial design points (random, Latin hypercube, etc.)
    • Evaluate objective function at those points
    • Create initial dataset D = {(x₁,y₁), (x₂,y₂), …, (xₙ,yₙ)}
  3. Fit Surrogate Model

    • Choose appropriate surrogate model
    • Fit/update model using all available data
    • Validate model quality if possible
  4. Optimize Acquisition Function

    • Select acquisition function
    • Optimize acquisition function over search space
    • Identify next point(s) to evaluate
  5. Evaluate Objective Function

    • Evaluate objective at selected point(s)
    • Add new observation(s) to dataset
  6. Update & Iterate

    • Refit surrogate model with expanded dataset
    • Repeat steps 3-6 until stopping criteria met
  7. Select Final Configuration

    • Choose best observed point or
    • Fit final model and predict optimum

Implementation Techniques

Search Space Design

TypeDescriptionImplementation Approaches
ContinuousReal-valued parametersDirect parameterization, log/logit transforms for bounded variables
IntegerWhole number parametersRounding, discretization, specialized kernels
CategoricalFinite set of optionsOne-hot encoding, specialized kernels, embeddings
ConditionalParameters dependent on othersHierarchical models, structured search spaces
ConstrainedValid region defined by constraintsPenalty methods, constrained acquisition optimization

Initialization Strategies

StrategyDescriptionBenefits
Random SamplingRandomly sample from search spaceSimple, unbiased
Latin HypercubeSpace-filling designBetter coverage than random
Sobol SequenceLow-discrepancy sequenceProgressive, uniform coverage
Prior KnowledgeStart with known good configurationsFaster convergence to good regions
Warm StartUse results from similar problemsTransfer knowledge across tasks

Practical Considerations

IssueDescriptionSolutions
Noise HandlingDealing with stochastic objective functionsRepeated evaluations, noise-aware surrogate models
Batch EvaluationEvaluating multiple points in parallelq-EI, q-UCB, BUCB, Local Penalization
Multi-ObjectiveOptimizing multiple objectives simultaneouslyPareto front modeling, scalarization, ParEGO
High DimensionalityHandling many parametersRandom embeddings, TuRBO, Trust regions, additive GPs
Cost-aware OptimizationVarying evaluation costsCost-weighted acquisition functions

Comparison with Other Optimization Methods

MethodProsConsWhen to Use Over BO
Grid SearchSimple, parallelizableExponential scaling with dimensionsVery few parameters, exhaustive search needed
Random SearchSimple, scales better than gridStill inefficientQuick prototype, baseline comparison
Genetic AlgorithmsWorks well for combinatorial problemsRequires many evaluationsDiscrete, combinatorial problems with cheap evaluations
Gradient DescentFast convergence when applicableRequires gradient informationWhen gradients are available and function is smooth
Simulated AnnealingGood for discrete spacesMany evaluations neededDiscrete problems with cheap evaluations
CMA-ESWorks well for non-convex problemsRequires many evaluationsBlack-box functions with moderate evaluation cost

Common Challenges & Solutions

ChallengeDescriptionSolutions
Exploration-Exploitation TradeoffBalancing new regions vs. promising areasAdjust acquisition function parameters, ensemble strategies
High-Dimensional SpacesBO performance degrades with many dimensionsTrust regions, random embeddings, additive GPs
Categorical VariablesHandling non-continuous parametersOne-hot encoding, specialized kernels, hierarchical models
Constrained OptimizationRespecting constraints in searchConstrained acquisition functions, penalty methods
Multi-Fidelity OptimizationLeveraging cheap approximationsMulti-fidelity GPs, knowledge transfer, early stopping
Outliers & Model MisspecificationDealing with model breakdownRobust regression, outlier detection, model validation

Advanced Techniques

Multi-Fidelity Bayesian Optimization

Uses evaluations of varying quality/cost to accelerate optimization:

  • Bayesian Optimization with Dropouts for Speeding Up (BODS)
  • Multi-task Gaussian Processes
  • Freeze-Thaw Bayesian Optimization
  • Fabolas: Optimization with adjustable evaluation budgets

Parallel & Distributed Optimization

Methods for batch parallel evaluation:

  • q-EI/q-UCB: Multi-point acquisition functions
  • Local Penalization: Penalizes points near previous batch members
  • Thompson Sampling: Naturally parallelizable
  • BUCB: Batch Upper Confidence Bound

Transfer Learning in BO

Leveraging knowledge from related tasks:

  • Multi-task Gaussian Processes
  • Ranking-weighted Gaussian Process Ensemble
  • Meta-Learning for BO
  • Combining Multiple Surrogate Models

Best Practices & Tips

General Guidelines

  • Start with sufficient initial points (rule of thumb: 10× number of dimensions)
  • Use expected improvement (EI) as default acquisition function unless specific reasons not to
  • Log-transform parameters with large ranges (e.g., learning rates)
  • Normalize input dimensions to similar scales
  • Verify model fit periodically (posterior predictive checks)
  • Save all evaluations for reproducibility and analysis

For Machine Learning Hyperparameters

  • Start with the most important hyperparameters
  • Use domain knowledge to set reasonable bounds
  • Consider dependencies between parameters
  • Log-transform learning rates, regularization strengths
  • Use early stopping criteria when possible
  • Consider multi-fidelity approaches with subset of data

For Experimental Design

  • Incorporate known constraints explicitly
  • Account for measurement error in surrogate model
  • Consider cost of different experiments in acquisition function
  • Record all experimental conditions, not just varied parameters

Software Tools & Libraries

ToolLanguageKey FeaturesBest For
Scikit-OptimizePythonSimple API, integration with scikit-learnGetting started, ML hyperparameters
OptunaPythonPruning, visualization, distributed optimizationComplex ML pipelines, conditional spaces
GPyOptPythonExtensive GP models, advanced acquisition functionsResearch, custom BO implementations
BoTorchPythonPyTorch integration, multi-objective, custom acquisition functionsDeep learning, custom models
Bayesian OptimizationRR interface, various surrogate modelsR users, statistical applications
SMAC3PythonRandom forests, categorical variablesHigh-dimensional, mixed parameter types
HyperoptPythonTree Parzen Estimator, parallel evaluationHierarchical search spaces
MOEPython/C++Metric optimization engine, parallelIndustrial applications, performance

Resources for Further Learning

Books & Surveys

  • “Gaussian Processes for Machine Learning” by Rasmussen & Williams
  • “Taking the Human Out of the Loop: A Review of Bayesian Optimization” by Shahriari et al.
  • “A Tutorial on Bayesian Optimization” by Frazier

Key Papers

  • Snoek et al. (2012) – “Practical Bayesian Optimization of Machine Learning Algorithms”
  • Hernández-Lobato et al. (2014) – “Predictive Entropy Search”
  • Srinivas et al. (2010) – “Gaussian Process Optimization in the Bandit Setting”
  • Kandasamy et al. (2015) – “High Dimensional Bayesian Optimisation and Bandits via Additive Models”
  • Wang et al. (2016) – “Bayesian Optimization with Robust Bayesian Neural Networks”

Online Resources

Libraries Documentation

Scroll to Top