Navigating the maze of hyperparameter optimization: insights from a systematic study

Summary of my preprint https://arxiv.org/abs/2311.15854

Balázs Kégl
6 min readDec 2, 2023

Introduction

In the ever-evolving world of machine learning, hyperparameter optimization (hyperopt) plays a pivotal role. For practicing data scientists, choosing the right hyperopt technique is a strategic decision that can significantly influence the performance of machine learning models.

In this blog post I summarize my independent comparison of hyperopt engines available in the Ray Tune library. I introduce two methods to normalize and aggregate statistics across datasets and models:

  1. Rank-Based metrics: This approach answers to the question: what is the probability that an engine outperforms random search?
  2. Score-based metrics: This approach measures the degree of improvement over random search. To normalize, I sandwich the performance score of each engine between the scores reached by random search and full grid search.

I had three goals:

  1. Ranking hyperopt engines for solving typical classification problems on tabular data.
  2. Making statistically significant statements about their improvement over random search.
  3. Providing actionable recommendations for practicing data scientists.

Key Findings

  1. Superiority over random search. Most of the hyperopt engines outperformed random search, with the best ones accelerating
    the search two to three times
    .
  2. Top performers. Among the engines tested, three particularly distinguished themselves: HEBO, AX, and BlendSearch.
  3. Specialization. Some engines seem to specialize in hyperopting certain learning algorithms. This makes it tricky to use hyperopt in comparison studies, since the choice of the hyperopt technique may favor some of the models in the comparison.

In the next sections I will dive into the methodology and the results. For the detailed study, I encourage you to read the paper.

Methodology

I aimed at testing various implementations of sequential gradient-free (Bayesian) optimization, a widely used method in hyperopt since its introduction into machine learning. This approach involves learning a probabilistic surrogate model based on completed trials and then optimizing an acquisition function to propose new hyperparameter combinations (arms).

An essential component of my experimental methodology was the decision to use a predefined grid. Here’s why this approach was pivotal:

  1. Pre-computing validation scores. By employing a finite grid, I was able to pre-compute the validation scores for the full grid. This approach let me rapidly retrieve results when an engine pulled an arm, decoupling the computationally expensive train-test-score step from the sequential optimization process.
  2. Real-life relevance of pre-defined grids. In practical data science scenarios, hyperparameters are often discretized, with data scientists using their knowledge to define the significantly different values that hyperparameters are allowed to take.
  3. Optimal grid resolution. There is a sweet spot in the grid resolution. Too coarse a grid might miss the optimum, while too fine a grid, combined with a good optimizer, might lead to overfitting.
  4. Simplicity and Robustness: Discretization can simplify the hyperopt algorithms and make them more robust, especially in the context of Gaussian processes (GP) and bandits, where discrete input spaces are easier to manage.

Hyperopt engines

I used all Ray Tune engines “out of the box”, according to the examples provided in the documentation (except for SigOpt — behind paywall; and Dragonfly — cannot handle integer grid) to avoid “overfitting” my set of experiments.

  1. AX is a domain-agnostic engine built and used by Meta for a wide variety of sequential optimization tasks (besides hyperopt, for A/B testing, infrastructure optimization, and hardware design). It links to BOTorch, a Bayesian optimization library built on GPyTorch, a GP library built on PyTorch. It is one of the top three engines overall, and especially good in the low-number-of-trials regime, which may be due to the smart space-filling strategy that jump starts the optimization.
  2. BayesOpt is a standalone vanilla GP-based Bayesian optimization
    library. It performed very well on forest-type models, and very badly on SVM and neural nets.
  3. BOHB combines Bayesian optimization and Hyperband, which is a bandit-based approach that speeds up random search using adaptive
    resource allocation and early stopping. In this study, BOHB did not beat random search, possibly due to the default settings, inadequate for my setup.
  4. CFO is the first of two engines in Microsoft’s FLAML library. It is
    a local search method with randomized restarts. It is at a disadvantage in this study since its main forte is to manage trials with varying costs, whereas here we measure performance at a given number of trials.
  5. BlendSearch is the second of two engines in Microsoft’s FLAML
    library. It combines local search (CFO) with a global search. Its forte is still costsensitive optimization, and it uses no surrogate model, yet it is one of the top three engines overall.
  6. HEBO is Huawei’s engine that won the 2020 NeurIPS BBO Challenge. It adds sophisticated processing steps to the classical BO framework, such
    as output warping, multi-objective acquisitions, non-stationary and heteroscedastic models, and input warping. It is one of the top three engines overall, and especially good with moderate and higher number of trials.
  7. Hyperopt is one of the oldest hyperopt libraries, implementing
    Bergstra et al.’s tree of Parzen estimators (TPE). It is tuned for high-dimensional computer vision tasks, so it seems to be disadvantaged in the low-dimensional low-number-of-trials regime.
  8. Nevergrad is a gradient-free optimization engine from Meta. As suggested by the Ray Tune documentation, I use the 1+1 engine,
    which is perhaps not the best choice for my regime.
  9. Optuna is a new-generation hyperopt library providing features such as pruning and dynamically constructed search spaces. Here I use its basic Bayesian optimization core which barely beats random search.
  10. SkOpt is an open-source community-developed Bayesian optimization library. It performs better than random search but overall does not reach the performance of the top engines.
  11. ZOOpt is a zeroth-order (derivative-free) optimization engine. It performs well on our ranking-based statistics but not on the score-based metrics. Its main issue seems to be that it uses only a small fraction of the trial budget: once it thinks it found the optimum, it stops exploring.

Results

I tested all engines on five binary classification algorithms,

on five classification data sets, all downloaded from OpenML. I selected the data sets for their large size, to be able to precisely measure the test scores.

The next table and figure summarize the results:

Three engines seem to perform significantly better than the rest: HEBO, BlendSearch, and AX. Using these engines, we can consistently accelerate random search by two to three times, or, from another angle, cut the difference between full grid search and random search with a given budget by about half (improvement degree = 50). HEBO is especially robust, managing to reach an improvement degree of 50 on all five models. The forte of AX is its performance on extra small budget (m = 1).

I found that some engines seem to specialize, for example Nevergrad is strong at optimizing SVM, whereas SKOpt is good at random forests and XGBoost. What this means is that in a comparison study of two algorithms on a data set, the winner may depend on which hyperopt engine is used. In fact, I found that out of the 8250 possible pairwise comparisons (pairs of models, pairs of engines, one of the data sets), about 4.3% inverts the
winner model. This may have quite serious consequences in systematic comparison studies, so in such studies I suggest that either full grid search or random search be used. This latter with add noise to the statistical tests but will not bias them.

Limitations

  1. Coarse grid. The search grids we used are relatively coarse, and I found that this may disadvantage some of the engines, towards the bottom of the rankings. In preliminary experiments I found that some of these engines can pick up the difference if a finer grid is
    used
    . Nevertheless, the overall best result will not be better than when using a coarser grid, these engines improve only relatively to random search and to themselves with a coarser grid.
  2. Sequential optimization. To simplify the experiments, I did not use parallel search. Arguably, some engines that specialize in optimizing parallel search may be disfavored by our approach.
  3. Limited data, metrics, and models. I will open source the code and the experimental setup to accommodate additional experiments on a wider range of data sets, metrics, and models.

Please, let me know in the comment section if you find my study useful and the additional experiments you would do. If you have other classification algorithms that you would like me to include in the study, send me the library and the hyperparameter grid.

--

--

Balázs Kégl

Head of AI research, Huawei France, previously head of the Paris-Saclay Center for Data Science. Podcast: https://www.youtube.com/@I.scientist.balazskegl