src.environment.problem.SOO.HPO_B.hpob_dataset

Problem Difficulty Classification

By default, the dataset is split into a fixed training set and a testing set.

Set Type

Source Data

Number of Problems

Training Set

meta_train_data

758

Testing Set

meta_vali_data + meta_test_data

177

Note: If difficulty is set to ‘all’, the training and testing sets are merged, containing all 935 problems.

Module Contents

Classes

HPOB_Dataset

Introduction

HPO-B is an autoML hyper-parameter optimization benchmark which includes a wide range of hyperparameter optimization tasks for 16 different model types (e.g., SVM, XGBoost, etc.), resulting in a total of 935 problem instances. The dimension of these problem instances range from 2 to 16. We also note that HPO-B represents problems with ill-conditioned landscape such as huge flattern.

Functions

get_data

Introduction

Loads and returns training, validation, and test datasets along with Bayesian Optimization (BO) initializations and surrogate statistics based on the specified mode and directory paths.

load_data

Introduction

Loads HPOB benchmark datasets according to specified parameters, supporting different dataset versions, test/train splits, and augmented training data.

get_bst

Introduction

Loads a pre-trained XGBoost surrogate model and retrieves its associated normalization statistics for a given search space and dataset.

API

class src.environment.problem.SOO.HPO_B.hpob_dataset.HPOB_Dataset(data, batch_size=1)[source]

Bases: torch.utils.data.Dataset

Introduction

HPO-B is an autoML hyper-parameter optimization benchmark which includes a wide range of hyperparameter optimization tasks for 16 different model types (e.g., SVM, XGBoost, etc.), resulting in a total of 935 problem instances. The dimension of these problem instances range from 2 to 16. We also note that HPO-B represents problems with ill-conditioned landscape such as huge flattern.

Original paper

Hpo-b: A large-scale reproducible benchmark for black-box hpo based on openml.” arXiv preprint arXiv:2106.06257 (2021).

Official Implementation

HPO-B

License

None

Initialization

Introduction

Initializes the dataset object for handling batches of data items, determining the maximum dimension among items, and setting up batch pointers and indices.

Args:

  • data (list): A list of data items, each expected to have a dim attribute.

  • batch_size (int, optional): The number of items per batch. Defaults to 1.

Built-in Attribute:

  • self.data (list): Stores the input data.

  • self.maxdim (int): The maximum dimension found among all data items.Defaults to 0.

  • self.batch_size (int): The batch size for processing data.

  • self.N (int): The total number of data items.

  • self.ptr (list): List of starting indices for each batch.

  • self.index (np.ndarray): Array of indices for the data items.

Returns:

  • None

static get_datasets(datapath=None, train_batch_size=1, test_batch_size=1, upperbound=None, difficulty=None, user_train_list=None, user_test_list=None, cost_normalize=False)[source]

Introduction

Loads and processes the HPO-B benchmark datasets, returning train and test sets as HPOB_Dataset objects. Handles dataset downloading, extraction, and filtering based on user-specified lists or difficulty level.

Args:

  • datapath (str, optional): Path to the root directory containing the HPO-B data. If None, defaults to a subdirectory in the current working directory.

  • train_batch_size (int, optional): Batch size for the training dataset. Defaults to 1.

  • test_batch_size (int, optional): Batch size for the test dataset. Defaults to 1.

  • upperbound (float, optional): Upper bound for the problem domain. Used to set the search space limits.

  • difficulty (str, optional): If set to ‘all’, merges train and test sets. Otherwise, uses user-specified lists for filtering.

  • user_train_list (list of str, optional): List of problem identifiers to include in the training set.

  • user_test_list (list of str, optional): List of problem identifiers to include in the test set.

  • cost_normalize (bool, optional): Whether to normalize the cost values in the problems. Defaults to False.

Returns:

  • tuple: A tuple containing:

    • HPOB_Dataset: The training dataset.

    • HPOB_Dataset: The test dataset.

Raises:

  • NotImplementedError: If neither user_train_list nor user_test_list is provided when required for filtering.

__getitem__(item)[source]

Introduction

Retrieves a batch of data samples corresponding to the given index.

Args:

  • item (int): The index of the batch to retrieve.

Returns:

  • list: A list containing the data samples for the specified batch.

Raises:

  • IndexError: If item is out of range of the available batches.

__len__()[source]

Returns the number of elements in the dataset.

Returns:

int: The total number of elements in the dataset.
__add__(other: src.environment.problem.SOO.HPO_B.hpob_dataset.HPOB_Dataset)[source]

Introduction

Combines two HPOB_Dataset instances by concatenating their data attributes.

Args:

  • other (HPOB_Dataset): Another dataset instance to be added.

Returns:

  • HPOB_Dataset: A new dataset containing the combined data from both instances.

Raises:

  • AttributeError: If other does not have a data attribute.

shuffle()[source]

Introduction

Randomly shuffles the indices of the dataset to change the order of data access.

Built-in Attribute:

  • self.N (int): The total number of data points in the dataset.

Returns:

  • None

Side Effects:

  • Updates self.index with a new permutation of indices from 0 to self.N - 1.

src.environment.problem.SOO.HPO_B.hpob_dataset.get_data(mode, surrogates_dir, root_dir)[source]

Introduction

Loads and returns training, validation, and test datasets along with Bayesian Optimization (BO) initializations and surrogate statistics based on the specified mode and directory paths.

Args:

  • mode (str): The mode specifying which dataset version or configuration to load. Supported values are “v1”, “v2”, “v3”, “v3-test”, and “v3-train-augmented”.

  • surrogates_dir (str): Directory path where the surrogate statistics file (“summary-stats.json”) is located.

  • root_dir (str): Root directory path containing the dataset files.

Returns:

  • train_set: The training dataset, or None if not applicable for the selected mode.

  • vali_set: The validation dataset, or None if not applicable for the selected mode.

  • test_set: The test dataset.

  • bo_initializations: Initializations for Bayesian Optimization.

  • surrogates_stats (dict): Dictionary containing surrogate statistics loaded from “summary-stats.json”.

Raises:

  • ValueError: If an invalid mode is provided.

src.environment.problem.SOO.HPO_B.hpob_dataset.load_data(rootdir='', version='v3', only_test=True, augmented_train=False)[source]

Introduction

Loads HPOB benchmark datasets according to specified parameters, supporting different dataset versions, test/train splits, and augmented training data.

Args:

  • rootdir (str, optional): Path to the directory containing the benchmark data files. Defaults to “”.

  • version (str, optional): HPOB dataset version to use. Options are “v1”, “v2”, or “v3”. Defaults to “v3”.

  • only_test (bool, optional): If True, loads only the test data (valid only for version “v3”). Defaults to True.

  • augmented_train (bool, optional): If True, loads the augmented training data (valid only for version “v3”). Defaults to False.

Returns:

  • meta_train_data (dict or None): The meta-training dataset, or None if only_test is True.

  • meta_validation_data (dict or None): The meta-validation dataset, or None if only_test is True.

  • meta_test_data (dict): The meta-testing dataset.

  • bo_initializations (dict): The Bayesian optimization initializations.

Raises:

  • FileNotFoundError: If any of the required dataset files are missing in the specified root directory.

  • json.JSONDecodeError: If any of the dataset files are not valid JSON.

src.environment.problem.SOO.HPO_B.hpob_dataset.get_bst(surrogates_dir, search_space_id, dataset_id, surrogates_stats)[source]

Introduction

Loads a pre-trained XGBoost surrogate model and retrieves its associated normalization statistics for a given search space and dataset.

Args:

  • surrogates_dir (str): Directory path where surrogate models are stored.

  • search_space_id (str): Identifier for the search space.

  • dataset_id (str): Identifier for the dataset.

  • surrogates_stats (dict): Dictionary containing normalization statistics for each surrogate model.

Returns:

  • bst_surrogate (xgboost.Booster): Loaded XGBoost surrogate model.

  • y_min (float): Minimum target value used for normalization.

  • y_max (float): Maximum target value used for normalization.

Raises:

  • AssertionError: If y_min is None for the specified surrogate model.