src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset

Problem Difficulty Classification

The dataset is deterministically split based on a fixed random seed (1035).

Difficulty Mode

Training Set Ratio

Testing Set Ratio

easy

75%

25%

difficult

25%

75%

Note: The split is applied to each protein category (‘rigid’, ‘medium’, ‘difficult’) separately. When difficulty is ‘all’, both sets contain all 280 problems.

Module Contents

Classes

Protein_Docking_Dataset

Introduction

Protein-Docking benchmark, where the objective is to minimize the Gibbs free energy resulting from protein-protein interaction between a given complex and any other conformation. We select 28 protein complexes and randomly initialize 10 starting points for each complex, resulting in 280 problem instances. To simplify the problem structure, we only optimize 12 interaction points in a complex instance (12D problem).

API

class src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset.Protein_Docking_Dataset(data, batch_size=1)[source]

Bases: torch.utils.data.Dataset

Introduction

Protein-Docking benchmark, where the objective is to minimize the Gibbs free energy resulting from protein-protein interaction between a given complex and any other conformation. We select 28 protein complexes and randomly initialize 10 starting points for each complex, resulting in 280 problem instances. To simplify the problem structure, we only optimize 12 interaction points in a complex instance (12D problem).

Original paper

Protein–protein docking benchmark version 4.0.” Proteins: Structure, Function, and Bioinformatics 78.15 (2010): 3111-3114.

Official Implementation

Protein-Docking

License

None

Initialization

Initializes the protein docking dataset object with provided data and batch size.

Args:

  • data (list): A list of data items, each expected to have a dim attribute.

  • batch_size (int, optional): The number of samples per batch. Defaults to 1.

Built-in Attributes:

  • data (list): Stores the input data.

  • batch_size (int): Stores the batch size.

  • N (int): The total number of data items.

  • ptr (list): List of starting indices for each batch.

  • index (np.ndarray): Array of indices for the data items.Defaults to a range from 0 to N.

  • maxdim (int): The maximum dim value among all data items.Defaults to 0.

proteins_set[source]

None

n_start_points[source]

10

static get_datasets(version, train_batch_size=1, test_batch_size=1, user_train_list=None, user_test_list=None, difficulty='easy', dataset_seed=1035)[source]

Introduction

Generates training and testing datasets for the protein docking problem, partitioning protein instances based on the specified difficulty level or user-provided lists. Supports both NumPy and PyTorch problem representations.

Args:

  • version (str): The backend to use for problem instances. Must be either ‘numpy’ or ‘torch’.

  • train_batch_size (int, optional): Batch size for the training dataset. Defaults to 1.

  • test_batch_size (int, optional): Batch size for the testing dataset. Defaults to 1.

  • user_train_list (list, optional): List of protein IDs to include in the training set. If None, uses automatic partitioning. Defaults to None.

  • user_test_list (list, optional): List of protein IDs to include in the testing set. If None, uses automatic partitioning. Defaults to None.

  • difficulty (str, optional): Difficulty level for dataset partitioning. Can be ‘easy’, ‘difficult’, or ‘all’. Defaults to ‘easy’.

  • dataset_seed (int, optional): Random seed for reproducible dataset partitioning. Defaults to 1035.

Returns:

  • Protein_Docking_Dataset: The training dataset.

  • Protein_Docking_Dataset: The testing dataset.

Raises:

  • ValueError: If the specified version is not supported.

__getitem__(item)[source]

Introduction

Retrieves a batch of data samples corresponding to the given index.

Args:

  • item (int): The batch index to retrieve data for.

Returns:

  • list: A list containing the data samples for the specified batch.

Raises:

  • IndexError: If item is out of range of available batches.

__len__()[source]

Introduction

Returns the number of items in the dataset.

Returns:

  • int: The total number of items in the dataset.

__add__(other: src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset.Protein_Docking_Dataset)[source]

Introduction

Implements the addition operator for the Protein_Docking_Dataset class, allowing two datasets to be combined into a new dataset.

Args:

  • other (Protein_Docking_Dataset): Another instance of Protein_Docking_Dataset to be added.

Returns:

  • Protein_Docking_Dataset: A new dataset instance containing the combined data from both datasets, using the current instance’s batch size.

Raises:

  • AttributeError: If other does not have a data attribute.

  • TypeError: If other is not an instance of Protein_Docking_Dataset.

shuffle()[source]

Introduction

Randomly shuffles the indices of the dataset, updating the internal index order.

Built-in Attribute:

  • self.N (int): The total number of items in the dataset.

  • self.index (np.ndarray): The array storing the current order of indices.

Returns:

  • None

Notes:

This method uses np.random.permutation to generate a new random ordering of indices for the dataset.