src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset¶
Problem Difficulty Classification¶
The dataset is deterministically split based on a fixed random seed (1035).
Difficulty Mode |
Training Set Ratio |
Testing Set Ratio |
|---|---|---|
easy |
75% |
25% |
difficult |
25% |
75% |
Note: The split is applied to each protein category (‘rigid’, ‘medium’, ‘difficult’) separately. When difficulty is ‘all’, both sets contain all 280 problems.
Module Contents¶
Classes¶
Introduction¶Protein-Docking benchmark, where the objective is to minimize the Gibbs free energy resulting from protein-protein interaction between a given complex and any other conformation. We select 28 protein complexes and randomly initialize 10 starting points for each complex, resulting in 280 problem instances. To simplify the problem structure, we only optimize 12 interaction points in a complex instance (12D problem). |
API¶
- class src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset.Protein_Docking_Dataset(data, batch_size=1)[source]¶
Bases:
torch.utils.data.DatasetIntroduction¶
Protein-Docking benchmark, where the objective is to minimize the Gibbs free energy resulting from protein-protein interaction between a given complex and any other conformation. We select 28 protein complexes and randomly initialize 10 starting points for each complex, resulting in 280 problem instances. To simplify the problem structure, we only optimize 12 interaction points in a complex instance (12D problem).
Original paper¶
“Protein–protein docking benchmark version 4.0.” Proteins: Structure, Function, and Bioinformatics 78.15 (2010): 3111-3114.
Official Implementation¶
License¶
None
Initialization
Initializes the protein docking dataset object with provided data and batch size.
Args:¶
data (list): A list of data items, each expected to have a
dimattribute.batch_size (int, optional): The number of samples per batch. Defaults to 1.
Built-in Attributes:¶
data (list): Stores the input data.
batch_size (int): Stores the batch size.
N (int): The total number of data items.
ptr (list): List of starting indices for each batch.
index (np.ndarray): Array of indices for the data items.Defaults to a range from 0 to N.
maxdim (int): The maximum
dimvalue among all data items.Defaults to 0.
- static get_datasets(version, train_batch_size=1, test_batch_size=1, user_train_list=None, user_test_list=None, difficulty='easy', dataset_seed=1035)[source]¶
Introduction¶
Generates training and testing datasets for the protein docking problem, partitioning protein instances based on the specified difficulty level or user-provided lists. Supports both NumPy and PyTorch problem representations.
Args:¶
version (str): The backend to use for problem instances. Must be either ‘numpy’ or ‘torch’.
train_batch_size (int, optional): Batch size for the training dataset. Defaults to 1.
test_batch_size (int, optional): Batch size for the testing dataset. Defaults to 1.
user_train_list (list, optional): List of protein IDs to include in the training set. If None, uses automatic partitioning. Defaults to None.
user_test_list (list, optional): List of protein IDs to include in the testing set. If None, uses automatic partitioning. Defaults to None.
difficulty (str, optional): Difficulty level for dataset partitioning. Can be ‘easy’, ‘difficult’, or ‘all’. Defaults to ‘easy’.
dataset_seed (int, optional): Random seed for reproducible dataset partitioning. Defaults to 1035.
Returns:¶
Protein_Docking_Dataset: The training dataset.
Protein_Docking_Dataset: The testing dataset.
Raises:¶
ValueError: If the specified
versionis not supported.
- __getitem__(item)[source]¶
Introduction¶
Retrieves a batch of data samples corresponding to the given index.
Args:¶
item (int): The batch index to retrieve data for.
Returns:¶
list: A list containing the data samples for the specified batch.
Raises:¶
IndexError: If
itemis out of range of available batches.
- __len__()[source]¶
Introduction¶
Returns the number of items in the dataset.
Returns:¶
int: The total number of items in the dataset.
- __add__(other: src.environment.problem.SOO.PROTEIN_DOCKING.protein_docking_dataset.Protein_Docking_Dataset)[source]¶
Introduction¶
Implements the addition operator for the
Protein_Docking_Datasetclass, allowing two datasets to be combined into a new dataset.Args:¶
other (Protein_Docking_Dataset): Another instance of
Protein_Docking_Datasetto be added.
Returns:¶
Protein_Docking_Dataset: A new dataset instance containing the combined data from both datasets, using the current instance’s batch size.
Raises:¶
AttributeError: If
otherdoes not have adataattribute.TypeError: If
otheris not an instance ofProtein_Docking_Dataset.
- shuffle()[source]¶
Introduction¶
Randomly shuffles the indices of the dataset, updating the internal index order.
Built-in Attribute:¶
self.N (int): The total number of items in the dataset.
self.index (np.ndarray): The array storing the current order of indices.
Returns:¶
None
Notes:¶
This method uses
np.random.permutationto generate a new random ordering of indices for the dataset.