src.rl.reinforce¶
Module Contents¶
Classes¶
Introduction¶A class to store and manage the memory required for reinforcement learning algorithms. It keeps track of log probabilities and rewards during an episode and provides functionality to clear the stored memory. |
|
Introduction¶The |
API¶
- class src.rl.reinforce.Memory[source]¶
Introduction¶
A class to store and manage the memory required for reinforcement learning algorithms. It keeps track of log probabilities and rewards during an episode and provides functionality to clear the stored memory.
Methods:¶
init(): Initializes the memory by creating empty lists for log probabilities and rewards.
clear_memory(): Clears the stored memory by deleting the lists of log probabilities and rewards.
Initialization
Initializes the memory by creating empty lists for log probabilities and rewards.
- class src.rl.reinforce.REINFORCE_Agent(config, networks: dict, learning_rates: float)[source]¶
Bases:
src.rl.basic_agent.Basic_AgentIntroduction¶
The
REINFORCE_Agentclass implements a REINFORCE algorithm-based agent for reinforcement learning. This agent uses policy gradient methods to optimize the policy directly by maximizing the expected cumulative reward. It supports parallelized environments, logging to TensorBoard, and saving/loading checkpoints for training continuation.Original paper¶
“Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” 1992 (in Machine Learning journal)
Args¶
config: Configuration object containing all necessary parameters for experiment.For details you can visit config.py.networks(dict): A dictionary of neural networks used by the agent, with keys as network names (e.g., ‘actor’, ‘critic’) and values as the corresponding network instances.learning_rates(float): Learning rate for the optimizer.
Attributes¶
gamma(float): Discount factor for future rewards.max_grad_norm(float): Maximum gradient norm for gradient clipping.device(str): Device to run the computations on (e.g., ‘cpu’ or ‘cuda’).network(list): List of network names used by the agent.optimizer(torch.optim.Optimizer): Optimizer for training the networks.learning_time(int): Counter for the number of training steps completed.cur_checkpoint(int): Counter for the current checkpoint index.
Methods¶
set_network(networks, learning_rates): Configures the networks and optimizer for the agent.update_setting(config): Updates the agent’s configuration and resets training-related attributes.train_episode(envs, seeds, para_mode, compute_resource, tb_logger, required_info): Trains the agent for one episode using the REINFORCE algorithm.rollout_episode(env, seed, required_info): Executes a single rollout episode in a given environment without training.log_to_tb_train(tb_logger, mini_step, grad_norms, loss, Return, Reward, logprobs, extra_info): Logs training metrics and additional information to TensorBoard.
Initialization
Initializes the REINFORCE agent with the given configuration, networks, and learning rates.Store the initial agent in the checkpoint directory.
Args:¶
config: Configuration object containing all necessary parameters for the experiment.
networks (dict): A dictionary of neural networks used by the agent.
learning_rates (float): Learning rate for the optimizer.
- set_network(networks: dict, learning_rates: float)[source]¶
Configures the networks and optimizer for the agent.
Args:¶
networks (dict): A dictionary of neural networks used by the agent.
learning_rates (float): Learning rate for the optimizer.
Raises:¶
ValueError: If the length of the learning rates list does not match the number of networks.
- update_setting(config)[source]¶
Updates the agent’s configuration and resets training-related attributes.
Args:¶
config: Configuration object containing updated parameters.
- train_episode(envs, seeds: Optional[Union[int, List[int], numpy.ndarray]], para_mode: Literal[dummy, subproc, ray, ray - subproc] = 'dummy', compute_resource={}, tb_logger=None, required_info={})[source]¶
Trains the agent for one episode using the REINFORCE algorithm.
Args:¶
envs: List of environments for training.
seeds: Seeds for reproducibility.
para_mode (str): Parallelization mode for the environments.
compute_resource (dict): Resources for computation (e.g., CPUs, GPUs).
tb_logger: TensorBoard logger for logging training metrics.
required_info (dict): Additional information required from the environment.
Returns:¶
tuple: A boolean indicating whether training has ended and a dictionary with training metrics.
- rollout_episode(env, seed=None, required_info={})[source]¶
Executes a single rollout episode in a given environment without training.
Args:¶
env: The environment for the rollout.
seed (int, optional): Seed for reproducibility.
required_info (dict): Additional information required from the environment.
Returns:¶
dict: A dictionary containing rollout results such as return, cost, and metadata.
- log_to_tb_train(tb_logger, mini_step, grad_norms, loss, Return, Reward, logprobs, extra_info={})[source]¶
Logs training metrics and additional information to TensorBoard.
Args:¶
tb_logger: TensorBoard logger for logging training metrics.
mini_step (int): Current mini-batch step.
grad_norms (tuple): Gradient norms for the networks.
loss (torch.Tensor): Training loss.
Return (torch.Tensor): Episode return.
Reward (torch.Tensor): Target reward.
logprobs (torch.Tensor): Log probabilities.
extra_info (dict): Additional information to log.