src.rl.ppo¶
Module Contents¶
Classes¶
Introduction¶A class to store and manage the memory required for reinforcement learning algorithms. It keeps track of actions, states, log probabilities, and rewards during an episode and provides functionality to clear the stored memory. |
|
Introduction¶The |
API¶
- class src.rl.ppo.Memory[source]¶
Introduction¶
A class to store and manage the memory required for reinforcement learning algorithms. It keeps track of actions, states, log probabilities, and rewards during an episode and provides functionality to clear the stored memory.
Methods:¶
init(): Initializes the memory by creating empty lists for actions, states, log probabilities, and rewards.
clear_memory(): Clears the stored memory by deleting the lists of actions, states, log probabilities, and rewards.
Initialization
- class src.rl.ppo.PPO_Agent(config, networks: dict, learning_rates: float)[source]¶
Bases:
src.rl.basic_agent.Basic_AgentIntroduction¶
The
PPO_Agentclass implements a Proximal Policy Optimization (PPO) agent for reinforcement learning. This agent uses actor-critic architecture, generalized advantage estimation, and clipping techniques to optimize policies in a stable and efficient manner. It supports parallelized environments, logging to TensorBoard, and saving/loading checkpoints for training continuation.Original paper¶
Args¶
config: Configuration object containing all necessary parameters for experiment.For details you can visit config.py.networks(dict): A dictionary of neural networks used by the agent, with keys as network names (e.g., ‘actor’, ‘critic’) and values as the corresponding network instances.learning_rates(float): Learning rate for the optimizer.
Attributes¶
gamma(float): Discount factor for future rewards.n_step(int): Number of steps for n-step returns.K_epochs(int): Number of epochs for PPO updates.eps_clip(float): Clipping parameter for PPO objective.max_grad_norm(float): Maximum gradient norm for gradient clipping.device(str): Device to run the computations on (e.g., ‘cpu’ or ‘cuda’).network(list): List of network names initialized in the agent.optimizer(torch.optim.Optimizer): Optimizer for training the networks.learning_time(int): Counter for the total number of training steps.cur_checkpoint(int): Counter for the current checkpoint index.
Methods¶
set_network(networks, learning_rates): Initializes the actor and critic networks, sets up the optimizer, and moves networks to the specified device.get_step(): Returns the current training step count.update_setting(config): Updates the agent’s configuration and resets training-related attributes.train_episode(envs, seeds, para_mode, compute_resource, tb_logger, required_info): Trains the agent for one episode using the PPO algorithm.rollout_episode(env, seed, required_info): Executes a single rollout in the environment and collects results.log_to_tb_train(tb_logger, mini_step, grad_norms, reinforce_loss, baseline_loss, Return, Reward, memory_reward, critic_output, logprobs, entropy, approx_kl_divergence, extra_info): Logs training metrics to TensorBoard.
Initialization
Initializes the PPO agent with the given configuration, networks, and learning rates.Store the initial agent in the checkpoint directory.
Args:¶
config: Configuration object containing all necessary parameters for the experiment.
networks (dict): A dictionary of neural networks used by the agent.
learning_rates (float): Learning rate for the optimizer.
- set_network(networks: dict, learning_rates: float)[source]¶
Initializes the actor and critic networks, sets up the optimizer, and moves networks to the specified device.
Args:¶
networks (dict): A dictionary of neural networks used by the agent.
learning_rates (float): Learning rate for the optimizer.
Raises:¶
ValueError: If the length of the learning rates list does not match the number of networks.
- get_step()[source]¶
Returns the current training step count.
Returns:¶
int: The current training step count.
- update_setting(config)[source]¶
Updates the agent’s configuration and resets training-related attributes.
Args:¶
config: Configuration object containing updated parameters.
- train_episode(envs, seeds: Optional[Union[int, List[int], numpy.ndarray]], para_mode: Literal[dummy, subproc, ray, ray - subproc] = 'dummy', compute_resource={}, tb_logger=None, required_info={})[source]¶
Trains the agent for one episode using the PPO algorithm.
Args:¶
envs: List of environments for training.
seeds: Seeds for reproducibility.
para_mode (str): Parallelization mode for the environments.
compute_resource (dict): Resources for computation (e.g., CPUs, GPUs).
tb_logger: TensorBoard logger for logging training metrics.
required_info (dict): Additional information required from the environment.
Returns:¶
tuple: A boolean indicating whether training has ended and a dictionary with training metrics.
- rollout_episode(env, seed=None, required_info={})[source]¶
Executes a single rollout in the environment and collects results.
Args:¶
env: The environment for the rollout.
seed (int, optional): Seed for reproducibility.
required_info (dict): Additional information required from the environment.
Returns:¶
dict: A dictionary containing results of the rollout episode, including return and environment-specific metrics.
- log_to_tb_train(tb_logger, mini_step, grad_norms, reinforce_loss, baseline_loss, Return, Reward, memory_reward, critic_output, logprobs, entropy, approx_kl_divergence, extra_info={})[source]¶
Logs training metrics to TensorBoard.
Args:¶
tb_logger: TensorBoard logger for logging training metrics.
mini_step (int): Current mini-batch step.
grad_norms (tuple): Gradient norms for the networks.
reinforce_loss (torch.Tensor): Actor loss.
baseline_loss (torch.Tensor): Critic loss.
Return (torch.Tensor): Episode return.
Reward (torch.Tensor): Target reward.
memory_reward (torch.Tensor): Memory reward.
critic_output (torch.Tensor): Critic output.
logprobs (torch.Tensor): Log probabilities.
entropy (torch.Tensor): Entropy of the policy.
approx_kl_divergence (torch.Tensor): Approximate KL divergence.
extra_info (dict): Additional information to log.