# Stable Baselines

Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.

You can read a detailed presentation of Stable Baselines in the Medium article.

These algorithms will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

Note: despite its simplicity of use, Stable Baselines (SB) assumes you have some knowledge about Reinforcement Learning (RL). You should not utilize this library without some practice. To that extent, we provide good resources in the documentation to get started with RL.

## Main differences with OpenAI Baselines

This toolset is a fork of OpenAI Baselines, with a major structural refactoring, and code cleanups:

• Unified structure for all algorithms
• PEP8 compliant (unified code style)
• Documented functions and classes
• More tests & more code coverage
• Additional algorithms: SAC and TD3 (+ HER support for DQN, DDPG, SAC and TD3)
Features Stable-Baselines OpenAI Baselines
State of the art RL methods ✔️ (1) ✔️
Documentation ✔️
Custom environments ✔️ ✔️
Custom policies ✔️ (2)
Common interface ✔️ (3)
Tensorboard support ✔️ (4)
Ipython / Notebook friendly ✔️
PEP8 code style ✔️ ✔️ (5)
Custom callback ✔️ (6)

(1): Forked from previous version of OpenAI baselines, with now SAC and TD3 in addition
(2): Currently not available for DDPG, and only from the run script.
(3): Only via the run script.
(4): Rudimentary logging of training information (no loss nor graph).
(5): EDIT: you did it OpenAI! 🐱
(6): Passing a callback function is only available for DQN

## RL Baselines Zoo: A Collection of 100+ Trained RL Agents

RL Baselines Zoo. is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Goals of this repository:

1. Provide a simple interface to train and enjoy RL agents
2. Benchmark the different Reinforcement Learning algorithms
3. Provide tuned hyperparameters for each environment and RL algorithm
4. Have fun with the trained agents!

Github repo: https://github.com/araffin/rl-baselines-zoo

## Installation

Note: Stable-Baselines supports Tensorflow versions from 1.8.0 to 1.14.0. Support for Tensorflow 2 API is planned.

### Prerequisites

Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows

#### Ubuntu

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev

#### Mac OS X

Installation of system packages on Mac requires Homebrew. With Homebrew installed, run the following:

brew install cmake openmpi

#### Windows 10

To install stable-baselines on Windows, please look at the documentation.

### Install using pip

Install the Stable Baselines package:

pip install stable-baselines[mpi]


This includes an optional dependency on MPI, enabling algorithms DDPG, GAIL, PPO1 and TRPO. If you do not need these algorithms, you can install without MPI:

pip install stable-baselines


Please read the documentation for more details and alternatives (from source, using docker).

## Example

Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.

Here is a quick example of how to train and run PPO2 on a cartpole environment:

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2

env = gym.make('CartPole-v1')
# Optional: PPO2 requires a vectorized environment to run
# the env is now wrapped automatically when passing it to the constructor
# env = DummyVecEnv([lambda: env])

model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()

env.close()

Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:

from stable_baselines import PPO2

model = PPO2('MlpPolicy', 'CartPole-v1').learn(10000)

## Try it online with Colab Notebooks !

All the following examples can be executed online using Google colab notebooks:

## Implemented Algorithms

Name Refactored(1) Recurrent Box Discrete MultiDiscrete MultiBinary Multi Processing
A2C ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
ACER ✔️ ✔️ (5) ✔️ ✔️
ACKTR ✔️ ✔️ ✔️ ✔️ ✔️
DDPG ✔️ ✔️ ✔️ (4)
DQN ✔️ ✔️
GAIL (2) ✔️ ✔️ ✔️ ✔️ (4)
HER (3) ✔️ ✔️ ✔️ ✔️
PPO1 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ (4)
PPO2 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️ ✔️
TD3 ✔️ ✔️
TRPO ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ (4)

(1): Whether or not the algorithm has be refactored to fit the BaseRLModel class.
(2): Only implemented for TRPO.
(3): Re-implemented from scratch, now supports DQN, DDPG, SAC and TD3
(4): Multi Processing with MPI.
(5): TODO, in project scope.

NOTE: Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) were not part of the original baselines and HER was reimplemented from scratch.

Actions gym.spaces:

• Box: A N-dimensional box that containes every point in the action space.
• Discrete: A list of possible actions, where each timestep only one of the actions can be used.
• MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.
• MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.

## MuJoCo

Some of the baselines examples use MuJoCo (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from www.mujoco.org). Instructions on setting up MuJoCo can be found here

## Testing the installation

All unit tests in baselines can be run using pytest runner:

pip install pytest pytest-cov
make pytest


## Projects Using Stable-Baselines

We try to maintain a list of project using stable-baselines in the documentation, please tell us when if you want your project to appear on this page ;)

## Citing the Project

To cite this repository in publications:

@misc{stable-baselines,
author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Kanervisto, Anssi and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
title = {Stable Baselines},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/hill-a/stable-baselines}},
}


## Maintainers

Stable-Baselines is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian Ernestus (aka @erniejunior), Adam Gleave (@AdamGleave) and Anssi Kanervisto (@Miffyli).

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.

## How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done. If you want to contribute, please read CONTRIBUTING.md guide first.

## Acknowledgments

Stable Baselines was created in the robotics lab U2IS (INRIA Flowers team) at ENSTA ParisTech.

Logo credits: L.M. Tenkes

###### Ashley Hill
Machine Learning, Robotics, and other oddities.
• #### Invalid Action Mask [WIP]

Right now, a number of tests don't pass, but this is per @araffin request to do a draft PR.

closes #351

• #### V3.0 implementation design

Version3 is now online: https://github.com/DLR-RM/stable-baselines3

Hello,

Before starting the migration to tf2 for stable baselines v3, I would like to discuss some design point we should agree on.

### Which tf paradigm should we use?

I would go for pytorch-like "eager mode", wrapping the method using a tf.function to improve the performance (as it is done here). The define-by-run is usually easier to read and debug (and I can compare it to my internal pytorch version). Wrapping it up with a tf.function should preserve performances.

My idea would be:

1. Refactor common folder (as done by @Miffyli in #540 )
2. Implement one on-policy algorithm and one off-policy: I would go for PPO/TD3 and I can be in charge of that. This would allow to discuss concrete implementation details.
3. Implement the rest, in order:
• SAC
• A2C
• DQN
• DDPG
• HER
• TRPO
1. Implement the recurrent versions?

I'm afraid that the remaining ones (ACKTR, GAIL and ACER) are not the easiest one to implement. And for GAIL, we can refer to https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.

### Is there other breaking changes we should do? Change in the interface?

There are different things that I would like to change/add.

First, it would be adding the evaluation in the training loop. That is to say, we allow use to pass an eval_env on which the agent will be evaluated every eval_freq for n_eval_episodes. This is a true measure of the agent performance compared to training reward.

I would like to manipulate only VecEnv in the algorithm (and wrap the gym.Env automatically if necessary) this simplify the thing (so we don't have to think about what is the type of the env). Currently, we are using an UnVecEnvWrapper which makes things complicated for DQN for instance.

Should we maintain MPI support? I would favor switching to VecEnv too, this remove a dependency and unify the rest. (and would maybe allow to have an easy way to multiprocess SAC/DDPG or TD3 (cf #324 )). This would mean that we will remove PPO1 too.

Next thing I would like to make default is the Monitor wrapper. This allow to retrieve statistics about the training and would remove the need of a buggy version of total_episode_reward_logger for computing reward (cf #143).

As discussed in an other issue, I would like to unify the learning rate schedule too (would not be too difficult).

I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

Anyway, I plan to do a PR and we can then discuss on that.

### Regarding the transition

As we will be switching to keras interface (at least for most of the layers), this will break previously saved models. I propose to create scripts that allow to convert old models to new SB version rather than try to be backward-compatible.

PS: I hope I did not forget any important point

EDIT: the draft repo is here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2 (ppo and td3 included for now)

• #### Multithreading broken pipeline on custom Env

First of all, thank you for this wonderful project, I can't stress it enough how badly baselines was in need of such a project.

Now, the Multiprocessing Tutorial created by stable-baselines (see) states that the following is to be used to generate multiple envs - as an example of course:

def make_env(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.

:param env_id: (str) the environment ID
:param num_env: (int) the number of environment you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def _init():
env = gym.make(env_id)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return _init


However, for some obscure reason, python never calls _init, for some obvious reason: even though it has no arguments, it is still a function hence, please replace it with 'return _init()'.

Secondly, even doing so results in an error when building the SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)]), namely:

Traceback (most recent call last):

File "", line 1, in runfile('C:/Users/X/Desktop/thesis.py', wdir='C:/Users/X/Desktop')

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile execfile(filename, namespace)

File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/X/Desktop/thesis.py", line 133, in env = SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)])

File "D:\Programs\Anaconda3\lib\site-packages\stable_baselines\common\vec_env\subproc_vec_env.py", line 52, in init process.start()

File "D:\Programs\Anaconda3\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self)

File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj)

File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj)

File "D:\Programs\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child)

File "D:\Programs\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

Any ideas on how to fix this? I have implemented a simply Gym env, does it need to extend/implement SubprocVecEnv?

• #### [question] [feature request] support for Dict and Tuple spaces

I want to train using two images from different cameras and an array of 1d data from a sensor. I'm passing these input as my env state. Obviously I need a cnn that can take those inputs, concatenate, and train on them. My question is how to pass these input to such a custom cnn in polocies.py. Also, I tried to pass two images and apparently dummy_vec_env.py had trouble with that. obs = env.reset() File "d:\resources\stable-baselines\stable_baselines\common\vec_env\dummy_vec_env.py", line 57, in reset self._save_obs(env_idx, obs) File "d:\resources\stable-baselines\stable_baselines\common\vec_env\dummy_vec_env.py", line 75, in _save_obs self.buf_obs[key][env_idx] = obs ValueError: cannot copy sequence with size 2 to array axis with dimension 80

I appreciate any thoughts or examples.

• #### Policy base invalid action mask

Currently support: Algorithm: PPO1, PPO2, A2C, ACER, ACKTR, TRPO Action_space: Discrete, MultiDiscrete Policy Network: MlpPolicy, MlpLnLstmPolicy, MlpLstmPolicy Policy Network(Theoretically supported, but not tested): CnnPolicy, CnnLnLstmPolicy, CnnLstmPolicy Vectorized Environments: DummyVecEnv, SubprocVecEnv

How to use: Environment, Test

• #### ppo2 performance and gpu utilization

I am running a ppo2 model. I see high cpu utilization and low gpu utilization.

When running:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())


I get:

Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
2019-05-06 11:06:02.117760: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-05-06 11:06:02.341488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.92GiB
2019-05-06 11:06:02.348112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-06 11:06:02.838521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-06 11:06:02.842724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-05-06 11:06:02.845154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-05-06 11:06:02.848092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 4641 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8905916217148098349
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4866611609
locality {
bus_id: 1
}
}
incarnation: 7192145949653879362
physical_device_desc: "device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
]


I understand that tensorflow is "seeing" my gpu. Why is the low utilization when training a stable baseline model?

# multiprocess environment
n_cpu = 4
env = PortfolioEnv(total_steps=settings['total_steps'], window_length=settings['window_length'], allow_short=settings['allow_short'] )
env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

if settings['policy'] == 'MlpPolicy':
model = PPO2(MlpPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLstmPolicy':
model = PPO2(MlpLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLnLstmPolicy':
model = PPO2(MlpLnLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

model.learn(total_timesteps=settings['total_timesteps'])

model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])
model.save(model_name)

• #### [Feature Request] Invalid Action Mask

it would be very useful to be able to adjust the gradient based on a binary vector of which outputs you want to be considered when computing the gradient.

This would be insanely helpful when dealing with environments where the number of actions is dependent on the observation. A simple example of this would be in StarCraft. At the beginning of a game, not every action is valid.

• #### ValueError: could not broadcast input array from shape (2) into shape (7,3,5)

Describe the bug I am trying to run stable_baseline alogs such as ppo1, ddpg and get this error: ValueError: could not broadcast input array from shape (2) into shape (7,1,5)

Code example

# action will be the portfolio weights from 0 to 1 for each asset

    self.action_space = gym.spaces.Box(-1, 1, shape=(len(instruments) + 1,), dtype=np.float32)  # include cash

# get the observation space from the data min and max
self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(len(instruments), window_length, history.shape[-1]), dtype=np.float32)


I tried using obs.reshape(-1), obs.flatten(), obs.ravel() nothing works. Also tried CnnPolicy onstead of MlpPolicy and got:

ValueError: Negative dimension size caused by subtracting 8 from 7 for 'model/c1/Conv2D' (op: 'Conv2D') with input shapes: [?,7,1,5], [8,8,5,32].

System Info Describe the characteristic of your environment: *library was installed using: git clone https://github.com/hill-a/stable-baselines.git cd stable-baselines pip install -e .

• GPU models and configuration: no gpu, cpu only
• Python 3.7.2
• tensorflow 1.12.0
• stable-baselines 2.4.1

tensorflow==1.13.1 cpu

• #### Why does env.render() create multiple render screens? | LSTM policy predict with one env [question]

When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.

import gym
import numpy as np

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines import ACKTR

def make_env(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.

:param env_id: (str) the environment ID
:param num_env: (int) the number of environments you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def _init():
env = gym.make(env_id)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return _init

env_id = "CartPole-v1"
num_cpu = 4  # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

model = ACKTR(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)

obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()


System Info Describe the characteristic of your environment:

• Vanilla install, followed the docs using pip
• gpus: 2-gtx-1080ti's
• Python version 3.6.5
• Tensorflow version 1.12.0
• ffmpeg 4.0

• #### [feature request] Implement goal-parameterized algorithms (HER)

I'd like to implement Hindsight Experience Replay (HER). This can be based on a whatever goal-parameterized RL off-policy algorithm.

Goal-parameterized architectures: it requires a variable for the current goal and one for the current outcome. By outcome, I mean anything that is requires to compute the current outcome in the process of targeting the goal, e.g. the RL task is to reach a 3D target (the goal) with a robotic hand. The position of the target is the goal, the position of the hand is the outcome. The reward is a function of the distance between the two. Goal and outcome are usually subparts of the state space.

How Gym handles this: In Gym, there is a class called GoalEnv to deal with such environments.

• The variable observation_space is replaced by another class that contains the true observation space observation_space.spaces['observation'], goal space (observation_space.spaces['desired_goal']) and outcome space(observation_space.spaces['achieved_goal']).
• The observation returned first by env.step is now a dictionnary obs['observation'], obs['desired_goal'], obs['achieved_goal'].
• The environment defines a reward function (compute_reward), that takes as argument the goal and the outcome to return the reward
• It also contains a sample_goal function that simply samples a goal uniformly from the goal space.

Stable-baselines does not consider this so far. The replay buffer, BasePolicy, BaseRLModels OffPolicyRLModels only consider observation, and are not made to include a notion of goal or outcome. Two solutions:

1. Adapt these default classes to allow the representation of goals and outcomes in option.
2. Concatenate the goal and outcome to the observation everywhere so that the previously mention classes don't see the difference. This requires to keep track of the indices of goals and outcomes in the full obs_goal_outcome vector. Ashley started to do this from what I understood. However, he did not take into account the GoalEnv class of Gym, which I think we should use as it's kind of neat, and it's used for the Robotic Fetch environment, which are kind of the only one generally used so far.

I think the second is more clear as it separates observation from goals and outcomes, but probably it would make the code less easy to follow, and would require more changes than the first option. So let's go for the first as Ashley started.

First thoughts on how it could be done.

1. we need (as Ashley started to do), a wrapper around the gym environment. GoalEnv are different from usual env because they return a dict in place of the former observation vector. This wrapper would unpack the observation in obs, goal, outcome from the GoalEnv.step. It would return a concatenation of all of those. Ashley considered that the goal was in the observation space, so that the concatenation was twice as long as the observation. This is generally not true. So we would need to keep as attribute the size of the goal and outcome spaces. It would keep the different spaces as attributes, keep the function to sample goals, and the reward function.

2. A multi-goal replay buffer to implement HER replay. It takes the observation from the buffer and redecompose it in obs, goal and outcome before performing replay.

I think it does not require too much work after what Ashley started to do. It would be a few modifications to integrate the GoalEnv of gym, as it is a standard way to use multi-goal environments. Then correct the assumption he made about the dimension of the goal.

If you're all ok, I will start in that direction and test them on the Fetch environments. In the baselines, their performance is achieved with 19 processes in parallel. They basically average the update of the 19 actors. I'll try first without parallelization.

• #### Image input into TD3

Hi,

I have a custom env with a image observation space and a continuous action space. After training TD3 policies, when I evaluate them there seems to be no reaction to the image observation (I manually drag objects in front of the camera to see what happens).

from stable_baselines.td3.policies import CnnPolicy as td3CnnPolicy
from stable_baselines import TD3

env = gym.make('GripperEnv-v0')
env = Monitor(env, log_dir)
ExperimentName = "TD3_test"
policy_kwargs = dict(layers=[64, 64])
model = TD3(td3CnnPolicy, env, verbose=1, policy_kwargs=policy_kwargs, tensorboard_log="tmp/", buffer_size=15000,
batch_size=2200, train_freq=2200, learning_starts=10000, learning_rate=1e-3)

callback = SaveOnBestTrainingRewardCallback(check_freq=1100, log_dir=log_dir)
time_steps = 50000
model.learn(total_timesteps=int(time_steps), callback=callback)
model.save("128128/"+ExperimentName)


I can view the observation using opencv and it is the right image (single channel, pixels between 0 and 1).

So how I understand it is that the CNN is 3 conv2D layers that connect to two layers 64 wide. Is it possible that I somehow disconnected these two parts or could it be that my hyper-parameters are just that bad? The behavior that is learnt by the policies is similar to if I just put in zero pixels in the network.

• #### True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments

I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab. In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.

Due to this, my agent is not learning correctly.

The following things are important to note here:

1. My environment is giving the agent nonzero rewards (I have checked it thoroughly) but on the agent side the rewards are not being collected.
2. This thing happens mostly but not always, some times when I install stable-baselines the whole system works perfectly.
3. This thing happens only with my custom environment and not with other openai gym environments.

• #### Deep Q-value network evaluation in SAC algorithm

I am implementing Soft-Actor Critic (SAC) agent and need to evaluate q-value network inside my custom environment (for the implementation of a special algorithm, called Wolpertinger's algorithm, to handle large discrete action spaces). I have tried to get the q-values from SAC class object, but failed. Any method or function like the one with stable baselines' PPO algorithm's implementation (namely, .value) will be very helpful.

• #### Link to gym docs on creating cusotm environment broken

The link to openai gym docs on creating a custom environment in the stable baselines docs is broken.

• #### 1D Vector of floats as an observation space

Hey there, I've been working on this environment for a bit but just can't seem to grasp the observation space. Essentially I have a list of attributes (13 floats) that I need held in the observation space. The max they could be is 1200 (x and y coords).

Do I need to have a vector defining the low and high for each value? Some values can only go up to 7

This is my observation space: self.observation_space = spaces.Box(low=0, high=1200, shape=(13, ), dtype=np.float32) In my reset(), I return a numpy vector of 13 floats, however when I run check_env, I get the following: AssertionError: The observation returned by the reset() method does not match the given observation space

Several people online mentioned using a dict instead, but I tried to do that and it didn't work. I also understand that I'm supposed to be using values between 0 and 1? I'm a bit confused about that. I'm just really unfamiliar with gym in general and I'm not quite sure what I'm doing, so any help would be appreciated

• #### Problem retraining PPO1 model and using Tensorflow with Stable Baselines 2

Dear altruists, I am new at stable baselines and RL. I am trying to retrain my previously trained PPO1 model as like it will start learning from where it was left in the previous training. What I am trying to do is :

1. loading my previously trained model from my computer and then re-train it from the point it ended it’s last training. For that, I am loading my previously saved model inside policy_fn() and I am giving policy_fn as parameter inside pposgd_simple.learn() method. It shows error "ValueError: At least two variables have the same name: pi/obfilter/count"

Also, I am unsure of whether it starts the training from the previous ending point or whether it started the training from the very beginning (when it trains correctly in a different setting). Can anyone please help me directing the way to verify it. One option may be printing the model parameters, but I am unsure of it.

1. I am also trying to use Tensorboard to monitor my training. But when I run the training, the program says “***tensorboard_log=logger_path, TypeError: learn() got an unexpected keyword argument 'tensorboard_log'.***” My stable baselines version 2.10.2. I am attaching my entire code of training below. I would appreciate any suggestions from you. Thanks in advance.
def make_env(seed=None):
reward_scale = 1.0

rank = MPI.COMM_WORLD.Get_rank()
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = Env()

env = Monitor(env, logger_path, allow_early_resets=True)

env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler

env = RewardScaler(env, reward_scale)
return env

def train(num_timesteps, path=None):

from baselines.ppo1 import mlp_policy, pposgd_simple

sess = U.make_session(num_cpu=1)
sess.__enter__()

def policy_fn(name, ob_space, ac_space):
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=3)
saver = tf.train.Saver()
if path is not None:
print("Tried to restore from ", path)
U.initialize()
saver.restore(tf.get_default_session(), path)
saver2 = tf.train.import_meta_graph('/srcs/src/models/model1.meta')
model = saver.restore(sess,tf.train.latest_checkpoint('/srcs/src/models/'))
#return policy
return saver2

env = make_env()

pi = pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
timesteps_per_actorbatch=1024,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10,
optim_stepsize=5e-5,
optim_batchsize=64,
gamma=0.99,
lam=0.95,
schedule='linear',
tensorboard_log=logger_path,
#tensorboard_log="./ppo1_tensorboard/",
)
env.env.plotSave()
saver = tf.train.Saver(tf.all_variables())
saver.save(sess, '/models/model1')
return pi

def main():
logger.configure()
path_ = "/models/model1"
train(num_timesteps=409600, path=path_)
if __name__ == '__main__':
rank = MPI.COMM_WORLD.Get_rank()
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
main()

• #### Running Stable Baselines on M1 Macs?

Hi everyone,

A while ago (early 2020) to be precise I compiled a DRL project using SB. Since that time, I've moved on to other things. Recently I purchased an ARM-architecture (i.e. M1) Mac and transferred my files over; however, Stable Baselines doesn't function. When I run the project, the kernel keeps restarting with no error message being displayed; I've followed the installation instructions and even matched the version numbers of TF etc. from my old Intel-based machine (on which the code still appears to run fine).

I also tried experimenting with SB-3, for which the Cartpole training example using PPO from the website ([https://stable-baselines3.readthedocs.io/en/master/guide/examples.html]) does appear to run; however, it is extremely slow.

I'd really appreciate if someone could answer the following questions:

1. Is there a special procedure for installing Stable Baselines and associated components (e.g. TensorFlow) on M1-based Macs?
2. Is there a performance penalty of running code on these machines?

Cheers!

###### Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Aug 1, 2022
###### Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

SLM Lab Modular Deep Reinforcement Learning framework in PyTorch. Documentation: https://slm-lab.gitbook.io/slm-lab/ BeamRider Breakout KungFuMaster M

Aug 1, 2022
###### A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

Aug 2, 2022
###### Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Aug 8, 2022
###### Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Aug 1, 2022
###### A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Jul 31, 2022
###### An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Jul 29, 2022
###### A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Aug 2, 2022
###### TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

Aug 2, 2022
###### Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Aug 2, 2022
###### TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

Jul 30, 2022
###### Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Aug 3, 2022
###### ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Jul 24, 2022
###### Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Aug 5, 2022
###### Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

Aug 1, 2022