TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

PyPI tf-agents

TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It provides well tested and modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking.

To get started, we recommend checking out one of our Colab tutorials. If you need an intro to RL (or a quick recap), start here. Otherwise, check out our DQN tutorial to get an agent up and running in the Cartpole environment. API documentation for the current stable release is on tensorflow.org.

TF-Agents is under active development and interfaces may change at any time. Feedback and comments are welcome.

Table of contents

Agents
Tutorials
Multi-Armed Bandits
Examples
Installation
Contributing
Releases
Principles
Citation
Disclaimer

Agents

In TF-Agents, the core elements of RL algorithms are implemented as Agents. An agent encompasses two main responsibilities: defining a Policy to interact with the Environment, and how to learn/train that Policy from collected experience.

Currently the following algorithms are available under TF-Agents:

Tutorials

See docs/tutorials/ for tutorials on the major components provided.

Multi-Armed Bandits

The TF-Agents library contains a comprehensive Multi-Armed Bandits suite, including Bandits environments and agents. RL agents can also be used on Bandit environments. There is a tutorial in bandits_tutorial.ipynb. and ready-to-run examples in tf_agents/bandits/agents/examples/v2.

Examples

End-to-end examples training agents can be found under each agent directory. e.g.:

Installation

TF-Agents publishes nightly and stable builds. For a list of releases read the Releases section. The commands below cover installing TF-Agents stable and nightly from pypi.org as well as from a GitHub clone.

Stable

Run the commands below to install the most recent stable release. API documentation for the release is on tensorflow.org.

$ pip install --user tf-agents[reverb]

# Use this tag get the matching examples and colabs.
$ git clone https://github.com/tensorflow/agents.git
$ cd agents
$ git checkout v0.6.0

If you want to install TF-Agents with versions of Tensorflow or Reverb that are flagged as not compatible by the pip dependency check, use the following pattern below at your own risk.

$ pip install --user tensorflow
$ pip install --user dm-reverb
$ pip install --user tf-agents

If you want to use TF-Agents with TensorFlow 1.15 or 2.0, install version 0.3.0:

# Newer versions of tensorflow-probability require newer versions of TensorFlow.
$ pip install tensorflow-probability==0.8.0
$ pip install tf-agents==0.3.0

Nightly

Nightly builds include newer features, but may be less stable than the versioned releases. The nightly build is pushed as tf-agents-nightly. We suggest installing nightly versions of TensorFlow (tf-nightly) and TensorFlow Probability (tfp-nightly) as those are the versions TF-Agents nightly are tested against.

To install the nightly build version, run the following:

# `--force-reinstall helps guarantee the right versions.
$ pip install --user --force-reinstall tf-nightly
$ pip install --user --force-reinstall tfp-nightly
$ pip install --user --force-reinstall dm-reverb-nightly

# Installing with the `--upgrade` flag ensures you'll get the latest version.
$ pip install --user --upgrade tf-agents-nightly

From GitHub

After cloning the repository, the dependencies can be installed by running pip install -e .[tests]. TensorFlow needs to be installed independently: pip install --user tf-nightly.

Contributing

We're eager to collaborate with you! See CONTRIBUTING.md for a guide on how to contribute. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

Releases

TF Agents has stable and nightly releases. The nightly releases are often fine but can have issues due to upstream libraries being in flux. The table below lists the version(s) of TensorFlow tested with each TF Agents' release to help users that may be locked into a specific version of TensorFlow. 0.3.0 was the last release compatible with Python 2.

Release Branch / Tag TensorFlow Version
Nightly master tf-nightly
0.7.1 v0.7.1 2.4.0
0.6.0 v0.6.0 2.3.0
0.5.0 v0.5.0 2.2.0
0.4.0 v0.4.0 2.1.0
0.3.0 v0.3.0 1.15.0 and 2.0.0

Principles

This project adheres to Google's AI principles. By participating, using or contributing to this project you are expected to adhere to these principles.

Citation

If you use this code, please cite it as:

@misc{TFAgents,
  title = {{TF-Agents}: A library for Reinforcement Learning in TensorFlow},
  author = {Sergio Guadarrama and Anoop Korattikara and Oscar Ramirez and
     Pablo Castro and Ethan Holly and Sam Fishman and Ke Wang and
     Ekaterina Gonina and Neal Wu and Efi Kokiopoulou and Luciano Sbaiz and
     Jamie Smith and Gábor Bartók and Jesse Berent and Chris Harris and
     Vincent Vanhoucke and Eugene Brevdo},
  howpublished = {\url{https://github.com/tensorflow/agents}},
  url = "https://github.com/tensorflow/agents",
  year = 2018,
  note = "[Online; accessed 25-June-2019]"
}

Disclaimer

This is not an official Google product.

Comments
  • Error loading DqnAgent saved model.

    Error loading DqnAgent saved model.

    I am creating a tf-agent DqnAgent in the following code:

        tf_agent = dqn_agent.DqnAgent(
            train_env.time_step_spec(),
            train_env.action_spec(),
            q_network=q_net,
            optimizer=optimizer,
            td_errors_loss_fn=dqn_agent.element_wise_squared_loss,
            train_step_counter=train_step_counter
    )
    

    During the training loop I am saving this model with

        tf.saved_model.save(tf_agent, saved_models_path)
    

    Once trained, I want to load saved model with

        if tf.saved_model.contains_saved_model(saved_models_path):
            tf_agent = tf.saved_model.load(saved_models_path)
    

    This code will load the saved model only if the folder in saved_path contains one, the functions contains_saved_model(saved_models_path) returns True, so the model is loaded, but there is an excetion and the program crashes:

        Traceback (most recent call last):
            File "/home/claudino/Projetos/dino-tf-agents/dino_ia/model/agent.py", line 50, in <module>
                tf_agent = tf.saved_model.load(saved_models_path)
            File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 408, in load
                return load_internal(export_dir, tags)
            File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 432, in load_internal
                export_dir)
            File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 58, in __init__
                self._load_all()
            File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 168, in _load_all
                slot_variable = optimizer_object.add_slot(
            AttributeError: '_UserObject' object has no attribute 'add_slot'
    
            Process finished with exit code 1
    
  • TRAIN TF-AGENTS WITH MULTIPLE GPUs

    TRAIN TF-AGENTS WITH MULTIPLE GPUs

    Hi, I finally got my vm up and running using: 2 Tesla P100 NVIDIA driver 440.33.01 cuda 10.2 tensorflow=2.1.0 tf_agents=0.3.0

    I start training a custom model/env based on SAC agent v2 train loop but only one GPU is used. My question : at the moment is tf-agents able to manage distribute training on multiple GPU? or should I use only one?

  • network.create_variables() clogs all GPU memory

    network.create_variables() clogs all GPU memory

    On calling network.create_variables() for my agent (using a DDPG agent), my GPU memory gets used 100% instantly and never clears up. I can control it by using a virtual memory cap, but I need memory for other computation downstream (CNN etc.) and the memory cap ensures there is no memory left for anything else.

    Why might this be happening and how do I get around this?

  • tf-agents SAC 10x slower than stable-baselines on same hardware

    tf-agents SAC 10x slower than stable-baselines on same hardware

    I am running a simple test of SAC using the LunarLanderContinuous-v2 environment. Training is for 500,000 steps with a replay buffer of size 50,000 (see code below). tf-agents takes over 10 hours to complete training whereas the stable-baselines implementation of SAC using the same hyperparameters only takes 39 minutes. I've checked and double-check my version of CUDA, tensorflow-gpu, tf-agent, etc and cannot speed things up.

    Here are the details to reproduce:

    Ubuntu 16.04, tf-agents==0.3.0, tensorflow-gpu==1.15.0, gym==0.15.4, CUDA==10.0, cudnn==7.6.5, stable-baselines==2.9.0a0, GPU==Quadro M4000 8Gb, CPU==i7 64 Gb

    My tf-agents test script is simply the v2 train_eval.py script from the sac/examples after substituting the LunarLanderContinuous-v2 environment for Half Cheetah and changing the hyperparameters as you can see below:

    # coding=utf-8
    # Copyright 2018 The TF-Agents Authors.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    r"""Train and Eval SAC.
    
    To run:
    
    #bash
    #tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &
    #
    #python tf_agents/agents/sac/examples/v2/train_eval.py \
    #  --root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
    #  --alsologtostderr
    #```
    #"""
    
    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import os
    import time
    
    from absl import app
    from absl import flags
    from absl import logging
    
    import gin
    import tensorflow as tf
    
    from tf_agents.agents.ddpg import critic_network
    from tf_agents.agents.sac import sac_agent
    from tf_agents.drivers import dynamic_step_driver
    from tf_agents.environments import parallel_py_environment
    from tf_agents.environments import suite_mujoco
    from tf_agents.environments import tf_py_environment
    from tf_agents.eval import metric_utils
    from tf_agents.metrics import tf_metrics
    from tf_agents.networks import actor_distribution_network
    from tf_agents.networks import normal_projection_network
    from tf_agents.policies import greedy_policy
    from tf_agents.policies import random_tf_policy
    from tf_agents.replay_buffers import tf_uniform_replay_buffer
    from tf_agents.utils import common
    
    flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
                        'Root directory for writing logs/summaries/checkpoints.')
    flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
    flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')
    
    FLAGS = flags.FLAGS
    
    
    @gin.configurable
    def normal_projection_net(action_spec,
                              init_action_stddev=0.35,
                              init_means_output_factor=0.1):
      del init_action_stddev
      return normal_projection_network.NormalProjectionNetwork(
          action_spec,
          mean_transform=None,
          state_dependent_std=True,
          init_means_output_factor=init_means_output_factor,
          std_transform=sac_agent.std_clip_transform,
          scale_distribution=True)
    
    
    _DEFAULT_REWARD_SCALE = 0
    
    
    @gin.configurable
    def train_eval(
        root_dir,
        env_name='LunarLanderContinuous-v2',
        eval_env_name=None,
        env_load_fn=suite_mujoco.load,
        num_iterations=500000,
        actor_fc_layers=(64, 64),
        critic_obs_fc_layers=None,
        critic_action_fc_layers=None,
        critic_joint_fc_layers=(64, 64),
        num_parallel_environments=1,
        # Params for collect
        initial_collect_steps=100,
        collect_steps_per_iteration=1,
        replay_buffer_capacity=50000,
        # Params for target update
        target_update_tau=0.005,
        target_update_period=1,
        # Params for train
        train_steps_per_iteration=1,
        batch_size=64,
        actor_learning_rate=3e-4,
        critic_learning_rate=3e-4,
        alpha_learning_rate=3e-4,
        td_errors_loss_fn=tf.compat.v1.losses.mean_squared_error,
        gamma=0.99,
        reward_scale_factor=_DEFAULT_REWARD_SCALE,
        gradient_clipping=None,
        use_tf_functions=True,
        # Params for eval
        num_eval_episodes=100,
        eval_interval=1000,
        # Params for summaries and logging
        train_checkpoint_interval=10000,
        policy_checkpoint_interval=5000,
        rb_checkpoint_interval=50000,
        log_interval=1000,
        summary_interval=1000,
        summaries_flush_secs=10,
        debug_summaries=False,
        summarize_grads_and_vars=False,
        eval_metrics_callback=None):
      """A simple train and eval for SAC on Mujoco.
    
      All hyperparameters come from the original SAC paper
      (https://arxiv.org/pdf/1801.01290.pdf).
      """
    
      if reward_scale_factor == _DEFAULT_REWARD_SCALE:
        # Use value recommended by https://arxiv.org/abs/1801.01290
        if env_name.startswith('Humanoid'):
          reward_scale_factor = 20.0
        else:
          reward_scale_factor = 5.0
    
      root_dir = os.path.expanduser(root_dir)
    
      summary_writer = tf.compat.v2.summary.create_file_writer(
          root_dir, flush_millis=summaries_flush_secs * 1000)
      summary_writer.set_as_default()
    
      eval_metrics = [
          tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
          tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
      ]
    
      global_step = tf.compat.v1.train.get_or_create_global_step()
      with tf.compat.v2.summary.record_if(
          lambda: tf.math.equal(global_step % summary_interval, 0)):
        # create training environment
        if num_parallel_environments == 1:
          py_env = env_load_fn(env_name)
        else:
          py_env = parallel_py_environment.ParallelPyEnvironment(
              [lambda: env_load_fn(env_name)] * num_parallel_environments)
        tf_env = tf_py_environment.TFPyEnvironment(py_env)
        # create evaluation environment
        eval_env_name = eval_env_name or env_name
        eval_py_env = env_load_fn(eval_env_name)
        eval_tf_env = tf_py_environment.TFPyEnvironment(eval_py_env)
    
        time_step_spec = tf_env.time_step_spec()
        observation_spec = time_step_spec.observation
        action_spec = tf_env.action_spec()
    
        actor_net = actor_distribution_network.ActorDistributionNetwork(
            observation_spec,
            action_spec,
            fc_layer_params=actor_fc_layers,
            continuous_projection_net=normal_projection_net)
        critic_net = critic_network.CriticNetwork(
            (observation_spec, action_spec),
            observation_fc_layer_params=critic_obs_fc_layers,
            action_fc_layer_params=critic_action_fc_layers,
            joint_fc_layer_params=critic_joint_fc_layers)
    
        tf_agent = sac_agent.SacAgent(
            time_step_spec,
            action_spec,
            actor_network=actor_net,
            critic_network=critic_net,
            actor_optimizer=tf.compat.v1.train.AdamOptimizer(
                learning_rate=actor_learning_rate),
            critic_optimizer=tf.compat.v1.train.AdamOptimizer(
                learning_rate=critic_learning_rate),
            alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
                learning_rate=alpha_learning_rate),
            target_update_tau=target_update_tau,
            target_update_period=target_update_period,
            td_errors_loss_fn=td_errors_loss_fn,
            gamma=gamma,
            reward_scale_factor=reward_scale_factor,
            gradient_clipping=gradient_clipping,
            debug_summaries=debug_summaries,
            summarize_grads_and_vars=summarize_grads_and_vars,
            train_step_counter=global_step)
        tf_agent.initialize()
    
        # Make the replay buffer.
        replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
            data_spec=tf_agent.collect_data_spec,
            batch_size=num_parallel_environments,
            max_length=replay_buffer_capacity)
        replay_observer = [replay_buffer.add_batch]
    
        env_steps = tf_metrics.EnvironmentSteps(prefix='Train')
        average_return = tf_metrics.AverageReturnMetric(
            prefix='Train',
            buffer_size=num_eval_episodes,
            batch_size=tf_env.batch_size)
        train_metrics = [
            tf_metrics.NumberOfEpisodes(prefix='Train'),
            env_steps,
            average_return,
            tf_metrics.AverageEpisodeLengthMetric(
                prefix='Train',
                buffer_size=num_eval_episodes,
                batch_size=tf_env.batch_size),
        ]
    
        eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
        initial_collect_policy = random_tf_policy.RandomTFPolicy(
            tf_env.time_step_spec(), tf_env.action_spec())
        collect_policy = tf_agent.collect_policy
    
        train_checkpointer = common.Checkpointer(
            ckpt_dir=os.path.join(root_dir, 'train'),
            agent=tf_agent,
            global_step=global_step,
            metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
        policy_checkpointer = common.Checkpointer(
            ckpt_dir=os.path.join(root_dir, 'policy'),
            policy=eval_policy,
            global_step=global_step)
        rb_checkpointer = common.Checkpointer(
            ckpt_dir=os.path.join(root_dir, 'replay_buffer'),
            max_to_keep=1,
            replay_buffer=replay_buffer)
    
        train_checkpointer.initialize_or_restore()
        rb_checkpointer.initialize_or_restore()
    
        initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
            tf_env,
            initial_collect_policy,
            observers=replay_observer + train_metrics,
            num_steps=initial_collect_steps)
    
        collect_driver = dynamic_step_driver.DynamicStepDriver(
            tf_env,
            collect_policy,
            observers=replay_observer + train_metrics,
            num_steps=collect_steps_per_iteration)
    
        if use_tf_functions:
          initial_collect_driver.run = common.function(initial_collect_driver.run)
          collect_driver.run = common.function(collect_driver.run)
          tf_agent.train = common.function(tf_agent.train)
    
        # Collect initial replay data.
        if env_steps.result() == 0 or replay_buffer.num_frames() == 0:
          logging.info(
              'Initializing replay buffer by collecting experience for %d steps'
              'with a random policy.', initial_collect_steps)
          initial_collect_driver.run()
    
        results = metric_utils.eager_compute(
            eval_metrics,
            eval_tf_env,
            eval_policy,
            num_episodes=num_eval_episodes,
            train_step=env_steps.result(),
            summary_writer=summary_writer,
            summary_prefix='Eval',
        )
        if eval_metrics_callback is not None:
          eval_metrics_callback(results, env_steps.result())
        metric_utils.log_metrics(eval_metrics)
    
        time_step = None
        policy_state = collect_policy.get_initial_state(tf_env.batch_size)
    
        time_acc = 0
        env_steps_before = env_steps.result().numpy()
    
        # Dataset generates trajectories with shape [Bx2x...]
        dataset = replay_buffer.as_dataset(
            num_parallel_calls=3, sample_batch_size=batch_size,
            num_steps=2).prefetch(3)
        iterator = iter(dataset)
    
        def train_step():
          experience, _ = next(iterator)
          return tf_agent.train(experience)
    
        if use_tf_functions:
          train_step = common.function(train_step)
    
        for _ in range(num_iterations):
          start_time = time.time()
          time_step, policy_state = collect_driver.run(
              time_step=time_step,
              policy_state=policy_state,
          )
          for _ in range(train_steps_per_iteration):
            train_step()
          time_acc += time.time() - start_time
    
          if global_step.numpy() % log_interval == 0:
            logging.info('env steps = %d, average return = %f', env_steps.result(),
                         average_return.result())
            env_steps_per_sec = (env_steps.result().numpy() -
                                 env_steps_before) / time_acc
            logging.info('%.3f env steps/sec', env_steps_per_sec)
            tf.compat.v2.summary.scalar(
                name='env_steps_per_sec',
                data=env_steps_per_sec,
                step=env_steps.result())
            time_acc = 0
            env_steps_before = env_steps.result().numpy()
    
          for train_metric in train_metrics:
            train_metric.tf_summaries(train_step=env_steps.result())
    
          if global_step.numpy() % eval_interval == 0:
            results = metric_utils.eager_compute(
                eval_metrics,
                eval_tf_env,
                eval_policy,
                num_episodes=num_eval_episodes,
                train_step=env_steps.result(),
                summary_writer=summary_writer,
                summary_prefix='Eval',
            )
            if eval_metrics_callback is not None:
              eval_metrics_callback(results, env_steps.result())
            metric_utils.log_metrics(eval_metrics)
    
          global_step_val = global_step.numpy()
          if global_step_val % train_checkpoint_interval == 0:
            train_checkpointer.save(global_step=global_step_val)
    
          if global_step_val % policy_checkpoint_interval == 0:
            policy_checkpointer.save(global_step=global_step_val)
    
          if global_step_val % rb_checkpoint_interval == 0:
            rb_checkpointer.save(global_step=global_step_val)
    
    
    def main(_):
      tf.compat.v1.enable_v2_behavior()
      logging.set_verbosity(logging.INFO)
      gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
      train_eval(FLAGS.root_dir)
    
    
    if __name__ == '__main__':
      flags.mark_flag_as_required('root_dir')
      app.run(main)
    

    My stable-baselines script looks like this:

    import gym
    import numpy as np
    
    from stable_baselines.common.vec_env import DummyVecEnv
    from stable_baselines.common import make_vec_env
    from stable_baselines.sac.policies import MlpPolicy
    from stable_baselines import SAC
    
    env = make_vec_env('LunarLanderContinuous-v2', n_envs=1)
    
    model_name = "sac_lunar_lander"
    
    model = SAC(MlpPolicy, env, verbose=1, tensorboard_log="./tensorboard_logs/stable_baselines_test")
    
    model.learn(total_timesteps=500000, log_interval=10)
    model.save(model_name)
    
    

    Finally, here is the output when I run the tf-agents script to show that the GPU is being detected and used:

    2019-12-22 11:26:35.054589: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
    2019-12-22 11:26:35.068596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
    name: Quadro M4000 major: 5 minor: 2 memoryClockRate(GHz): 0.7725
    pciBusID: 0000:01:00.0
    2019-12-22 11:26:35.068767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
    2019-12-22 11:26:35.069770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2019-12-22 11:26:35.070479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
    2019-12-22 11:26:35.070640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
    2019-12-22 11:26:35.071572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
    2019-12-22 11:26:35.072306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
    2019-12-22 11:26:35.074604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2019-12-22 11:26:35.075808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
    2019-12-22 11:26:35.076022: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    2019-12-22 11:26:35.080915: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3407920000 Hz
    2019-12-22 11:26:35.081214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555945a77880 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2019-12-22 11:26:35.081228: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2019-12-22 11:26:35.144953: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555945a9b180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
    2019-12-22 11:26:35.144974: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro M4000, Compute Capability 5.2
    2019-12-22 11:26:35.145550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
    name: Quadro M4000 major: 5 minor: 2 memoryClockRate(GHz): 0.7725
    pciBusID: 0000:01:00.0
    2019-12-22 11:26:35.145578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
    2019-12-22 11:26:35.145588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2019-12-22 11:26:35.145597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
    2019-12-22 11:26:35.145605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
    2019-12-22 11:26:35.145629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
    2019-12-22 11:26:35.145650: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
    2019-12-22 11:26:35.145674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2019-12-22 11:26:35.146551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
    2019-12-22 11:26:35.146575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
    2019-12-22 11:26:35.147375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-12-22 11:26:35.147384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
    2019-12-22 11:26:35.147388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
    2019-12-22 11:26:35.148348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6876 MB memory) -> physical GPU (device: 0, name: Quadro M4000, pci bus id: 0000:01:00.0, compute capability: 5.2)
    /home/patrick/src/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
      warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
    WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/agents/ddpg/critic_network.py:141: The name tf.keras.initializers.RandomUniform is deprecated. Please use tf.compat.v1.keras.initializers.RandomUniform instead.
    
    W1222 11:26:35.589284 140187933329152 module_wrapper.py:139] From /home/patrick/src/tf_agents/tf_agents/agents/ddpg/critic_network.py:141: The name tf.keras.initializers.RandomUniform is deprecated. Please use tf.compat.v1.keras.initializers.RandomUniform instead.
    
    2019-12-22 11:26:35.600509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/distributions/utils.py:92: AffineScalar.__init__ (from tensorflow_probability.python.bijectors.affine_scalar) is deprecated and will be removed after 2020-01-01.
    Instructions for updating:
    `AffineScalar` bijector is deprecated; please use `tfb.Shift(loc)(tfb.Scale(...))` instead.
    W1222 11:26:35.787435 140187933329152 deprecation.py:323] From /home/patrick/src/tf_agents/tf_agents/distributions/utils.py:92: AffineScalar.__init__ (from tensorflow_probability.python.bijectors.affine_scalar) is deprecated and will be removed after 2020-01-01.
    Instructions for updating:
    `AffineScalar` bijector is deprecated; please use `tfb.Shift(loc)(tfb.Scale(...))` instead.
    I1222 11:26:35.814536 140187933329152 common.py:920] Checkpoint available: tensorboard_logs/tf_agents_v2/train/ckpt-30000
    I1222 11:26:35.902629 140187933329152 common.py:920] Checkpoint available: tensorboard_logs/tf_agents_v2/policy/ckpt-35000
    I1222 11:26:35.908307 140187933329152 common.py:923] No checkpoint available at tensorboard_logs/tf_agents_v2/replay_buffer
    I1222 11:26:35.910735 140187933329152 tf_agents_v2_lunar_lander.py:267] Initializing replay buffer by collecting experience for 100 stepswith a random policy.
    WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/metrics/tf_metrics.py:161: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.where in 2.0, which has the same broadcast rule as np.where
    W1222 11:26:36.424730 140187933329152 deprecation.py:323] From /home/patrick/src/tf_agents/tf_agents/metrics/tf_metrics.py:161: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.where in 2.0, which has the same broadcast rule as np.where
    I1222 11:28:23.095548 140187933329152 metric_utils.py:47]  
    		 AverageReturn = 1.452040195465088
    		 AverageEpisodeLength = 501.0
    I1222 11:28:34.015443 140187933329152 tf_agents_v2_lunar_lander.py:314] env steps = 31200, average return = -80.228371
    I1222 11:28:34.015817 140187933329152 tf_agents_v2_lunar_lander.py:317] 131.060 env steps/sec
    etc.
    

    And the output from nvidia-smi while running the script:

    Sun Dec 22 11:29:16 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.129      Driver Version: 410.129      CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Quadro M4000        Off  | 00000000:01:00.0  On |                  N/A |
    | 51%   56C    P0    43W / 120W |   7865MiB /  8104MiB |     10%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1370      G   /usr/lib/xorg/Xorg                           435MiB |
    |    0      2062      G   compiz                                       146MiB |
    |    0      3479      G   ...uest-channel-token=17571043003057555071   211MiB |
    |    0     17466      C   python                                      7057MiB |
    +-----------------------------------------------------------------------------+
    
  • tf-agents-nightly installed on colab seems very different from the master branch

    tf-agents-nightly installed on colab seems very different from the master branch

    tf-agents-nightly installed on colab seems very different from the master branch. The experimental examples folder is missing . Not 100% if this is a colab issue or a tf-agents issue.

  • Problem with importing the

    Problem with importing the "reverb" package with Tutorial: SAC minitaur with the Actor-Learner API

    HI,

    I am getting an ImportError when trying to import the "reverb" package as done in the tutorial.

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-2-38745e83da94> in <module>
          4 import matplotlib.pyplot as plt
          5 import os
    ----> 6 import reverb
          7 import tempfile
          8 import PIL.Image
    
    ~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/__init__.py in <module>
         25 # pylint: enable=g-bad-import-order
         26 
    ---> 27 from reverb import item_selectors as selectors
         28 from reverb import rate_limiters
         29 
    
    ~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/item_selectors.py in <module>
         17 import functools
         18 
    ---> 19 from reverb import pybind
         20 
         21 Fifo = pybind.FifoSelector
    
    ~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/pybind.py in <module>
    ----> 1 import tensorflow as _tf; from .libpybind import *; del _tf
    
    ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
    

    I have tried to export this variable: export LD_LIBRARY_PATH=/home/orie/Desktop/AI/ai_venv/lib/

    I have also tried including this environment variable in my python notebook:

    import os
    os.environ['LD_LIBRARY_PATH'] = '/home/orie/Desktop/AI/ai_venv/lib/'
    

    I also tried: sudo ldconfig /home/orie/Desktop/AI/ai_venv/lib I'm using Ubuntu and a virtual environment.

    Thx for anyone who helps!

  • DQN Agent Issue With Custom Environment

    DQN Agent Issue With Custom Environment

    So I've been following the DQN agent example / tutorial and I set it up like in the example, only difference is that I built my own custom python environment which I then wrapped in TensorFlow. However, no matter how I shape my observations and action specs, I can't seem to get it to work whenever I give it an observation and request an action. Here's the error that I get:

    tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0] is not a matrix. Instead it has shape [10] [Op:MatMul]

    Here's how I'm setting up my agent:

        layer_parameters = (10,) #10 layers deep, shape is unspecified
        
        #placeholders 
        learning_rate = 1e-3  # @param {type:"number"}
        train_step_counter = tf.Variable(0)
    
        #instantiate agent
    
        optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)
        
        env = SumoEnvironment(self._num_actions,self._num_states)
        env2 = tf_py_environment.TFPyEnvironment(env)
        q_net= q_network.QNetwork(env2.observation_spec(),env2.action_spec(),fc_layer_params = layer_parameters)
        
        print("Time step spec")
        print(env2.time_step_spec())
    
        agent = dqn_agent.DqnAgent(env2.time_step_spec(),
        env2.action_spec(),
        q_network=q_net,
        optimizer = optimizer,
        td_errors_loss_fn=common.element_wise_squared_loss,
        train_step_counter=train_step_counter)`
    

    And here's how I'm setting up my environment:

    `class SumoEnvironment(py_environment.PyEnvironment):

    def __init__(self, no_of_Actions, no_of_Observations):
    
        #this means that the observation consists of a number of arrays equal to self._num_states, with datatype float32
        self._observation_spec = specs.TensorSpec(shape=(16,),dtype=np.float32,name='observation')
        #action spec, shape unknown, min is 0, max is the number of actions
        self._action_spec = specs.BoundedArraySpec(shape=(1,),dtype=np.int32,minimum=0,maximum=no_of_Actions-1,name='action')
        
       
        self._state = 0
        self._episode_ended = False`
    

    And here is what my input / observations look like:

    tf.Tensor([ 0. 0. 0. 0. 0. 0. 0. 0. -1. -1. -1. -1. 0. 0. 0. -1.], shape=(16,), dtype=float32)

    I've tried experimenting with the shape and depth of my Q_Net and it seems to me that the [10] in the error is related to the shape of my q network. Setting its layer parameters to (4,) yields an error of:

    tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0] is not a matrix. Instead it has shape [4] [Op:MatMul]

  • Feature request make it easier to supply custom model

    Feature request make it easier to supply custom model

    I tried assigning my own layers to the post_processing variable within my categorical qnetwork but i get a message that weights are shared. when i try to use then create my categorical dqn agent. It would be nice if the main categorical q network constructor allowed a parameter for you to provide a set of keras layers where the q_layer is just appended to the end like it is the the encoding network scheme. The weights will be copied for you.

  • AttributeError: 'tuple' object has no attribute 'rank'

    AttributeError: 'tuple' object has no attribute 'rank'

    Trying out the most basic example on

    • Windows 10
    • Python 3.7
    • tensorflow 2.1.0
    • tf-agents 0.4.0

    Error i get

    Traceback (most recent call last):
      File "src\agent.py", line 58, in <module>
        action, _states = agent.policy.action(obs)
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\policies\tf_policy.py", line 279, in action
        step = action_fn(time_step=time_step, policy_state=policy_state, seed=seed)
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\common.py", line 154, in with_check_resource_vars
        return fn(*fn_args, **fn_kwargs)
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\policies\random_tf_policy.py", line 89, in _action
        outer_dims = nest_utils.get_outer_shape(time_step, self._time_step_spec)
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 394, in get_outer_shape
        nested_tensor, spec, num_outer_dims=num_outer_dims):
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 97, in is_batched_nested_tensors
        if any(spec_shape.rank is None for spec_shape in spec_shapes):
      File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 97, in <genexpr>
        if any(spec_shape.rank is None for spec_shape in spec_shapes):
    AttributeError: 'tuple' object has no attribute 'rank'
    
    

    Code i run

    import tensorflow as tf
    from collections import Counter, defaultdict
    from tf_agents.networks import q_network
    from tf_agents.utils import common
    from tf_agents.agents.dqn import dqn_agent
    from tf_agents.agents.random.random_agent import RandomAgent
    from tf_agents.environments import suite_gym
    from environment import StockExchangeEnv01
    
    # tried with and without..error persists
    # tf.compat.v1.enable_v2_behavior()
    
    learning_rate = 0.0001
    optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)
    
    # tried both my own Environment and the basic "cartpole-v0"
    train_env = StockExchangeEnv01()
    env_name = 'CartPole-v0'
    #train_env = suite_gym.load(env_name)
    
    train_env.reset()
    print(train_env.action_spec())
    """
    # Neural Net of the Agent. This NN will get x (env) and spit out y (action).
    q_net = q_network.QNetwork(
      train_env.observation_spec(),
      train_env.action_spec(),
      fc_layer_params=(100,))
    print(train_env.action_spec())
    
    #
    agent = dqn_agent.DqnAgent(
      train_env.time_step_spec(),
      train_env.action_spec(),
      q_network=q_net,
      optimizer=optimizer)
    """
    
    # tried both..dqn agent and random agent
    
    agent = RandomAgent(
        train_env.time_step_spec(),
        train_env.action_spec()
    )
    agent.initialize()
    
    obs = train_env.reset()
    actions = Counter()
    pnl = defaultdict(float)
    total_rewards = 0.0
    
    for i in range(300):
        #action, _states = model.predict(obs)
        action, _states = agent.policy.action(obs)
        obs, rewards, dones, info = train_env.step(action)
        actions[action[0].item()] += 1
        pnl[action[0].item()] += rewards
        total_rewards += rewards
        if dones:
            break
    
    print('actions : {}'.format(actions))
    print('rewards : {}'.format(total_rewards))
    
    

    The code in tf agents gets the 'shape' from the action_spec, which is a tuple in my case. Then it tries to retrieve key "rank" from a tuple.

    What am i missing?

  • Memory leak with DqnAgent

    Memory leak with DqnAgent

    I have built basic DQN agent to play within CartPole environment by following the DQN tutorial: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial However, after couple of training hours I noticed that process is increasing memory consumption substantially. I was able to simplify the training script in order to narrow down the problem and figured out that memory leaks whenever driver is using agent.policy or agent.collect_policy (replacing that one with RandomTFPolicy eliminates the issue):

    import tensorflow as tf
    import gc
    
    from tf_agents.environments import suite_gym, tf_py_environment
    from tf_agents.networks import q_network
    from tf_agents.agents.dqn import dqn_agent
    from tf_agents.drivers import dynamic_step_driver
    from tf_agents.utils import common
    
    tf.compat.v1.enable_v2_behavior()
    
    # Create CartPole as TFPyEnvironment
    env = suite_gym.load('CartPole-v0')
    tf_env = tf_py_environment.TFPyEnvironment(env)
    
    # Create DQN Agent
    q_net = q_network.QNetwork(
            tf_env.observation_spec(),
            tf_env.action_spec(),
            fc_layer_params=(100,))
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
    train_step_counter = tf.Variable(0)
    
    agent = dqn_agent.DqnAgent(
        tf_env.time_step_spec(),
        tf_env.action_spec(),
        q_network=q_net,
        optimizer=optimizer,
        td_errors_loss_fn=common.element_wise_squared_loss,
        train_step_counter=train_step_counter)
    
    agent.initialize()
    
    # Replacing agent.collect_policy with tf_policy eliminates issue a of memory leak
    # tf_policy = random_tf_policy.RandomTFPolicy(action_spec=train_env.action_spec(),
    #                                            time_step_spec=train_env.time_step_spec())
    
    # Create dynamic step driver with no observers
    driver = dynamic_step_driver.DynamicStepDriver(
        env = tf_env,
        policy = agent.collect_policy,
        observers = [],
        num_steps = 1)
    
    # Calls to driver end up continuously increasing memory consumption 
    while True:
        driver.run()
        # One of the possible solutions is to call gc.collect() but it significantly slows down training
    

    Other hotfix as mentioned in the code above is to call gc.collect() after each driver.run() but that has huge impact on the performance.

    This memory leak prevents long-running training process which might be a bit of bummer for more complex environments based on DQN.

    Running setup:

    • Ubuntu 20.10 / 64-bit
    • Python 3.8.6 + tensorflow==2.4.1 + tf-agents==0.7.1
    • Running on the CPU: AMD Ryzen Threadripper 3960x
    • RAM: 128GB

    Same script has been also run within Docker container and confirmed memory leak.

    What could be possible cause for this problem and how to properly fix it?

  • OOM after a couple of iterations

    OOM after a couple of iterations

    I am running DQN on an Atari game (BeamRider-v0). I just get the input image and flatten it and connect it to a fully connected layer with 32 neurons. It runs for 14000 iterations on a Telsa v100 GPU. After 14000 iterations, I get OOM. Is there a memory leak? I am using tf-nightly-gpu-2.0-preview. I have also tried tf-nightly-gpu and the same problem exists. My question is why I don't get the error at the very first iterations? What causes memory usage to grow for 14000 iterations?

  • DQN sample - AverageReturn output is same as AverageEpisodeLength

    DQN sample - AverageReturn output is same as AverageEpisodeLength

    I have ran sample: https://github.com/tensorflow/agents/tree/942db59044f2b25151f313dc9a098ff652ab90f2/tf_agents/agents/dqn/examples/v2

    Apparently AverageReturn always equal AverageEpisodeLength. Potential bug?

    INFO:absl: 
    		 AverageReturn = 119.5999984741211
    		 AverageEpisodeLength = 119.5999984741211
    INFO:absl:step = 3000, loss = 2.203988
    INFO:absl:403.487 steps/sec
    
  • Implement batched observer unbatching

    Implement batched observer unbatching

    This is essentially an adapter that allows observers which don't support a batch dimension (looking at you, ReverbObservers) to be used in batch contexts.

  • Loaded policy eval runs 4 times faster than original policy eval

    Loaded policy eval runs 4 times faster than original policy eval

    I have made RL solution largely based on: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

    After training finished I run eval once. Then I save the policy and run eval again. The timing is 400% different. Is this expected? Is there a reasonable explanation?

    This is original eval:

    start = timer()
    avg_return, total_rewards = policy_eval(eval_env, agent.policy, total_eval_episodes)
    end = timer()
    print('{2} | steps = {0:6}: Average Return = {1:<+9e}, per step: {3}'.format(total_eval_episodes, avg_return, timedelta(seconds=end-start), timedelta(seconds=(end-start)/total_eval_episodes)))
    
    Out:
    0:25:27.958856 | steps =   1000: Average Return = -4.910102e-07, per step: 0:00:01.527959
    

    this is save/load:

    tf_policy_saver = policy_saver.PolicySaver(agent.policy)
    tf_policy_saver.save(policy_dir)
    . . .
    saved_policy = tf.saved_model.load(policy_dir)
    

    Eval using loaded policy:

    start = timer()
    avg_return2, total_rewards2 = policy_eval(eval_env, saved_policy, total_eval_episodes)
    end = timer()
    print('{2} | Saved policy: steps = {0:6}: Average Return = {1:<+9e}, per step: {3}'.format(total_eval_episodes, avg_return2, timedelta(seconds=end-start), timedelta(seconds=(end-start)/total_eval_episodes)))
    
    Out:
    0:03:47.331221 | Saved policy: steps =   1000: Average Return = -7.780847e-07, per step: 0:00:00.227331
    

    eval function (almost same as in the tutorial:

    def policy_eval(environment, policy, num_episodes=10):
    
      total_return = 0.0
      episode_returns = []
      policy_state = policy.get_initial_state(environment.batch_size)
    
      for _ in range(num_episodes):
    
        time_step = environment.reset()
    
        while not time_step.is_last():
          action_step = policy.action(time_step, policy_state)
          policy_state = action_step.state
          time_step = environment.step(action_step.action)
          total_return += time_step.reward
          episode_returns.append(time_step.reward)
    
      avg_return = total_return / num_episodes
        
      return avg_return.numpy()[0], episode_returns
    
  • Batching ReverbAddEpisodeObserver with variable length episodes

    Batching ReverbAddEpisodeObserver with variable length episodes

    It looks like this bug is already tracking the issue, but I would like to use ParallelPyEnvironment in conjunction with ReverbAddEpisodeObserver where my environment has variable length episodes. Is there a timeline for addressing this bug? If not, would it be something I could contribute?

  • Keras model usage

    Keras model usage

    I'm trying to get a custom Keras model to work with DDPG agent, but having no luck. I looked at this question. https://github.com/tensorflow/agents/issues/457

    It did not work. Ideally I need to be able to save out the underlying Keras model to convert. It is destined for an edge device that can not run tensorflow. Is there a tutorial on using keras models?

    ValueError: actor_network output spec does not match action spec:
    TensorSpec(shape=(1,), dtype=tf.float32, name=None)
    vs.
    BoundedTensorSpec(shape=(), dtype=tf.float32, name='action', minimum=array(-1., dtype=float32), maximum=array(17.6, dtype=float32))
      In call to configurable 'ActorPolicy' (<class 'tf_agents.policies.actor_policy.ActorPolicy'>)
      In call to configurable 'DdpgAgent' (<class 'tf_agents.agents.ddpg.ddpg_agent.DdpgAgent'>)
    

    This is the model. Both the spec and model only have 1 action. I don't quite understand this error.

    input_1 (InputLayer)         [(None, 32)]              0         
    _________________________________________________________________
    dense (Dense)                (None, 64)                2112      
    _________________________________________________________________
    dense_1 (Dense)              (None, 16)                1040      
    _________________________________________________________________
    dense_2 (Dense)              (None, 1)                 17        
    _________________________________________________________________
    
  • Value Error

    Value Error

    ​import random from abc import ABC from random import choice

    import numpy as np from tf_agents.environments import py_environment from tf_agents.specs import is_continuous from tf_agents.trajectories import time_step as ts, time_step

    class WarehouseEnv(py_environment.PyEnvironment, ABC):

    def __init__(self, col, row):
        super(WarehouseEnv).__init__()
        self.ARTICLE_DICT = dict(Cola=1, Sprite=2, Fanta=3, Cola_Zero=4, Cola_Light=5, Mezzo_Mix=6)
        # (ARTICLE_DICT)
        self.ARTICLE = list(self.ARTICLE_DICT.values())
        # print(ARTICLE)
    
        self.DEMAND_WAREHOUSE = [['1', '2', '3', '1', '1', '1'],
                                 ['1', '2'],
                                 ['3', '2', '1', '1'],
                                 ['4'],
                                 ['1', '6', '1']]
    
        self._episode_ended = False
        self.row = row
        self.col = col
        self._action_spec = is_continuous(spec=np.float64).BoundedArraySpec(shape=(1,), dtype=np.float64, minimum=0,
                                                                            maximum=1,
                                                                            name='action')
        self._observation_spec = is_continuous(spec=np.float64).BoundedArraySpec(shape=(25,), dtype=np.float64,
                                                                                 minimum=0,
                                                                                 maximum=25, name='observation')
        self.Supply_Warehouse = np.zeros([self.row, self.col], dtype=np.float64)
        print(self.Supply_Warehouse)
        self._state = self.Supply_Warehouse
    
    def _reset(self):
        self._state = self.Supply_Warehouse
        self._episode_ended = False
        return time_step.restart(np.array([self._state], dtype=np.float64))
    
    def action_spec(self):
        return self._action_spec
    
    def observation_spec(self):
        return self._observation_spec
    
    def add_to_supply_warehouse(self):
        for row in range(self.row):
            for col in range(self.col):
                if self.Supply_Warehouse[row][col] == 0:
                    if self.Supply_Warehouse[row - 1][col - 1] != 0:
                        self.Supply_Warehouse[row][col] += random.choice(list(self.ARTICLE_DICT.values()))
                        break
                return self._state
    
    def check_demand(self):
        for row in self._state:
            for col in self._state:
                if self._state[row][col] > self._state[:, 1] or self._state[row][col] < self._state[
                    row + 1, col + 1] == 0:
                    if self._state[row][col] == self.DEMAND_WAREHOUSE[:, :]:
                        return self._state and self.DEMAND_WAREHOUSE
    
    def remove_article(self):
        for row in range(self.row):
            for col in range(self.col):
                if self.check_demand is True:
                    self.DEMAND_WAREHOUSE[:, :] = 0
                    return self.Supply_Warehouse
    
    def _step(self, action):
        if self._episode_ended:
            return self.reset()
        # action = action.item()
        if action == 0:  # add article to the supply warehouse
            self.add_to_supply_warehouse()
            new_article = random.choice(list(self.ARTICLE_DICT.values()))
            self._state += new_article
            return ts.transition(np.array([self._state], dtype=np.float64), reward=1, discount=0.9)
        if all in self._state > 0:
            self._episode_ended = True
            return ts.termination(np.array([self._state], dtype=np.float64), reward=-9)
    
        if action == 1:  # Checks demand arbitrarily and remove article from supply warehouse and similar value in
            # Demand list becomes 0
            check = [self.add_to_supply_warehouse, self.check_demand]
            choice(check)()
            if self.check_demand is True:
                self.remove_article()
                return ts.transition(np.array([self._state], dtype=np.float64), reward=5, discount=0.9)
            elif len(self.DEMAND_WAREHOUSE) == 0:
                self._episode_ended = True
                return ts.termination(np.array([self._state], dtype=np.float64), reward=2)
            else:
                return ts.transition(np.array([self._state], dtype=np.float64), reward=0, discount=0.9)
    
        info = {}
        return self._observation_spec, self._reward, info
    
    1. from tf_agents.environments import tf_py_environment, utils from WarehouseEnv import WarehouseEnv

    python_environment = WarehouseEnv(5, 7) utils.validate_py_environment(python_environment, episodes=5) tf_env = tf_py_environment.TFPyEnvironment(python_environment)

    Description: I'm trying to design and simulate a warehouse environment where adding the article to the warehouse (column wise if the previous column is filled). Checking demand list and if the article in the demand list (article from '0' index in demand list) matches the articles in warehouse (articles in columns towards the right side) then both values = 0 (i.e., article from warehouse is removed to satisfy demand and hence demand is satisfied, the similar value in demand list is also cleared). This function after remmoving the article is called removing article. This cycle contnuous until demand list is 0 or warehouse is full. rewards are based on how I want the Env to behave.

    But I have a value error if I use bounded (Discrete )array:

    ValueError: Given "time_step: TimeStep(

    {° discount': array(1., dtype=fLoat32),

    "observation': array(IL[0, 0, 0, 0, 0],

    [0, 0, 0, 0, 0],

    [0, 0, 0, 0, 0],

    [o,

    0, 0, 0, 0],

    [0, 0, 0, 0, 0]]], dtype=int32),

    *reward': array(0., dtype=FLoat32),

    'step_type": array(0, dtype=int32)}) does not match expected 'time_step_spec': Timestep(

    {'discount': BoundedArraySpec (shape=), dtype=dtype('FLoat32'), name='discount', minimum=0.0, maximum=1.0),

    'observation': BoundedArraySpec (shape=(5, 5), type=dtype ('int32'), name='observation', minimum=0, maximum=10),

    "reward': ArraySpec (shape=), dtype=dtype ('FLoat32'), name='reward'),

    "step_type': ArraySpec(shape=), dtype=dtype('int32'), name='step_type')})

    I tried it by changing the action to continuous which gives me this error:

    line 29, in init self._action_spec = is_continuous(spec=np.float64).BoundedArraySpec(shape=(1,), dtype=np.float64, minimum=0, AttributeError: 'bool' object has no attribute 'BoundedArraySpec'

    I'm in desperate need of help. Please help!!!!

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Sep 20, 2022
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book

SLM Lab Modular Deep Reinforcement Learning framework in PyTorch. Documentation: https://slm-lab.gitbook.io/slm-lab/ BeamRider Breakout KungFuMaster M

Sep 24, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Sep 23, 2022
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

Sep 23, 2022
ChainerRL is a deep reinforcement learning library built on top of Chainer.
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Sep 22, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

Sep 26, 2022
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Sep 20, 2022
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Sep 24, 2022
A toolkit for reproducible reinforcement learning research.
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Sep 20, 2022
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

Sep 20, 2022
A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

Sep 23, 2022
Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Sep 19, 2022
Deep Reinforcement Learning for Keras.
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Sep 26, 2022
Open world survival environment for reinforcement learning
Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Sep 13, 2022
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

Sep 21, 2022
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks,该仓库主要是针对强化学

Sep 10, 2022
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Sep 27, 2022
Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli
Source code and data from the RecSys 2020 article

Carousel Personalization in Music Streaming Apps with Contextual Bandits - RecSys 2020 This repository provides Python code and data to reproduce expe

Sep 14, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

Sep 24, 2022