Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning

Docs Gitter Build Status pypi version python version License Donate Donate

Introduction

Tensorforce is an open-source deep reinforcement learning framework, with an emphasis on modularized flexible library design and straightforward usability for applications in research and practice. Tensorforce is built on top of Google's TensorFlow framework and requires Python 3.

Tensorforce follows a set of high-level design choices which differentiate it from other similar libraries:

  • Modular component-based design: Feature implementations, above all, strive to be as generally applicable and configurable as possible, potentially at some cost of faithfully resembling details of the introducing paper.
  • Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.
  • Full-on TensorFlow models: The entire reinforcement learning logic, including control flow, is implemented in TensorFlow, to enable portable computation graphs independent of application programming language, and to facilitate the deployment of models.

Quicklinks

Table of content

Installation

A stable version of Tensorforce is periodically updated on PyPI and installed as follows:

pip3 install tensorforce

To always use the latest version of Tensorforce, install the GitHub version instead:

git clone https://github.com/tensorforce/tensorforce.git
pip3 install -e tensorforce

Environments require additional packages for which there are setup options available (ale, gym, retro, vizdoom, carla; or envs for all environments), however, some require additional tools to be installed separately (see environments documentation). Other setup options include tfa for TensorFlow Addons and tune for HpBandSter required for the tune.py script.

Note on GPU usage: Different from (un)supervised deep learning, RL does not always benefit from running on a GPU, depending on environment and agent configuration. In particular for environments with low-dimensional state spaces (i.e., no images), it is hence worth trying to run on CPU only.

Quickstart example code

from tensorforce import Agent, Environment

# Pre-defined or custom environment
environment = Environment.create(
    environment='gym', level='CartPole', max_episode_timesteps=500
)

# Instantiate a Tensorforce agent
agent = Agent.create(
    agent='tensorforce',
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=10000,
    update=dict(unit='timesteps', batch_size=64),
    optimizer=dict(type='adam', learning_rate=3e-4),
    policy=dict(network='auto'),
    objective='policy_gradient',
    reward_estimation=dict(horizon=20)
)

# Train for 300 episodes
for _ in range(300):

    # Initialize episode
    states = environment.reset()
    terminal = False

    while not terminal:
        # Episode timestep
        actions = agent.act(states=states)
        states, terminal, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminal, reward=reward)

agent.close()
environment.close()

Command line usage

Tensorforce comes with a range of example configurations for different popular reinforcement learning environments. For instance, to run Tensorforce's implementation of the popular Proximal Policy Optimization (PPO) algorithm on the OpenAI Gym CartPole environment, execute the following line:

python3 run.py --agent benchmarks/configs/ppo.json --environment gym \
    --level CartPole-v1 --episodes 100

For more information check out the documentation.

Features

  • Network layers: Fully-connected, 1- and 2-dimensional convolutions, embeddings, pooling, RNNs, dropout, normalization, and more; plus support of Keras layers.
  • Network architecture: Support for multi-state inputs and layer (block) reuse, simple definition of directed acyclic graph structures via register/retrieve layer, plus support for arbitrary architectures.
  • Memory types: Simple batch buffer memory, random replay memory.
  • Policy distributions: Bernoulli distribution for boolean actions, categorical distribution for (finite) integer actions, Gaussian distribution for continuous actions, Beta distribution for range-constrained continuous actions, multi-action support.
  • Reward estimation: Configuration options for estimation horizon, future reward discount, state/state-action/advantage estimation, and for whether to consider terminal and horizon states.
  • Training objectives: (Deterministic) policy gradient, state-(action-)value approximation.
  • Optimization algorithms: Various gradient-based optimizers provided by TensorFlow like Adam/AdaDelta/RMSProp/etc, evolutionary optimizer, natural-gradient-based optimizer, plus a range of meta-optimizers.
  • Exploration: Randomized actions, sampling temperature, variable noise.
  • Preprocessing: Clipping, deltafier, sequence, image processing.
  • Regularization: L2 and entropy regularization.
  • Execution modes: Parallelized execution of multiple environments based on Python's multiprocessing and socket.
  • Optimized act-only SavedModel extraction.
  • TensorBoard support.

By combining these modular components in different ways, a variety of popular deep reinforcement learning models/features can be replicated:

Note that in general the replication is not 100% faithful, since the models as described in the corresponding paper often involve additional minor tweaks and modifications which are hard to support with a modular design (and, arguably, also questionable whether it is important/desirable to support them). On the upside, these models are just a few examples from the multitude of module combinations supported by Tensorforce.

Environment adapters

  • Arcade Learning Environment, a simple object-oriented framework that allows researchers and hobbyists to develop AI agents for Atari 2600 games.
  • CARLA, is an open-source simulator for autonomous driving research.
  • OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms which supports teaching agents everything from walking to playing games like Pong or Pinball.
  • OpenAI Retro, lets you turn classic video games into Gym environments for reinforcement learning and comes with integrations for ~1000 games.
  • OpenSim, reinforcement learning with musculoskeletal models.
  • PyGame Learning Environment, learning environment which allows a quick start to Reinforcement Learning in Python.
  • ViZDoom, allows developing AI bots that play Doom using only the visual information.

Support, feedback and donating

Please get in touch via mail or on Gitter if you have questions, feedback, ideas for features/collaboration, or if you seek support for applying Tensorforce to your problem.

If you want to support the Tensorforce core team (see below), please also consider donating: GitHub Sponsors or Liberapay.

Core team and contributors

Tensorforce is currently developed and maintained by Alexander Kuhnle.

Earlier versions of Tensorforce (<= 0.4.2) were developed by Michael Schaarschmidt, Alexander Kuhnle and Kai Fricke.

The advanced parallel execution functionality was originally contributed by Jean Rabault (@jerabaul29) and Vincent Belus (@vbelus). Moreover, the pretraining feature was largely developed in collaboration with Hongwei Tang (@thw1021) and Jean Rabault (@jerabaul29).

The CARLA environment wrapper is currently developed by Luca Anzalone (@luca96).

We are very grateful for our open-source contributors (listed according to Github, updated periodically):

Islandman93, sven1977, Mazecreator, wassname, lefnire, daggertye, trickmeyer, mkempers, mryellow, ImpulseAdventure, janislavjankov, andrewekhalel, HassamSheikh, skervim, beflix, coord-e, benelot, tms1337, vwxyzjn, erniejunior, Deathn0t, petrbel, nrhodes, batu, yellowbee686, tgianko, AdamStelmaszczyk, BorisSchaeling, christianhidber, Davidnet, ekerazha, gitter-badger, kborozdin, Kismuz, mannsi, milesmcc, nagachika, neitzal, ngoodger, perara, sohakes, tomhennigan.

Cite Tensorforce

Please cite the framework as follows:

@misc{tensorforce,
  author       = {Kuhnle, Alexander and Schaarschmidt, Michael and Fricke, Kai},
  title        = {Tensorforce: a TensorFlow library for applied reinforcement learning},
  howpublished = {Web page},
  url          = {https://github.com/tensorforce/tensorforce},
  year         = {2017}
}

If you use the parallel execution functionality, please additionally cite it as follows:

@article{rabault2019accelerating,
  title        = {Accelerating deep reinforcement learning strategies of flow control through a multi-environment approach},
  author       = {Rabault, Jean and Kuhnle, Alexander},
  journal      = {Physics of Fluids},
  volume       = {31},
  number       = {9},
  pages        = {094105},
  year         = {2019},
  publisher    = {AIP Publishing}
}

If you use Tensorforce in your research, you may additionally consider citing the following paper:

@article{lift-tensorforce,
  author       = {Schaarschmidt, Michael and Kuhnle, Alexander and Ellis, Ben and Fricke, Kai and Gessert, Felix and Yoneki, Eiko},
  title        = {{LIFT}: Reinforcement Learning in Computer Systems by Learning From Demonstrations},
  journal      = {CoRR},
  volume       = {abs/1808.07903},
  year         = {2018},
  url          = {http://arxiv.org/abs/1808.07903},
  archivePrefix = {arXiv},
  eprint       = {1808.07903}
}
Comments
  • Unable to train for many episodes: RAM usage too high!

    Unable to train for many episodes: RAM usage too high!

    Hi @AlexKuhnle, I have some trouble training a ppo agent. Basically, I'm able to train it for only very few episodes (e.g. 4, 8). If I increase the number of episodes, my laptop will crash or freeze due to running out of RAM.

    I have a linux machine with 16GB of RAM. Tensorflow 2.1.0 (cpu-only) and Tensorforce 0.5.4. The agent I'm trying to train is defined as follows:

    policy_network = dict(type='auto', size=64, depth=2, 
                          final_size=256, final_depth=2, internal_rnn=10)
            
    agent = Agent.create(
                agent='ppo', 
                environment=environment, 
                max_episode_timesteps=200,
                network=policy_network,
                # Optimization
                batch_size=4, 
                update_frequency=1, 
                learning_rate=1e-3, 
                subsampling_fraction=0.5,
                optimization_steps=5,:
                # Reward estimation
                likelihood_ratio_clipping=0.2, discount=0.99, estimate_terminal=False,
                # Critic
                critic_network='auto',
                critic_optimizer=dict(optimizer='adam', multi_step=10, learning_rate=1e-3),
                # Exploration
                exploration=0.0, variable_noise=0.0,
                # Regularization
                l2_regularization=0.0, entropy_regularization=0.0,
            )
    

    The environment is a custom one I've made: it has a complex state space (i.e. an image and some feature vectors), and a simple action space (i.e. five float actions).

    I use a Runner to train the agent:

    runner = Runner(agent, environment, max_episode_timesteps=200, num_parallel=None)
    runner.run(num_episodes=100)
    

    As you can see from the above code snippet, I'd like to train my agent for (at least) 100 episodes but the system crashes after completing episode 4.

    I noticed that, during training, every batch_size episodes (4 in my case) Tensorforce allocates an additional amount of 6/7 GB of RAM which causes the system to crash: my OS uses 1 GB + 2/3 GB the environment simulator + 3/4 GB for the agent and environment.

    This is what happens (slightly before freezing): memory_issue_tensorforce

    Is this behaviour normal? Just to be sure, I tested a similar (but simpler) agent on the CartPole environment for 300 episodes and it works fine with very little memory overhead. How it's possible?

    Thank you in advance.

  • Custom network and layer freezing

    Custom network and layer freezing

    Hi,

    I want to build a custom environment in which an action would be a 2d matrix (basically a b/w image), and one of the solutions I found is to use a policy based algorithm such as PPO with the policy network having layers of deconvolutions (I would probably use a U-net).

    I first intended to use baselines, but I want the output of my network to match the action pixel-wise (the output at position (x,y) is used for the value at position (x,y) in the action), which I believe is not the case in the PPO2 implementation of baselines, where there is a fully-connected layer when the output of the network becomes the parameters of a probability distribution from which the action is sampled.

    Would it be possible to simply write the U-net architecture as a dictionary in your implementation, and have it working like I want, given the action space and network output shape are matching, or am I missing something ?

    Also, is it possible to freeze layers of the network, and/or use a pre-trained network ?

    I read through the documentation, but some of my questions will probably have an obvious answer somewhere in the repo, sorry for that !

  • Quickstart example get stuck [GPU]

    Quickstart example get stuck [GPU]

    Hi,

    I just installed tensorforce (from pip) with tensorflow-gpu 1.7 and tried to run example/quickstart.py. The training starts but then gets stucks after n episodes where n is the minimum of batch_size and frequency value of the update_mode argument of PPOAgent.

    update_mode=dict(
        unit='episodes',
        # 10 episodes per update
        batch_size=20,
        # Every 10 episodes
        frequency=20
    ),
    

    No error message is displayed, it just hangs forever. Has anyone experienced something similar?

    Thanks,

  • tf2 branch: unable to use

    tf2 branch: unable to use "saved_model"

    Hi,

    I've started to look at the saved_model export in the tf2 branch and I face some issues: First, I had to change tensorforce/core/utils/dicts.py, line 121 to accept all data types - it seems that tensorflow tries to rebuild dictionaries in the process: value_type=(tf.IndexedSlices, tf.Tensor, tf.Variable, object)

    Then, in tensorforce/core/models/model.py line 678, I got errors caused by the signature: ValueError: Got non-flat outputs '(TensorDict(main_sail_angle=Tensor("StatefulPartitionedCall:1", shape=(None,), dtype=float32), jib_angle=Tensor("StatefulPartitionedCall:0", shape=(None,), dtype=float32), rudder_angle=Tensor("StatefulPartitionedCall:2", shape=(None,), dtype=float32)), TensorDict())' from 'b'__inference_function_graph_2203'' for SavedModel signature 'serving_default'. Signatures have one Tensor per output, so to have predictable names Python functions used to generate these signatures should avoid outputting Tensors in nested structures.

    I tried to remove the signature in the saved_model.save call, and I got troubles with tensorforce/core/module.py, the function tf_function which build a function_graphs with keys which are tuples - and tensorflow doesn't like it. I converted them to string and I could save a file, but it's totally unusable.

    So I'm stuck here, I'd need more help: what is tf_function doing exactly? Why don't you use tf.function instead?

    Thanks! Ben

  • Error: Invalid Gradient

    Error: Invalid Gradient

    Hi! I got this error during the training which I never saw before. Could you please help me with it?

    Thank you very much! Zebin Li

    Traceback (most recent call last): File "C:/Users/lizeb/Box/research projects/active learning for image classification/code/run_manytimes_RAL_AL_samedata_0522.py", line 18, in performance_history_RL, performance_history_RL_test, performance_history_AL, performance_history_AL_test, test_RL, test_AL, rewards = runmanytimes() File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\RAL_AL_samedata_0522.py", line 407, in runmanytimes agent.observe(terminal=terminal, reward=reward) File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorforce\agents\agent.py", line 510, in observe updated, episodes, updates = self.model.observe( File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorforce\core\module.py", line 128, in decorated output_args = function_graphsstr(graph_params) File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in call result = self._call(*args, **kwds) File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 814, in _call results = self._stateful_fn(*args, **kwds) File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1843, in _filtered_call return self._call_flat( File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1923, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\function.py", line 545, in call outputs = execute.execute( File "C:\Users\lizeb\Box\research projects\active learning for image classification\code\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid gradient: contains inf or nan. : Tensor had NaN values [[{{node agent/StatefulPartitionedCall/agent/cond_1/then/_262/agent/cond_1/StatefulPartitionedCall/agent/StatefulPartitionedCall_5/policy_optimizer/StatefulPartitionedCall/policy_optimizer/StatefulPartitionedCall/policy_optimizer/while/body/_1185/policy_optimizer/while/StatefulPartitionedCall/policy_optimizer/cond/then/_1464/policy_optimizer/cond/StatefulPartitionedCall/policy_optimizer/VerifyFinite/CheckNumerics}}]] [Op:__inference_observe_5103]

    Function call stack: observe

  • Some questions about tensorforce

    Some questions about tensorforce

    Hi, thanks for your great work. But when I read the docs I have some questions about this framework.

    Q1: How does the network update? Is it agent.observe(terminal=terminal, reward=reward) collect gradients until the specified timesteps/episodes in update_model?

    Q2: Does the output layer of network define automatically when we define an agent? Such as I define an DQNAgent which has three actions to choose and I don't need to define the last layer is dict(type='dense', size=3, activation='softmax').

    Q3: DQNAgent needs to collect [St, a, r, St+1], in the following examples:

    while True:
        state2=f(state)  
        action = agent.act(states=state2)
        action2=g(action) 
        state, reward, terminal = environment.execute(actions=action2)
        agent.observe(reward=reward, terminal=terminal)
    

    Does it collects [state2, action2, r, state2'] or [state, action, r, state'] ?

    Q4: How can I output training loss ?

    Actually, I use a DQNAgent to realize robot navigation task. the input is compressed image, goal and the previous action. The output is three actions(forwrd, left, right) to choose. The agent is defined as:

    network_spec = [
        dict(type='dense', size=128, activation='relu'),
        dict(type='dense', size=32, activation='relu')
    ]
    
    memory = dict(
        type='replay',
        include_next_states=True,
        capacity=10000
    )
    
    exploration = dict(
        type='epsilon_decay',
        initial_epsilon=1.0,
        final_epsilon=0.1,
        timesteps=10000,
        start_timestep=0
    )
    
    update_model = dict(
        unit='timesteps',
        batch_size=64,
        frequency=64
    )
    
    optimizer = dict(
        type='adam',
        learning_rate=0.0001
    )
    
    agent = DQNAgent(
        states=dict(shape=(36,), type='float'), 
        actions=dict(shape=(3,), type='int'), 
        network=network_spec,
        update_mode=update_model,
        memory=memory,
        actions_exploration=exploration,
        optimizer=optimizer,
        double_q_model=True
    )
    

    Because I need to deal with the captured image to a compressed vector as a part of state, I run an episode as the following rather than using runner.

        while True:
            compressed_image = compress_image(observation)   # map the capture image to a 32-dim vector
            goal = env.goal   # shape(2, )
            pre_action = action  # shape(2, )
            state = compressed_image + goal + pre_action
            action = agent.act(state)
            observation, terminal, reward = env.execute(action)
            agent.observe(terminal=terminal, reward=reward)
            timestep += 1
            episode_reward += reward
            if terminal or timestep == max_timesteps:
                success = env.success
                break
    

    Can it work? I haved trained much time but the result is not ideal. So I want to know if I use tensorforce correctly. Thank you!

  • [silent BUG] Saving/Restoring/Seeding PPO model when action_spec has multiple actions

    [silent BUG] Saving/Restoring/Seeding PPO model when action_spec has multiple actions

    I'm still a novice with tensorforce. I'm trying to save my ppo agent after training. The agent train well but when I save the model, stop the program, then relaunch the program, restore model then the agent performance is as from scratch, whereas it was working well before.

    To save/restore I use :

    agent.save_model(directory=directory)
    agent.restore_model(directory=directory)
    

    I have looked using :

     tf_weights =agent.model.network.get_variables()
     weights = agent.model.session.run(tf_weights)
     print(weights)
    

    That the saved weights are correctly restored.

    I tried to set some seed using tf.set_random_seed(42) at the beginning of my program in the hope to obtain reproducible results (my env is fully deterministic), but upon two sequential launch from the same restored weight, I get different actions for the same input.

    First run first action after restore :

    input : 
    [[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
       0.05  0.    0.    0.    0.05  0.    0.    0.    0.05]]
    action : 
    [-1.65043855 -0.12582253  0.33019719 -0.42400551  0.39128172 -0.1892394
     -1.38783872 -0.84797424 -0.76125687 -0.44233581  0.2647087   0.57517719]
    

    Second run first action after restore :

    input : 
    [[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
       0.05  0.    0.    0.    0.05  0.    0.    0.    0.05]]
    action : 
    [ 0.00452828  1.70186901  0.18290332  0.1153125   0.80178595 -1.31738091
      0.2404308  -0.16986398 -1.69459999  2.09507513 -0.46165684 -0.34024456]
    

    I have disabled exploration and created the agent with :

    layerSize=300
     actions = {}
     for i in range(12):
           actions[str(i)] = {'type': 'float'}
    network_spec = [
                dict(type='dense', size=layerSize, activation='selu'),
                dict(type='dense', size=layerSize, activation='selu'),
                dict(type='dense', size=layerSize, activation='selu')
            ]
    agent = PPOAgent(
                states=dict(type='float', shape=(12+9,)),
                actions=actions,
                batching_capacity=1000,
                network=network_spec,
                states_preprocessing=None,
                entropy_regularization=1e-3,
                actions_exploration=None,
                step_optimizer=dict(
                    type='adam',
                    learning_rate=1e-5
                ),
            )
    

    Are there some extra parameters which needs to be saved when saving a PPO agent ? (maybe the parameters of the last layer which are used to generate the mean and variance of the gaussians needed to generate the continuous action).

    tensorforce.__version__
    '0.4.2'
    

    Thanks

  • What is the output of Agent Neural Network? If there is a std, can we fix it manually?

    What is the output of Agent Neural Network? If there is a std, can we fix it manually?

    Hi there, I'm curious about the output of actor NN. In RL, the action is obtained by sampling from the output distribution of actor NN. Therefore, the output of actor NN must have something like mean and standard deviation if it is a Gaussian. We can also fix the std and let NN give us the mean. What is the setting in your library? Can we change it manually?

    Besides, when we create the agent, we only need to provide the max and min value of actions. How do you choose the action if the sampled action outside the range? Do you select the boundary value or shrink the distribution?

    Help appreciated!

  • Network Spec / Layers Documentation

    Network Spec / Layers Documentation

    First of all, hello! I'm glad to have discovered this project and am planning on trying to use it.

    As for my question - I am unable to find any documentation describing what each of the layers are, what they do, and what their parameters are. Have I missed it or is it nonexistent? If it doesn't exist, I'd be happy to add some.

  • InvalidArgumentError on terminal observe call

    InvalidArgumentError on terminal observe call

    Perhaps something is wrong with my code, but almost half the time when the episode ends, I get an assertion error when I run observe on my PPO agent:

    Traceback (most recent call last):
      File "ll.py", line 208, in <module>
        main()
      File "ll.py", line 181, in main
        agent.give_reward(reward, terminal)
      File "ll.py", line 123, in give_reward
        self.agent.observe(reward=reward, terminal=terminal)
      File "c:\users\connor\desktop\tensorforce\tensorforce\agents\agent.py", line 534, in observe
        terminal=terminal, reward=reward, parallel=[parallel], **kwargs
      File "c:\users\connor\desktop\tensorforce\tensorforce\core\module.py", line 578, in fn
        fetched = self.monitored_session.run(fetches=fetches, feed_dict=feed_dict)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\training\monitored_session.py", line 754, in run
        run_metadata=run_metadata)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\training\monitored_session.py", line 1360, in run
        raise six.reraise(*original_exc_info)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\six.py", line 696, in reraise
        raise value
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\training\monitored_session.py", line 1345, in run
        return self._sess.run(*args, **kwargs)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\training\monitored_session.py", line 1418, in run
        run_metadata=run_metadata)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\training\monitored_session.py", line 1176, in run
        return self._sess.run(*args, **kwargs)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
        run_metadata_ptr)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
        feed_dict_tensor, options, run_metadata)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
        run_metadata)
      File "C:\Users\Connor\Anaconda3\envs\ll\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
        raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x <= y did not hold element-wise:x (baseline-network-state.observe/baseline-network-state.core_observe/baseline-network-state.core_experience/memory.enqueue/strided_slice:0) = ] [18243] [y (baseline-network-state.observe/baseline-network-state.core_observe/baseline-network-state.core_experience/memory.enqueue/sub_2:0) = ] [17999]
             [[{{node Assert}}]]
    

    My original theory was that I was accidentally calling observe again after setting terminal=True and before resetting the agent, or some other abuse of observe, but I prevented that from happening in my code, so I don't believe that's the case. Also, the episode runs completely fine, and I get through thousands of calls to observe without ever running to any issues. It's only when terminal=True that it seems to occur.

    Running on Windows 10 x64, with tensorflow-gpu v2.0.0 on an RTX2070, Tensorforce installed from the Github at commit 827febcf8ffda851e5e4f0c9d12d4a8e8502b282

  • Configuration refactoring - thoughts and suggestions welcome!

    Configuration refactoring - thoughts and suggestions welcome!

    Configuration has been a topic of discussion for quite some time now, so I thought it'd be a good idea to get all those thoughts in one place and hopefully solicit some user thoughts as well.

    From my understanding, the current purposes of configs are:

    1. make it easy for people to get up and running
    2. get all parameters in one place for ease of setting up experiments and making them interpretable
    3. keep signatures simple, which makes it easy to create arbitrary things from one big blob

    The current issues I'm having with configs are:

    1. they are somewhere between a dictionary and a blob-object, which makes them confusing
    2. they aren't serializable, so I have to create config wrappers around Configurations. Eww.
    3. defaults and unused parameters make it challenging to know what's really being used.

    My personal opinion is that we can keep benefits (1) and (2) above and get rid of all three issues all in exchange for the small sacrifice of benefit (3). In fact, I don't know how much of a benefit (3) is, as it obfuscates the true parameters of all objects in the codebase.

    I would propose doing so by putting the burden of the parameter creation and passing into the constructors of objects onto the user. Any intermediate user will have experience creating parameter/config generation wrappers. Less experienced users who want to get up and running quickly can still use the same JSON objects you've already written with something like this when actually creating the objects downstream:

    SomeObject(config_dict['some_key'], config_dict['another_key'])

    Users who want to create defaults can do this:

    SomeObject(config_dict.get('some_key', default_value), config_dict.get('another_key', another_default_value))

    Which to me is a much more clear way of going about defaults.

    The side benefit of all of this is that you aren't stuck supporting configurations for users. Configuration and deployment are two challenging parts of any project, and I personally would prefer everyone's time spent on RL, not trying to solve a fundamental CS issue (configuration!) that in the end is always problem-specific, and no matter how hard we try, never suits everyone's needs.

  • Bump mistune from 0.8.4 to 2.0.3 in /docs

    Bump mistune from 0.8.4 to 2.0.3 in /docs

    Bumps mistune from 0.8.4 to 2.0.3.

    Release notes

    Sourced from mistune's releases.

    Version 2.0.2

    Fix escape_url via lepture/mistune#295

    Version 2.0.1

    Fix XSS for image link syntax.

    Version 2.0.0

    First release of Mistune v2.

    Version 2.0.0 RC1

    In this release, we have a Security Fix for harmful links.

    Version 2.0.0 Alpha 1

    This is the first release of v2. An alpha version for users to have a preview of the new mistune.

    Changelog

    Sourced from mistune's changelog.

    Changelog

    Here is the full history of mistune v2.

    Version 2.0.4

    
    Released on Jul 15, 2022
    
    • Fix url plugin in &lt;a&gt; tag
    • Fix * formatting

    Version 2.0.3

    Released on Jun 27, 2022

    • Fix table plugin
    • Security fix for CVE-2022-34749

    Version 2.0.2

    
    Released on Jan 14, 2022
    

    Fix escape_url

    Version 2.0.1

    Released on Dec 30, 2021

    XSS fix for image link syntax.

    Version 2.0.0

    
    Released on Dec 5, 2021
    

    This is the first non-alpha release of mistune v2.

    Version 2.0.0rc1

    Released on Feb 16, 2021

    Version 2.0.0a6

    
    </tr></table> 
    

    ... (truncated)

    Commits
    • 3f422f1 Version bump 2.0.3
    • a6d4321 Fix asteris emphasis regex CVE-2022-34749
    • 5638e46 Merge pull request #307 from jieter/patch-1
    • 0eba471 Fix typo in guide.rst
    • 61e9337 Fix table plugin
    • 76dec68 Add documentation for renderer heading when TOC enabled
    • 799cd11 Version bump 2.0.2
    • babb0cf Merge pull request #295 from dairiki/bug.escape_url
    • fc2cd53 Make mistune.util.escape_url less aggressive
    • 3e8d352 Version bump 2.0.1
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • parallel processing fails when preprocessing is Sequence or Deltafier and batch_agent_calls=True

    parallel processing fails when preprocessing is Sequence or Deltafier and batch_agent_calls=True

    When I have a code like this:

    class PongRamEnvironment(Environment):

    def __init__(self):
        self.base_env=Environment.create(
       environment='gym', level='Pong-ram-v4',max_episode_timesteps=10000)
        super().__init__()
    
    def states(self):
        return {'type': 'float', 'shape': (4,), 'min_value': 0.0, 'max_value': 255.0}
    
    def actions(self):
        return {'type': 'int', 'shape': (), 'num_values': 2}
    
    def preprocess_state(self, state):
        #cpu_score = ram[13]  # computer/ai opponent score 
        #player_score = ram[14]  # your score
        cpu_paddle_y = state[21]  # Y coordinate of computer paddle
        player_paddle_y = state[51]  # Y coordinate of your paddle
        ball_x = state[49]  # X coordinate of ball
        ball_y = state[54]  # Y coordinate of ball
        obs = np.array([cpu_paddle_y, player_paddle_y, ball_x, ball_y], dtype=np.float32)
        return obs
        
    # Optional: should only be defined if environment has a natural fixed
    # maximum episode length; restrict training timesteps via
    #     Environment.create(..., max_episode_timesteps=???)
    def max_episode_timesteps(self):
        return super().max_episode_timesteps()
    
    # Optional additional steps to close environment
    def close(self):
        super().close()
    
    def reset(self):
        state = self.base_env.reset()
        state = self.preprocess_state(state)
        return state
    
    def execute(self, actions):
        
        #actions mapped
        actions={0:2,1:3}[actions]
        
        next_state, terminal, reward = self.base_env.execute(actions)
        next_state = self.preprocess_state(next_state)
       
        return next_state, terminal, reward
    

    DQN agent specification

    agent = dict( agent='dqn', # Automatically configured network network=dict(type='auto', size=64, depth=1), # Parameters memory= 20000, batch_size = 32, # Reward estimation discount=0.99, predict_terminal_values=False,

    state_preprocessing=[
        dict(type='deltafier',concatenate=0),
        dict(type='linear_normalization')
    ],
    reward_processing=None,
    # Regularization
    l2_regularization=0.0, 
    entropy_regularization=0.0,
    # Preprocessing
    exploration=0.1,
    variable_noise=0.0,
    # Default additional config values
    config=None,
    # Save agent every 10 updates and keep the 5 most recent checkpoints
    saver=dict(directory='model_ram', frequency=20, max_checkpoints=5),
    # Log all available Tensorboard summaries
    summarizer=dict(directory='summaries_ram', summaries='all'),
    # Do record agent-environment interaction trace
    #recorder=dict(directory='record')
    max_episode_timesteps=10000
    

    )

    runner = Runner(agent=agent, environment=dict(environment=PongRamEnvironment), num_parallel=4, max_episode_timesteps=10000)

    runner.run(num_episodes=10,batch_agent_calls=True) runner.close()

    I get following error:


    InvalidArgumentError Traceback (most recent call last) Input In [139], in <cell line: 4>() 1 runner = Runner(agent=agent, environment=dict(environment=PongRamEnvironment), num_parallel=4, max_episode_timesteps=10000) 3 # Train for 200 episodes ----> 4 runner.run(num_episodes=10,batch_agent_calls=True) 5 runner.close()

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/execution/runner.py:604, in Runner.run(self, num_episodes, num_timesteps, num_updates, batch_agent_calls, sync_timesteps, sync_episodes, num_sleep_secs, callback, callback_episode_frequency, callback_timestep_frequency, use_tqdm, mean_horizon, evaluation, save_best_agent, evaluation_callback) 601 self.terminals[n] = self.prev_terminals[n] 603 self.handle_observe_joint() --> 604 self.handle_act_joint() 606 # Parallel environments loop 607 no_environment_ready = True

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/execution/runner.py:726, in Runner.handle_act_joint(self) 724 if len(parallel) > 0: 725 agent_start = time.time() --> 726 self.actions = self.agent.act( 727 states=[self.states[p] for p in parallel], parallel=parallel 728 ) 729 agent_second = (time.time() - agent_start) / len(parallel) 730 for p in parallel:

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/agents/agent.py:415, in Agent.act(self, states, internals, parallel, independent, deterministic, evaluation) 410 if evaluation is not None: 411 raise TensorforceError.deprecated( 412 name='Agent.act', argument='evaluation', replacement='independent' 413 ) --> 415 return super().act( 416 states=states, internals=internals, parallel=parallel, independent=independent, 417 deterministic=deterministic 418 )

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/agents/recorder.py:262, in Recorder.act(self, states, internals, parallel, independent, deterministic, **kwargs) 260 # fn_act() 261 if self._is_agent: --> 262 actions, internals = self.fn_act( 263 states=states, internals=internals, parallel=parallel, independent=independent, 264 deterministic=deterministic, is_internals_none=is_internals_none, 265 num_parallel=num_parallel 266 ) 267 else: 268 if batched:

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/agents/agent.py:462, in Agent.fn_act(self, states, internals, parallel, independent, deterministic, is_internals_none, num_parallel) 460 # Model.act() 461 if not independent: --> 462 actions, timesteps = self.model.act( 463 states=states, auxiliaries=auxiliaries, parallel=parallel 464 ) 465 self.timesteps = timesteps.numpy().item() 467 elif len(self.internals_spec) > 0:

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/core/module.py:136, in tf_function..decorator..decorated(self, _initialize, *args, **kwargs) 134 # Apply function graph 135 with self: --> 136 output_args = function_graphsstr(graph_params) 137 if not is_loop_body: 138 return output_signature.args_to_kwargs( 139 args=output_args, outer_tuple=True, from_dict=dict_interface 140 )

    File ~/anaconda3/envs/master/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback..error_handler(*args, **kwargs) 151 except Exception as e: 152 filtered_tb = _process_traceback_frames(e.traceback) --> 153 raise e.with_traceback(filtered_tb) from None 154 finally: 155 del filtered_tb

    File ~/anaconda3/envs/master/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 52 try: 53 ctx.ensure_initialized() ---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 55 inputs, attrs, num_outputs) 56 except core._NotOkStatusException as e: 57 if name is not None:

    InvalidArgumentError: Graph execution error:

  • 🐛: TensorforceError has no attribute 'dtype'. So fixed typo TensorforceError.dtype -> TensorforceError.type (Bagfix)

    🐛: TensorforceError has no attribute 'dtype'. So fixed typo TensorforceError.dtype -> TensorforceError.type (Bagfix)

    Hi developers!

    As commented in the issue #879, TensorforceError has no attribute 'dtype', but it has 'type' method. So fixed bellow typo by this PR

    https://github.com/tensorforce/tensorforce/blob/868d12d7db655c816dc1439dd7826eef06d6ef0e/tensorforce/core/parameters/decaying.py#L136-L145

    https://github.com/tensorforce/tensorforce/blob/868d12d7db655c816dc1439dd7826eef06d6ef0e/tensorforce/core/parameters/constant.py#L35-L44

    I really appreciate for your time and effort.

  • Parameters are not working

    Parameters are not working

    when runninng:

    from tensorforce.core.parameters import Linear Linear(unit='episodes', num_steps=100000, initial_value=1.0, final_value=0.05)

    I get:


    AttributeError Traceback (most recent call last) Input In [2], in <cell line: 1>() ----> 1 Linear(unit='episodes', num_steps=100000, initial_value=1.0, final_value=0.05)

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/core/parameters/linear.py:42, in Linear.init(self, unit, num_steps, initial_value, final_value, name, dtype, min_value, max_value) 38 def init( 39 self, *, unit, num_steps, initial_value, final_value, name=None, dtype=None, min_value=None, 40 max_value=None 41 ): ---> 42 super().init( 43 decay='linear', unit=unit, num_steps=num_steps, initial_value=initial_value, name=name, 44 dtype=dtype, min_value=min_value, max_value=max_value, final_value=final_value 45 )

    File ~/Desktop/Semantic-Reasoning-in-Reinforcement-Learning/tensorforce/tensorforce/core/parameters/decaying.py:143, in Decaying.init(self, decay, unit, num_steps, initial_value, increasing, inverse, scale, name, dtype, min_value, max_value, **kwargs) 141 elif isinstance(initial_value, float): 142 if dtype != 'float': --> 143 raise TensorforceError.dtype( 144 name='Decaying', argument='initial_value', dtype=type(initial_value) 145 ) 146 else: 147 raise TensorforceError.unexpected()

    AttributeError: type object 'TensorforceError' has no attribute 'dtype'

    I use tensorforce version 0.6.5

  • Having

    Having "saver" in the agent json throws "Checkpoint does not exist"

    Hi,

    When I define the agent configuration through a JSON file (e.g., ppo.json) and I define a "saver" entity inside the JSON to have the intermediate models saved, it throws an error saying: tensorforce.exception.TensorforceError: Checkpoint does not exist: model.

    However, when I define the exact properties through a "dict" inside the code itself everything seems OK, the simulation starts, and intermediate models are correctly saved.

    Is there something wrong here or am I doing it not correctly? (I am relatively new to this code).

    I appreciate any help you can provide. Saeed

  • Fixed bug communications larger than MAX_BYTES

    Fixed bug communications larger than MAX_BYTES

    After the first iteration, str_result should have length cls.MAX_BYTES, however, since n = 0, the checks will return an error even if the recv operation was successful. With this fix, we check that the string has the correct length

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Aug 1, 2022
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

Aug 2, 2022
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

Jul 30, 2022
ChainerRL is a deep reinforcement learning library built on top of Chainer.
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Jul 24, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

Aug 2, 2022
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Aug 1, 2022
A toolkit for reproducible reinforcement learning research.
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Jul 31, 2022
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Jul 29, 2022
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

Aug 8, 2022
A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

Aug 1, 2022
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Aug 2, 2022
Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Aug 8, 2022
Deep Reinforcement Learning for Keras.
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Aug 3, 2022
Open world survival environment for reinforcement learning
Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Aug 5, 2022
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

Aug 1, 2022
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks,该仓库主要是针对强化学

Jul 26, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Aug 8, 2022
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book

SLM Lab Modular Deep Reinforcement Learning framework in PyTorch. Documentation: https://slm-lab.gitbook.io/slm-lab/ BeamRider Breakout KungFuMaster M

Aug 1, 2022
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Aug 1, 2022