Deep Reinforcement Learning Hands-On - Second Edition

Book description

Revised and expanded to include multi-agent methods, discrete optimization, RL in robotics, advanced exploration techniques, and more

Key Features

  • Second edition of the bestselling introduction to deep reinforcement learning, expanded with six new chapters
  • Learn advanced exploration techniques including noisy networks, pseudo-count, and network distillation methods
  • Apply RL methods to cheap hardware robotics platforms

Book Description

Deep Reinforcement Learning Hands-On, Second Edition is an updated and expanded version of the bestselling guide to the very latest reinforcement learning (RL) tools and techniques. It provides you with an introduction to the fundamentals of RL, along with the hands-on ability to code intelligent learning agents to perform a range of practical tasks.

With six new chapters devoted to a variety of up-to-the-minute developments in RL, including discrete optimization (solving the Rubik's Cube), multi-agent methods, Microsoft's TextWorld environment, advanced exploration techniques, and more, you will come away from this book with a deep understanding of the latest innovations in this emerging field.

In addition, you will gain actionable insights into such topic areas as deep Q-networks, policy gradient methods, continuous control problems, and highly scalable, non-gradient methods. You will also discover how to build a real hardware robot trained with RL for less than $100 and solve the Pong environment in just 30 minutes of training using step-by-step code optimization.

In short, Deep Reinforcement Learning Hands-On, Second Edition, is your companion to navigating the exciting complexities of RL as it helps you attain experience and knowledge through real-world examples.

What you will learn

  • Understand the deep learning context of RL and implement complex deep learning models
  • Evaluate RL methods including cross-entropy, DQN, actor-critic, TRPO, PPO, DDPG, D4PG, and others
  • Build a practical hardware robot trained with RL methods for less than $100
  • Discover Microsoft s TextWorld environment, which is an interactive fiction games platform
  • Use discrete optimization in RL to solve a Rubik s Cube
  • Teach your agent to play Connect 4 using AlphaGo Zero
  • Explore the very latest deep RL research on topics including AI chatbots
  • Discover advanced exploration techniques, including noisy networks and network distillation techniques

Who this book is for

Some fluency in Python is assumed. Sound understanding of the fundamentals of deep learning will be helpful. This book is an introduction to deep RL and requires no background in RL

Table of contents

  1. Preface
    1. Why I wrote this book
    2. The approach
    3. Who this book is for
    4. What this book covers
    5. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    6. Get in touch
      1. Reviews
  2. What Is Reinforcement Learning?
    1. Supervised learning
    2. Unsupervised learning
    3. Reinforcement learning
    4. RL's complications
    5. RL formalisms
      1. Reward
      2. The agent
      3. The environment
      4. Actions
      5. Observations
    6. The theoretical foundations of RL
      1. Markov decision processes
        1. The Markov process
        2. Markov reward processes
        3. Adding actions
      2. Policy
    7. Summary
  3. OpenAI Gym
    1. The anatomy of the agent
    2. Hardware and software requirements
    3. The OpenAI Gym API
      1. The action space
      2. The observation space
      3. The environment
      4. Creating an environment
      5. The CartPole session
    4. The random CartPole agent
    5. Extra Gym functionality – wrappers and monitors
      1. Wrappers
      2. Monitor
    6. Summary
  4. Deep Learning with PyTorch
    1. Tensors
      1. The creation of tensors
      2. Scalar tensors
      3. Tensor operations
      4. GPU tensors
    2. Gradients
      1. Tensors and gradients
    3. NN building blocks
    4. Custom layers
    5. The final glue – loss functions and optimizers
      1. Loss functions
      2. Optimizers
    6. Monitoring with TensorBoard
      1. TensorBoard 101
      2. Plotting stuff
    7. Example – GAN on Atari images
    8. PyTorch Ignite
      1. Ignite concepts
    9. Summary
  5. The Cross-Entropy Method
    1. The taxonomy of RL methods
    2. The cross-entropy method in practice
    3. The cross-entropy method on CartPole
    4. The cross-entropy method on FrozenLake
    5. The theoretical background of the cross-entropy method
    6. Summary
  6. Tabular Learning and the Bellman Equation
    1. Value, state, and optimality
    2. The Bellman equation of optimality
    3. The value of the action
    4. The value iteration method
    5. Value iteration in practice
    6. Q-learning for FrozenLake
    7. Summary
  7. Deep Q-Networks
    1. Real-life value iteration
    2. Tabular Q-learning
    3. Deep Q-learning
      1. Interaction with the environment
      2. SGD optimization
      3. Correlation between steps
      4. The Markov property
      5. The final form of DQN training
    4. DQN on Pong
      1. Wrappers
      2. The DQN model
      3. Training
      4. Running and performance
      5. Your model in action
    5. Things to try
    6. Summary
  8. Higher-Level RL Libraries
    1. Why RL libraries?
    2. The PTAN library
      1. Action selectors
      2. The agent
        1. DQNAgent
        2. PolicyAgent
      3. Experience source
        1. Toy environment
        2. The ExperienceSource class
        3. ExperienceSourceFirstLast
      4. Experience replay buffers
      5. The TargetNet class
      6. Ignite helpers
    3. The PTAN CartPole solver
    4. Other RL libraries
    5. Summary
  9. DQN Extensions
    1. Basic DQN
      1. Common library
      2. Implementation
      3. Results
    2. N-step DQN
      1. Implementation
      2. Results
    3. Double DQN
      1. Implementation
      2. Results
    4. Noisy networks
      1. Implementation
      2. Results
    5. Prioritized replay buffer
      1. Implementation
      2. Results
    6. Dueling DQN
      1. Implementation
      2. Results
    7. Categorical DQN
      1. Implementation
      2. Results
    8. Combining everything
      1. Results
    9. Summary
    10. References
  10. Ways to Speed up RL
    1. Why speed matters
    2. The baseline
    3. The computation graph in PyTorch
    4. Several environments
    5. Play and train in separate processes
    6. Tweaking wrappers
    7. Benchmark summary
    8. Going hardcore: CuLE
    9. Summary
    10. References
  11. Stocks Trading Using RL
    1. Trading
    2. Data
    3. Problem statements and key decisions
    4. The trading environment
    5. Models
    6. Training code
    7. Results
      1. The feed-forward model
      2. The convolution model
    8. Things to try
    9. Summary
  12. Policy Gradients – an Alternative
    1. Values and policy
      1. Why the policy?
      2. Policy representation
      3. Policy gradients
    2. The REINFORCE method
      1. The CartPole example
      2. Results
      3. Policy-based versus value-based methods
    3. REINFORCE issues
      1. Full episodes are required
      2. High gradients variance
      3. Exploration
      4. Correlation between samples
    4. Policy gradient methods on CartPole
      1. Implementation
      2. Results
    5. Policy gradient methods on Pong
      1. Implementation
      2. Results
    6. Summary
  13. The Actor-Critic Method
    1. Variance reduction
    2. CartPole variance
    3. Actor-critic
    4. A2C on Pong
    5. A2C on Pong results
    6. Tuning hyperparameters
      1. Learning rate
      2. Entropy beta
      3. Count of environments
      4. Batch size
    7. Summary
  14. Asynchronous Advantage Actor-Critic
    1. Correlation and sample efficiency
    2. Adding an extra A to A2C
    3. Multiprocessing in Python
    4. A3C with data parallelism
      1. Implementation
      2. Results
    5. A3C with gradients parallelism
      1. Implementation
      2. Results
    6. Summary
  15. Training Chatbots with RL
    1. An overview of chatbots
    2. Chatbot training
    3. The deep NLP basics
      1. RNNs
      2. Word embedding
      3. The Encoder-Decoder architecture
    4. Seq2seq training
      1. Log-likelihood training
      2. The bilingual evaluation understudy (BLEU) score
      3. RL in seq2seq
      4. Self-critical sequence training
    5. Chatbot example
      1. The example structure
      2. Modules: cornell.py and data.py
      3. BLEU score and utils.py
      4. Model
    6. Dataset exploration
    7. Training: cross-entropy
      1. Implementation
      2. Results
    8. Training: SCST
      1. Implementation
      2. Results
    9. Models tested on data
    10. Telegram bot
    11. Summary
  16. The TextWorld Environment
    1. Interactive fiction
    2. The environment
      1. Installation
      2. Game generation
      3. Observation and action spaces
      4. Extra game information
    3. Baseline DQN
      1. Observation preprocessing
      2. Embeddings and encoders
      3. The DQN model and the agent
      4. Training code
      5. Training results
    4. The command generation model
      1. Implementation
      2. Pretraining results
      3. DQN training code
      4. The result of DQN training
    5. Summary
  17. Web Navigation
    1. Web navigation
      1. Browser automation and RL
      2. The MiniWoB benchmark
    2. OpenAI Universe
      1. Installation
      2. Actions and observations
      3. Environment creation
      4. MiniWoB stability
    3. The simple clicking approach
      1. Grid actions
      2. Example overview
      3. The model
      4. The training code
      5. Starting containers
      6. The training process
      7. Checking the learned policy
      8. Issues with simple clicking
    4. Human demonstrations
      1. Recording the demonstrations
      2. The recording format
      3. Training using demonstrations
      4. Results
      5. The tic-tac-toe problem
    5. Adding text descriptions
      1. Implementation
      2. Results
    6. Things to try
    7. Summary
  18. Continuous Action Space
    1. Why a continuous space?
      1. The action space
      2. Environments
    2. The A2C method
      1. Implementation
      2. Results
      3. Using models and recording videos
    3. Deterministic policy gradients
      1. Exploration
      2. Implementation
      3. Results
      4. Recording videos
    4. Distributional policy gradients
      1. Architecture
      2. Implementation
      3. Results
      4. Video recordings
    5. Things to try
    6. Summary
  19. RL in Robotics
    1. Robots and robotics
      1. Robot complexities
      2. The hardware overview
      3. The platform
      4. The sensors
      5. The actuators
      6. The frame
    2. The first training objective
    3. The emulator and the model
      1. The model definition file
      2. The robot class
    4. DDPG training and results
    5. Controlling the hardware
      1. MicroPython
      2. Dealing with sensors
        1. The I2C bus
        2. Sensor initialization and reading
        3. Sensor classes and timer reading
        4. Observations
      3. Driving servos
      4. Moving the model to hardware
        1. The model export
        2. Benchmarks
      5. Combining everything
    6. Policy experiments
    7. Summary
  20. Trust Regions – PPO, TRPO, ACKTR, and SAC
    1. Roboschool
    2. The A2C baseline
      1. Implementation
      2. Results
      3. Video recording
    3. PPO
      1. Implementation
      2. Results
    4. TRPO
      1. Implementation
      2. Results
    5. ACKTR
      1. Implementation
      2. Results
    6. SAC
      1. Implementation
      2. Results
    7. Summary
  21. Black-Box Optimization in RL
    1. Black-box methods
    2. Evolution strategies
      1. ES on CartPole
        1. Results
      2. ES on HalfCheetah
        1. Implementation
        2. Results
    3. Genetic algorithms
      1. GA on CartPole
        1. Results
      2. GA tweaks
        1. Deep GA
        2. Novelty search
      3. GA on HalfCheetah
        1. Results
    4. Summary
    5. References
  22. Advanced Exploration
    1. Why exploration is important
    2. What's wrong with ε-greedy?
    3. Alternative ways of exploration
      1. Noisy networks
      2. Count-based methods
      3. Prediction-based methods
    4. MountainCar experiments
      1. The DQN method with ε-greedy
      2. The DQN method with noisy networks
      3. The DQN method with state counts
      4. The proximal policy optimization method
      5. The PPO method with noisy networks
      6. The PPO method with count-based exploration
      7. The PPO method with network distillation
    5. Atari experiments
      1. The DQN method with ε-greedy
      2. The classic PPO method
      3. The PPO method with network distillation
      4. The PPO method with noisy networks
    6. Summary
    7. References
  23. Beyond Model-Free – Imagination
    1. Model-based methods
      1. Model-based versus model-free
      2. Model imperfections
    2. The imagination-augmented agent
      1. The EM
      2. The rollout policy
      3. The rollout encoder
      4. The paper's results
    3. I2A on Atari Breakout
      1. The baseline A2C agent
      2. EM training
      3. The imagination agent
        1. The I2A model
        2. The Rollout encoder
        3. The training of I2A
    4. Experiment results
      1. The baseline agent
      2. Training EM weights
      3. Training with the I2A model
    5. Summary
    6. References
  24. AlphaGo Zero
    1. Board games
    2. The AlphaGo Zero method
      1. Overview
      2. MCTS
      3. Self-play
      4. Training and evaluation
    3. The Connect 4 bot
      1. The game model
      2. Implementing MCTS
      3. The model
      4. Training
      5. Testing and comparison
    4. Connect 4 results
    5. Summary
    6. References
  25. RL in Discrete Optimization
    1. RL's reputation
    2. The Rubik's Cube and combinatorial optimization
    3. Optimality and God's number
    4. Approaches to cube solving
      1. Data representation
      2. Actions
      3. States
    5. The training process
      1. The NN architecture
      2. The training
    6. The model application
    7. The paper's results
    8. The code outline
      1. Cube environments
      2. Training
      3. The search process
    9. The experiment results
      1. The 2×2 cube
      2. The 3×3 cube
    10. Further improvements and experiments
    11. Summary
  26. Multi-agent RL
    1. Multi-agent RL explained
      1. Forms of communication
      2. The RL approach
    2. The MAgent environment
      1. Installation
      2. An overview
      3. A random environment
    3. Deep Q-network for tigers
      1. Training and results
    4. Collaboration by the tigers
    5. Training both tigers and deer
    6. The battle between equal actors
    7. Summary
  27. Other Books You May Enjoy
  28. Index

Product information

  • Title: Deep Reinforcement Learning Hands-On - Second Edition
  • Author(s): Maxim Lapan
  • Release date: January 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781838826994