Deep Reinforcement Learning with Python - Second Edition

Book description

An example-rich guide for beginners to start their reinforcement and deep reinforcement learning journey with state-of-the-art distinct algorithms

Key Features

  • Covers a vast spectrum of basic-to-advanced RL algorithms with mathematical explanations of each algorithm
  • Learn how to implement algorithms with code by following examples with line-by-line explanations
  • Explore the latest RL methodologies such as DDPG, PPO, and the use of expert demonstrations

Book Description

With significant enhancements in the quality and quantity of algorithms in recent years, this second edition of Hands-On Reinforcement Learning with Python has been revamped into an example-rich guide to learning state-of-the-art reinforcement learning (RL) and deep RL algorithms with TensorFlow 2 and the OpenAI Gym toolkit.

In addition to exploring RL basics and foundational concepts such as Bellman equation, Markov decision processes, and dynamic programming algorithms, this second edition dives deep into the full spectrum of value-based, policy-based, and actor-critic RL methods. It explores state-of-the-art algorithms such as DQN, TRPO, PPO and ACKTR, DDPG, TD3, and SAC in depth, demystifying the underlying math and demonstrating implementations through simple code examples.

The book has several new chapters dedicated to new RL techniques, including distributional RL, imitation learning, inverse RL, and meta RL. You will learn to leverage stable baselines, an improvement of OpenAI's baseline library, to effortlessly implement popular RL algorithms. The book concludes with an overview of promising approaches such as meta-learning and imagination augmented agents in research.

By the end, you will become skilled in effectively employing RL and deep RL in your real-world projects.

What you will learn

  • Understand core RL concepts including the methodologies, math, and code
  • Train an agent to solve Blackjack, FrozenLake, and many other problems using OpenAI Gym
  • Train an agent to play Ms Pac-Man using a Deep Q Network
  • Learn policy-based, value-based, and actor-critic methods
  • Master the math behind DDPG, TD3, TRPO, PPO, and many others
  • Explore new avenues such as the distributional RL, meta RL, and inverse RL
  • Use Stable Baselines to train an agent to walk and play Atari games

Who this book is for

If you're a machine learning developer with little or no experience with neural networks interested in artificial intelligence and want to learn about reinforcement learning from scratch, this book is for you.

Basic familiarity with linear algebra, calculus, and the Python programming language is required. Some experience with TensorFlow would be a plus.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Fundamentals of Reinforcement Learning
    1. Key elements of RL
      1. Agent
      2. Environment
      3. State and action
      4. Reward
    2. The basic idea of RL
    3. The RL algorithm
      1. RL agent in the grid world
    4. How RL differs from other ML paradigms
    5. Markov Decision Processes
      1. The Markov property and Markov chain
      2. The Markov Reward Process
      3. The Markov Decision Process
    6. Fundamental concepts of RL
      1. Math essentials
        1. Expectation
      2. Action space
      3. Policy
        1. Deterministic policy
        2. Stochastic policy
      4. Episode
      5. Episodic and continuous tasks
      6. Horizon
      7. Return and discount factor
        1. Small discount factor
        2. Large discount factor
        3. What happens when we set the discount factor to 0?
        4. What happens when we set the discount factor to 1?
      8. The value function
      9. Q function
      10. Model-based and model-free learning
      11. Different types of environments
        1. Deterministic and stochastic environments
        2. Discrete and continuous environments
        3. Episodic and non-episodic environments
        4. Single and multi-agent environments
    7. Applications of RL
    8. RL glossary
    9. Summary
    10. Questions
    11. Further reading
  3. A Guide to the Gym Toolkit
    1. Setting up our machine
      1. Installing Anaconda
      2. Installing the Gym toolkit
        1. Common error fixes
    2. Creating our first Gym environment
      1. Exploring the environment
        1. States
        2. Actions
        3. Transition probability and reward function
      2. Generating an episode in the Gym environment
        1. Action selection
        2. Generating an episode
    3. More Gym environments
      1. Classic control environments
        1. State space
        2. Action space
        3. Cart-Pole balancing with random policy
      2. Atari game environments
      3. General environment
        1. Deterministic environment
        2. No frame skipping
        3. State and action space
        4. An agent playing the Tennis game
        5. Recording the game
      4. Other environments
        1. Box2D
        2. MuJoCo
        3. Robotics
        4. Toy text
        5. Algorithms
    4. Environment synopsis
    5. Summary
    6. Questions
    7. Further reading
  4. The Bellman Equation and Dynamic Programming
    1. The Bellman equation
      1. The Bellman equation of the value function
      2. The Bellman equation of the Q function
      3. The Bellman optimality equation
      4. The relationship between the value and Q functions
    2. Dynamic programming
      1. Value iteration
        1. The value iteration algorithm
        2. Solving the Frozen Lake problem with value iteration
      2. Policy iteration
        1. Algorithm – policy iteration
        2. Solving the Frozen Lake problem with policy iteration
    3. Is DP applicable to all environments?
    4. Summary
    5. Questions
  5. Monte Carlo Methods
    1. Understanding the Monte Carlo method
    2. Prediction and control tasks
      1. Prediction task
      2. Control task
    3. Monte Carlo prediction
      1. MC prediction algorithm
      2. Types of MC prediction
        1. First-visit Monte Carlo
        2. Every-visit Monte Carlo
      3. Implementing the Monte Carlo prediction method
        1. Understanding the blackjack game
        2. The blackjack environment in the Gym library
        3. Every-visit MC prediction with the blackjack game
        4. First-visit MC prediction with the blackjack game
      4. Incremental mean updates
      5. MC prediction (Q function)
    4. Monte Carlo control
      1. MC control algorithm
      2. On-policy Monte Carlo control
        1. Monte Carlo exploring starts
        2. Monte Carlo with the epsilon-greedy policy
        3. Implementing on-policy MC control
      3. Off-policy Monte Carlo control
    5. Is the MC method applicable to all tasks?
    6. Summary
    7. Questions
  6. Understanding Temporal Difference Learning
    1. TD learning
    2. TD prediction
      1. TD prediction algorithm
        1. Predicting the value of states in the Frozen Lake environment
    3. TD control
      1. On-policy TD control – SARSA
        1. Computing the optimal policy using SARSA
      2. Off-policy TD control – Q learning
        1. Computing the optimal policy using Q learning
      3. The difference between Q learning and SARSA
    4. Comparing the DP, MC, and TD methods
    5. Summary
    6. Questions
    7. Further reading
  7. Case Study – The MAB Problem
    1. The MAB problem
      1. Creating a bandit in the Gym
      2. Exploration strategies
        1. Epsilon-greedy
        2. Softmax exploration
        3. Upper confidence bound
        4. Thompson sampling
    2. Applications of MAB
    3. Finding the best advertisement banner using bandits
      1. Creating a dataset
      2. Initialize the variables
      3. Define the epsilon-greedy method
      4. Run the bandit test
    4. Contextual bandits
    5. Summary
    6. Questions
    7. Further reading
  8. Deep Learning Foundations
    1. Biological and artificial neurons
    2. ANN and its layers
      1. Input layer
      2. Hidden layer
      3. Output layer
    3. Exploring activation functions
      1. The sigmoid function
      2. The tanh function
      3. The Rectified Linear Unit function
      4. The softmax function
    4. Forward propagation in ANNs
    5. How does an ANN learn?
    6. Putting it all together
      1. Building a neural network from scratch
    7. Recurrent Neural Networks
      1. The difference between feedforward networks and RNNs
      2. Forward propagation in RNNs
      3. Backpropagating through time
    8. LSTM to the rescue
      1. Understanding the LSTM cell
    9. What are CNNs?
      1. Convolutional layers
        1. Strides
        2. Padding
      2. Pooling layers
      3. Fully connected layers
    10. The architecture of CNNs
    11. Generative adversarial networks
      1. Breaking down the generator
      2. Breaking down the discriminator
      3. How do they learn, though?
      4. Architecture of a GAN
      5. Demystifying the loss function
        1. Discriminator loss
        2. Generator loss
    12. Total loss
    13. Summary
    14. Questions
    15. Further reading
  9. A Primer on TensorFlow
    1. What is TensorFlow?
    2. Understanding computational graphs and sessions
      1. Sessions
    3. Variables, constants, and placeholders
      1. Variables
      2. Constants
      3. Placeholders and feed dictionaries
    4. Introducing TensorBoard
      1. Creating a name scope
    5. Handwritten digit classification using TensorFlow
      1. Importing the required libraries
      2. Loading the dataset
      3. Defining the number of neurons in each layer
      4. Defining placeholders
      5. Forward propagation
      6. Computing loss and backpropagation
      7. Computing accuracy
      8. Creating a summary
      9. Training the model
      10. Visualizing graphs in TensorBoard
    6. Introducing eager execution
    7. Math operations in TensorFlow
    8. TensorFlow 2.0 and Keras
      1. Bonjour Keras
        1. Defining the model
        2. Compiling the model
        3. Training the model
        4. Evaluating the model
      2. MNIST digit classification using TensorFlow 2.0
    9. Summary
    10. Questions
    11. Further reading
  10. Deep Q Network and Its Variants
    1. What is DQN?
      1. Understanding DQN
        1. Replay buffer
        2. Loss function
        3. Target network
      2. Putting it all together
        1. The DQN algorithm
    2. Playing Atari games using DQN
      1. Architecture of the DQN
      2. Getting hands-on with the DQN
        1. Preprocess the game screen
        2. Defining the DQN class
        3. Training the DQN
    3. The double DQN
      1. The double DQN algorithm
    4. DQN with prioritized experience replay
      1. Types of prioritization
        1. Proportional prioritization
        2. Rank-based prioritization
      2. Correcting the bias
    5. The dueling DQN
      1. Understanding the dueling DQN
      2. The architecture of a dueling DQN
    6. The deep recurrent Q network
      1. The architecture of a DRQN
    7. Summary
    8. Questions
    9. Further reading
  11. Policy Gradient Method
    1. Why policy-based methods?
    2. Policy gradient intuition
      1. Understanding the policy gradient
      2. Deriving the policy gradient
      3. Algorithm – policy gradient
    3. Variance reduction methods
      1. Policy gradient with reward-to-go
        1. Algorithm – Reward-to-go policy gradient
      2. Cart pole balancing with policy gradient
        1. Computing discounted and normalized reward
        2. Building the policy network
        3. Training the network
      3. Policy gradient with baseline
        1. Algorithm – REINFORCE with baseline
    4. Summary
    5. Questions
    6. Further reading
  12. Actor-Critic Methods – A2C and A3C
    1. Overview of the actor-critic method
      1. Understanding the actor-critic method
      2. The actor-critic algorithm
    2. Advantage actor-critic (A2C)
    3. Asynchronous advantage actor-critic (A3C)
      1. The three As
      2. The architecture of A3C
      3. Mountain car climbing using A3C
        1. Creating the mountain car environment
        2. Defining the variables
        3. Defining the actor-critic class
        4. Defining the worker class
        5. Training the network
        6. Visualizing the computational graph
    4. A2C revisited
    5. Summary
    6. Questions
    7. Further reading
  13. Learning DDPG, TD3, and SAC
    1. Deep deterministic policy gradient
      1. An overview of DDPG
        1. Actor
        2. Critic
      2. DDPG components
        1. Critic network
        2. Actor network
      3. Putting it all together
      4. Algorithm – DDPG
      5. Swinging up a pendulum using DDPG
        1. Creating the Gym environment
        2. Defining the variables
        3. Defining the DDPG class
        4. Training the network
    2. Twin delayed DDPG
      1. Key features of TD3
        1. Clipped double Q learning
        2. Delayed policy updates
        3. Target policy smoothing
      2. Putting it all together
      3. Algorithm – TD3
    3. Soft actor-critic
      1. Understanding soft actor-critic
        1. V and Q functions with the entropy term
      2. Components of SAC
        1. Critic network
        2. Actor network
      3. Putting it all together
      4. Algorithm – SAC
    4. Summary
    5. Questions
    6. Further reading
  14. TRPO, PPO, and ACKTR Methods
    1. Trust region policy optimization
      1. Math essentials
        1. The Taylor series
        2. The trust region method
        3. The conjugate gradient method
        4. Lagrange multipliers
        5. Importance sampling
      2. Designing the TRPO objective function
        1. Parameterizing the policies
        2. Sample-based estimation
      3. Solving the TRPO objective function
        1. Computing the search direction
        2. Performing a line search in the search direction
      4. Algorithm – TRPO
    2. Proximal policy optimization
      1. PPO with a clipped objective
        1. Algorithm – PPO-clipped
      2. Implementing the PPO-clipped method
        1. Creating the Gym environment
        2. Defining the PPO class
        3. Training the network
      3. PPO with a penalized objective
        1. Algorithm – PPO-penalty
    3. Actor-critic using Kronecker-factored trust region
      1. Math essentials
        1. Block matrix
        2. Block diagonal matrix
        3. The Kronecker product
        4. The vec operator
        5. Properties of the Kronecker product
      2. Kronecker-Factored Approximate Curvature (K-FAC)
      3. K-FAC in actor-critic
      4. Incorporating the trust region
    4. Summary
    5. Questions
    6. Further reading
  15. Distributional Reinforcement Learning
    1. Why distributional reinforcement learning?
    2. Categorical DQN
      1. Predicting the value distribution
      2. Selecting an action based on the value distribution
      3. Training the categorical DQN
        1. Projection step
      4. Putting it all together
      5. Algorithm – categorical DQN
      6. Playing Atari games using a categorical DQN
        1. Defining the variables
        2. Defining the replay buffer
        3. Defining the categorical DQN class
    3. Quantile Regression DQN
      1. Math essentials
        1. Quantile
        2. Inverse CDF (quantile function)
      2. Understanding QR-DQN
        1. Action selection
        2. Loss function
    4. Distributed Distributional DDPG
      1. Critic network
      2. Actor network
      3. Algorithm – D4PG
    5. Summary
    6. Questions
    7. Further reading
  16. Imitation Learning and Inverse RL
    1. Supervised imitation learning
    2. DAgger
      1. Understanding DAgger
      2. Algorithm – DAgger
    3. Deep Q learning from demonstrations
      1. Phases of DQfD
        1. Pre-training phase
        2. Training phase
      2. Loss function of DQfD
      3. Algorithm – DQfD
    4. Inverse reinforcement learning
      1. Maximum entropy IRL
        1. Key terms
        2. Back to maximum entropy IRL
        3. Computing the gradient
        4. Algorithm – maximum entropy IRL
    5. Generative adversarial imitation learning
      1. Formulation of GAIL
    6. Summary
    7. Questions
    8. Further reading
  17. Deep Reinforcement Learning with Stable Baselines
    1. Installing Stable Baselines
    2. Creating our first agent with Stable Baselines
      1. Evaluating the trained agent
      2. Storing and loading the trained agent
      3. Viewing the trained agent
      4. Putting it all together
    3. Vectorized environments
      1. SubprocVecEnv
      2. DummyVecEnv
    4. Integrating custom environments
    5. Playing Atari games with a DQN and its variants
      1. Implementing DQN variants
    6. Lunar lander using A2C
      1. Creating a custom network
    7. Swinging up a pendulum using DDPG
      1. Viewing the computational graph in TensorBoard
    8. Training an agent to walk using TRPO
      1. Installing the MuJoCo environment
      2. Implementing TRPO
      3. Recording the video
    9. Training a cheetah bot to run using PPO
      1. Making a GIF of a trained agent
    10. Implementing GAIL
    11. Summary
    12. Questions
    13. Further reading
  18. Reinforcement Learning Frontiers
    1. Meta reinforcement learning
      1. Model-agnostic meta learning
        1. Understanding MAML
        2. MAML in a supervised learning setting
        3. MAML in a reinforcement learning setting
    2. Hierarchical reinforcement learning
      1. MAXQ value function Decomposition
    3. Imagination augmented agents
    4. Summary
    5. Questions
    6. Further reading
  19. Appendix 1 – Reinforcement Learning Algorithms
    1. Reinforcement learning algorithm
    2. Value Iteration
    3. Policy Iteration
    4. First-Visit MC Prediction
    5. Every-Visit MC Prediction
    6. MC Prediction – the Q Function
    7. MC Control Method
    8. On-Policy MC Control – Exploring starts
    9. On-Policy MC Control – Epsilon-Greedy
    10. Off-Policy MC Control
    11. TD Prediction
    12. On-Policy TD Control – SARSA
    13. Off-Policy TD Control – Q Learning
    14. Deep Q Learning
    15. Double DQN
    16. REINFORCE Policy Gradient
    17. Policy Gradient with Reward-To-Go
    18. REINFORCE with Baseline
    19. Advantage Actor Critic
    20. Asynchronous Advantage Actor-Critic
    21. Deep Deterministic Policy Gradient
    22. Twin Delayed DDPG
    23. Soft Actor-Critic
    24. Trust Region Policy Optimization
    25. PPO-Clipped
    26. PPO-Penalty
    27. Categorical DQN
    28. Distributed Distributional DDPG
    29. DAgger
    30. Deep Q learning from demonstrations
    31. MaxEnt Inverse Reinforcement Learning
    32. MAML in Reinforcement Learning
  20. Appendix 2 – Assessments
    1. Chapter 1 – Fundamentals of Reinforcement Learning
    2. Chapter 2 – A Guide to the Gym Toolkit
    3. Chapter 3 – The Bellman Equation and Dynamic Programming
    4. Chapter 4 – Monte Carlo Methods
    5. Chapter 5 – Understanding Temporal Difference Learning
    6. Chapter 6 – Case Study – The MAB Problem
    7. Chapter 7 – Deep Learning Foundations
    8. Chapter 8 – A Primer on TensorFlow
    9. Chapter 9 – Deep Q Network and Its Variants
    10. Chapter 10 – Policy Gradient Method
    11. Chapter 11 – Actor-Critic Methods – A2C and A3C
    12. Chapter 12 – Learning DDPG, TD3, and SAC
    13. Chapter 13 – TRPO, PPO, and ACKTR Methods
    14. Chapter 14 – Distributional Reinforcement Learning
    15. Chapter 15 – Imitation Learning and Inverse RL
    16. Chapter 16 – Deep Reinforcement Learning with Stable Baselines
    17. Chapter 17 – Reinforcement Learning Frontiers
  21. Other Books You May Enjoy
  22. Index

Product information

  • Title: Deep Reinforcement Learning with Python - Second Edition
  • Author(s): Sudharsan Ravichandiran
  • Release date: September 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781839210686