In the previous chapter, we looked at policy gradient algorithms. Their uniqueness lies in the order in which they solve a reinforcement learning (RL) problem—policy gradient algorithms take a step in the direction of the highest gain of the reward. The simpler version of this algorithm (REINFORCE) has a straightforward implementation that alone achieves good results. Nevertheless, it is slow and has a high variance. For this reason, we introduced a value function that has a double goal—to critique the actor and to provide a baseline. Despite their great potential, these actor-critic algorithms can suffer from unwanted rapid variations in the action distribution that may cause a drastic change in the states that ...
TRPO and PPO Implementation
Get Reinforcement Learning Algorithms with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.