Policy Gradients

So far, we have seen how to derive implicit policies from a value function with the value-based approach. Here, an agent will try to learn the policy directly. The approach is similar, any experienced agent will change the policy after witnessing it.

Value iteration, policy iteration, and Q-learning come under the value-based approach solved by dynamic programming, while the policy optimization approach involves policy gradients and union of this knowledge along with policy iteration, giving rise to actor-critic algorithms.

As per the dynamic programming method, there are a set of self-consistent equations to satisfy the Q and V values. Policy optimization is different, where policy learning happens directly, unlike deriving ...

Get Reinforcement Learning with TensorFlow now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.