Now, the first thing we need to clarify is how to sample from the environment, and how to interact with it to get usable information about its dynamics:
![](/api/v2/epubs/9781789131116/files/assets/891bc12d-6d45-4ff1-9cbd-9967c18c2063.png)
![](/api/v2/epubs/9781789131116/files/assets/d2f04276-d89b-4c09-be66-5c3e046bdcc1.png)
The simple way to do this is to execute the current policy until the end of the episode. You would end up with a trajectory as shown in figure 4.1. Once the episode terminates, the return values can be computed for each state by backpropagating upward the sum of the rewards, . Repeating this process multiple times (that is, running ...