Reinforcement Learning and Stochastic Optimization

Book description

REINFORCEMENT LEARNING AND STOCHASTIC OPTIMIZATION

Clearing the jungle of stochastic optimization

Sequential decision problems, which consist of “decision, information, decision, information,” are ubiquitous, spanning virtually every human activity ranging from business applications, health (personal and public health, and medical decision making), energy, the sciences, all fields of engineering, finance, and e-commerce. The diversity of applications attracted the attention of at least 15 distinct fields of research, using eight distinct notational systems which produced a vast array of analytical tools. A byproduct is that powerful tools developed in one community may be unknown to other communities.

Reinforcement Learning and Stochastic Optimization offers a single canonical framework that can model any sequential decision problem using five core components: state variables, decision variables, exogenous information variables, transition function, and objective function. This book highlights twelve types of uncertainty that might enter any model and pulls together the diverse set of methods for making decisions, known as policies, into four fundamental classes that span every method suggested in the academic literature or used in practice.

Reinforcement Learning and Stochastic Optimization is the first book to provide a balanced treatment of the different methods for modeling and solving sequential decision problems, following the style used by most books on machine learning, optimization, and simulation. The presentation is designed for readers with a course in probability and statistics, and an interest in modeling and applications. Linear programming is occasionally used for specific problem classes. The book is designed for readers who are new to the field, as well as those with some background in optimization under uncertainty.

Throughout this book, readers will find references to over 100 different applications, spanning pure learning problems, dynamic resource allocation problems, general state-dependent problems, and hybrid learning/resource allocation problems such as those that arose in the COVID pandemic. There are 370 exercises, organized into seven groups, ranging from review questions, modeling, computation, problem solving, theory, programming exercises and a “diary problem” that a reader chooses at the beginning of the book, and which is used as a basis for questions throughout the rest of the book.

Table of contents

  1. Cover
  2. Title page
  3. Copyright
  4. Preface
  5. Acknowledgments
  6. Part I – Introduction
    1. 1 Sequential Decision Problems
      1. 1.1 The Audience
      2. 1.2 The Communities of Sequential Decision Problems
      3. 1.3 Our Universal Modeling Framework
      4. 1.4 DesigningPolicies for Sequential Decision Problems
      5. 1.4.1 Policy Search
      6. 1.4.2 Policies Based on Lookahead Approximations
      7. 1.4.3 Mixing and Matching
      8. 1.4.4 Optimality of the Four Classes
      9. 1.4.5 Pulling it All Together
      10. 1.5 Learning
      11. 1.6 Themes
      12. 1.6.1 Blending Learning and Optimization
      13. 1.6.2 Bridging Machine Learning to Sequential Decisions
      14. 1.6.3 From Deterministic to Stochastic Optimization
      15. 1.6.4 From Single to Multiple Agents
      16. 1.7 Our Modeling Approach
      17. 1.8 How to Read this Book
      18. 1.8.1 Organization of Topics
      19. 1.8.2 How to Read Each Chapter
      20. 1.8.3 Organization of Exercises
      21. 1.9 Bibliographic Notes
      22. Exercises
      23. Bibliography
    2. 2 Canonical Problems and Applications
      1. 2.1 Canonical Problems
      2. 2.1.1 Stochastic Search – Derivative-based and Derivative-free
      3. 2.1.1.1 Derivative-based Stochastic Search
      4. 2.1.1.2 Derivative-free Stochastic Search
      5. 2.1.2 Decision Trees
      6. 2.1.3 Markov Decision Processes
      7. 2.1.4 Optimal Control
      8. 2.1.5 Approximate Dynamic Programming
      9. 2.1.6 Reinforcement Learning
      10. 2.1.7 Optimal Stopping
      11. 2.1.8 Stochastic Programming
      12. 2.1.9 The Multiarmed Bandit Problem
      13. 2.1.10 Simulation Optimization
      14. 2.1.11 Active Learning
      15. 2.1.12 Chance-constrained Programming
      16. 2.1.13 Model Predictive Control
      17. 2.1.14 Robust Optimization
      18. 2.2 A Universal Modeling Framework for Sequential Decision Problems
      19. 2.2.1 Our Universal Model for Sequential Decision Problems
      20. 2.2.2 A Compact Modeling Presentation
      21. 2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks
      22. 2.3 Applications
      23. 2.3.1 The Newsvendor Problems
      24. 2.3.1.1 Basic Newsvendor – Final Reward
      25. 2.3.1.2 Basic Newsvendor – Cumulative Reward
      26. 2.3.1.3 Contextual Newsvendor
      27. 2.3.1.4 Multidimensional Newsvendor Problems
      28. 2.3.2 Inventory/Storage Problems
      29. 2.3.2.1 Inventory Without Lags
      30. 2.3.2.2 Inventory Planning with Forecasts
      31. 2.3.2.3 Lagged Decisions
      32. 2.3.3 Shortest Path Problems
      33. 2.3.3.1 A Deterministic Shortest Path Problem
      34. 2.3.3.2 A Stochastic Shortest Path Problem
      35. 2.3.3.3 A Dynamic Shortest Path Problem
      36. 2.3.3.4 A Robust Shortest Path Problem
      37. 2.3.4 Some Fleet Management Problems
      38. 2.3.4.1 The Nomadic Trucker
      39. 2.3.4.2 From One Driver to a Fleet
      40. 2.3.5 Pricing
      41. 2.3.6 Medical Decision Making
      42. 2.3.7 Scientific Exploration
      43. 2.3.8 Machine Learning vs. Sequential Decision Problems
      44. 2.4 Bibliographic Notes
      45. Exercises
      46. Bibliography
    3. 3 Online Learning
      1. 3.1 Machine Learning for Sequential Decisions
      2. 3.1.1 Observations and Data in Stochastic Optimization
      3. 3.1.2 Indexing Input xn and Response yn+1
      4. 3.1.3 FunctionsWe are Learning
      5. 3.1.4 Sequential Learning: From Very Little Data to … More Data
      6. 3.1.5 Approximation Strategies
      7. 3.1.6 From Data Analytics to Decision Analytics
      8. 3.1.7 Batch vs. Online Learning
      9. 3.2 Adaptive Learning Using Exponential Smoothing
      10. 3.3 Lookup Tables with Frequentist Updating
      11. 3.4 Lookup Tables with Bayesian Updating
      12. 3.4.1 The Updating Equations for Independent Beliefs
      13. 3.4.2 Updating for Correlated Beliefs
      14. 3.4.3 Gaussian Process Regression
      15. 3.5 Computing Bias and Variance*
      16. 3.6 Lookup Tables and Aggregation*
      17. 3.6.1 Hierarchical Aggregation
      18. 3.6.2 Estimates of Different Levels of Aggregation
      19. 3.6.3 Combining Multiple Levels of Aggregation
      20. 3.7 Linear Parametric Models
      21. 3.7.1 Linear Regression Review
      22. 3.7.2 Sparse Additive Models and Lasso
      23. 3.8 Recursive Least Squares for Linear Models
      24. 3.8.1 Recursive Least Squares for Stationary Data
      25. 3.8.2 Recursive Least Squares for Nonstationary Data*
      26. 3.8.3 Recursive Estimation Using Multiple Observations*
      27. 3.9 Nonlinear Parametric Models
      28. 3.9.1 Maximum Likelihood Estimation
      29. 3.9.2 Sampled Belief Models
      30. 3.9.3 Neural Networks – Parametric*
      31. 3.9.4 Limitations of Neural Networks
      32. 3.10 Nonparametric Models*
      33. 3.10.1 K-Nearest Neighbor
      34. 3.10.2 Kernel Regression
      35. 3.10.3 Local Polynomial Regression
      36. 3.10.4 Deep Neural Networks
      37. 3.10.5 Support Vector Machines
      38. 3.10.6 Indexed Functions, Tree Structures, and Clustering
      39. 3.10.7 Comments on Nonparametric Models
      40. 3.11 Nonstationary Learning*
      41. 3.11.1 Nonstationary Learning I – Martingale Truth
      42. 3.11.2 Nonstationary Learning II – Transient Truth
      43. 3.11.3 Learning Processes
      44. 3.12 The Curse of Dimensionality
      45. 3.13 Designing Approximation Architectures in Adaptive Learning
      46. 3.14 Why Does ItWork?**
      47. 3.14.1 Derivation of the Recursive Estimation Equations
      48. 3.14.2 The Sherman-Morrison Updating Formula
      49. 3.14.3 Correlations in Hierarchical Estimation
      50. 3.14.4 Proof of Proposition 3.14.1
      51. 3.15 Bibliographic Notes
      52. Exercises
      53. Bibliography
    4. 4 Introduction to Stochastic Search
      1. 4.1 Illustrations of the Basic Stochastic Optimization Problem
      2. 4.2 Deterministic Methods
      3. 4.2.1 A “Stochastic” Shortest Path Problem
      4. 4.2.2 A Newsvendor Problem with Known Distribution
      5. 4.2.3 Chance-Constrained Optimization
      6. 4.2.4 Optimal Control
      7. 4.2.5 Discrete Markov Decision Processes
      8. 4.2.6 Remarks
      9. 4.3 Sampled Models
      10. 4.3.1 Formulating a Sampled Model
      11. 4.3.1.1 A Sampled Stochastic Linear Program
      12. 4.3.1.2 Sampled Chance-Constrained Models
      13. 4.3.1.3 Sampled Parametric Models
      14. 4.3.2 Convergence
      15. 4.3.3 Creating a Sampled Model
      16. 4.3.4 Decomposition Strategies*
      17. 4.4 Adaptive Learning Algorithms
      18. 4.4.1 Modeling Adaptive Learning Problems
      19. 4.4.2 Online vs. Offline Applications
      20. 4.4.2.1 Machine Learning
      21. 4.4.2.2 Optimization
      22. 4.4.3 Objective Functions for Learning
      23. 4.4.4 Designing Policies
      24. 4.5 Closing Remarks
      25. 4.6 Bibliographic Notes
      26. Exercises
      27. Bibliography
  7. Part II – Stochastic Search
    1. 5 Derivative-Based Stochastic Search
      1. 5.1 Some Sample Applications
      2. 5.2 Modeling Uncertainty
      3. 5.2.1 Training Uncertainty W1, … WN
      4. 5.2.2 Model UncertaintyS0
      5. 5.2.3 Testing Uncertainty
      6. 5.2.4 Policy Evaluation
      7. 5.2.5 Closing Notes
      8. 5.3 Stochastic Gradient Methods
      9. 5.3.1 A Stochastic Gradient Algorithm
      10. 5.3.2 Introduction to Stepsizes
      11. 5.3.3 Evaluating a Stochastic Gradient Algorithm
      12. 5.3.4 A Note on Notation
      13. 5.4 Styles of Gradients
      14. 5.4.1 Gradient Smoothing
      15. 5.4.2 Second-Order Methods
      16. 5.4.3 Finite Differences
      17. 5.4.4 SPSA
      18. 5.4.5 Constrained Problems
      19. 5.5 Parameter Optimization for Neural Networks*
      20. 5.5.1 Computing the Gradient
      21. 5.5.2 The Stochastic Gradient Algorithm
      22. 5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem
      23. 5.7 Empirical Issues
      24. 5.8 Transient Problems*
      25. 5.9 Theoretical Performance*
      26. 5.10 Why Does itWork?
      27. 5.10.1 Some Probabilistic Preliminaries
      28. 5.10.2 An Older Proof*
      29. 5.10.3 A More Modern Proof**
      30. 5.11 Bibliographic Notes
      31. Exercises
      32. Bibliography
    2. 6 Stepsize Policies
      1. 6.1 Deterministic Stepsize Policies
      2. 6.1.1 Properties for Convergence
      3. 6.1.2 A Collection of Deterministic Policies
      4. 6.1.2.1 Constant Stepsizes
      5. 6.1.2.2 Generalized Harmonic Stepsizes
      6. 6.1.2.3 Polynomial Learning Rates
      7. 6.1.2.4 McClain’s Formula
      8. 6.1.2.5 Search-then-Converge Learning Policy
      9. 6.2 Adaptive Stepsize Policies
      10. 6.2.1 The Case for Adaptive Stepsizes
      11. 6.2.2 Convergence Conditions
      12. 6.2.3 A Collection of Stochastic Policies
      13. 6.2.3.1 Kesten’s Rule
      14. 6.2.3.2 Trigg’s Formula
      15. 6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule
      16. 6.2.3.4 ADAM
      17. 6.2.3.5 AdaGrad
      18. 6.2.3.6 RMSProp
      19. 6.2.4 Experimental Notes
      20. 6.3 Optimal Stepsize Policies*
      21. 6.3.1 Optimal Stepsizes for Stationary Data
      22. 6.3.2 Optimal Stepsizes for Nonstationary Data – I
      23. 6.3.3 Optimal Stepsizes for Nonstationary Data – II
      24. 6.4 OptimalStepsizesforApproximateValueIteration*
      25. 6.5 Convergence
      26. 6.6 Guidelines for Choosing Stepsize Policies
      27. 6.7 Why Does itWork*
      28. 6.7.1 Proof of BAKF Stepsize
      29. 6.8 Bibliographic Notes
      30. Exercises
      31. Bibliography
    3. 7 Derivative-Free Stochastic Search
      1. 7.1 Overview of Derivative-free Stochastic Search
      2. 7.1.1 Applications and Time Scales
      3. 7.1.2 The Communities of Derivative-free Stochastic Search
      4. 7.1.3 The Multiarmed Bandit Story
      5. 7.1.4 From Passive Learning to Active Learning to Bandit Problems
      6. 7.2 Modeling Derivative-free Stochastic Search
      7. 7.2.1 The Universal Model
      8. 7.2.2 Illustration: Optimizing a Manufacturing Process
      9. 7.2.3 Major Problem Classes
      10. 7.3 Designing Policies
      11. 7.4 Policy Function Approximations
      12. 7.5 Cost Function Approximations
      13. 7.6 VFA-based Policies
      14. 7.6.1 An Optimal Policy
      15. 7.6.2 Beta-Bernoulli Belief Model
      16. 7.6.3 Backward Approximate Dynamic Programming
      17. 7.6.4 Gittins Indices for Learning in Steady State
      18. 7.7 Direct Lookahead Policies
      19. 7.7.1 When do we Need Lookahead Policies?
      20. 7.7.2 Single Period Lookahead Policies
      21. 7.7.3 Restricted Multiperiod Lookahead
      22. 7.7.4 Multiperiod Deterministic Lookahead
      23. 7.7.5 Multiperiod Stochastic Lookahead Policies
      24. 7.7.6 Hybrid Direct Lookahead
      25. 7.8 The Knowledge Gradient (Continued)*
      26. 7.8.1 The Belief Model
      27. 7.8.2 The Knowledge Gradient for Maximizing Final Reward
      28. 7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward
      29. 7.8.4 The Knowledge Gradient for Sampled Belief Model*
      30. 7.8.5 Knowledge Gradient for Correlated Beliefs
      31. 7.9 Learning in Batches
      32. 7.10 Simulation Optimization*
      33. 7.10.1 An Indifference Zone Algorithm
      34. 7.10.2 Optimal Computing Budget Allocation
      35. 7.11 Evaluating Policies
      36. 7.11.1 Alternative Performance Metrics*
      37. 7.11.2 Perspectives of Optimality*
      38. 7.12 Designing Policies
      39. 7.12.1 Characteristics of a Policy
      40. 7.12.2 The Effect of Scaling
      41. 7.12.3 Tuning
      42. 7.13 Extensions*
      43. 7.13.1 Learning in Nonstationary Settings
      44. 7.13.2 Strategies for Designing Time-dependent Policies
      45. 7.13.3 A Transient Learning Model
      46. 7.13.4 The Knowledge Gradient for Transient Problems
      47. 7.13.5 Learning with Large or Continuous Choice Sets
      48. 7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem
      49. 7.13.7 State-dependent vs. State-independent Problems
      50. 7.14 Bibliographic Notes
      51. Exercises
      52. Bibliography
  8. Part III – State-dependent Problems
    1. 8 State-dependent Problems
      1. 8.1 Graph Problems
      2. 8.1.1 A Stochastic Shortest Path Problem
      3. 8.1.2 The Nomadic Trucker
      4. 8.1.3 The Transformer Replacement Problem
      5. 8.1.4 Asset Valuation
      6. 8.2 Inventory Problems
      7. 8.2.1 A Basic Inventory Problem
      8. 8.2.2 The Inventory Problem – II
      9. 8.2.3 The Lagged Asset Acquisition Problem
      10. 8.2.4 The Batch Replenishment Problem
      11. 8.3 Complex Resource Allocation Problems
      12. 8.3.1 The Dynamic Assignment Problem
      13. 8.3.2 The Blood Management Problem
      14. 8.4 State-dependent Learning Problems
      15. 8.4.1 Medical Decision Making
      16. 8.4.2 Laboratory Experimentation
      17. 8.4.3 Bidding for Ad-clicks
      18. 8.4.4 An Information-collecting Shortest Path Problem
      19. 8.5 A Sequence of Problem Classes
      20. 8.6 Bibliographic Notes
      21. Exercises
      22. Bibliography
    2. 9 Modeling Sequential Decision Problems
      1. 9.1 A Simple Modeling Illustration
      2. 9.2 Notational Style
      3. 9.3 Modeling Time
      4. 9.4 The States of Our System
      5. 9.4.1 Defining the State Variable
      6. 9.4.2 The Three States of Our System
      7. 9.4.3 Initial State S0 vs. Subsequent States St t > 0 > 0
      8. 9.4.4 Lagged State Variables*
      9. 9.4.5 The Post-decision State Variable*
      10. 9.4.6 A Shortest Path Illustration
      11. 9.4.7 Belief States*
      12. 9.4.8 Latent Variables*
      13. 9.4.9 Rolling Forecasts*
      14. 9.4.10 Flat vs. Factored State Representations*
      15. 9.4.11 A Programmer’s Perspective of State Variables
      16. 9.5 Modeling Decisions
      17. 9.5.1 Types of Decisions
      18. 9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 0
      19. 9.5.3 Strategic, Tactical, and Execution Decisions
      20. 9.5.4 Constraints
      21. 9.5.5 Introducing Policies
      22. 9.6 The Exogenous Information Process
      23. 9.6.1 Basic Notation for Information Processes
      24. 9.6.2 Outcomes and Scenarios
      25. 9.6.3 Lagged Information Processes*
      26. 9.6.4 Models of Information Processes*
      27. 9.6.5 Supervisory Processes*
      28. 9.7 The Transition Function
      29. 9.7.1 A General Model
      30. 9.7.2 Model-free Dynamic Programming
      31. 9.7.3 Exogenous Transitions
      32. 9.8 The Objective Function
      33. 9.8.1 The Performance Metric
      34. 9.8.2 Optimizing the Policy
      35. 9.8.3 Dependence of Optimal Policy on S0
      36. 9.8.4 State-dependent Variations
      37. 9.8.5 Uncertainty Operators
      38. 9.9 Illustration: An Energy Storage Model
      39. 9.9.1 With a Time-series Price Model
      40. 9.9.2 With Passive Learning
      41. 9.9.3 With Active Learning
      42. 9.9.4 With Rolling Forecasts
      43. 9.10 Base Models and Lookahead Models
      44. 9.11 A Classification of Problems*
      45. 9.12 Policy Evaluation*
      46. 9.13 Advanced Probabilistic Modeling Concepts**
      47. 9.13.1 A Measure-theoretic View of Information**
      48. 9.13.2 Policies and Measurability
      49. 9.14 Looking Forward
      50. 9.15 Bibliographic Notes
      51. Exercises
      52. Bibliography
    3. 10 Uncertainty Modeling
      1. 10.1 Sources of Uncertainty
      2. 10.1.1 Observational Errors
      3. 10.1.2 Exogenous Uncertainty
      4. 10.1.3 Prognostic Uncertainty
      5. 10.1.4 Inferential (or Diagnostic) Uncertainty
      6. 10.1.5 Experimental Variability
      7. 10.1.6 Model Uncertainty
      8. 10.1.7 Transitional Uncertainty
      9. 10.1.8 Control/implementation Uncertainty
      10. 10.1.9 Communication Errors and Biases
      11. 10.1.10 Algorithmic Instability
      12. 10.1.11 Goal Uncertainty
      13. 10.1.12 Political/regulatory Uncertainty
      14. 10.1.13 Discussion
      15. 10.2 A Modeling Case Study: The COVID Pandemic
      16. 10.3 Stochastic Modeling
      17. 10.3.1 Sampling Exogenous Information
      18. 10.3.2 Types of Distributions
      19. 10.3.3 Modeling Sample Paths
      20. 10.3.4 State-action-dependent Processes
      21. 10.3.5 Modeling Correlations
      22. 10.4 Monte Carlo Simulation
      23. 10.4.1 Generating Uniform [0, 1] Random Variables
      24. 10.4.2 Uniform and Normal Random Variable
      25. 10.4.3 Generating Random Variables from Inverse Cumulative Distributions
      26. 10.4.4 Inverse Cumulative From Quantile Distributions
      27. 10.4.5 Distributions with Uncertain Parameters
      28. 10.5 Case Study: Modeling Electricity Prices
      29. 10.5.1 Mean Reversion
      30. 10.5.2 Jump-diffusion Models
      31. 10.5.3 Quantile Distributions
      32. 10.5.4 Regime Shifting
      33. 10.5.5 Crossing Times
      34. 10.6 Sampling vs. Sampled Models
      35. 10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm
      36. 10.6.2 Static Sampling: Solving a Sampled Model
      37. 10.6.3 Sampled Representation with Bayesian Updating
      38. 10.7 Closing Notes
      39. 10.8 Bibliographic Notes
      40. Exercises
      41. Bibliography
    4. 11 Designing Policies
      1. 11.1 From Optimization to Machine Learning to Sequential Decision Problems
      2. 11.2 The Classes of Policies
      3. 11.3 Policy Function Approximations
      4. 11.4 Cost Function Approximations
      5. 11.5 Value Function Approximations
      6. 11.6 Direct Lookahead Approximations
      7. 11.6.1 The Basic Idea
      8. 11.6.2 Modeling the Lookahead Problem
      9. 11.6.3 The Policy-Within-a-Policy
      10. 11.7 Hybrid Strategies
      11. 11.7.1 Cost Function Approximation with Policy Function Approximations
      12. 11.7.2 Lookahead Policies with Value Function Approximations
      13. 11.7.3 Lookahead Policies with Cost Function Approximations
      14. 11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy
      15. 11.7.5 Value Function Approximation with Policy Function Approximation
      16. 11.7.6 Fitting Value Functions Using ADP and Policy Search
      17. 11.8 Randomized Policies
      18. 11.9 Illustration: An Energy Storage Model Revisited
      19. 11.9.1 Policy Function Approximation
      20. 11.9.2 Cost Function Approximation
      21. 11.9.3 Value Function Approximation
      22. 11.9.4 Deterministic Lookahead
      23. 11.9.5 Hybrid Lookahead-Cost Function Approximation
      24. 11.9.6 Experimental Testing
      25. 11.10 Choosing the Policy Class
      26. 11.10.1 The Policy Classes
      27. 11.10.2 Policy Complexity-Computational Tradeoffs
      28. 11.10.3 Screening Questions
      29. 11.11 Policy Evaluation
      30. 11.12 Parameter Tuning
      31. 11.12.1 The Soft Issues
      32. 11.12.2 Searching Across Policy Classes
      33. 11.13 Bibliographic Notes
      34. Exercises
      35. Bibliography
  9. Part IV – Policy Search
    1. 12 Policy Function Approximations and Policy Search
      1. 12.1 Policy Search as a Sequential Decision Problem
      2. 12.2 Classes of Policy Function Approximations
      3. 12.2.1 Lookup Table Policies
      4. 12.2.2 Boltzmann Policies for Discrete Actions
      5. 12.2.3 Linear Decision Rules
      6. 12.2.4 Monotone Policies
      7. 12.2.5 Nonlinear Policies
      8. 12.2.6 Nonparametric/Locally Linear Policies
      9. 12.2.7 Contextual Policies
      10. 12.3 Problem Characteristics
      11. 12.4 Flavors of Policy Search
      12. 12.5 Policy Search with Numerical Derivatives
      13. 12.6 Derivative-Free Methods for Policy Search
      14. 12.6.1 Belief Models
      15. 12.6.2 Learning Through Perturbed PFAs
      16. 12.6.3 Learning CFAs
      17. 12.6.4 DLA Using the Knowledge Gradient
      18. 12.6.5 Comments
      19. 12.7 Exact Derivatives for Continuous Sequential Problems*
      20. 12.8 ExactDerivativesforDiscreteDynamicPrograms**
      21. 12.8.1 A Stochastic Policy
      22. 12.8.2 The Objective Function
      23. 12.8.3 The Policy Gradient Theorem
      24. 12.8.4 Computing the Policy Gradient
      25. 12.9 Supervised Learning
      26. 12.10 Why Does itWork?
      27. 12.10.1 Derivation of the Policy Gradient Theorem
      28. 12.11 Bibliographic Notes
      29. Exercises
      30. Bibliography
    2. 13 Cost Function Approximations
      1. 13.1 General Formulation for Parametric CFA
      2. 13.2 Objective-Modified CFAs
      3. 13.2.1 Linear Cost Function Correction
      4. 13.2.2 CFAs for Dynamic Assignment Problems
      5. 13.2.3 Dynamic Shortest Paths
      6. 13.2.4 Dynamic Trading Policy
      7. 13.2.5 Discussion
      8. 13.3 Constraint-Modified CFAs
      9. 13.3.1 General Formulation of Constraint-Modified CFAs
      10. 13.3.2 A Blood Management Problem
      11. 13.3.3 An Energy Storage Example with Rolling Forecasts
      12. 13.4 Bibliographic Notes
      13. Exercises
      14. Bibliography
  10. Part V – Lookahead Policies
    1. 14 Exact Dynamic Programming
      1. 14.1 Discrete Dynamic Programming
      2. 14.2 The Optimality Equations
      3. 14.2.1 Bellman’s Equations
      4. 14.2.2 Computing the Transition Matrix
      5. 14.2.3 Random Contributions
      6. 14.2.4 Bellman’s Equation Using Operator Notation*
      7. 14.3 Finite Horizon Problems
      8. 14.4 Continuous Problems with Exact Solutions
      9. 14.4.1 The Gambling Problem
      10. 14.4.2 The Continuous Budgeting Problem
      11. 14.5 Infinite Horizon Problems*
      12. 14.6 Value Iteration for Infinite Horizon Problems*
      13. 14.6.1 A Gauss-Seidel Variation
      14. 14.6.2 Relative Value Iteration
      15. 14.6.3 Bounds and Rates of Convergence
      16. 14.7 Policy Iteration for Infinite Horizon Problems*
      17. 14.8 Hybrid Value-Policy Iteration*
      18. 14.9 Average Reward Dynamic Programming*
      19. 14.10 The Linear Programming Method for Dynamic Programs**
      20. 14.11 Linear Quadratic Regulation
      21. 14.12 Why Does itWork?**
      22. 14.12.1 The Optimality Equations
      23. 14.12.2 Convergence of Value Iteration
      24. 14.12.3 Monotonicity of Value Iteration
      25. 14.12.4 Bounding the Error from Value Iteration
      26. 14.12.5 Randomized Policies
      27. 14.13 Bibliographic Notes
      28. Exercises
      29. Bibliography
    2. 15 Backward Approximate Dynamic Programming
      1. 15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems
      2. 15.1.1 Some Preliminaries
      3. 15.1.2 Backward ADP Using Lookup Tables
      4. 15.1.3 Backward ADP Algorithm with Continuous Approximations
      5. 15.2 FittedValueIterationforInfiniteHorizonProblems
      6. 15.3 Value Function Approximation Strategies
      7. 15.3.1 Linear Models
      8. 15.3.2 Monotone Functions
      9. 15.3.3 Other Approximation Models
      10. 15.4 Computational Observations
      11. 15.4.1 Experimental Benchmarking of Backward ADP
      12. 15.4.2 Computational Notes
      13. 15.5 Bibliographic Notes
      14. Exercises
      15. Bibliography
    3. 16 Forward ADP I: The Value of a Policy
      1. 16.1 Sampling the Value of a Policy
      2. 16.1.1 Direct Policy Evaluation for Finite Horizon Problems
      3. 16.1.2 Policy Evaluation for Infinite Horizon Problems
      4. 16.1.3 Temporal Difference Updates
      5. 16.1.4 TD(γ)
      6. 16.1.5 TD(0) and Approximate Value Iteration
      7. 16.1.6 TD Learning for Infinite Horizon Problems
      8. 16.2 Stochastic Approximation Methods
      9. 16.3 Bellman’s Equation Using a Linear Model*
      10. 16.3.1 A Matrix-based Derivation**
      11. 16.3.2 A Simulation-based Implementation
      12. 16.3.3 Least Squares Temporal Difference Learning (LSTD)
      13. 16.3.4 Least Squares Policy Evaluation
      14. 16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State*
      15. 16.4.1 Recursive Least Squares and TD(0)
      16. 16.4.2 Least Squares Policy Evaluation
      17. 16.4.3 Least Squares Temporal Difference Learning
      18. 16.4.4 Discussion
      19. 16.5 Gradient-based Methods for Approximate Value Iteration*
      20. 16.5.1 Approximate Value Iteration with Linear Models**
      21. 16.5.2 A Geometric View of Linear Models*
      22. 16.6 Value Function Approximations Based on Bayesian Learning*
      23. 16.6.1 Minimizing Bias for Infinite Horizon Problems
      24. 16.6.2 Lookup Tables with Correlated Beliefs
      25. 16.6.3 Parametric Models
      26. 16.6.4 Creating the Prior
      27. 16.7 Learning Algorithms and Atepsizes
      28. 16.7.1 Least Squares Temporal Differences
      29. 16.7.2 Least Squares Policy Evaluation
      30. 16.7.3 Recursive Least Squares
      31. 16.7.4 Bounding 1/n Convergence for Approximate value Iteration
      32. 16.7.5 Discussion
      33. 16.8 Bibliographic Notes
      34. Exercises
      35. Bibliography
    4. 17 Forward ADP II: Policy Optimization
      1. 17.1 Overview of Algorithmic Strategies
      2. 17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables
      3. 17.2.1 Value Iteration Using a Pre-Decision State Variable
      4. 17.2.2 Q-Learning
      5. 17.2.3 Value Iteration Using a Post-Decision State Variable
      6. 17.2.4 Value Iteration Using a Backward Pass
      7. 17.3 Styles of Learning
      8. 17.3.1 Offline Learning
      9. 17.3.2 From Offline to Online
      10. 17.3.3 Evaluating Offline and Online Learning Policies
      11. 17.3.4 Lookahead Policies
      12. 17.4 Approximate Value Iteration Using Linear Models
      13. 17.5 On-policy vs. off-policy learning and the exploration–exploitation problem
      14. 17.5.1 Terminology
      15. 17.5.2 Learning with Lookup Tables
      16. 17.5.3 Learning with Generalized Belief Models
      17. 17.6 Applications
      18. 17.6.1 Pricing an American Option
      19. 17.6.2 Playing “Lose Tic-Tac-Toe”
      20. 17.6.3 Approximate Dynamic Programming for Deterministic Problems
      21. 17.7 Approximate Policy Iteration
      22. 17.7.1 Finite Horizon Problems Using Lookup Tables
      23. 17.7.2 Finite Horizon Problems Using Linear Models
      24. 17.7.3 LSTD for Infinite Horizon Problems Using Linear Models
      25. 17.8 The Actor–Critic Paradigm
      26. 17.9 Statistical Bias in the Max Operator*
      27. 17.10 The Linear Programming Method Using Linear Models*
      28. 17.11 Finite Horizon Approximations for Steady-State Applications
      29. 17.12 Bibliographic Notes
      30. Exercises
      31. Bibliography
    5. 18 Forward ADP III: Convex Resource Allocation Problems
      1. 18.1 Resource Allocation Problems
      2. 18.1.1 The Newsvendor Problem
      3. 18.1.2 Two-Stage Resource Allocation Problems
      4. 18.1.3 A General Multiperiod Resource Allocation Model*
      5. 18.2 Values Versus Marginal Values
      6. 18.3 Piecewise Linear Approximations for Scalar Funtions
      7. 18.3.1 The Leveling Algorithm
      8. 18.3.2 The CAVE Algorithm
      9. 18.4 Regression Methods
      10. 18.5 Separable Piecewise Linear Approximations
      11. 18.6 Benders Decomposition for Nonseparable Approximations**
      12. 18.6.1 Benders’ Decomposition for Two-Stage Problems
      13. 18.6.2 Asymptotic Analysis of Benders with Regularization**
      14. 18.6.3 Benders with Regularization
      15. 18.7 Linear Approximations for High-Dimensional Applications
      16. 18.8 Resource Allocation with Exogenous Information State
      17. 18.9 Closing Notes
      18. 18.10 Bibliographic Notes
      19. Exercises
      20. Bibliography
    6. 19 Direct Lookahead Policies
      1. 19.1 Optimal Policies Using Lookahead Models
      2. 19.2 Creating an Approximate Lookahead Model
      3. 19.2.1 Modeling the Lookahead Model
      4. 19.2.2 Strategies for Approximating the Lookahead Model
      5. 19.3 Modified Objectives in Lookahead Models
      6. 19.3.1 Managing Risk
      7. 19.3.2 Utility Functions for Multiobjective Problems
      8. 19.3.3 Model Discounting
      9. 19.4 Evaluating DLA Policies
      10. 19.4.1 Evaluating Policies in a Simulator
      11. 19.4.2 Evaluating Risk-Adjusted Policies
      12. 19.4.3 Evaluating Policies in the Field
      13. 19.4.4 Tuning Direct Lookahead Policies
      14. 19.5 Why Use a DLA?
      15. 19.6 Deterministic Lookaheads
      16. 19.6.1 A Deterministic Lookahead: Shortest Path Problems
      17. 19.6.2 Parameterized Lookaheads
      18. 19.7 A Tour of Stochastic Lookahead Policies
      19. 19.7.1 Lookahead PFAs
      20. 19.7.2 Lookahead CFAs
      21. 19.7.3 Lookahead VFAs for the Lookahead Model
      22. 19.7.4 Lookahead DLAs for the Lookahead Model
      23. 19.7.5 Discussion
      24. 19.8 Monte Carlo Tree Search for Discrete Decisions
      25. 19.8.1 Basic Idea
      26. 19.8.2 The Steps of MCTS
      27. 19.8.3 Discussion
      28. 19.8.4 Optimistic Monte Carlo Tree Search
      29. 19.9 Two-Stage Stochastic Programming for Vector Decisions*
      30. 19.9.1 The Basic Two-Stage Stochastic Program
      31. 19.9.2 Two-Stage Approximation of a Sequential Problem
      32. 19.9.3 Discussion
      33. 19.10 Observations on DLA Policies
      34. 19.11 Bibliographic Notes
      35. Exercises
      36. Bibliography
  11. Part VI – Multiagent Systems
    1. 20 Multiagent Modeling and Learning
      1. 20.1 Overview of Multiagent Systems
      2. 20.1.1 Dimensions of a Multiagent System
      3. 20.1.2 Communication
      4. 20.1.3 Modeling a Multiagent System
      5. 20.1.4 Controlling Architectures
      6. 20.2 A Learning Problem – Flu Mitigation
      7. 20.2.1 Model 1: A Static Model
      8. 20.2.2 Variations of Our Flu Model
      9. 20.2.3 Two-Agent Learning Models
      10. 20.2.4 Transition Functions for Two-Agent Model
      11. 20.2.5 Designing Policies for the Flu Problem
      12. 20.3 The POMDP Perspective*
      13. 20.4 The Two-Agent Newsvendor Problem
      14. 20.5 Multiple Independent Agents – An HVAC Controller Model
      15. 20.5.1 Model
      16. 20.5.2 Designing Policies
      17. 20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem
      18. 20.7 Closing Notes
      19. 20.8 Why Does itWork?
      20. 20.8.1 Derivation of the POMDP Belief Transition Function
      21. 20.9 Bibliographic Notes
      22. Exercises
      23. Bibliography
  12. Index
  13. End User License Agreement

Product information

  • Title: Reinforcement Learning and Stochastic Optimization
  • Author(s): Warren B. Powell
  • Release date: March 2022
  • Publisher(s): Wiley
  • ISBN: 9781119815037