Book description
REINFORCEMENT LEARNING AND STOCHASTIC OPTIMIZATIONClearing the jungle of stochastic optimization
Sequential decision problems, which consist of “decision, information, decision, information,” are ubiquitous, spanning virtually every human activity ranging from business applications, health (personal and public health, and medical decision making), energy, the sciences, all fields of engineering, finance, and e-commerce. The diversity of applications attracted the attention of at least 15 distinct fields of research, using eight distinct notational systems which produced a vast array of analytical tools. A byproduct is that powerful tools developed in one community may be unknown to other communities.
Reinforcement Learning and Stochastic Optimization offers a single canonical framework that can model any sequential decision problem using five core components: state variables, decision variables, exogenous information variables, transition function, and objective function. This book highlights twelve types of uncertainty that might enter any model and pulls together the diverse set of methods for making decisions, known as policies, into four fundamental classes that span every method suggested in the academic literature or used in practice.
Reinforcement Learning and Stochastic Optimization is the first book to provide a balanced treatment of the different methods for modeling and solving sequential decision problems, following the style used by most books on machine learning, optimization, and simulation. The presentation is designed for readers with a course in probability and statistics, and an interest in modeling and applications. Linear programming is occasionally used for specific problem classes. The book is designed for readers who are new to the field, as well as those with some background in optimization under uncertainty.
Throughout this book, readers will find references to over 100 different applications, spanning pure learning problems, dynamic resource allocation problems, general state-dependent problems, and hybrid learning/resource allocation problems such as those that arose in the COVID pandemic. There are 370 exercises, organized into seven groups, ranging from review questions, modeling, computation, problem solving, theory, programming exercises and a “diary problem” that a reader chooses at the beginning of the book, and which is used as a basis for questions throughout the rest of the book.
Table of contents
- Cover
- Title page
- Copyright
- Preface
- Acknowledgments
-
Part I – Introduction
-
1 Sequential Decision Problems
- 1.1 The Audience
- 1.2 The Communities of Sequential Decision Problems
- 1.3 Our Universal Modeling Framework
- 1.4 DesigningPolicies for Sequential Decision Problems
- 1.4.1 Policy Search
- 1.4.2 Policies Based on Lookahead Approximations
- 1.4.3 Mixing and Matching
- 1.4.4 Optimality of the Four Classes
- 1.4.5 Pulling it All Together
- 1.5 Learning
- 1.6 Themes
- 1.6.1 Blending Learning and Optimization
- 1.6.2 Bridging Machine Learning to Sequential Decisions
- 1.6.3 From Deterministic to Stochastic Optimization
- 1.6.4 From Single to Multiple Agents
- 1.7 Our Modeling Approach
- 1.8 How to Read this Book
- 1.8.1 Organization of Topics
- 1.8.2 How to Read Each Chapter
- 1.8.3 Organization of Exercises
- 1.9 Bibliographic Notes
- Exercises
- Bibliography
-
2 Canonical Problems and Applications
- 2.1 Canonical Problems
- 2.1.1 Stochastic Search – Derivative-based and Derivative-free
- 2.1.1.1 Derivative-based Stochastic Search
- 2.1.1.2 Derivative-free Stochastic Search
- 2.1.2 Decision Trees
- 2.1.3 Markov Decision Processes
- 2.1.4 Optimal Control
- 2.1.5 Approximate Dynamic Programming
- 2.1.6 Reinforcement Learning
- 2.1.7 Optimal Stopping
- 2.1.8 Stochastic Programming
- 2.1.9 The Multiarmed Bandit Problem
- 2.1.10 Simulation Optimization
- 2.1.11 Active Learning
- 2.1.12 Chance-constrained Programming
- 2.1.13 Model Predictive Control
- 2.1.14 Robust Optimization
- 2.2 A Universal Modeling Framework for Sequential Decision Problems
- 2.2.1 Our Universal Model for Sequential Decision Problems
- 2.2.2 A Compact Modeling Presentation
- 2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks
- 2.3 Applications
- 2.3.1 The Newsvendor Problems
- 2.3.1.1 Basic Newsvendor – Final Reward
- 2.3.1.2 Basic Newsvendor – Cumulative Reward
- 2.3.1.3 Contextual Newsvendor
- 2.3.1.4 Multidimensional Newsvendor Problems
- 2.3.2 Inventory/Storage Problems
- 2.3.2.1 Inventory Without Lags
- 2.3.2.2 Inventory Planning with Forecasts
- 2.3.2.3 Lagged Decisions
- 2.3.3 Shortest Path Problems
- 2.3.3.1 A Deterministic Shortest Path Problem
- 2.3.3.2 A Stochastic Shortest Path Problem
- 2.3.3.3 A Dynamic Shortest Path Problem
- 2.3.3.4 A Robust Shortest Path Problem
- 2.3.4 Some Fleet Management Problems
- 2.3.4.1 The Nomadic Trucker
- 2.3.4.2 From One Driver to a Fleet
- 2.3.5 Pricing
- 2.3.6 Medical Decision Making
- 2.3.7 Scientific Exploration
- 2.3.8 Machine Learning vs. Sequential Decision Problems
- 2.4 Bibliographic Notes
- Exercises
- Bibliography
-
3 Online Learning
- 3.1 Machine Learning for Sequential Decisions
- 3.1.1 Observations and Data in Stochastic Optimization
- 3.1.2 Indexing Input xn and Response yn+1
- 3.1.3 FunctionsWe are Learning
- 3.1.4 Sequential Learning: From Very Little Data to … More Data
- 3.1.5 Approximation Strategies
- 3.1.6 From Data Analytics to Decision Analytics
- 3.1.7 Batch vs. Online Learning
- 3.2 Adaptive Learning Using Exponential Smoothing
- 3.3 Lookup Tables with Frequentist Updating
- 3.4 Lookup Tables with Bayesian Updating
- 3.4.1 The Updating Equations for Independent Beliefs
- 3.4.2 Updating for Correlated Beliefs
- 3.4.3 Gaussian Process Regression
- 3.5 Computing Bias and Variance*
- 3.6 Lookup Tables and Aggregation*
- 3.6.1 Hierarchical Aggregation
- 3.6.2 Estimates of Different Levels of Aggregation
- 3.6.3 Combining Multiple Levels of Aggregation
- 3.7 Linear Parametric Models
- 3.7.1 Linear Regression Review
- 3.7.2 Sparse Additive Models and Lasso
- 3.8 Recursive Least Squares for Linear Models
- 3.8.1 Recursive Least Squares for Stationary Data
- 3.8.2 Recursive Least Squares for Nonstationary Data*
- 3.8.3 Recursive Estimation Using Multiple Observations*
- 3.9 Nonlinear Parametric Models
- 3.9.1 Maximum Likelihood Estimation
- 3.9.2 Sampled Belief Models
- 3.9.3 Neural Networks – Parametric*
- 3.9.4 Limitations of Neural Networks
- 3.10 Nonparametric Models*
- 3.10.1 K-Nearest Neighbor
- 3.10.2 Kernel Regression
- 3.10.3 Local Polynomial Regression
- 3.10.4 Deep Neural Networks
- 3.10.5 Support Vector Machines
- 3.10.6 Indexed Functions, Tree Structures, and Clustering
- 3.10.7 Comments on Nonparametric Models
- 3.11 Nonstationary Learning*
- 3.11.1 Nonstationary Learning I – Martingale Truth
- 3.11.2 Nonstationary Learning II – Transient Truth
- 3.11.3 Learning Processes
- 3.12 The Curse of Dimensionality
- 3.13 Designing Approximation Architectures in Adaptive Learning
- 3.14 Why Does ItWork?**
- 3.14.1 Derivation of the Recursive Estimation Equations
- 3.14.2 The Sherman-Morrison Updating Formula
- 3.14.3 Correlations in Hierarchical Estimation
- 3.14.4 Proof of Proposition 3.14.1
- 3.15 Bibliographic Notes
- Exercises
- Bibliography
-
4 Introduction to Stochastic Search
- 4.1 Illustrations of the Basic Stochastic Optimization Problem
- 4.2 Deterministic Methods
- 4.2.1 A “Stochastic” Shortest Path Problem
- 4.2.2 A Newsvendor Problem with Known Distribution
- 4.2.3 Chance-Constrained Optimization
- 4.2.4 Optimal Control
- 4.2.5 Discrete Markov Decision Processes
- 4.2.6 Remarks
- 4.3 Sampled Models
- 4.3.1 Formulating a Sampled Model
- 4.3.1.1 A Sampled Stochastic Linear Program
- 4.3.1.2 Sampled Chance-Constrained Models
- 4.3.1.3 Sampled Parametric Models
- 4.3.2 Convergence
- 4.3.3 Creating a Sampled Model
- 4.3.4 Decomposition Strategies*
- 4.4 Adaptive Learning Algorithms
- 4.4.1 Modeling Adaptive Learning Problems
- 4.4.2 Online vs. Offline Applications
- 4.4.2.1 Machine Learning
- 4.4.2.2 Optimization
- 4.4.3 Objective Functions for Learning
- 4.4.4 Designing Policies
- 4.5 Closing Remarks
- 4.6 Bibliographic Notes
- Exercises
- Bibliography
-
1 Sequential Decision Problems
-
Part II – Stochastic Search
-
5 Derivative-Based Stochastic Search
- 5.1 Some Sample Applications
- 5.2 Modeling Uncertainty
- 5.2.1 Training Uncertainty W1, … WN
- 5.2.2 Model UncertaintyS0
- 5.2.3 Testing Uncertainty
- 5.2.4 Policy Evaluation
- 5.2.5 Closing Notes
- 5.3 Stochastic Gradient Methods
- 5.3.1 A Stochastic Gradient Algorithm
- 5.3.2 Introduction to Stepsizes
- 5.3.3 Evaluating a Stochastic Gradient Algorithm
- 5.3.4 A Note on Notation
- 5.4 Styles of Gradients
- 5.4.1 Gradient Smoothing
- 5.4.2 Second-Order Methods
- 5.4.3 Finite Differences
- 5.4.4 SPSA
- 5.4.5 Constrained Problems
- 5.5 Parameter Optimization for Neural Networks*
- 5.5.1 Computing the Gradient
- 5.5.2 The Stochastic Gradient Algorithm
- 5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem
- 5.7 Empirical Issues
- 5.8 Transient Problems*
- 5.9 Theoretical Performance*
- 5.10 Why Does itWork?
- 5.10.1 Some Probabilistic Preliminaries
- 5.10.2 An Older Proof*
- 5.10.3 A More Modern Proof**
- 5.11 Bibliographic Notes
- Exercises
- Bibliography
-
6 Stepsize Policies
- 6.1 Deterministic Stepsize Policies
- 6.1.1 Properties for Convergence
- 6.1.2 A Collection of Deterministic Policies
- 6.1.2.1 Constant Stepsizes
- 6.1.2.2 Generalized Harmonic Stepsizes
- 6.1.2.3 Polynomial Learning Rates
- 6.1.2.4 McClain’s Formula
- 6.1.2.5 Search-then-Converge Learning Policy
- 6.2 Adaptive Stepsize Policies
- 6.2.1 The Case for Adaptive Stepsizes
- 6.2.2 Convergence Conditions
- 6.2.3 A Collection of Stochastic Policies
- 6.2.3.1 Kesten’s Rule
- 6.2.3.2 Trigg’s Formula
- 6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule
- 6.2.3.4 ADAM
- 6.2.3.5 AdaGrad
- 6.2.3.6 RMSProp
- 6.2.4 Experimental Notes
- 6.3 Optimal Stepsize Policies*
- 6.3.1 Optimal Stepsizes for Stationary Data
- 6.3.2 Optimal Stepsizes for Nonstationary Data – I
- 6.3.3 Optimal Stepsizes for Nonstationary Data – II
- 6.4 OptimalStepsizesforApproximateValueIteration*
- 6.5 Convergence
- 6.6 Guidelines for Choosing Stepsize Policies
- 6.7 Why Does itWork*
- 6.7.1 Proof of BAKF Stepsize
- 6.8 Bibliographic Notes
- Exercises
- Bibliography
-
7 Derivative-Free Stochastic Search
- 7.1 Overview of Derivative-free Stochastic Search
- 7.1.1 Applications and Time Scales
- 7.1.2 The Communities of Derivative-free Stochastic Search
- 7.1.3 The Multiarmed Bandit Story
- 7.1.4 From Passive Learning to Active Learning to Bandit Problems
- 7.2 Modeling Derivative-free Stochastic Search
- 7.2.1 The Universal Model
- 7.2.2 Illustration: Optimizing a Manufacturing Process
- 7.2.3 Major Problem Classes
- 7.3 Designing Policies
- 7.4 Policy Function Approximations
- 7.5 Cost Function Approximations
- 7.6 VFA-based Policies
- 7.6.1 An Optimal Policy
- 7.6.2 Beta-Bernoulli Belief Model
- 7.6.3 Backward Approximate Dynamic Programming
- 7.6.4 Gittins Indices for Learning in Steady State
- 7.7 Direct Lookahead Policies
- 7.7.1 When do we Need Lookahead Policies?
- 7.7.2 Single Period Lookahead Policies
- 7.7.3 Restricted Multiperiod Lookahead
- 7.7.4 Multiperiod Deterministic Lookahead
- 7.7.5 Multiperiod Stochastic Lookahead Policies
- 7.7.6 Hybrid Direct Lookahead
- 7.8 The Knowledge Gradient (Continued)*
- 7.8.1 The Belief Model
- 7.8.2 The Knowledge Gradient for Maximizing Final Reward
- 7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward
- 7.8.4 The Knowledge Gradient for Sampled Belief Model*
- 7.8.5 Knowledge Gradient for Correlated Beliefs
- 7.9 Learning in Batches
- 7.10 Simulation Optimization*
- 7.10.1 An Indifference Zone Algorithm
- 7.10.2 Optimal Computing Budget Allocation
- 7.11 Evaluating Policies
- 7.11.1 Alternative Performance Metrics*
- 7.11.2 Perspectives of Optimality*
- 7.12 Designing Policies
- 7.12.1 Characteristics of a Policy
- 7.12.2 The Effect of Scaling
- 7.12.3 Tuning
- 7.13 Extensions*
- 7.13.1 Learning in Nonstationary Settings
- 7.13.2 Strategies for Designing Time-dependent Policies
- 7.13.3 A Transient Learning Model
- 7.13.4 The Knowledge Gradient for Transient Problems
- 7.13.5 Learning with Large or Continuous Choice Sets
- 7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem
- 7.13.7 State-dependent vs. State-independent Problems
- 7.14 Bibliographic Notes
- Exercises
- Bibliography
-
5 Derivative-Based Stochastic Search
-
Part III – State-dependent Problems
-
8 State-dependent Problems
- 8.1 Graph Problems
- 8.1.1 A Stochastic Shortest Path Problem
- 8.1.2 The Nomadic Trucker
- 8.1.3 The Transformer Replacement Problem
- 8.1.4 Asset Valuation
- 8.2 Inventory Problems
- 8.2.1 A Basic Inventory Problem
- 8.2.2 The Inventory Problem – II
- 8.2.3 The Lagged Asset Acquisition Problem
- 8.2.4 The Batch Replenishment Problem
- 8.3 Complex Resource Allocation Problems
- 8.3.1 The Dynamic Assignment Problem
- 8.3.2 The Blood Management Problem
- 8.4 State-dependent Learning Problems
- 8.4.1 Medical Decision Making
- 8.4.2 Laboratory Experimentation
- 8.4.3 Bidding for Ad-clicks
- 8.4.4 An Information-collecting Shortest Path Problem
- 8.5 A Sequence of Problem Classes
- 8.6 Bibliographic Notes
- Exercises
- Bibliography
-
9 Modeling Sequential Decision Problems
- 9.1 A Simple Modeling Illustration
- 9.2 Notational Style
- 9.3 Modeling Time
- 9.4 The States of Our System
- 9.4.1 Defining the State Variable
- 9.4.2 The Three States of Our System
- 9.4.3 Initial State S0 vs. Subsequent States St t > 0 > 0
- 9.4.4 Lagged State Variables*
- 9.4.5 The Post-decision State Variable*
- 9.4.6 A Shortest Path Illustration
- 9.4.7 Belief States*
- 9.4.8 Latent Variables*
- 9.4.9 Rolling Forecasts*
- 9.4.10 Flat vs. Factored State Representations*
- 9.4.11 A Programmer’s Perspective of State Variables
- 9.5 Modeling Decisions
- 9.5.1 Types of Decisions
- 9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 0
- 9.5.3 Strategic, Tactical, and Execution Decisions
- 9.5.4 Constraints
- 9.5.5 Introducing Policies
- 9.6 The Exogenous Information Process
- 9.6.1 Basic Notation for Information Processes
- 9.6.2 Outcomes and Scenarios
- 9.6.3 Lagged Information Processes*
- 9.6.4 Models of Information Processes*
- 9.6.5 Supervisory Processes*
- 9.7 The Transition Function
- 9.7.1 A General Model
- 9.7.2 Model-free Dynamic Programming
- 9.7.3 Exogenous Transitions
- 9.8 The Objective Function
- 9.8.1 The Performance Metric
- 9.8.2 Optimizing the Policy
- 9.8.3 Dependence of Optimal Policy on S0
- 9.8.4 State-dependent Variations
- 9.8.5 Uncertainty Operators
- 9.9 Illustration: An Energy Storage Model
- 9.9.1 With a Time-series Price Model
- 9.9.2 With Passive Learning
- 9.9.3 With Active Learning
- 9.9.4 With Rolling Forecasts
- 9.10 Base Models and Lookahead Models
- 9.11 A Classification of Problems*
- 9.12 Policy Evaluation*
- 9.13 Advanced Probabilistic Modeling Concepts**
- 9.13.1 A Measure-theoretic View of Information**
- 9.13.2 Policies and Measurability
- 9.14 Looking Forward
- 9.15 Bibliographic Notes
- Exercises
- Bibliography
-
10 Uncertainty Modeling
- 10.1 Sources of Uncertainty
- 10.1.1 Observational Errors
- 10.1.2 Exogenous Uncertainty
- 10.1.3 Prognostic Uncertainty
- 10.1.4 Inferential (or Diagnostic) Uncertainty
- 10.1.5 Experimental Variability
- 10.1.6 Model Uncertainty
- 10.1.7 Transitional Uncertainty
- 10.1.8 Control/implementation Uncertainty
- 10.1.9 Communication Errors and Biases
- 10.1.10 Algorithmic Instability
- 10.1.11 Goal Uncertainty
- 10.1.12 Political/regulatory Uncertainty
- 10.1.13 Discussion
- 10.2 A Modeling Case Study: The COVID Pandemic
- 10.3 Stochastic Modeling
- 10.3.1 Sampling Exogenous Information
- 10.3.2 Types of Distributions
- 10.3.3 Modeling Sample Paths
- 10.3.4 State-action-dependent Processes
- 10.3.5 Modeling Correlations
- 10.4 Monte Carlo Simulation
- 10.4.1 Generating Uniform [0, 1] Random Variables
- 10.4.2 Uniform and Normal Random Variable
- 10.4.3 Generating Random Variables from Inverse Cumulative Distributions
- 10.4.4 Inverse Cumulative From Quantile Distributions
- 10.4.5 Distributions with Uncertain Parameters
- 10.5 Case Study: Modeling Electricity Prices
- 10.5.1 Mean Reversion
- 10.5.2 Jump-diffusion Models
- 10.5.3 Quantile Distributions
- 10.5.4 Regime Shifting
- 10.5.5 Crossing Times
- 10.6 Sampling vs. Sampled Models
- 10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm
- 10.6.2 Static Sampling: Solving a Sampled Model
- 10.6.3 Sampled Representation with Bayesian Updating
- 10.7 Closing Notes
- 10.8 Bibliographic Notes
- Exercises
- Bibliography
-
11 Designing Policies
- 11.1 From Optimization to Machine Learning to Sequential Decision Problems
- 11.2 The Classes of Policies
- 11.3 Policy Function Approximations
- 11.4 Cost Function Approximations
- 11.5 Value Function Approximations
- 11.6 Direct Lookahead Approximations
- 11.6.1 The Basic Idea
- 11.6.2 Modeling the Lookahead Problem
- 11.6.3 The Policy-Within-a-Policy
- 11.7 Hybrid Strategies
- 11.7.1 Cost Function Approximation with Policy Function Approximations
- 11.7.2 Lookahead Policies with Value Function Approximations
- 11.7.3 Lookahead Policies with Cost Function Approximations
- 11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy
- 11.7.5 Value Function Approximation with Policy Function Approximation
- 11.7.6 Fitting Value Functions Using ADP and Policy Search
- 11.8 Randomized Policies
- 11.9 Illustration: An Energy Storage Model Revisited
- 11.9.1 Policy Function Approximation
- 11.9.2 Cost Function Approximation
- 11.9.3 Value Function Approximation
- 11.9.4 Deterministic Lookahead
- 11.9.5 Hybrid Lookahead-Cost Function Approximation
- 11.9.6 Experimental Testing
- 11.10 Choosing the Policy Class
- 11.10.1 The Policy Classes
- 11.10.2 Policy Complexity-Computational Tradeoffs
- 11.10.3 Screening Questions
- 11.11 Policy Evaluation
- 11.12 Parameter Tuning
- 11.12.1 The Soft Issues
- 11.12.2 Searching Across Policy Classes
- 11.13 Bibliographic Notes
- Exercises
- Bibliography
-
8 State-dependent Problems
-
Part IV – Policy Search
-
12 Policy Function Approximations and Policy Search
- 12.1 Policy Search as a Sequential Decision Problem
- 12.2 Classes of Policy Function Approximations
- 12.2.1 Lookup Table Policies
- 12.2.2 Boltzmann Policies for Discrete Actions
- 12.2.3 Linear Decision Rules
- 12.2.4 Monotone Policies
- 12.2.5 Nonlinear Policies
- 12.2.6 Nonparametric/Locally Linear Policies
- 12.2.7 Contextual Policies
- 12.3 Problem Characteristics
- 12.4 Flavors of Policy Search
- 12.5 Policy Search with Numerical Derivatives
- 12.6 Derivative-Free Methods for Policy Search
- 12.6.1 Belief Models
- 12.6.2 Learning Through Perturbed PFAs
- 12.6.3 Learning CFAs
- 12.6.4 DLA Using the Knowledge Gradient
- 12.6.5 Comments
- 12.7 Exact Derivatives for Continuous Sequential Problems*
- 12.8 ExactDerivativesforDiscreteDynamicPrograms**
- 12.8.1 A Stochastic Policy
- 12.8.2 The Objective Function
- 12.8.3 The Policy Gradient Theorem
- 12.8.4 Computing the Policy Gradient
- 12.9 Supervised Learning
- 12.10 Why Does itWork?
- 12.10.1 Derivation of the Policy Gradient Theorem
- 12.11 Bibliographic Notes
- Exercises
- Bibliography
-
13 Cost Function Approximations
- 13.1 General Formulation for Parametric CFA
- 13.2 Objective-Modified CFAs
- 13.2.1 Linear Cost Function Correction
- 13.2.2 CFAs for Dynamic Assignment Problems
- 13.2.3 Dynamic Shortest Paths
- 13.2.4 Dynamic Trading Policy
- 13.2.5 Discussion
- 13.3 Constraint-Modified CFAs
- 13.3.1 General Formulation of Constraint-Modified CFAs
- 13.3.2 A Blood Management Problem
- 13.3.3 An Energy Storage Example with Rolling Forecasts
- 13.4 Bibliographic Notes
- Exercises
- Bibliography
-
12 Policy Function Approximations and Policy Search
-
Part V – Lookahead Policies
-
14 Exact Dynamic Programming
- 14.1 Discrete Dynamic Programming
- 14.2 The Optimality Equations
- 14.2.1 Bellman’s Equations
- 14.2.2 Computing the Transition Matrix
- 14.2.3 Random Contributions
- 14.2.4 Bellman’s Equation Using Operator Notation*
- 14.3 Finite Horizon Problems
- 14.4 Continuous Problems with Exact Solutions
- 14.4.1 The Gambling Problem
- 14.4.2 The Continuous Budgeting Problem
- 14.5 Infinite Horizon Problems*
- 14.6 Value Iteration for Infinite Horizon Problems*
- 14.6.1 A Gauss-Seidel Variation
- 14.6.2 Relative Value Iteration
- 14.6.3 Bounds and Rates of Convergence
- 14.7 Policy Iteration for Infinite Horizon Problems*
- 14.8 Hybrid Value-Policy Iteration*
- 14.9 Average Reward Dynamic Programming*
- 14.10 The Linear Programming Method for Dynamic Programs**
- 14.11 Linear Quadratic Regulation
- 14.12 Why Does itWork?**
- 14.12.1 The Optimality Equations
- 14.12.2 Convergence of Value Iteration
- 14.12.3 Monotonicity of Value Iteration
- 14.12.4 Bounding the Error from Value Iteration
- 14.12.5 Randomized Policies
- 14.13 Bibliographic Notes
- Exercises
- Bibliography
-
15 Backward Approximate Dynamic Programming
- 15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems
- 15.1.1 Some Preliminaries
- 15.1.2 Backward ADP Using Lookup Tables
- 15.1.3 Backward ADP Algorithm with Continuous Approximations
- 15.2 FittedValueIterationforInfiniteHorizonProblems
- 15.3 Value Function Approximation Strategies
- 15.3.1 Linear Models
- 15.3.2 Monotone Functions
- 15.3.3 Other Approximation Models
- 15.4 Computational Observations
- 15.4.1 Experimental Benchmarking of Backward ADP
- 15.4.2 Computational Notes
- 15.5 Bibliographic Notes
- Exercises
- Bibliography
-
16 Forward ADP I: The Value of a Policy
- 16.1 Sampling the Value of a Policy
- 16.1.1 Direct Policy Evaluation for Finite Horizon Problems
- 16.1.2 Policy Evaluation for Infinite Horizon Problems
- 16.1.3 Temporal Difference Updates
- 16.1.4 TD(γ)
- 16.1.5 TD(0) and Approximate Value Iteration
- 16.1.6 TD Learning for Infinite Horizon Problems
- 16.2 Stochastic Approximation Methods
- 16.3 Bellman’s Equation Using a Linear Model*
- 16.3.1 A Matrix-based Derivation**
- 16.3.2 A Simulation-based Implementation
- 16.3.3 Least Squares Temporal Difference Learning (LSTD)
- 16.3.4 Least Squares Policy Evaluation
- 16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State*
- 16.4.1 Recursive Least Squares and TD(0)
- 16.4.2 Least Squares Policy Evaluation
- 16.4.3 Least Squares Temporal Difference Learning
- 16.4.4 Discussion
- 16.5 Gradient-based Methods for Approximate Value Iteration*
- 16.5.1 Approximate Value Iteration with Linear Models**
- 16.5.2 A Geometric View of Linear Models*
- 16.6 Value Function Approximations Based on Bayesian Learning*
- 16.6.1 Minimizing Bias for Infinite Horizon Problems
- 16.6.2 Lookup Tables with Correlated Beliefs
- 16.6.3 Parametric Models
- 16.6.4 Creating the Prior
- 16.7 Learning Algorithms and Atepsizes
- 16.7.1 Least Squares Temporal Differences
- 16.7.2 Least Squares Policy Evaluation
- 16.7.3 Recursive Least Squares
- 16.7.4 Bounding 1/n Convergence for Approximate value Iteration
- 16.7.5 Discussion
- 16.8 Bibliographic Notes
- Exercises
- Bibliography
-
17 Forward ADP II: Policy Optimization
- 17.1 Overview of Algorithmic Strategies
- 17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables
- 17.2.1 Value Iteration Using a Pre-Decision State Variable
- 17.2.2 Q-Learning
- 17.2.3 Value Iteration Using a Post-Decision State Variable
- 17.2.4 Value Iteration Using a Backward Pass
- 17.3 Styles of Learning
- 17.3.1 Offline Learning
- 17.3.2 From Offline to Online
- 17.3.3 Evaluating Offline and Online Learning Policies
- 17.3.4 Lookahead Policies
- 17.4 Approximate Value Iteration Using Linear Models
- 17.5 On-policy vs. off-policy learning and the exploration–exploitation problem
- 17.5.1 Terminology
- 17.5.2 Learning with Lookup Tables
- 17.5.3 Learning with Generalized Belief Models
- 17.6 Applications
- 17.6.1 Pricing an American Option
- 17.6.2 Playing “Lose Tic-Tac-Toe”
- 17.6.3 Approximate Dynamic Programming for Deterministic Problems
- 17.7 Approximate Policy Iteration
- 17.7.1 Finite Horizon Problems Using Lookup Tables
- 17.7.2 Finite Horizon Problems Using Linear Models
- 17.7.3 LSTD for Infinite Horizon Problems Using Linear Models
- 17.8 The Actor–Critic Paradigm
- 17.9 Statistical Bias in the Max Operator*
- 17.10 The Linear Programming Method Using Linear Models*
- 17.11 Finite Horizon Approximations for Steady-State Applications
- 17.12 Bibliographic Notes
- Exercises
- Bibliography
-
18 Forward ADP III: Convex Resource Allocation Problems
- 18.1 Resource Allocation Problems
- 18.1.1 The Newsvendor Problem
- 18.1.2 Two-Stage Resource Allocation Problems
- 18.1.3 A General Multiperiod Resource Allocation Model*
- 18.2 Values Versus Marginal Values
- 18.3 Piecewise Linear Approximations for Scalar Funtions
- 18.3.1 The Leveling Algorithm
- 18.3.2 The CAVE Algorithm
- 18.4 Regression Methods
- 18.5 Separable Piecewise Linear Approximations
- 18.6 Benders Decomposition for Nonseparable Approximations**
- 18.6.1 Benders’ Decomposition for Two-Stage Problems
- 18.6.2 Asymptotic Analysis of Benders with Regularization**
- 18.6.3 Benders with Regularization
- 18.7 Linear Approximations for High-Dimensional Applications
- 18.8 Resource Allocation with Exogenous Information State
- 18.9 Closing Notes
- 18.10 Bibliographic Notes
- Exercises
- Bibliography
-
19 Direct Lookahead Policies
- 19.1 Optimal Policies Using Lookahead Models
- 19.2 Creating an Approximate Lookahead Model
- 19.2.1 Modeling the Lookahead Model
- 19.2.2 Strategies for Approximating the Lookahead Model
- 19.3 Modified Objectives in Lookahead Models
- 19.3.1 Managing Risk
- 19.3.2 Utility Functions for Multiobjective Problems
- 19.3.3 Model Discounting
- 19.4 Evaluating DLA Policies
- 19.4.1 Evaluating Policies in a Simulator
- 19.4.2 Evaluating Risk-Adjusted Policies
- 19.4.3 Evaluating Policies in the Field
- 19.4.4 Tuning Direct Lookahead Policies
- 19.5 Why Use a DLA?
- 19.6 Deterministic Lookaheads
- 19.6.1 A Deterministic Lookahead: Shortest Path Problems
- 19.6.2 Parameterized Lookaheads
- 19.7 A Tour of Stochastic Lookahead Policies
- 19.7.1 Lookahead PFAs
- 19.7.2 Lookahead CFAs
- 19.7.3 Lookahead VFAs for the Lookahead Model
- 19.7.4 Lookahead DLAs for the Lookahead Model
- 19.7.5 Discussion
- 19.8 Monte Carlo Tree Search for Discrete Decisions
- 19.8.1 Basic Idea
- 19.8.2 The Steps of MCTS
- 19.8.3 Discussion
- 19.8.4 Optimistic Monte Carlo Tree Search
- 19.9 Two-Stage Stochastic Programming for Vector Decisions*
- 19.9.1 The Basic Two-Stage Stochastic Program
- 19.9.2 Two-Stage Approximation of a Sequential Problem
- 19.9.3 Discussion
- 19.10 Observations on DLA Policies
- 19.11 Bibliographic Notes
- Exercises
- Bibliography
-
14 Exact Dynamic Programming
-
Part VI – Multiagent Systems
-
20 Multiagent Modeling and Learning
- 20.1 Overview of Multiagent Systems
- 20.1.1 Dimensions of a Multiagent System
- 20.1.2 Communication
- 20.1.3 Modeling a Multiagent System
- 20.1.4 Controlling Architectures
- 20.2 A Learning Problem – Flu Mitigation
- 20.2.1 Model 1: A Static Model
- 20.2.2 Variations of Our Flu Model
- 20.2.3 Two-Agent Learning Models
- 20.2.4 Transition Functions for Two-Agent Model
- 20.2.5 Designing Policies for the Flu Problem
- 20.3 The POMDP Perspective*
- 20.4 The Two-Agent Newsvendor Problem
- 20.5 Multiple Independent Agents – An HVAC Controller Model
- 20.5.1 Model
- 20.5.2 Designing Policies
- 20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem
- 20.7 Closing Notes
- 20.8 Why Does itWork?
- 20.8.1 Derivation of the POMDP Belief Transition Function
- 20.9 Bibliographic Notes
- Exercises
- Bibliography
-
20 Multiagent Modeling and Learning
- Index
- End User License Agreement
Product information
- Title: Reinforcement Learning and Stochastic Optimization
- Author(s):
- Release date: March 2022
- Publisher(s): Wiley
- ISBN: 9781119815037
You might also like
book
Optimization and Machine Learning
Machine learning and optimization techniques are revolutionizing our world. Other types of information technology have not …
book
Hyperparameter Optimization in Machine Learning: Make Your Machine Learning and Deep Learning Models More Efficient
Dive into hyperparameter tuning of machine learning models and focus on what hyperparameters are and how …
book
Graph-Powered Machine Learning
Upgrade your machine learning models with graph-based algorithms, the perfect structure for complex and interlinked data. …
book
Practical Simulations for Machine Learning
Simulation and synthesis are core parts of the future of AI and machine learning. Consider: programmers, …