3
An Abstract View of
Reinforcement Learning
Contents
3.1. Bellman Operators . . . . . . . . . . . . . . . . . p. 32
3.2. Approximation in Value Space and Newton’s Method . . p. 39
3.3. Region of Stability . . . . . . . . . . . . . . . . . p. 46
3.4. Policy Iteration, Rollout, and Newton’s Method . . . . . p. 50
3.5. How Sensitive is On-Line Play to the Off-Line
Training Process? . . . . . . . . . . . . . . . . . . p. 58
3.6. Why Not Just Train a Policy Network and Use it Without
On-Line Play? . . . . . . . . . . . . . . . . . . . p. 60
3.7. Multiagent Problems and Multiagent Rollout . . . . . . p. 61
3.8. On-Line Simplified Policy Iteration . . . . . . . . . . p. 66
3.9. Exceptional Cases . . . . . . . . . . . . . . . . . . p. 72
3.10. Notes and Sources . . . . . . . . . . . . . . . . . p. 79
31
32 An Abstract View of Reinforcement Learning Chap. 3
In this chapter we will use geometric constructions to obtain insight into
Bellman’s equation, the value and policy iteration algorithms, approximation
in value space, and some of the properties of the corresponding onestep
or multistep lookahead policy ?μ. To understand these constructions,
we need an abstract notational framework that is based on the operators
that are involved in the Bellman equations.
3.1 BELLMAN OPERATORS
We denote by TJ the function of x that appears in the right-hand side of
Bellman’s equation. Its value at state x is given by
(TJ)(x) = min
u!U(x)
E!g(x, u,w) + !J"f(x, u,w)#$, for all x. (3.1)
Also for each policy μ, we introduce the corresponding function TμJ, which
has value at x given by
(TμJ)(x) = E!g"x, μ(x), w# + !J"f(x, μ(x), w)#$, for all x. (3.2)
Thus T and Tμ can be viewed as operators (broadly referred to as the
Bellman operators), which map functions J to other functions (TJ or TμJ,
respectively).?
An important property of the operators T and Tμ is that they are
monotone, in the sense that if J and J" are two functions of x such that
J(x) ! J"(x), for all x,
then we have
(TJ)(x) ! (TJ")(x), (TμJ)(x) ! (TμJ")(x), for all x and μ.
(3.3)
This monotonicity property is evident from Eqs. (3.1) and (3.2), where the
values of J are multiplied by nonnegative numbers.
? Within the context of this work, the functions J on which T and Tμ operate
will be real-valued functions of x, which we denote by J ! R(X). We will
assume throughout that the expected values in Eqs. (3.1) and (3.2) are welldefined
and finite when J is real-valued. This implies that TμJ will also be
real-valued functions of x. On the other hand (TJ)(x) may take the value ?#
because of the minimization in Eq. (3.1). We allow this possibility, although our
illustrations will primarily depict the case where TJ is real-valued. Note that the
general theory of abstract DP is developed with the use of extended real-valued
functions; see [Ber22a].
Sec. 3.1 Bellman Operators 33
Another important property is that the Bellman operator Tμ is linear ,
in the sense that it has the form TμJ = G+AμJ, where G " R(X) is some
function and Aμ : R(X) #$ R(X) is an operator such that for any functions
J1, J2, and scalars "1, "2, we have?
Aμ("1J1 + "2J2) = "1AμJ1 + "2AμJ2.
Moreover, from the definitions (3.1) and (3.2), we have
(TJ)(x) = min
μ!M
(TμJ)(x), for all x,
where M is the set of stationary policies. This is true because for any policy
μ, there is no coupling constraint between the controls μ(x) and μ(x")
that correspond to two different states x and x". It follows that (TJ)(x) is
a concave function of J for every x (the pointwise minimum of linear functions
is a concave function). This will be important for our interpretation
of one-step and multistep lookahead minimization as a Newton iteration
for solving the Bellman equation J = TJ.
Example 3.1.1 (A Two-State and Two-Control Example)
Assume that there are two states 1 and 2, and two controls u and v. Consider
the policy μ that applies control u at state 1 and control v at state 2. Then
the operator Tμ takes the form
(TμJ)(1) =
2
%
y=1
p1y(u)"g(1, u, y) + !J(y)#, (3.4)
(TμJ)(2) =
2
%
y=1
p2y(v)"g(2, v, y) + !J(y)#, (3.5)
where pxy(u) and pxy(v) are the probabilities that the next state will be y,
when the current state is x, and the control is u or v, respectively. Clearly,
(TμJ)(1) and (TμJ)(2) are linear functions of J. Also the operator T of the
Bellman equation J = TJ takes the form
(TJ)(1) = min
& 2
%
y=1
p1y(u)"g(1, u, y) + !J(y)#,
2
%
y=1
p1y(v)"g(1, v, y) + !J(y)#
'
,
(3.6)
? An operator Tμ with this property is often called “affine,” but in this work
we just call it “linear.” Also we use abbreviated notation to express pointwise
equalities and inequalities, so that we write J = J! or J $ J! to express the fact
that J(x) = J!(x) or J(x) $ J!(x), for all x, respectively.
34 An Abstract View of Reinforcement Learning Chap. 3
(TJ)(2) = min
& 2
%
y=1
p2y(u)"g(2, u, y) + !J(y)#,
2
%
y=1
p2y(v)"g(2, v, y) + !J(y)#
'
.
(3.7)
Thus, (TJ)(1) and (TJ)(2) are concave and piecewise linear as functions of
the two-dimensional vector J (with two pieces; more generally, as many linear
pieces as the number of controls). This concavity property holds in general
since (TJ)(x) is the minimum of a collection of linear functions of J, one for
each u ! U(x). Figure 3.1.1 illustrates (TμJ)(1) for the cases where μ(1) = u
and μ(1) = v, (TμJ)(2) for the cases where μ(2) = u and μ(2) = v, (TJ)(1),
and (TJ)(2), as functions of J = "J(1), J(2)#.
Critical properties from the DP point of view are whether T and Tμ
have fixed points; equivalently, whether the Bellman equations J = TJ
and J = TμJ have solutions within the class of real-valued functions, and
whether the set of solutions includes J* and Jμ, respectively. It may thus
be important to verify that T or Tμ are contraction mappings. This is true
for example in the benign case of discounted problems with bounded cost
per stage. However, for undiscounted problems, asserting the contraction
property of T or Tμ may be more complicated, and even impossible; the abstract
DP book [Ber22a] deals extensively with such questions, and related
issues regarding the solution sets of the Bellman equations.
Geometrical Interpretations
We will now interpret the Bellman operators geometrically, starting with
Tμ. Figure 3.1.2 illustrates its form. Note here that the functions J and
TμJ are multidimensional. They have as many scalar components J(x)
and (TμJ)(x), respectively, as there are states x, but they can only be
shown projected onto one dimension. The function TμJ for each policy μ
is linear. The cost function Jμ satisfies Jμ = TμJμ, so it is obtained from
the intersection of the graph of TμJ and the 45 degree line, when Jμ is
real-valued. Later we will interpret the case where Jμ is not real-valued as
the system being unstable under μ [we have Jμ(x) = % for some initial
states x].
The form of the Bellman operator T is illustrated in Fig. 3.1.3. Again
the functions J, J#, TJ, TμJ, etc, are multidimensional, but they are
shown projected onto one dimension (alternatively they are illustrated for a
system with a single state, plus possibly a termination state). The Bellman
equation J = TJ may have one or many real-valued solutions. It may also
have no real-valued solution in exceptional situations, as we will discuss
later (see Section 3.8). The figure assumes a unique real-valued solution
of the Bellman equations J = TJ and J = TμJ, which is true if T and
Tμ are contraction mappings, as is the case for discounted problems with
Sec. 3.1 Bellman Operators 35
State 1 State 2
State 1 State 2
One-step lookahead J!
! J!(1)
(2) (TJ!)(1) = J!(1) (
One-step lookahead J!
(1) J!(2)
(1) (TJ!)(2) = J!(2)
Figure 3.1.1 Geometric illustrations of the Bellman operators Tμ and T for
states 1 and 2 in Example 3.1.1; cf. Eqs. (3.4)-(3.7). The problem’s transition
probabilities are: p11(u) = 0.3, p12(u) = 0.7, p21(u) = 0.4, p22(u) = 0.6, p11(v) =
0.6, p12(v) = 0.4, p21(v) = 0.9, p22(v) = 0.1. The stage costs are g(1, u, 1) =
3, g(1, u, 2) = 10, g(2, u, 1) = 0, g(2, u, 2) = 6, g(1, v, 1) = 7, g(1, v, 2) = 5,
g(2, v, 1) = 3, g(2, v, 2) = 12. The discount factor is ! = 0.9, and the optimal
costs are J"(1) = 50.59 and J"(2) = 47.41. The optimal policy is μ"(1) = v and
μ"(2) = u. The figure also shows two one-dimensional slices of T that are parallel
to the J(1) and J(2) axes and pass through J".
36 An Abstract View of Reinforcement Learning Chap. 3
1 J J
1 J J
45!Line
TμJ
Cost of μ
Player/Policy Jμ = TμJμ
(1) = 0
Generic stable policy
J Generic unstableGpoenliecryic stable policy μ
Generic unstable policy μ!
T
μ
!J
Figure 3.1.2 Geometric interpretation of the linear Bellman operator Tμ and
the corresponding Bellman equation. The graph of Tμ is a plane in the space
R(X) × R(X), and when projected on a one-dimensional plane that corresponds
to a single state and passes through Jμ, it becomes a line. Then there are three
cases:
(a) The line has slope less than 45 degrees, so it intersects the 45-degree line at
a unique point, which is equal to Jμ, the solution of the Bellman equation
J = TμJ. This is true if Tμ is a contraction mapping, as is the case for
discounted problems with bounded cost per stage.
(b) The line has slope greater than 45 degrees. Then it intersects the 45-
degree line at a unique point, which is a solution of the Bellman equation
J = TμJ, but is not equal to Jμ. Then Jμ is not real-valued; we will call
such μ unstable in Section 3.2.
(c) The line has slope exactly equal to 45 degrees. This is an exceptional case
where the Bellman equation J = TμJ has an infinite number of real-valued
solutions or no real-valued solution at all; we will provide examples where
this occurs in Section 3.8.
bounded cost per stage. Otherwise, these equations may have no solution
or multiple solutions within the class of real-valued functions (see Section
3.8). The equation J = TJ typically has J# as a solution, but may have
more than one solution in cases where either ! = 1, or ! < 1 and the cost
per stage is unbounded.
Sec. 3.1 Bellman Operators 37
J J! = TJ!
0 Prob. = 1
1 J J
1 J J
Optimal cost Cost of rollout policy ?
45!Line
TμJ
Cost of μ
TJ = min
μ TμJ
Final Features Optimal Policy
Final Features Optimal Policy
? J
Position Evaluation Policy ?μ ON-LINE PLAY LookaheadwiTthree States
T?μJ?= T J?
One-step lookahead
One-step lookahead Generic policy μ
= 4 Model minμ Tμ
? J
Player/Policy Jμ = TμJμ
(1) = 0
T?μJ
ective Cost Approximation Value Space Approximation
Cost of ?μ
J?μ = T?μJ?μ
Figure 3.1.3 Geometric interpretation of the Bellman operator T, and the corresponding
Bellman equation. For a fixed x, the function (TJ)(x) can be written
as minμ(TμJ)(x), so it is concave as a function of J. The optimal cost function
J" satisfies J" = TJ", so it is obtained from the intersection of the graph of TJ
and the 45 degree line shown, assuming J" is real-valued.
Note that the graph of T lies below the graph of every operator Tμ, and is
in fact obtained as the lower envelope of the graphs of Tμ as μ ranges over the
set of policies M. In particular, for any given function J?, for every x, the value
(TJ?)(x) is obtained by finding a support hyperplane/subgradient of the graph of
the concave function (TJ)(x) at J = J?, as shown in the figure. This support
hyperplane is defined by the control μ(x) of a policy ?μ that attains the minimum
of (Tμ J?)(x) over μ:
?μ(x) " arg min
μ#M
(Tμ J?)(x)
(there may be multiple policies attaining this minimum, defining multiple support
hyperplanes).
Example 3.1.2 (A Two-State and Infinite Controls Problem)
Let us consider the mapping T for a problem that involves two states, 1 and
2, but an infinite number of controls. In particular, the control space at both
states is the unit interval, U(1) = U(2) = [0, 1]. Here (TJ)(1) and (TJ)(2)
are given by
(TJ)(1) = min
u#[0,1]
(g1 + r11u2 + r12(1 ? u)2 + !uJ(1) + !(1 ? u)J(2)),
(TJ)(2) = min
u#[0,1]
(g2 + r21u2 + r22(1 ? u)2 + !uJ(1) + !(1 ? u)J(2)).
38 An Abstract View of Reinforcement Learning Chap. 3
State 1 State 2
One-step lookahead J! One-step lookahead J!
! J!(1) (1) J!(2)
(2) (TJ!)(1) = J!(1) ( (1) (TJ!)(2) = J!(2)
Figure 3.1.4 Illustration of the Bellman operator T for states 1 and 2 in Example
3.1.2. The parameter values are g1 = 5, g2 = 3, r11 = 3, r12 = 15, r21 = 9,
r22 = 1, and the discount factor is ! = 0.9. The optimal costs are J"(1) = 49.7
and J"(2) = 40.0, and the optimal policy is μ"(1) = 0.59 and μ"(2) = 0. The
figure also shows the two one-dimensional slices of the operators at J(1) = 15 and
J(2) = 30 that are parallel to the J(1) and J(2) axes.
The control u at each state x = 1,2 has the meaning of a probability that
we must select at that state. In particular, we control the probabilities u and
(1?u) of moving to states y = 1 and y = 2, at a control cost that is quadratic
in u and (1 ? u), respectively. For this problem (TJ)(1) and (TJ)(2) can be
calculated in closed form, so they are easy to plot and understand. They are
piecewise quadratic, unlike the corresponding plots of Fig. 3.1.1, which are
piecewise linear; see Fig. 3.1.4.
Visualization of Value Iteration
The operator notation simplifies algorithmic descriptions, derivations, and
proofs related to DP. For example, we can write the VI algorithm in the
compact form
Jk+1 = TJk, k= 0, 1, . . . ,
as illustrated in Fig. 3.1.5. Moreover, the VI algorithm for a given policy
μ can be written as
Jk+1 = TμJk, k= 0, 1, . . . ,
and it can be similarly interpreted, except that the graph of the function
TμJ is linear. Also we will see shortly that there is a similarly compact
description for the policy iteration algorithm.
To keep the presentation simple, we will focus our attention on the
abstract DP framework as it applies to the optimal control problems of Section
2.1. In particular, we will assume without further mention that T and
Tμ have the monotonicity property (3.3), that TμJ is linear for all μ, and
Sec. 3.2 Approximation in Value Space and Newton’s Method 39
J J! = TJ!
0 Prob. = 1
1 J J
1 J J
J0 J1
J1
J2
J2
Optimal cost Cost of rollout policy TJ
45!Line
provement Bellman Equation Value Iterations
Stability Region 0
Figure 3.1.5 Geometric interpretation of the VI algorithm Jk+1 = TJk, starting
from some initial function J0. Successive iterates are obtained through the
staircase construction shown in the figure. The VI algorithm Jk+1 = TμJk for a
given policy μ can be similarly interpreted, except that the graph of the function
TμJ is linear.
that (as a consequence) the component (TJ)(x) is concave as a function of
J for every state x. We note, however, that the abstract notation facilitates
the extension of the infinite horizon DP theory to models beyond the
ones that we discuss in this work. Such models include semi-Markov problems,
minimax control problems, risk sensitive problems, Markov games,
and others (see the DP textbook [Ber12], and the abstract DP monograph
[Ber22a]).
3.2 APPROXIMATION IN VALUE SPACE AND NEWTON’S
METHOD
Let us now consider approximation in value space and an abstract geometric
interpretation, first provided in the author’s book [Ber20a]. By using
the operators T and Tμ, for a given ? J, a one-step lookahead policy ?μ is
characterized by the equation T?μ ? J = T ? J, or equivalently
?μ(x) " arg min
u!U(x)
E!g(x, u,w) + !J?"f(x, u,w)#$, (3.8)
as in Fig. 3.2.1. Furthermore, this equation implies that the graph of T?μJ
just touches the graph of T J at J?, as shown in the figure.
40 An Abstract View of Reinforcement Learning Chap. 3
In mathematical terms, for each state x " X, the hyperplane H?μ(x) "
R(X)×'
H?μ(x) = ((J, #) | (T?μJ)(x) = #), (3.9)
supports from above the hypograph of the concave function (TJ)(x), i.e.,
the convex set
((J, #) | (TJ)(x) ! #).
The point of support is "J?, (Tμ? ? J)(x)#, and relates the function ? J with the
corresponding one-step lookahead minimization policy ?μ, the one that satisfies
T?μ ? J = T ? J. The hyperplane H?μ(x) of Eq. (3.9) defines a subgradient
of (TJ)(x) at ? J. Note that the one-step lookahead policy ?μ need not be
unique, since T need not be differentiable, so there may be multiple hyperplanes
of support at ? J. Still this construction shows that the linear
operator T?μ is a linearization of the operator T at the point ? J (pointwise
for each x).
Equivalently, for every x " X, the linear scalar equation J(x) =
(T?μJ)(x) is a linearization of the nonlinear equation J(x) = (TJ)(x) at the
point ? J. Consequently, the linear operator equation J = T?μJ is a linearization
of the equation J = TJ at ? J, and its solution, J?μ, can be viewed as
the result of a Newton iteration at the point ? J (here we adopt an expanded
view of the Newton iteration that applies to possibly nondifferentiable fixed
point equations; see the Appendix). In summary, the Newton iterate at ? J
is J?μ, the solution of the linearized equation J = T?μJ.?
? The classical Newton’s method for solving a fixed point problem of the form
y = G(y), where y is an n-dimensional vector, operates as follows: At the current
iterate yk, we linearize G and find the solution yk+1 of the corresponding linear
fixed point problem. Assuming G is differentiable, the linearization is obtained
by using a first order Taylor expansion:
yk+1 = G(yk) +
"G(yk)
"y
(yk+1 ? yk),
where "G(yk)/"y is the n × n Jacobian matrix of G evaluated at the vector
yk. The most commonly given convergence rate property of Newton’s method is
quadratic convergence. It states that near the solution y", we have
&yk+1 ? y"& = O"&yk ? y"&2#,
where &·& is the Euclidean norm, and holds assuming the Jacobian matrix exists,
is invertible, and is Lipschitz continuous (see the books by Ortega and Rheinboldt
[OrR70], and by the author [Ber16], Section 1.4).
There are well-studied extensions of Newton’s method that are based on
solving a linearized system at the current iterate, but relax the differentiability
requirement through alternative requirements of piecewise differentiability,
B-differentiability, and semi-smoothness, while maintaining the superlinear convergence
property of the method. In particular, the quadratic rate of convergence
Sec. 3.2 Approximation in Value Space and Newton’s Method 41
J J! = TJ!
0 Prob. = 1
1 J J
1 J J
Optimal cost Cost of rollout policy ?
TJ
T?μJ
?J J?μ
= T?μJ?μ
One-Step Lookahead Policy Cost l
One-Step Lookahead Policy Cost l
One-Step Lookahead Policy Cost
One-Step Lookahead Policy ?μ
Corresponds to One-Step Lookahead Policy ?
Stability Region 0 ? J
Cost Approximation Value Space Approximation
Newton step from ? J
? J for solving J = TJ
Approximations Result of
also Newton Step
Off-Line Training On-Line Play
-Line Training On-Line Play
Figure 3.2.1 Geometric interpretation of approximation in value space and the
one-step lookahead policy ?μ as a step of Newton’s method [cf. Eq. (3.8)]. Given
? J, we find a policy ?μ that attains the minimum in the relation
T ? J = min
μ
Tμ ? J.
This policy satisfies T ? J = T?μ ? J, so the graph of TJ and T?μJ touch at ? J, as shown.
It may not be unique. Because TJ has concave components, the equation J = T?μJ
is the linearization of the equation J = TJ at ? J [for each x, the hyperplane H?μ(x)
of Eq. (3.9) defines a subgradient of (TJ)(x) at ? J]. The linearized equation is
solved at the typical step of Newton’s method to provide the next iterate, which
is just J?μ.
The structure of the Bellman operators (3.1) and (3.2), with their
monotonicity and concavity properties, tends to enhance the convergence
and the rate of convergence properties of Newton’s method, even in the
absence of differentiability, as evidenced by the favorable Newton-related
convergence analysis of PI, and the extensive favorable experience with
rollout, PI, and MPC. In fact, the role of monotonicity and concavity in affecting
the convergence properties of Newton’s method has been addressed
result for differentiable G of Prop. 1.4.1 of the book [Ber16] admits a straightforward
and intuitive extension to piecewise differentiable G, given in the paper
[KoS86]; see the Appendix, which contains references to the literature.
42 An Abstract View of Reinforcement Learning Chap. 3
J J! = TJ!
0 Prob. = 1
1 J J
1 J J
Optimal cost Cost of rollout policy ?
TJ
? J
T?μ J
?J J?μ
= T?μJ?μ
One-Step Lookahead Policy Cost l
One-Step Lookahead Policy ?μ
Corresponds to One-Step Lookahead Policy ?
Stability Region 0
Multistep Lookahead Policy Cost l
Multistep Lookahead Policy Cost
Cost Approximation Value Space Approximation
Cost Approximation Value Space Approximation
Multistep Lookahead Policy Cost T 2 ? J
Effective Cost Approximation J? foVraslouleviSnpgaJce=ApTpJroximation
Newton step from T !?1 ? J
Approximations Result of
Linear policy parameter Optimal ! = 3
also Newton Step
Off-Line Training On-Line Pla-yLine Training On-Line Play
Figure 3.2.2 Geometric interpretation of approximation in value space with "-
step lookahead (in this figure " = 3). It is the same as approximation in value
space with one-step lookahead using T!?1 ? J as cost approximation. It can be
viewed as a Newton step at the point T!?1 ? J, the result of " ? 1 value iterations
applied to ? J. Note that as " increases the cost function J?μ of the "-step lookahead
policy ?μ approaches more closely the optimal J", and that lim!%& J?μ = J".
in the mathematical literature.?
As noted earlier, approximation in value space with $-step lookahead
using ? J is the same as approximation in value space with one-step lookahead
using the ($?1)-fold operation of T on ? J, T !?1 ? J. Thus it can be interpreted
as a Newton step starting from T !?1 ? J, the result of $ ? 1 value iterations
applied to ? J. This is illustrated in Fig. 3.2.2.?
? See the papers by Ortega and Rheinboldt [OrR67], and Vandergraft [Van67],
the books by Ortega and Rheinboldt [OrR70], and Argyros [Arg08], and the references
cited there. In this connection, it is worth noting that in the case of
Markov games, where the concavity property does not hold, the PI method may
oscillate, as shown by Pollatschek and Avi-Itzhak [PoA69], and needs to be modified
to restore its global convergence; see the author’s paper [Ber21c], and the
references cited there.
? We note that several variants of Newton’s method that involve combinations
of first-order iterative methods, such as the Gauss-Seidel and Jacobi algorithms,
and Newton’s method, are well-known in numerical analysis. They
belong to the general family of Newton-SOR methods (SOR stands for “succesSec.
3.2 Approximation in Value Space and Newton’s Method 43
Let us also note that $-step lookahead minimization involves $ successive
VI iterations, but only the first of these iterations has a Newton step
interpretation. As an example, consider two-step lookahead minimization
with a terminal cost approximation ? J. The second step minimization is a
VI that starts from ? J to produce T ? J. The first step minimization is a VI
that starts from T ? J to produce T 2 ? J, but it also does something else that
is more significant: It produces a two-step lookahead minimization policy ?μ
through T?μ(T ? J) = T (T ? J), and the step from T ? J to J?μ (the cost function of
?μ) is the Newton step. Thus, there is only one policy produced (i.e., ?μ) and
only one Newton step (from T ? J to J?μ). In the case of one-step lookahead
minimization, the Newton step starts from ? J and ends at J?μ. Similarly, in
the case of $-step lookahead minimization, the first step of the lookahead
is the Newton step (from T !?1 ? J to J?μ), and whatever follows the first step
of the lookahead is preparation for the Newton step.
Finally, it is worth mentioning that the approximation in value space
algorithm computes J?μ differently than both the PI method and the classical
form of Newton’s method. It does not explicitly compute any values
of J?μ; instead, the control is applied to the system and the cost is accumulated
accordingly. Thus the values of J?μ are implicitly computed only
for those x that are encountered in the system trajectory that is generated
on-line.
Certainty Equivalent Approximations and the Newton Step
We noted earlier that for stochastic DP problems, $-step lookahead can
be computationally expensive, because the lookahead graph expands fast
as $ increases, due to the stochastic character of the problem. The certainty
equivalence approach is an important approximation idea for dealing
with this difficulty. In the classical form of this approach, some or all of
the stochastic disturbances wk are replaced by some deterministic quantities,
such as their expected values. Then a policy is computed off-line for
the resulting deterministic problem, and it is used on-line for the actual
stochastic problem.
The certainty equivalence approach can also be used to expedite the
computations of the $-step lookahead minimization. One way to do this is
to simply replace each of the uncertain $ quantities wk, wk+1, . . ., wk+!?1 by
a deterministic value w. Conceptually, this replaces the Bellman operators
T and Tμ,
(TJ)(x) = min
u!U(x)
E!g(x, u,w) + !J"f(x, u,w)#$,
sive over-relaxation”); see the book by Ortega and Rheinboldt [OrR70] (Section
13.4). Their convergence rate is superlinear, similar to Newton’s method, as long
as they involve a pure Newton step, along with the first-order steps.
44 An Abstract View of Reinforcement Learning Chap. 3
(TμJ)(x) = E
n
g
"
x, μ(x), w
#
+ ?J
"
f(x, μ(x), w)
#o
,
[cf. Eqs. (3.1) and (3.2)] with deterministic operators T and Tμ, given by
(TJ)(x) = min
u2U(x)
h
g(x, u,w) + ?J
"
f(x, u,w)
#i
,
(TμJ)(x) = g
"
x, μ(x),w
#
+ ?J
"
f(x, μ(x),w)
#
.
The resulting `-step lookahead minimization then becomes simpler; for
example, in the case of a finite control space problem, it is a deterministic
shortest path computation, involving an acyclic `-stage graph that expands
at each stage by a factor n, where n is the size of the control space. However,
this approach yields a policy μ such that
Tμ(T
`?1 ? J) = T(T
`?1 ? J),
and the cost function Jμ of this policy is generated by a Newton step,
which aims to find a fixed point of T (not T), starting from T
`?1 ? J. Thus
the Newton step now aims at a fixed point of T, which is not equal to J*.
As a result the benefit of the Newton step is lost to a great extent.
Still, we may largely correct this di