3 An Abstract View of Reinforcement Learning Contents 3.1. Bellman Operators . . . . . . . . . . . . . . . . . p. 32 3.2. Approximation in Value Space and Newton¡¯s Method . . p. 39 3.3. Region of Stability . . . . . . . . . . . . . . . . . p. 46 3.4. Policy Iteration, Rollout, and Newton¡¯s Method . . . . . p. 50 3.5. How Sensitive is On-Line Play to the Off-Line Training Process? . . . . . . . . . . . . . . . . . . p. 58 3.6. Why Not Just Train a Policy Network and Use it Without On-Line Play? . . . . . . . . . . . . . . . . . . . p. 60 3.7. Multiagent Problems and Multiagent Rollout . . . . . . p. 61 3.8. On-Line Simplified Policy Iteration . . . . . . . . . . p. 66 3.9. Exceptional Cases . . . . . . . . . . . . . . . . . . p. 72 3.10. Notes and Sources . . . . . . . . . . . . . . . . . p. 79 31 32 An Abstract View of Reinforcement Learning Chap. 3 In this chapter we will use geometric constructions to obtain insight into Bellman¡¯s equation, the value and policy iteration algorithms, approximation in value space, and some of the properties of the corresponding onestep or multistep lookahead policy ?¦Ì. To understand these constructions, we need an abstract notational framework that is based on the operators that are involved in the Bellman equations. 3.1 BELLMAN OPERATORS We denote by TJ the function of x that appears in the right-hand side of Bellman¡¯s equation. Its value at state x is given by (TJ)(x) = min u!U(x) E!g(x, u,w) + !J"f(x, u,w)#$, for all x. (3.1) Also for each policy ¦Ì, we introduce the corresponding function T¦ÌJ, which has value at x given by (T¦ÌJ)(x) = E!g"x, ¦Ì(x), w# + !J"f(x, ¦Ì(x), w)#$, for all x. (3.2) Thus T and T¦Ì can be viewed as operators (broadly referred to as the Bellman operators), which map functions J to other functions (TJ or T¦ÌJ, respectively).? An important property of the operators T and T¦Ì is that they are monotone, in the sense that if J and J" are two functions of x such that J(x) ! J"(x), for all x, then we have (TJ)(x) ! (TJ")(x), (T¦ÌJ)(x) ! (T¦ÌJ")(x), for all x and ¦Ì. (3.3) This monotonicity property is evident from Eqs. (3.1) and (3.2), where the values of J are multiplied by nonnegative numbers. ? Within the context of this work, the functions J on which T and T¦Ì operate will be real-valued functions of x, which we denote by J ! R(X). We will assume throughout that the expected values in Eqs. (3.1) and (3.2) are welldefined and finite when J is real-valued. This implies that T¦ÌJ will also be real-valued functions of x. On the other hand (TJ)(x) may take the value ?# because of the minimization in Eq. (3.1). We allow this possibility, although our illustrations will primarily depict the case where TJ is real-valued. Note that the general theory of abstract DP is developed with the use of extended real-valued functions; see [Ber22a]. Sec. 3.1 Bellman Operators 33 Another important property is that the Bellman operator T¦Ì is linear , in the sense that it has the form T¦ÌJ = G+A¦ÌJ, where G " R(X) is some function and A¦Ì : R(X) #$ R(X) is an operator such that for any functions J1, J2, and scalars "1, "2, we have? A¦Ì("1J1 + "2J2) = "1A¦ÌJ1 + "2A¦ÌJ2. Moreover, from the definitions (3.1) and (3.2), we have (TJ)(x) = min ¦Ì!M (T¦ÌJ)(x), for all x, where M is the set of stationary policies. This is true because for any policy ¦Ì, there is no coupling constraint between the controls ¦Ì(x) and ¦Ì(x") that correspond to two different states x and x". It follows that (TJ)(x) is a concave function of J for every x (the pointwise minimum of linear functions is a concave function). This will be important for our interpretation of one-step and multistep lookahead minimization as a Newton iteration for solving the Bellman equation J = TJ. Example 3.1.1 (A Two-State and Two-Control Example) Assume that there are two states 1 and 2, and two controls u and v. Consider the policy ¦Ì that applies control u at state 1 and control v at state 2. Then the operator T¦Ì takes the form (T¦ÌJ)(1) = 2 % y=1 p1y(u)"g(1, u, y) + !J(y)#, (3.4) (T¦ÌJ)(2) = 2 % y=1 p2y(v)"g(2, v, y) + !J(y)#, (3.5) where pxy(u) and pxy(v) are the probabilities that the next state will be y, when the current state is x, and the control is u or v, respectively. Clearly, (T¦ÌJ)(1) and (T¦ÌJ)(2) are linear functions of J. Also the operator T of the Bellman equation J = TJ takes the form (TJ)(1) = min & 2 % y=1 p1y(u)"g(1, u, y) + !J(y)#, 2 % y=1 p1y(v)"g(1, v, y) + !J(y)# ' , (3.6) ? An operator T¦Ì with this property is often called ¡°affine,¡± but in this work we just call it ¡°linear.¡± Also we use abbreviated notation to express pointwise equalities and inequalities, so that we write J = J! or J $ J! to express the fact that J(x) = J!(x) or J(x) $ J!(x), for all x, respectively. 34 An Abstract View of Reinforcement Learning Chap. 3 (TJ)(2) = min & 2 % y=1 p2y(u)"g(2, u, y) + !J(y)#, 2 % y=1 p2y(v)"g(2, v, y) + !J(y)# ' . (3.7) Thus, (TJ)(1) and (TJ)(2) are concave and piecewise linear as functions of the two-dimensional vector J (with two pieces; more generally, as many linear pieces as the number of controls). This concavity property holds in general since (TJ)(x) is the minimum of a collection of linear functions of J, one for each u ! U(x). Figure 3.1.1 illustrates (T¦ÌJ)(1) for the cases where ¦Ì(1) = u and ¦Ì(1) = v, (T¦ÌJ)(2) for the cases where ¦Ì(2) = u and ¦Ì(2) = v, (TJ)(1), and (TJ)(2), as functions of J = "J(1), J(2)#. Critical properties from the DP point of view are whether T and T¦Ì have fixed points; equivalently, whether the Bellman equations J = TJ and J = T¦ÌJ have solutions within the class of real-valued functions, and whether the set of solutions includes J* and J¦Ì, respectively. It may thus be important to verify that T or T¦Ì are contraction mappings. This is true for example in the benign case of discounted problems with bounded cost per stage. However, for undiscounted problems, asserting the contraction property of T or T¦Ì may be more complicated, and even impossible; the abstract DP book [Ber22a] deals extensively with such questions, and related issues regarding the solution sets of the Bellman equations. Geometrical Interpretations We will now interpret the Bellman operators geometrically, starting with T¦Ì. Figure 3.1.2 illustrates its form. Note here that the functions J and T¦ÌJ are multidimensional. They have as many scalar components J(x) and (T¦ÌJ)(x), respectively, as there are states x, but they can only be shown projected onto one dimension. The function T¦ÌJ for each policy ¦Ì is linear. The cost function J¦Ì satisfies J¦Ì = T¦ÌJ¦Ì, so it is obtained from the intersection of the graph of T¦ÌJ and the 45 degree line, when J¦Ì is real-valued. Later we will interpret the case where J¦Ì is not real-valued as the system being unstable under ¦Ì [we have J¦Ì(x) = % for some initial states x]. The form of the Bellman operator T is illustrated in Fig. 3.1.3. Again the functions J, J#, TJ, T¦ÌJ, etc, are multidimensional, but they are shown projected onto one dimension (alternatively they are illustrated for a system with a single state, plus possibly a termination state). The Bellman equation J = TJ may have one or many real-valued solutions. It may also have no real-valued solution in exceptional situations, as we will discuss later (see Section 3.8). The figure assumes a unique real-valued solution of the Bellman equations J = TJ and J = T¦ÌJ, which is true if T and T¦Ì are contraction mappings, as is the case for discounted problems with Sec. 3.1 Bellman Operators 35 State 1 State 2 State 1 State 2 One-step lookahead J! ! J!(1) (2) (TJ!)(1) = J!(1) ( One-step lookahead J! (1) J!(2) (1) (TJ!)(2) = J!(2) Figure 3.1.1 Geometric illustrations of the Bellman operators T¦Ì and T for states 1 and 2 in Example 3.1.1; cf. Eqs. (3.4)-(3.7). The problem¡¯s transition probabilities are: p11(u) = 0.3, p12(u) = 0.7, p21(u) = 0.4, p22(u) = 0.6, p11(v) = 0.6, p12(v) = 0.4, p21(v) = 0.9, p22(v) = 0.1. The stage costs are g(1, u, 1) = 3, g(1, u, 2) = 10, g(2, u, 1) = 0, g(2, u, 2) = 6, g(1, v, 1) = 7, g(1, v, 2) = 5, g(2, v, 1) = 3, g(2, v, 2) = 12. The discount factor is ! = 0.9, and the optimal costs are J"(1) = 50.59 and J"(2) = 47.41. The optimal policy is ¦Ì"(1) = v and ¦Ì"(2) = u. The figure also shows two one-dimensional slices of T that are parallel to the J(1) and J(2) axes and pass through J". 36 An Abstract View of Reinforcement Learning Chap. 3 1 J J 1 J J 45!Line T¦ÌJ Cost of ¦Ì Player/Policy J¦Ì = T¦ÌJ¦Ì (1) = 0 Generic stable policy J Generic unstableGpoenliecryic stable policy ¦Ì Generic unstable policy ¦Ì! T ¦Ì !J Figure 3.1.2 Geometric interpretation of the linear Bellman operator T¦Ì and the corresponding Bellman equation. The graph of T¦Ì is a plane in the space R(X) ¡Á R(X), and when projected on a one-dimensional plane that corresponds to a single state and passes through J¦Ì, it becomes a line. Then there are three cases: (a) The line has slope less than 45 degrees, so it intersects the 45-degree line at a unique point, which is equal to J¦Ì, the solution of the Bellman equation J = T¦ÌJ. This is true if T¦Ì is a contraction mapping, as is the case for discounted problems with bounded cost per stage. (b) The line has slope greater than 45 degrees. Then it intersects the 45- degree line at a unique point, which is a solution of the Bellman equation J = T¦ÌJ, but is not equal to J¦Ì. Then J¦Ì is not real-valued; we will call such ¦Ì unstable in Section 3.2. (c) The line has slope exactly equal to 45 degrees. This is an exceptional case where the Bellman equation J = T¦ÌJ has an infinite number of real-valued solutions or no real-valued solution at all; we will provide examples where this occurs in Section 3.8. bounded cost per stage. Otherwise, these equations may have no solution or multiple solutions within the class of real-valued functions (see Section 3.8). The equation J = TJ typically has J# as a solution, but may have more than one solution in cases where either ! = 1, or ! < 1 and the cost per stage is unbounded. Sec. 3.1 Bellman Operators 37 J J! = TJ! 0 Prob. = 1 1 J J 1 J J Optimal cost Cost of rollout policy ? 45!Line T¦ÌJ Cost of ¦Ì TJ = min ¦Ì T¦ÌJ Final Features Optimal Policy Final Features Optimal Policy ? J Position Evaluation Policy ?¦Ì ON-LINE PLAY LookaheadwiTthree States T?¦ÌJ?= T J? One-step lookahead One-step lookahead Generic policy ¦Ì = 4 Model min¦Ì T¦Ì ? J Player/Policy J¦Ì = T¦ÌJ¦Ì (1) = 0 T?¦ÌJ ective Cost Approximation Value Space Approximation Cost of ?¦Ì J?¦Ì = T?¦ÌJ?¦Ì Figure 3.1.3 Geometric interpretation of the Bellman operator T, and the corresponding Bellman equation. For a fixed x, the function (TJ)(x) can be written as min¦Ì(T¦ÌJ)(x), so it is concave as a function of J. The optimal cost function J" satisfies J" = TJ", so it is obtained from the intersection of the graph of TJ and the 45 degree line shown, assuming J" is real-valued. Note that the graph of T lies below the graph of every operator T¦Ì, and is in fact obtained as the lower envelope of the graphs of T¦Ì as ¦Ì ranges over the set of policies M. In particular, for any given function J?, for every x, the value (TJ?)(x) is obtained by finding a support hyperplane/subgradient of the graph of the concave function (TJ)(x) at J = J?, as shown in the figure. This support hyperplane is defined by the control ¦Ì(x) of a policy ?¦Ì that attains the minimum of (T¦Ì J?)(x) over ¦Ì: ?¦Ì(x) " arg min ¦Ì#M (T¦Ì J?)(x) (there may be multiple policies attaining this minimum, defining multiple support hyperplanes). Example 3.1.2 (A Two-State and Infinite Controls Problem) Let us consider the mapping T for a problem that involves two states, 1 and 2, but an infinite number of controls. In particular, the control space at both states is the unit interval, U(1) = U(2) = [0, 1]. Here (TJ)(1) and (TJ)(2) are given by (TJ)(1) = min u#[0,1] (g1 + r11u2 + r12(1 ? u)2 + !uJ(1) + !(1 ? u)J(2)), (TJ)(2) = min u#[0,1] (g2 + r21u2 + r22(1 ? u)2 + !uJ(1) + !(1 ? u)J(2)). 38 An Abstract View of Reinforcement Learning Chap. 3 State 1 State 2 One-step lookahead J! One-step lookahead J! ! J!(1) (1) J!(2) (2) (TJ!)(1) = J!(1) ( (1) (TJ!)(2) = J!(2) Figure 3.1.4 Illustration of the Bellman operator T for states 1 and 2 in Example 3.1.2. The parameter values are g1 = 5, g2 = 3, r11 = 3, r12 = 15, r21 = 9, r22 = 1, and the discount factor is ! = 0.9. The optimal costs are J"(1) = 49.7 and J"(2) = 40.0, and the optimal policy is ¦Ì"(1) = 0.59 and ¦Ì"(2) = 0. The figure also shows the two one-dimensional slices of the operators at J(1) = 15 and J(2) = 30 that are parallel to the J(1) and J(2) axes. The control u at each state x = 1,2 has the meaning of a probability that we must select at that state. In particular, we control the probabilities u and (1?u) of moving to states y = 1 and y = 2, at a control cost that is quadratic in u and (1 ? u), respectively. For this problem (TJ)(1) and (TJ)(2) can be calculated in closed form, so they are easy to plot and understand. They are piecewise quadratic, unlike the corresponding plots of Fig. 3.1.1, which are piecewise linear; see Fig. 3.1.4. Visualization of Value Iteration The operator notation simplifies algorithmic descriptions, derivations, and proofs related to DP. For example, we can write the VI algorithm in the compact form Jk+1 = TJk, k= 0, 1, . . . , as illustrated in Fig. 3.1.5. Moreover, the VI algorithm for a given policy ¦Ì can be written as Jk+1 = T¦ÌJk, k= 0, 1, . . . , and it can be similarly interpreted, except that the graph of the function T¦ÌJ is linear. Also we will see shortly that there is a similarly compact description for the policy iteration algorithm. To keep the presentation simple, we will focus our attention on the abstract DP framework as it applies to the optimal control problems of Section 2.1. In particular, we will assume without further mention that T and T¦Ì have the monotonicity property (3.3), that T¦ÌJ is linear for all ¦Ì, and Sec. 3.2 Approximation in Value Space and Newton¡¯s Method 39 J J! = TJ! 0 Prob. = 1 1 J J 1 J J J0 J1 J1 J2 J2 Optimal cost Cost of rollout policy TJ 45!Line provement Bellman Equation Value Iterations Stability Region 0 Figure 3.1.5 Geometric interpretation of the VI algorithm Jk+1 = TJk, starting from some initial function J0. Successive iterates are obtained through the staircase construction shown in the figure. The VI algorithm Jk+1 = T¦ÌJk for a given policy ¦Ì can be similarly interpreted, except that the graph of the function T¦ÌJ is linear. that (as a consequence) the component (TJ)(x) is concave as a function of J for every state x. We note, however, that the abstract notation facilitates the extension of the infinite horizon DP theory to models beyond the ones that we discuss in this work. Such models include semi-Markov problems, minimax control problems, risk sensitive problems, Markov games, and others (see the DP textbook [Ber12], and the abstract DP monograph [Ber22a]). 3.2 APPROXIMATION IN VALUE SPACE AND NEWTON¡¯S METHOD Let us now consider approximation in value space and an abstract geometric interpretation, first provided in the author¡¯s book [Ber20a]. By using the operators T and T¦Ì, for a given ? J, a one-step lookahead policy ?¦Ì is characterized by the equation T?¦Ì ? J = T ? J, or equivalently ?¦Ì(x) " arg min u!U(x) E!g(x, u,w) + !J?"f(x, u,w)#$, (3.8) as in Fig. 3.2.1. Furthermore, this equation implies that the graph of T?¦ÌJ just touches the graph of T J at J?, as shown in the figure. 40 An Abstract View of Reinforcement Learning Chap. 3 In mathematical terms, for each state x " X, the hyperplane H?¦Ì(x) " R(X)¡Á' H?¦Ì(x) = ((J, #) | (T?¦ÌJ)(x) = #), (3.9) supports from above the hypograph of the concave function (TJ)(x), i.e., the convex set ((J, #) | (TJ)(x) ! #). The point of support is "J?, (T¦Ì? ? J)(x)#, and relates the function ? J with the corresponding one-step lookahead minimization policy ?¦Ì, the one that satisfies T?¦Ì ? J = T ? J. The hyperplane H?¦Ì(x) of Eq. (3.9) defines a subgradient of (TJ)(x) at ? J. Note that the one-step lookahead policy ?¦Ì need not be unique, since T need not be differentiable, so there may be multiple hyperplanes of support at ? J. Still this construction shows that the linear operator T?¦Ì is a linearization of the operator T at the point ? J (pointwise for each x). Equivalently, for every x " X, the linear scalar equation J(x) = (T?¦ÌJ)(x) is a linearization of the nonlinear equation J(x) = (TJ)(x) at the point ? J. Consequently, the linear operator equation J = T?¦ÌJ is a linearization of the equation J = TJ at ? J, and its solution, J?¦Ì, can be viewed as the result of a Newton iteration at the point ? J (here we adopt an expanded view of the Newton iteration that applies to possibly nondifferentiable fixed point equations; see the Appendix). In summary, the Newton iterate at ? J is J?¦Ì, the solution of the linearized equation J = T?¦ÌJ.? ? The classical Newton¡¯s method for solving a fixed point problem of the form y = G(y), where y is an n-dimensional vector, operates as follows: At the current iterate yk, we linearize G and find the solution yk+1 of the corresponding linear fixed point problem. Assuming G is differentiable, the linearization is obtained by using a first order Taylor expansion: yk+1 = G(yk) + "G(yk) "y (yk+1 ? yk), where "G(yk)/"y is the n ¡Á n Jacobian matrix of G evaluated at the vector yk. The most commonly given convergence rate property of Newton¡¯s method is quadratic convergence. It states that near the solution y", we have &yk+1 ? y"& = O"&yk ? y"&2#, where &¡¤& is the Euclidean norm, and holds assuming the Jacobian matrix exists, is invertible, and is Lipschitz continuous (see the books by Ortega and Rheinboldt [OrR70], and by the author [Ber16], Section 1.4). There are well-studied extensions of Newton¡¯s method that are based on solving a linearized system at the current iterate, but relax the differentiability requirement through alternative requirements of piecewise differentiability, B-differentiability, and semi-smoothness, while maintaining the superlinear convergence property of the method. In particular, the quadratic rate of convergence Sec. 3.2 Approximation in Value Space and Newton¡¯s Method 41 J J! = TJ! 0 Prob. = 1 1 J J 1 J J Optimal cost Cost of rollout policy ? TJ T?¦ÌJ ?J J?¦Ì = T?¦ÌJ?¦Ì One-Step Lookahead Policy Cost l One-Step Lookahead Policy Cost l One-Step Lookahead Policy Cost One-Step Lookahead Policy ?¦Ì Corresponds to One-Step Lookahead Policy ? Stability Region 0 ? J Cost Approximation Value Space Approximation Newton step from ? J ? J for solving J = TJ Approximations Result of also Newton Step Off-Line Training On-Line Play -Line Training On-Line Play Figure 3.2.1 Geometric interpretation of approximation in value space and the one-step lookahead policy ?¦Ì as a step of Newton¡¯s method [cf. Eq. (3.8)]. Given ? J, we find a policy ?¦Ì that attains the minimum in the relation T ? J = min ¦Ì T¦Ì ? J. This policy satisfies T ? J = T?¦Ì ? J, so the graph of TJ and T?¦ÌJ touch at ? J, as shown. It may not be unique. Because TJ has concave components, the equation J = T?¦ÌJ is the linearization of the equation J = TJ at ? J [for each x, the hyperplane H?¦Ì(x) of Eq. (3.9) defines a subgradient of (TJ)(x) at ? J]. The linearized equation is solved at the typical step of Newton¡¯s method to provide the next iterate, which is just J?¦Ì. The structure of the Bellman operators (3.1) and (3.2), with their monotonicity and concavity properties, tends to enhance the convergence and the rate of convergence properties of Newton¡¯s method, even in the absence of differentiability, as evidenced by the favorable Newton-related convergence analysis of PI, and the extensive favorable experience with rollout, PI, and MPC. In fact, the role of monotonicity and concavity in affecting the convergence properties of Newton¡¯s method has been addressed result for differentiable G of Prop. 1.4.1 of the book [Ber16] admits a straightforward and intuitive extension to piecewise differentiable G, given in the paper [KoS86]; see the Appendix, which contains references to the literature. 42 An Abstract View of Reinforcement Learning Chap. 3 J J! = TJ! 0 Prob. = 1 1 J J 1 J J Optimal cost Cost of rollout policy ? TJ ? J T?¦Ì J ?J J?¦Ì = T?¦ÌJ?¦Ì One-Step Lookahead Policy Cost l One-Step Lookahead Policy ?¦Ì Corresponds to One-Step Lookahead Policy ? Stability Region 0 Multistep Lookahead Policy Cost l Multistep Lookahead Policy Cost Cost Approximation Value Space Approximation Cost Approximation Value Space Approximation Multistep Lookahead Policy Cost T 2 ? J Effective Cost Approximation J? foVraslouleviSnpgaJce=ApTpJroximation Newton step from T !?1 ? J Approximations Result of Linear policy parameter Optimal ! = 3 also Newton Step Off-Line Training On-Line Pla-yLine Training On-Line Play Figure 3.2.2 Geometric interpretation of approximation in value space with "- step lookahead (in this figure " = 3). It is the same as approximation in value space with one-step lookahead using T!?1 ? J as cost approximation. It can be viewed as a Newton step at the point T!?1 ? J, the result of " ? 1 value iterations applied to ? J. Note that as " increases the cost function J?¦Ì of the "-step lookahead policy ?¦Ì approaches more closely the optimal J", and that lim!%& J?¦Ì = J". in the mathematical literature.? As noted earlier, approximation in value space with $-step lookahead using ? J is the same as approximation in value space with one-step lookahead using the ($?1)-fold operation of T on ? J, T !?1 ? J. Thus it can be interpreted as a Newton step starting from T !?1 ? J, the result of $ ? 1 value iterations applied to ? J. This is illustrated in Fig. 3.2.2.? ? See the papers by Ortega and Rheinboldt [OrR67], and Vandergraft [Van67], the books by Ortega and Rheinboldt [OrR70], and Argyros [Arg08], and the references cited there. In this connection, it is worth noting that in the case of Markov games, where the concavity property does not hold, the PI method may oscillate, as shown by Pollatschek and Avi-Itzhak [PoA69], and needs to be modified to restore its global convergence; see the author¡¯s paper [Ber21c], and the references cited there. ? We note that several variants of Newton¡¯s method that involve combinations of first-order iterative methods, such as the Gauss-Seidel and Jacobi algorithms, and Newton¡¯s method, are well-known in numerical analysis. They belong to the general family of Newton-SOR methods (SOR stands for ¡°succesSec. 3.2 Approximation in Value Space and Newton¡¯s Method 43 Let us also note that $-step lookahead minimization involves $ successive VI iterations, but only the first of these iterations has a Newton step interpretation. As an example, consider two-step lookahead minimization with a terminal cost approximation ? J. The second step minimization is a VI that starts from ? J to produce T ? J. The first step minimization is a VI that starts from T ? J to produce T 2 ? J, but it also does something else that is more significant: It produces a two-step lookahead minimization policy ?¦Ì through T?¦Ì(T ? J) = T (T ? J), and the step from T ? J to J?¦Ì (the cost function of ?¦Ì) is the Newton step. Thus, there is only one policy produced (i.e., ?¦Ì) and only one Newton step (from T ? J to J?¦Ì). In the case of one-step lookahead minimization, the Newton step starts from ? J and ends at J?¦Ì. Similarly, in the case of $-step lookahead minimization, the first step of the lookahead is the Newton step (from T !?1 ? J to J?¦Ì), and whatever follows the first step of the lookahead is preparation for the Newton step. Finally, it is worth mentioning that the approximation in value space algorithm computes J?¦Ì differently than both the PI method and the classical form of Newton¡¯s method. It does not explicitly compute any values of J?¦Ì; instead, the control is applied to the system and the cost is accumulated accordingly. Thus the values of J?¦Ì are implicitly computed only for those x that are encountered in the system trajectory that is generated on-line. Certainty Equivalent Approximations and the Newton Step We noted earlier that for stochastic DP problems, $-step lookahead can be computationally expensive, because the lookahead graph expands fast as $ increases, due to the stochastic character of the problem. The certainty equivalence approach is an important approximation idea for dealing with this difficulty. In the classical form of this approach, some or all of the stochastic disturbances wk are replaced by some deterministic quantities, such as their expected values. Then a policy is computed off-line for the resulting deterministic problem, and it is used on-line for the actual stochastic problem. The certainty equivalence approach can also be used to expedite the computations of the $-step lookahead minimization. One way to do this is to simply replace each of the uncertain $ quantities wk, wk+1, . . ., wk+!?1 by a deterministic value w. Conceptually, this replaces the Bellman operators T and T¦Ì, (TJ)(x) = min u!U(x) E!g(x, u,w) + !J"f(x, u,w)#$, sive over-relaxation¡±); see the book by Ortega and Rheinboldt [OrR70] (Section 13.4). Their convergence rate is superlinear, similar to Newton¡¯s method, as long as they involve a pure Newton step, along with the first-order steps. 44 An Abstract View of Reinforcement Learning Chap. 3 (T¦ÌJ)(x) = E n g " x, ¦Ì(x), w # + ?J " f(x, ¦Ì(x), w) #o , [cf. Eqs. (3.1) and (3.2)] with deterministic operators T and T¦Ì, given by (TJ)(x) = min u2U(x) h g(x, u,w) + ?J " f(x, u,w) #i , (T¦ÌJ)(x) = g " x, ¦Ì(x),w # + ?J " f(x, ¦Ì(x),w) # . The resulting `-step lookahead minimization then becomes simpler; for example, in the case of a finite control space problem, it is a deterministic shortest path computation, involving an acyclic `-stage graph that expands at each stage by a factor n, where n is the size of the control space. However, this approach yields a policy ¦Ì such that T¦Ì(T `?1 ? J) = T(T `?1 ? J), and the cost function J¦Ì of this policy is generated by a Newton step, which aims to find a fixed point of T (not T), starting from T `?1 ? J. Thus the Newton step now aims at a fixed point of T, which is not equal to J*. As a result the benefit of the Newton step is lost to a great extent. Still, we may largely correct this di