3 Parametric Approximation Contents 3.1. Approximation Architectures . . . . . . . . . . . . p. 126 3.1.1. Linear and Nonlinear Feature-Based Architectures p. 126 3.1.2. Training of Linear and Nonlinear Architectures . p. 134 3.1.3. Incremental Gradient and Newton Methods . . . p. 135 3.2. Neural Networks . . . . . . . . . . . . . . . . . p. 149 3.2.1. Training of Neural Networks . . . . . . . . . p. 153 3.2.2. Multilayer and Deep Neural Networks . . . . . p. 157 3.3. Sequential Dynamic Programming Approximation . . . p. 161 3.4. Q-Factor Parametric Approximation . . . . . . . . . p. 162 3.5. Parametric Approximation in Policy Space by . . . . . . . Classification . . . . . . . . . . . . . . . . . . . p. 165 3.6. Notes and Sources . . . . . . . . . . . . . . . . p. 171 125 126 Parametric Approximation Chap. 3 Clearly, for the success of approximation in value space, it is important to select a class of lookahead functions ? Jk that is suitable for the problem at hand. In the preceding chapter we discussed several methods for choosing ? Jk based mostly on problem approximation and rollout. In this chapter we discuss an alternative approach, whereby ? Jk is chosen to be a member of a parametric class of functions, including neural networks, with the parameters ※optimized§ or ※trained§ by using some algorithm. The training methods used for parametric approximation in value space can also be used for approximation in policy space, as we will discuss in Section 3.5. 3.1 APPROXIMATION ARCHITECTURES The starting point for the schemes of this chapter is a class of functions ? Jk(xk, rk) that for each k, depend on the current state xk and a vector rk = (r1,k, . . . , rmk,k) of mk ※tunable§ scalar parameters, also called weights. By adjusting the weights, one can change the ※shape§ of ? Jk so that it is a reasonably good approximation to the true optimal cost-to-go function J* k . The class of functions ? Jk(xk, rk) is called an approximation architecture, and the process of choosing the parameter vectors rk is commonly called training or tuning the architecture. The simplest training approach is to do some form of semi-exhaustive or semi-random search in the space of parameter vectors and adopt the parameters that result in best performance of the associated one-step lookahead controller (according to some criterion). More systematic approaches are based on numerical optimization, such as for example a least squares fit that aims to match the cost approximation produced by the architecture to a ※training set,§ i.e., a large number of pairs of state and cost values that are obtained through some form of sampling process. Throughout this chapter we will focus on this latter approach. 3.1.1 Linear and Nonlinear Feature-Based Architectures There is a large variety of approximation architectures, based for example on polynomials, wavelets, radial basis functions, discretization/interpolation schemes, neural networks, and others. A particularly interesting type of cost approximation involves feature extraction, a process that maps the state xk into some vector 耳k(xk), called the feature vector associated with xk at time k. The vector 耳k(xk) consists of scalar components 耳1,k(xk), . . . , 耳mk,k(xk), called features. A feature-based cost approximation has the form ? Jk(xk, rk) = ? Jk??耳k(xk), rk, Sec. 3.1 Approximation Architectures 127 i Feature Extraction Mapping Feature Vector Approximator Feature Extraction( M) apping Feature VectFoerature Extraction Mapping Feature Vector i) Linear Cost i) Linear Cost State xk k Feature Vector !k(xk) Approximato)rApproximator r ∩ k !k(xk) Figure 3.1.1 The structure of a linear feature-based architecture. We use a feature extraction mapping to generate an input k(xk) to a linear mapping defined by a weight vector rk. where rk is a parameter vector and ? Jk is some function. Thus, the cost approximation depends on the state xk through its feature vector 耳k(xk). Note that we are allowing for different features 耳k(xk) and different parameter vectors rk for each stage k. This is necessary for nonstationary problems (e.g., if the state space changes over time), and also to capture the effect of proximity to the end of the horizon. On the other hand, for stationary problems with a long or infinite horizon, where the state space does not change with k, it is common to use the same features and parameters for all stages. The subsequent discussion can easily be adapted to infinite horizon methods, as we will discuss later. Features are often handcrafted, based on whatever human intelligence, insight, or experience is available, and are meant to capture the most important characteristics of the current state. There are also systematic ways to construct features, including the use of data and neural networks, which we will discuss shortly. In this section, we provide a brief and selective discussion of architectures, and we refer to the specialized literature (e.g., Bertsekas and Tsitsiklis [BeT96], Bishop [Bis95], Haykin [Hay08], Sutton and Barto [SuB18]), and the author*s [Ber12], Section 6.1.1, for more detailed presentations. One idea behind using features is that the optimal cost-to-go functions J* k may be complicated nonlinear mappings, so it is sensible to try to break their complexity into smaller, less complex pieces. In particular, if the features encode much of the nonlinearity of J* k , we may be able to use a relatively simple architecture ? Jk to approximate J* k . For example, with a well-chosen feature vector 耳k(xk), a good approximation to the cost-to-go is often provided by linearly weighting the features, i.e., ? Jk(xk, rk) = ? Jk??耳k(xk), rk = mk X?=1 r?,k耳?,k(xk) = r∩ k耳k(xk), (3.1) where r?,k and 耳?,k(xk) are the ?th components of rk and 耳k(xk), respectively, and r∩ k耳k(xk) denotes the inner product of rk and 耳k(xk), viewed as column vectors of ?mk (a prime denotes transposition, so r∩ k is a row vector); see Fig. 3.1.1. This is called a linear feature-based architecture, and the scalar parameters r?,k are also called weights. Among other advantages, these architectures admit simpler training algorithms than their nonlinear counterparts. 128 Parametric Approximation Chap. 3 ? J(x, r) = !m ?=1 r?!?(x) x S1 S? ? Sm . . . . . . ) x S . . . r1 r? rm 1 r? Figure 3.1.2 Illustration of a piecewise constant architecture. The state space is partitioned into subsets S1, . . . , Sm, with each subset S? defining the feature ?(x) = n1 if x 2 S?, 0 if x /2 S?, with its own weight r?. Mathematically, the approximating function ? Jk(xk, rk) can be viewed as a member of the subspace spanned by the features 耳?,k(xk), ? = 1, . . . ,mk, which for this reason are also referred to basis functions. We provide a few examples, where for simplicity we drop the index k. Example 3.1.1 (Piecewise Constant Approximation) Suppose that the state space is partitioned into subsets S1, . . . , Sm, so that every state belongs to one and only one subset. Let the ?th feature be defined by membership to the set S?, i.e., 耳?(x) = n1 if x ﹋ S?, 0 if x /﹋ S?. Consider the architecture ? J(x, r) = m X?=1 r?耳?(x), where r is the vector consists of the m scalar parameters r1, . . . , rm. It can be seen that ? J(x, r) is the piecewise constant function that has value r? for all states within the set S?; see Fig. 3.1.2. Sec. 3.1 Approximation Architectures 129 The piecewise constant approximation is an example of a linear feature- based architecture that involves exclusively local features. These are features that take a nonzero value only for a relatively small subset of states. Thus a change of a single weight causes a change of the value of ? J(x, r) for relatively few states x. At the opposite end we have linear feature-based architectures that involve global features. These are features that take nonzero values for a large number of states. The following is an example. Example 3.1.2 (Polynomial Approximation) An important case of linear architecture is one that uses polynomial basis functions. Suppose that the state consists of n components x1, . . . , xn, each taking values within some range of integers. For example, in a queueing system, xi may represent the number of customers in the ith queue, where i = 1, . . . , n. Suppose that we want to use an approximating function that is quadratic in the components xi. Then we can define a total of 1 + n + n2 basis functions that depend on the state x = (x1, . . . , xn) via 耳0(x) = 1, 耳i(x) = xi, 耳ij(x) = xixj , i, j = 1, . . . , n. A linear approximation architecture that uses these functions is given by ? J(x, r) = r0 + n Xi=1 rixi + n Xi=1 n Xj=1 rijxixj , where the parameter vector r has components r0, ri, and rij , with i, j = 1, . . . , n. Indeed, any kind of approximating function that is polynomial in the components x1, . . . , xn can be constructed similarly. A more general polynomial approximation may be based on some other known features of the state. For example, we may start with a feature vector 耳(x) = ??耳1(x), . . . , 耳m(x)∩ , and transform it with a quadratic polynomial mapping. In this way we obtain approximating functions of the form ? J(x, r) = r0 + m Xi=1 ri耳i(x) + m Xi=1 m Xj=1 rij耳i(x)耳j(x), where the parameter r has components r0, ri, and rij , with i, j = 1, . . . ,m. This can also be viewed as a linear architecture that uses the basis functions w0(x) = 1, wi(x) = 耳i(x), wij (x) = 耳i(x)耳j(x), i, j = 1, . . . ,m. The preceding example architectures are generic in the sense that they can be applied to many different types of problems. Other architectures rely on problem-specific insight to construct features, which are then combined into a relatively simple architecture. The following are two examples involving games. 130 Parametric Approximation Chap. 3 Example 3.1.3 (Tetris) Let us revisit the game of tetris, which we discussed in Example 1.3.4. We can model the problem of finding an optimal playing strategy as a finite horizon problem with a very large horizon. In Example 1.3.4 we viewed as state the pair of the board position x and the shape of the current falling block y. We viewed as control, the horizontal positioning and rotation applied to the falling block. However, the DP algorithm can be executed over the space of x only, since y is an uncontrollable state component; cf. Section 1.3.5. The optimal cost-to-go function is a vector of huge dimension (there are 2200 board positions in a ※standard§ tetris board of width 10 and height 20). However, it has been successfully approximated in practice by low-dimensional linear architectures. In particular, the following features have been proposed in [BeI96]: the heights of the columns, the height differentials of adjacent columns, the wall height (the maximum column height), the number of holes of the board, and the constant 1 (the unit is often included as a feature in cost approximation architectures, as it allows for a constant shift in the approximating function). These features are readily recognized by tetris players as capturing important aspects of the board position.? There are a total of 22 features for a ※standard§ board with 10 columns. Of course the 2200 ℅ 22 matrix of feature values cannot be stored in a computer, but for any board position, the corresponding row of features can be easily generated, and this is sufficient for implementation of the associated approximate DP algorithms. For recent works involving approximate DP methods and the preceding 22 features, see [Sch13], [GGS13], and [SGG15], which reference several other related papers. Example 3.1.4 (Computer Chess) Computer chess programs that involve feature-based architectures have been available for many years, and are still used widely (they have been upstaged in the mid-2010s by alternative types of chess programs, which use neural network techniques that will be discussed later). These programs are based on approximate DP for minimax problems,? a feature-based parametric architecture, and multistep lookahead. For the most part, however, the computer chess training methodology has been qualitatively different from the parametric approximation methods that we consider in this book. In particular, with few exceptions, the training of chess architectures has been done with ad hoc hand-tuning techniques (as opposed to some form of optimization). Moreover, the features have traditionally been hand-crafted ? The use of feature-based approximate DP methods for the game of tetris was first suggested in the paper [TsV96], which introduced just two features (in addition to the constant 1): the wall height and the number of holes of the board. Most studies have used the set of features of [BeI96] described here, but other sets of features have also been used; see [ThS09] and the discussion in [GGS13]. ? We have not discussed DP for minimax problems and two-player games, but the ideas of approximation in value space apply to these contexts as well; see [Ber17] for a discussion that is focused on computer chess. Sec. 3.1 Approximation Architectures 131 Feature Extraction Features: Material Balance, Mobility, Safety, etc Feature Extraction Features: Material Balance, Feature Extraction Features: Material Balance, Feature Extraction Features: Material Balance, Mobility, Safety, etc Weighting of Features Score Position Mobility, Safety, etc Weighting of Features Score Position Evaluator Mobility, Safety, etc Weighting of Features Score Position Mobility, Safety, etc Weighting of Features Score Position Evaluator Figure 3.1.3 A feature-based architecture for computer chess. based on chess-specific knowledge (as opposed to automatically generated through a neural network or some other method). Indeed, it has been argued for a long time that the success of chess programs in outperforming the best humans can be more properly attributed to their use of long lookahead and the brute-force calculating power of modern computers, than to algorithmic approaches that can learn powerful playing strategies, which humans have difficulty conceiving or executing. Thus computer chess was viewed as an approximate DP context where a lot of computational power is available, but innovative and sophisticated algorithms are hard to apply. However, this assessment has changed radically following the impressive success of the AlphaZero chess program (Silver et al. [SHS17]). The fundamental principles on which all computer chess programs are based were laid out by Shannon [Sha50]. He proposed limited lookahead and evaluation of the end positions by means of a ※scoring function§ (in our terminology this plays the role of a cost function approximation). This function may involve, for example, the calculation of a numerical value for each of a set of major features of a position that chess players easily recognize (such as material balance, mobility, pawn structure, and other positional factors), together with a method to combine these numerical values into a single score. Shannon then went on to describe various strategies of exhaustive and selective search over a multistep lookahead tree of moves. We may view the scoring function as a feature-based architecture for evaluating a chess position/state (cf. Fig. 3.1.3). In most computer chess programs, the features are weighted linearly, i.e., the architecture ? J(x, r) that is used for limited lookahead is linear [cf. Eq. (3.1)]. In many cases, the weights are determined manually, by trial and error based on experience. However, in some programs, the weights are determined with supervised learning techniques that use examples of grandmaster play, i.e., by adjustment to bring the play of the program as close as possible to the play of chess grandmasters. This is a technique that applies more broadly in artificial intelligence; see Tesauro [Tes89b], [Tes01]. In a recent computer chess breakthrough, the entire idea of extracting features of a position through human expertise was abandoned in favor of feature discovery through self-play and the use of neural networks. The first program of this type to attain supremacy over humans, as well as over the best computer programs that use human expertise-based features, was AlphaZero (Silver et al. [SHS17]). This program is based on DP principles of policy iteration and Monte Carlo tree search. It will be discussed further later. 132 Parametric Approximation Chap. 3 Our next example relates to a methodology for feature construction, where the number of features may increase as more data is collected. For a simple example, consider the piecewise constant approximation of Example 3.1.1, where more pieces are progressively added based on new data. Example 3.1.5 (Feature Extraction from Data) We have viewed so far feature vectors 耳(x) as functions of x, obtained through some unspecified process that is based on prior knowledge about the cost function being approximated. On the other hand, features may also be extracted from data. For example suppose that with some preliminary calculation using data, we have identified some suitable states x(?), ? = 1, . . . ,m, that can serve as ※anchors§ for the construction of Gaussian basis functions of the form 耳?(x) = e ? kx?x(?)k2 2考2 , ? = 1, . . . ,m, (3.2) where 考 is a scalar ※variance§ parameter. This type of function is known as a radial basis function. It is concentrated around the state x(?), and it is weighed with a scalar weight r? to form a parametric linear feature-based architecture, which can be trained using additional data. Several other types of data-dependent basis functions, such as support vector machines, are used in machine learning, where they are often referred to as kernels. While it is possible to use a preliminary calculation to obtain the anchors x(?) in Eq. (3.2), and then use additional data for training, one may also consider enrichment of the set of basis functions simultaneously with training. In this case the number of the basis functions increases as the training data is collected. A motivation here is that the quality of the approximation may increase with additional basis functions. This idea underlies a field of machine learning, known as kernel methods or sometimes nonparametric methods. A further discussion is outside our scope. We refer to the literature; see e.g., books such as Cristianini and Shawe-Tylor [ChS00], [ShC04], Scholkopf and Smola [ScS02], Bishop [Bis06], Kung [Kun14], surveys such as Hofmann, Scholkopf, and Smola [HSS08], Pillonetto et al. [PDC14], and RL-related discussions such as Dietterich and Wang [DiW02], Ormoneit and Sen [OrS02], Engel, Mannor, and Meir [EMM05], Jung and Polani [JuP07], Reisinger, Stone, and Miikkulainen [RSM08], Busoniu et al. [BBS10], Bethke [Bet10]. In what follows, we will focus on parametric architectures with a fixed and given feature vector. The next example considers a feature extraction strategy that is particularly relevant to problems of partial state information. Example 3.1.6 (Feature Extraction from Sufficient Statistics) The concept of a sufficient statistic, which originated in inference methodologies, plays an important role in DP. As discussed in Section 1.3, it refers to quantities that summarize all the essential content of the state xk for optimal control selection at time k. Sec. 3.1 Approximation Architectures 133 In particular, consider a partial information context where at time k we have accumulated an information record (also called the past history) Ik = (z0, . . . , zk, u0, . . . , uk?1), which consists of the past controls u0, . . . , uk?1 and the state-related measurements z0, . . . , zk obtained at the times 0, . . . , k. The control uk is allowed to depend only on Ik, and the optimal policy is a sequence of the form 米? 0 (I0), . . . , 米? N?1(IN?1) . We say that a function Sk(Ik) is a suffi- cient statistic at time k if the control function 米? k depends on Ik only through Sk(Ik), i.e., for some function ?米k, we have 米? k(Ik) = ?米k??Sk(Ik), where 米? k is optimal. There are several examples of sufficient statistics, and they are typically problem-dependent. A trivial possibility is to view Ik itself as a sufficient statistic, and a more sophisticated possibility is to view the belief state bk as a sufficient statistic (this is the conditional probability distribution of xk given Ik; cf. Section 1.3.6). For a proof that bk is indeed a sufficient statistic and for a more detailed discussion of other possible sufficient statistics, see [Ber17], Chapter 4. For a mathematically more advanced discussion, see [BeS78], Chapter 10. Since a sufficient statistic contains all the relevant information for optimal control purposes, an idea that suggests itself is to introduce features of a given sufficient statistic and to train a corresponding approximation architecture accordingly. As examples of potentially good features, one may consider some special characteristic of Ik (such as whether some alarm-like ※special§ event has been observed), or a partial history (such as the last m measurements and controls in Ik, or more sophisticated versions based on the concept of a finite-state controller proposed by White [Whi91], andWhite and Scherer [WhS94], and further discussed by Hansen [Han98], Kaelbling, Littman, and Cassandra [KLC98], Meuleau et al. [MPK99], Poupart and Boutilier [PoB04], Yu and Bertsekas [YuB06], Saldi, Yuksel, and Linder [SYL17]). In the case where the belief state bk is used as a sufficient statistic, examples of good features may be a point estimate based on bk, the variance of this estimate, and other quantities that can be simply extracted from bk. Of course it is possible to supplement a sufficient statistic with features of other sufficient statistics, and thus obtain an enlarged/richer sufficient statistic. In problem-specific contexts, and in the presence of approximations, this may yield improved results. Unfortunately, in many situations an adequate set of features is not known, so it is important to have methods that construct features automatically, to supplement whatever features may already be available. Indeed, there are architectures that do not rely on the knowledge of good features. We have noted the kernel methods of Example 3.1.5 in this connection. Another very popular possibility is neural networks, which we will describe in Section 3.2. Some of these architectures involve training that constructs simultaneously both the feature vectors 耳k(xk) and the parameter vectors rk that weigh them. 134 Parametric Approximation Chap. 3 3.1.2 Training of Linear and Nonlinear Architectures In what follows in this section we discuss the process of choosing the parameter vector r of a parametric architecture ? J(x, r). This is generally referred to as training. The most common type of training is based on a least squares optimization, also known as least squares regression. Here a set of state-cost training pairs (xs, 汕s), s = 1, . . . , q, called the training set , is collected and r is determined by solving the problem min r q Xs=1 ??? J(xs, r) ? 汕s2 . (3.3) Thus r is chosen to minimize the sum of squared errors between the sample costs 汕s and the architecture-predicted costs ? J(xs, r). Here there is some ※target§ cost function J that we aim to approximate with ? J(﹞, r), and the sample cost 汕s is the value J(xs) plus perhaps some error or ※noise.§ The cost function of the training problem (3.3) is generally nonconvex, which may pose challenges, since there may exist multiple local minima. However, for a linear architecture the cost function is convex quadratic, and the training problem admits a closed-form solution. In particular, for the linear architecture ? J(x, r) = r∩耳(x), the problem becomes min r q Xs=1 ??r∩耳(xs) ? 汕s2 . By setting the gradient of the quadratic objective to 0, we obtain q Xs=1 耳(xs)??r∩耳(xs) ? 汕s = 0, or q Xs=1 耳(xs)耳(xs)∩r = q Xs=1 耳(xs)汕s. Thus by matrix inversion we obtain the minimizing parameter vector ?r = q Xs=1 耳(xs)耳(xs)∩!?1 q Xs=1 耳(xs)汕s. (3.4) If the inverse above does not exist, an additional quadratic in r, called a regularization function, is added to the least squares objective to deal with this, and also to help with other issues to be discussed later. A singular value decomposition approach may also be used to deal with the matrix inversion issue; see [BeT96], Section 3.2.2. Sec. 3.1 Approximation Architectures 135 Thus a linear architecture has the important advantage that the training problem can be solved exactly and conveniently with the formula (3.4) (of course it may be solved by any other algorithm that is suitable for linear least squares problems). By contrast, if we use a nonlinear architecture, such as a neural network, the associated least squares problem is nonquadratic and also nonconvex. Despite this fact, through a combination of sophisticated implementation of special gradient algorithms, called in- cremental , and powerful computational resources, neural network methods have been successful in practice as we will discuss in Section 3.2. 3.1.3 Incremental Gradient and Newton Methods We will now digress to discuss special methods for solution of the least squares training problem (3.3), assuming a parametric architecture that is differentiable in the parameter vector. This methodology can be properly viewed as a subject in nonlinear programming and iterative algorithms, and as such it can be studied independently of the approximate DP methods of this book. Thus the reader who has already some exposure to the subject may skip to the next section, and return later as needed. Incremental methods have a rich theory, and our presentation in this section is brief, focusing primarily on implementation and intuition. Some surveys and book references are provided at the end of the chapter, which include a more detailed treatment. In particular, the neuro-dynamic programming book [BeT96] contains convergence analysis for both deterministic and stochastic training problems, while the author*s nonlinear programming book [Ber16a] contains an extensive discussion of incremental methods. Since, we want to cover problems that are more general than the specific least squares training problem (3.3), we will adopt a broader formulation and notation that are standard in nonlinear programming. Incremental Gradient Methods We view the training problem (3.3) as a special case of the minimization of a sum of component functions f(y) = m Xi=1 fi(y), (3.5) where each fi is a differentiable scalar function of the n-dimensional column vector y (this is the parameter vector). Thus we use the more common symbols y and m in place of r and q, respectively, and we replace the squared error terms ??? J(xs, r)?汕s2 in the training problem (3.3) with the generic terms fi(y). The (ordinary) gradient method for problem (3.5) generates a sequence {yk} of iterates, starting from some initial guess y0 for the minimum 136 Parametric Approximation Chap. 3 of the cost function f. It has the form? yk+1 = yk ? 污k?f(yk) = yk ? 污k m Xi=1 ?fi(yk), (3.6) where 污k is a positive stepsize parameter. The incremental gradient method is similar to the ordinary gradient method, but uses the gradient of a single component of f at each iteration. It has the general form yk+1 = yk ? 污k?fik (yk), (3.7) where ik is some index from the set {1, . . . ,m}, chosen by some deterministic or randomized rule. Thus a single component function fik is used at iteration k, with great economies in gradient calculation cost over the ordinary gradient method (3.6), particularly when m is large. The method for selecting the index ik of the component to be iterated on at iteration k is important for the performance of the method. Three common rules are: (1) A cyclic order , the simplest rule, whereby the indexes are taken up in the fixed deterministic order 1, . . . ,m, so that ik is equal to (k modulo m) plus 1. A contiguous block of iterations involving the components f1, . . . , fm in this order and exactly once is called a cycle. (2) A uniform random order , whereby the index ik chosen randomly by sampling over all indexes with a uniform distribution, independently of the past history of the algorithm. This rule may perform better than the cyclic rule in some circumstances. (3) A cyclic order with random reshuffling, whereby the indexes are taken up one by one within each cycle, but their order after each cycle is reshuffled randomly (and independently of the past). This rule is used widely in practice, particularly when the number of components m is modest, for reasons to be discussed later. Note that in the cyclic cases, it is essential to include all components in a cycle, for otherwise some components will be sampled more often than others, leading to a bias in the convergence process. Similarly, it is necessary to sample according to the uniform distribution in the random order case. ? We use standard calculus notation for gradients; see, e.g., [Ber16a], Appendix A. In particular, ?f(y) denotes the n-dimensional column vector whose components are the first partial derivatives ?f(y)/?yi of f with respect to the components y1, . . . , yn of the column vector y. Sec. 3.1 Approximation Architectures 137 Focusing for the moment on the cyclic rule (with or without reshuffling), we note that the motivation for the incremental gradient method is faster convergence: we hope that far from the solution, a single cycle of the method will be as effective as several (as many as m) iterations of the ordinary gradient method (think of the case where the components fi are similar in structure). Near a solution, however, the incremental method may not be as effective. To be more specific, we note that there are two complementary performance issues to consider in comparing incremental and nonincremental methods: (a) Progress when far from convergence. Here the incremental method can be much faster. For an extreme case take m large and all components fi identical to each other. Then an incremental iteration requires m times less computation than a classical gradient iteration, but gives exactly the same result, when the stepsize is scaled to be m times larger. While this is an extreme example, it reflects the essential mechanism by which incremental methods can be much superior: far from the minimum a single component gradient will point to ※more or less§ the right direction, at least most of the time; see the following example. (b) Progress when close to convergence. Here the incremental method can be inferior. In particular, the ordinary gradient method (3.6) can be shown to converge with a constant stepsize under reasonable assumptions, see e.g., [Ber16a], Chapter 1. However, the incremental method requires a diminishing stepsize, and its ultimate rate of convergence can be much slower. This type of behavior is illustrated in the following example. Example 3.1.7 Assume that y is a scalar, and that the problem is minimize f(y) = 1 2 m Xi=1 (ciy ? bi)2 subject to y ﹋ ?, where ci and bi are given scalars with ci 6= 0 for all i. The minimum of each of the components fi(y) = 1 2 (ciy ? bi)2 is y? i = bi ci , while the minimum of the least squares cost function f is y? = Pm i=1 cibi Pm i=1 c2i . 138 Parametric Approximation Chap. 3 x* x R REGION OF CONFUSION FAROUT REGIOFAROUT REGION N (ciy ? bi)2 2 R mini y R ! i ! i maxi y ! i Figure 3.1.4. Illustrating the advantage of incrementalism when far from the optimal solution. The region of component minima R = hmin i y? i , max i y? i i, is labeled as the ※region of confusion.§ It is the region where the method does not have a clear direction towards the optimum. The ith step in an incremental gradient cycle is a gradient step for minimizing (ciy ? bi)2, so if y lies outside the region of component minima R = mini y? i , maxi y? i , (labeled as the ※farout region§) and the stepsize is small enough, progress towards the solution y? is made. It can be seen that y? lies within the range of the component minima R = hmin i y? i , max i y? i i, and that for all y outside the range R, the gradient ?fi(y) = ci(ciy ? bi) has the same sign as ?f(y) (see Fig. 3.1.4). As a result, when outside the region R, the incremental gradient method yk+1 = yk ? 污kcik (cikyk ? bik ) approaches y? at each step, provided the stepsize 污k is small enough. In fact it can be verified that it is sufficient that 污k ≒ min i 1 c2i . Sec. 3.1 Approximation Architectures 139 However, for y inside the region R, the ith step of a cycle of the incremental gradient method need not make progress. It will approach y? (for small enough stepsize 污k) only if the current point yk does not lie in the interval connecting y? i and y?. This induces an oscillatory behavior within the region R, and as a result, the incremental gradient method will typically not converge to y? unless 污k ↙ 0. By contrast, the ordinary gradient method, which takes the form yk+1 = yk ? 污 m Xi=1 ci(ciyk ? bi), can be verified to converge to y? for any constant stepsize 污 with 0 < 污 ≒ 1 Pm i=1 c2i . However, for y outside the region R, a full iteration of the ordinary gradient method need not make more progress towards the solution than a single step of the incremental gradient method. In other words, with comparably intelligent stepsize choices, far from the solution (outside R), a single pass through the entire set of cost components by incremental gradient is roughly as effective as m passes by ordinary gradient . The preceding example assumes that each component function fi has a minimum, so that the range of component minima is defined. In cases where the components fi have no minima, a similar phenomenon may occur, as illustrated by the following example (the idea here is that we may combine several components into a single component that has a minimum). Example 3.1.8: Consider the case where f is the sum of increasing and decreasing convex exponentials, i.e., fi(y) = aiebiy, y ﹋ ?, where ai and bi are scalars with ai > 0 and bi 6= 0. Let I+ = {i | bi > 0}, I? = {i | bi < 0}, and assume that I+ and I? have roughly equal numbers of components. Let also y? be the minimum of Pm i=1 fi. Consider the incremental gradient method that given the current point, call it yk, chooses some component fik and iterates according to the incremental gradient iteration yk+1 = yk ? 汐k?fik (yk). Then it can be seen that if yk >> y?, yk+1 will be substantially closer to y? if i ﹋ I+, and negligibly further away than y? if i ﹋ I?. The net effect, averaged 140 Parametric Approximation Chap. 3 over many incremental iterations, is that if yk >> y?, an incremental gradient iteration makes roughly one half the progress of a full gradient iteration, with m times less overhead for calculating gradients. The same is true if yk << y?. On the other hand as yk gets closer to y? the advantage of incrementalism is reduced, similar to the preceding example. In fact in order for the incremental method to converge, a diminishing stepsize is necessary, which will ultimately make the convergence slower than the one of the nonincremental gradient method with a constant stepsize. The discussion of the preceding examples relies on y being one-dimensional, but in many multidimensional problems the same qualitative behavior can be observed. In particular, a pass through the ith component fi by the incremental gradient method can make progress towards the solution in the region where the component gradient ?fik (yk) makes an angle less than 90 degrees with the cost function gradient ?f(yk). If the components fi are not ※too dissimilar§, this is likely to happen in a region of points that are not too close to the optimal solution set. This behavior has been verified in many practical contexts, including the training of neural networks (cf. the next section), where incremental gradient methods have been used extensively, frequently under the name backpropagation methods. Stepsize Choice and Diagonal Scaling The choice of the stepsize 污k plays an important role in the performance of incremental gradient methods. On close examination, it turns out that the direction used by the method differs from the gradient direction by an error that is proportional to the stepsize, and for this reason a diminishing stepsize is essential for convergence to a local minimum of f (this is the best we can hope for since f may not be convex). However, it turns out that for the cyclic rule without reshuffling a peculiar form of convergence also typically occurs for a constant but sufficiently small stepsize. In this case, the iterates converge to a ※limit cycle§, whereby the ith iterates within the cycles converge to a different limit than the jth iterates for i 6= j. The sequence {yk} of the iterates obtained at the end of cycles converges, except that the limit obtained need not be a minimum of f, even when f is convex. The limit tends to be close to the minimum point when the constant stepsize is small (see Section 2.4 of [Ber16a] for analysis and examples). In practice, it is common to use a constant stepsize for a (possibly prespecified) number of iterations, then decrease the stepsize by a certain factor, and repeat, up to the point where the stepsize reaches a prespecified floor value. An alternative possibility is to use a diminishing stepsize rule of the form 污k = min污, 汕1 k + 汕2, Sec. 3.1 Approximation Architectures 141 where 污, 汕1, and 汕2 are some positive scalars. There are also variants of the method that use a constant stepsize throughout, and generically converge to a stationary point of f under reasonable assumptions. In one type of such method the degree of incrementalism gradually diminishes as the method progresses (see [Ber97a]). Another incremental approach with similar aims, is the aggregated incremental gradient method, which will be discussed later in this section. Regardless of whether a constant or a diminishing stepsize is ultimately used, to maintain the advantage of faster convergence when far from the solution, the incremental method must use a much larger stepsize than the corresponding nonincremental gradient method (as much as m times larger so that the size of the incremental gradient step is comparable to the size of the nonincremental gradient step). One possibility is to use an adaptive stepsize rule, whereby, roughly speaking, the stepsize is reduced (or increased) when the progress of the method indicates that the algorithm is (or is not) oscillating. There are formal ways to implement such stepsize rules with sound convergence properties (see [Tse98], [MYF03]). The difficulty with stepsize selection may also be addressed with di- agonal scaling, i.e., using a stepsize 污k j that is different for each of the components yj of y. Second derivatives can be very useful for this purpose. In generic nonlinear programming problems of unconstrained minimization of a function f, it is common to use diagonal scaling with stepsizes 污k j = 污 ?2f(yk) ?2yj ?1 , j = 1, . . . , n, where 污 is a constant that is nearly equal 1 (the second derivatives may also be approximated by gradient difference approximations). However, in least squares training problems, this type of scaling is inconvenient because of the additive form of f as a sum of a large number of component functions: f(y) = m Xi=1 fi(y), cf. Eq. (3.5). The neural network literature includes a number of practical scaling schemes, some of which have been incorporated in publicly and commercially available software. Also, later in this section, when we discuss incremental Newton methods, we will provide another type of diagonal scaling that uses second derivatives and is well suited to the additive character of the cost function f. Stochastic Gradient Descent Incremental gradient methods are related to methods that aim to minimize an expected value f(y) = EF(y,w) , 142 Parametric Approximation Chap. 3 where w is a random variable, and F(﹞,w) : ?n 7↙ ? is a differentiable function for each possible value of w. The stochastic gradient method for minimizing f is given by yk+1 = yk ? 污k?yF(yk,wk), (3.8) where wk is a sample of w and ?yF denotes gradient of F with respect to y. This method has a rich theory and a long history, and it is strongly related to the classical algorithmic field of stochastic approximation; see the books [BeT96], [KuY03], [Spa03], [Mey07], [Bor08], [BPP13]. The method is also often referred to as stochastic gradient descent , particularly in the context of machine learning applications. If we view the expected value cost EF(y,w) as a weighted sum of cost function components, we see that the stochastic gradient method (3.8) is related to the incremental gradient method yk+1 = yk ? 污k?fik (yk) (3.9) for minimizing a finite sumPm i=1 fi, when randomization is used for component selection. An important difference is that the former method involves stochastic sampling of cost components F(y,w) from a possibly infinite population, under some probabilistic assumptions, while in the latter the set of cost components fi is predetermined and finite. However, it is possible to view the incremental gradient method (3.9), with uniform randomized selection of the component function fi (i.e., with ik chosen to be any one of the indexes 1, . . . ,m, with equal probability 1/m, and independently of preceding choices), as a stochastic gradient method. Despite the apparent similarity of the incremental and the stochastic gradient methods, the view that the problem minimize f(y) = m Xi=1 fi(y) subject to y ﹋ ?n, (3.10) can simply be treated as a special case of the problem minimize f(y) = EF(y,w) subject to y ﹋ ?n, is questionable. One reason is that once we convert the finite sum problem to a stochastic problem, we preclude the use of methods that exploit the finite sum structure, such as the incremental aggregated gradient methods to be discussed in the next subsection. Another reason is that the finitecomponent problem (3.10) is often genuinely deterministic, and to view Sec. 3.1 Approximation Architectures 143 it as a stochastic problem at the outset may mask some of its important characteristics, such as the number m of cost components, or the sequence in which the components are ordered and processed. These characteristics may potentially be algorithmically exploited. For example, with insight into the problem*s structure, one may be able to discover a special deterministic or partially randomized order of processing the component functions that is superior to a uniform randomized order. On the other hand analysis indicates that in the absence of problem-specific knowledge that can be exploited to select a favorable deterministic order, a uniform randomized order (each component fi chosen with equal probability 1/m at each iteration, independently of preceding choices) sometimes has superior worst-case complexity; see [NeB00], [NeB01], [BNO03], [Ber15a], [WaB16]. Finally, let us note the popular hybrid technique, which reshuffles randomly the order of the cost components after each cycle. Practical experience indicates that it has somewhat better performance to the uniform randomized order when m is large. One possible reason is that random reshuffling allocates exactly one computation slot to each component in an m-slot cycle, while uniform random sampling allocates one computation slot to each component on the average. A nonzero variance in the number of slots that any fixed component gets within a cycle, may be detrimental to performance. While it seems difficult to establish this fact analytically, a justification is suggested by the view of the incremental gradient method as a gradient method with error in the computation of the gradient. The error has apparently greater variance in the uniform random sampling method than in the random reshuffling method, and heuristically, if the variance of the error is larger, the direction of descent deteriorates, suggesting slower convergence. Incremental Aggregated Gradient Methods Another algorithm that is well suited for least squares training problems is the incremental aggregated gradient method, which has the form yk+1 = yk ? 污k m?1 X?=0 ?fik?? (yk??), (3.11) where fik is the new component function selected for iteration k.? In the most common version of the method the component indexes ik are selected in a cyclic order [ik = (k modulo m) + 1]. Random selection of the index ik has also been suggested. From Eq. (3.11) it can be seen that the method computes the gradient incrementally, one component per iteration. However, in place of the single ? In the case where k < m, the summation in Eq. (3.11) should go up to ? = k, and the stepsize should be replaced by a corresponding larger value. 144 Parametric Approximation Chap. 3 component gradient ?fik (yk), used in the incremental gradient method (3.7), it uses the sum of the component gradients computed in the past m iterations, which is an approximation to the total cost gradient ?f(yk). The idea of the method is that by aggregating the component gradients one may be able to reduce the error between the true gradient ?f(yk) and the incrementally computed approximation used in Eq. (3.11), and thus attain a faster asymptotic convergence rate. Indeed, it turns out that under certain conditions the method exhibits a linear convergence rate, just like in the nonincremental gradient method, without incurring the cost of a full gradient evaluation at each iteration (a strongly convex cost function and with a sufficiently small constant stepsize are required for this; see [Ber16a], Section 2.4.2, and the references quoted there). This is in contrast with the incremental gradient method (3.7), for which a linear convergence rate can be achieved only at the expense of asymptotic error, as discussed earlier. A disadvantage of the aggregated gradient method (3.11) is that it requires that the most recent component gradients be kept in memory, so that when a component gradient is reevaluated at a new point, the preceding gradient of the same component is discarded from the sum of gradients of Eq. (3.11). There have been alternative implementations that ameliorate this memory issue, by recalculating the full gradient periodically and replacing an old component gradient by a new one; we refer to the specialized literature on the subject. Another potential drawback of the aggregated gradient method is that for a large number of terms m, one hopes to converge within the first cycle through the components fi, thereby reducing the effect of aggregating the gradients of the components. Incremental Newton Methods We will now consider an incremental version of Newton*s method for unconstrained minimization of an additive cost function of the form f(y) = m Xi=1 fi(y), where the functions fi are convex and twice continuously differentiable.? The ordinary Newton*s method, at the current iterate yk, obtains the next iterate yk+1 by minimizing over y the quadratic approximation/second ? We will denote by ?2f(y) the n ℅ n Hessian matrix of f at y, i.e., the matrix whose (i, j)th component is the second partial derivative ?2f(y)/?yi?yj . A beneficial consequence of assuming convexity of fi is that the Hessian matrices ?2fi(y) are positive semidefinite, which facilitates the implementation of the algorithms to be described. On the other hand, the algorithmic ideas of this section may also be adapted for the case where the functions fi are nonconvex. Sec. 3.1 Approximation Architectures 145 order expansion ? f(y; yk) = ?f(yk)∩(y ? yk) + 1 2 (y ? yk)∩?2f(yk)(y ? yk), of f at yk. Similarly, the incremental form of Newton*s method minimizes a sum of quadratic approximations of components of the general form ? fi(y; 肉) = ?fi(肉)∩(y ? 肉) + 12 (y ? 肉)∩?2fi(肉)(y ? 肉), (3.12) as we will now explain. As in the case of the incremental gradient method, we view an iteration as a cycle of m subiterations, each involving a single additional component fi, and its gradient and Hessian at the current point within the cycle. In particular, if yk is the vector obtained after k cycles, the vector yk+1 obtained after one more cycle is yk+1 = 肉m,k, where starting with 肉0,k = yk, we obtain 肉m,k after the m steps 肉i,k ﹋ arg min y﹋?n i X?=1 ? f?(y; 肉??1,k), i = 1, . . . ,m, (3.13) where ? f? is defined as the quadratic approximation (3.12). If all the functions fi are quadratic, it can be seen that the method finds the solution in a single cycle.? The reason is that when fi is quadratic, each fi(y) differs from ?fi(y; 肉) by a constant, which does not depend on y. Thus the difference m Xi=1 fi(y) ? m Xi=1 ? fi(y; 肉i?1,k) is a constant that is independent of y, and minimization of either sum in the above expression gives the same result. It is important to note that the computations of Eq. (3.13) can be carried out efficiently. For simplicity, let as assume that ? f1(y; 肉) is a positive definite quadratic, so that for all i, 肉i,k is well defined as the unique ? Here we assume that them quadratic minimizations (3.13) to generate 肉m,k have a solution. For this it is sufficient that the first Hessian matrix ?2f1(y0) be positive definite, in which case there is a unique solution at every iteration. A simple possibility to deal with this requirement is to add to f1 a small positive regularization term, such as ?ky ? y0k2. A more sound possibility is to lump together several of the component functions (enough to ensure that the sum of their quadratic approximations at y0 is positive definite), and to use them in place of f1. This is generally a good idea and leads to smoother initialization, as it ensures a relatively stable behavior of the algorithm for the initial iterations. 146 Parametric Approximation Chap. 3 solution of the minimization problem in Eq. (3.13). We will show that the incremental Newton method (3.13) can be implemented with the incremental update formula 肉i,k = 肉i?1,k ? Di,k?fi(肉i?1,k), (3.14) where Di,k is given by Di,k = i X?=1 ?2f?(肉??1,k)! ?1 , (3.15) and (assuming existence of the required inverses) is generated iteratively as Di,k = D?1 i?1,k + ?2fi(肉i?1,k)?1 . (3.16) Indeed, from the definition of the method (3.13), the quadratic function Pi?1 ?=1 ? f?(y; 肉??1,k) is minimized by 肉i?1,k and its Hessian matrix is D?1 i?1,k, so we have i?1 X?=1 ? f?(y; 肉??1,k) = 1 2 (y ? 肉i?1,k)∩D?1 i?1,k(y ? 肉i?1,k) + constant. Thus, by adding ? fi(y; 肉i?1,k) to both sides of this expression, we obtain i X?=1 ? f?(y; 肉??1,k) = 1 2 (y ? 肉i?1,k)∩D?1 i?1,k(y ? 肉i?1,k) + constant + 1 2 (y ? 肉i?1,k)∩?2fi(肉i?1,k)(y ? 肉i?1,k) + ?fi(肉i?1,k)∩(y ? 肉i?1,k). Since by definition 肉i,k minimizes this function, we obtain Eqs. (3.14)- (3.16). The recursion (3.16) for the matrix Di,k can often be efficiently implemented by using convenient formulas for the inverse of the sum of two matrices. In particular, if fi is given by fi(y) = hi(a∩ iy ? bi), for some twice differentiable convex function hi : ? 7↙ ?, vector ai, and scalar bi, we have ?2fi(肉i?1,k) = ?2hi(a∩ i肉i?1,k ? bi) aia∩ i, and the recursion (3.16) can be written as Di,k = Di?1,k ? Di?1,kaia∩ iDi?1,k ?2hi(a∩ i肉i?1,k ? bi)?1 + a∩ iDi?1,kai . Sec. 3.1 Approximation Architectures 147 This is the well-known Sherman-Morrison formula for the inverse of the sum of an invertible matrix and a rank-one matrix. We have considered so far a single cycle of the incremental Newton method. Similar to the case of the incremental gradient method, we may cycle through the component functions multiple times. In particular, we may apply the incremental Newton method to the extended set of components f1, f2, . . . , fm, f1, f2, . . . , fm, f1, f2, . . . . (3.17) The resulting method asymptotically resembles an incremental gradient method with diminishing stepsize of the type described earlier. Indeed, from Eq. (3.15), the matrix Di,k diminishes roughly in proportion to 1/k. From this it follows that the asymptotic convergence properties of the incremental Newton method are similar to those of an incremental gradient method with diminishing stepsize of order O(1/k). Thus its convergence rate is slower than linear. To accelerate the convergence of the method one may employ a form of restart, so that Di,k does not converge to 0. For example Di,k may be reinitialized and increased in size at the beginning of each cycle. For problems where f has a unique nonsingular minimum y? [one for which ?2f(y?) is nonsingular], one may design incremental Newton schemes with restart that converge linearly to within a neighborhood of y? (and even superlinearly if y? is also a minimum of all the functions fi, so there is no region of confusion); see [Ber96a]. Alternatively, the update formula (3.16) may be modified to Di,k = 汕kD?1 i?1,k + ?2f?(肉i,k)?1 , (3.18) by introducing a parameter 汕k ﹋ (0, 1), which can be used to accelerate the practical convergence rate of the method. Incremental Newton Method with Diagonal Approximation Generally, with proper implementation, the incremental Newton method is often substantially faster than the incremental gradient method, in terms of numbers of iterations (there are theoretical results suggesting this property for stochastic versions of the two methods; see the end-of-chapter references). However, in addition to computation of second derivatives, the incremental Newton method involves greater overhead per iteration due to the matrix-vector calculations in Eqs. (3.14), (3.16), and (3.18), so it is suitable only for problems where n, the dimension of y, is relatively small. A way to remedy in part this difficulty is to approximate ?2fi(肉i,k) by a diagonal matrix, and recursively update a diagonal approximation of Di,k using Eqs. (3.16) or (3.18). In particular, one may set to 0 the off-diagonal components of ?2fi(肉i,k). In this case, the iteration (3.14) 148 Parametric Approximation Chap. 3 becomes a diagonally scaled version of the incremental gradient method, and involves comparable overhead per iteration (assuming the required diagonal second derivatives are easily computed or approximated). As an additional scaling option, one may multiply the diagonal components with a stepsize parameter that is close to 1 and add a small positive constant (to bound them away from 0). Ordinarily this method is easily implementable, and requires little experimentation with stepsize selection. Example 3.1.9: (Diagonalized Newton Method for Linear Least Squares) Consider the problem of minimizing the sum of squares of linear scalar functions: minimize f(y) = 1 2 m Xi=1 (c∩ iy ? bi)2 subject to y ﹋ ?n, where for all i, ci is a given nonzero vector in ?n and bi is a given scalar. In this example, we will assume that we apply the incremental Newton method to the extended set of components f1, f2, . . . , fm, f1, f2, . . . , fm, f1, f2, . . . , c.f. Eq. (3.17). The inverse Hessian matrix for the kth cycle is given by Di,k = (k ? 1) m X?=1 c?c∩ ? + i X?=1 c?c∩ ?!?1 , i = 1, . . . ,m, and its jth diagonal term (the one to multiply the partial derivative ?fi/?xj within cycle k + 1) is 污k i,j = 1 (k ? 1)Pm ?=1(cj ?)2 +Pi ?=1(cj ?)2 . (3.19) The diagonalized version of the incremental Newton method is given for each of the n components of 肉i,k by 肉j i,k = 肉j i?1,k ? 污k i,jcj i (c∩ i肉i?1,k ? bi), j = 1, . . . , n, where 肉j i,k and cj i are the jth components of the vectors 肉i,k and ci, respectively. Contrary to the incremental Newton iteration, it does not involve matrix inversion. Note that the diagonal scaling term (3.19) tends to 0 as k increases, for each component j. As noted earlier, this diagonal term may be multiplied by a constant that is less or equal to 1 for a better guarantee of asymptotic convergence. Sec. 3.2 Neural Networks 149 In an alternative implementation of the diagonal Hessian approximation idea, we reinitialize the inverse Hessian matrix to 0 (or to small multiple of the identity) at the beginning of each cycle. Then instead of the scalar 污k i,j of Eq. (3.19), the jth diagonal term of the inverse Hessian will be 1 Pi ?=1(cj ?)2 , and will not converge to 0. It may be modified by multiplication with a suitable diminishing scalar for practical purposes. 3.2 NEURAL NETWORKS There are several different types of neural networks that can be used for a variety of tasks, such as pattern recognition, classification, image and speech recognition, and others. In this section, we focus on our finite horizon DP context, and the role that neural networks can play in approximating the optimal cost-to-go functions J* k . As an example within this context, we may first use a neural network to construct an approximation to J*N ?1. Then we may use this approximation to approximate J*N ?2, and continue this process backwards in time, to obtain approximations to all the optimal cost-to-go functions J* k , k = 1, . . . ,N ?1, as we will discuss in more detail in Section 3.3. To describe the use of neural networks in finite horizon DP, let us consider the typical stage k, and for convenience drop the index k; the subsequent discussion applies to each value of k separately. We consider parametric architectures J?(x, v, r) of the form ? J(x, v, r) = r∩耳(x, v) (3.20) that depend on two parameter vectors v and r. Our objective is to select v and r so that ? J(x, v, r) approximates some cost function that can be sampled (possibly with some error). The process is to collect a training set that consists of a large number of state-cost pairs (xs, 汕s), s = 1, . . . , q, and to find a function ? J(x, v, r) of the form (3.20) that matches the training set in a least squares sense, i.e., (v, r) minimizes q Xs=1 ?? J?(xs, v, r) ? 汕s2 . We postpone for later the question of how the training pairs (xs, 汕s) are generated, and how the training problem is solved.? Notice the different ? The least squares training problem used here is based on nonlinear re- gression. This is a classical method for approximating the expected value of a function with a parametric architecture, and involves a least squares fit of the architecture to simulation-generated samples of the expected value. We refer to machine learning and statistics textbooks for more discussion. 150 Parametric Approximation Chap. 3 . . . . . . . . . State x 1 0 -1 Encoding LineaCroLsatyAerppPraorxaimmeatteiorn Cost Approximation ) Sigmoidal Layer Lin)eSairgmWoeiidgahltiLnagyer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting 1 0 -1 Encoding y(x) Linear Layer Parameter Linear Layer Parameter Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear Weighting b !1(x, v) Ay(x) + b ) !2(x, v) ) !m(x, v) ) r State Cost Approximation r ∩ !(x, v) Nonlinear Ay FEATURES Figure 3.2.1 A perceptron consisting of a linear layer and a nonlinear layer. It provides a way to compute features of the state, which can be used for cost function approximation. The state x is encoded as a vector of numerical values y(x), which is then transformed linearly as Ay(x) + b in the linear layer. The m scalar output components of the linear layer, become the inputs to nonlinear functions that produce the m scalars ?(x, v) = ??(Ay(x) + b)?, which can be viewed as features that are in turn linearly weighted with parameters r?. roles of the two parameter vectors here: v parametrizes 耳(x, v), which in some interpretation may be viewed as a feature vector, and r is a vector of linear weighting parameters for the components of 耳(x, v). A neural network architecture provides a parametric class of functions ? J(x, v, r) of the form (3.20) that can be used in the optimization framework just described. The simplest type of neural network is the single layer per- ceptron; see Fig. 3.2.1. Here the state x is encoded as a vector of numerical values y(x) with components y1(x), . . . , yn(x), which is then transformed linearly as Ay(x) + b, where A is an m℅n matrix and b is a vector in ?m.? This transformation is called the linear layer of the neural network. We view the components of A and b as parameters to be determined, and we group them together into the parameter vector v = (A, b). Each of the m scalar output components of the linear layer, ??Ay(x) + b?, ? = 1, . . . ,m, becomes the input to a nonlinear differentiable function 考 that maps scalars to scalars. Typically 考 is differentiable and monotonically increasing. A simple and popular possibility is the rectified linear unit (ReLU for short), ? The method of encoding x into the numerical vector y(x) is problemdependent, but it is important to note that some of the components of y(x) could be some known interesting features of x that can be designed based on problem-specific knowledge. Sec. 3.2 Neural Networks 151 ) ! 1 0 -1 ) ! 1 0 -1 1 0 -1 1 0 -1 Good approximation Poor Approximation !(") = ln(1 + e !) max{0, !} Figure 3.2.2 The rectified linear unit () = ln(1 + e缶). It is the function max{0, } with its corner ※smoothed out.§ Its derivative is ∩() = e缶/(1 + e缶), and approaches 0 and 1 as ! ?1 and ! 1, respectively. Selective Depth Lookahead Tree !(") Selective Depth Lookahead Tree !(") ) ! 1 0 -1 ) ! 1 0 -1 ! 1 0 -1 ! 1 0 -1 1 0 -1 1 0 -1 1 0 -1 Figure 3.2.3 Some examples of sigmoid functions. The hyperbolic tangent function is on the left, while the logistic function is on the right. which is simply the function max{0, 缶}, approximated by a differentiable function 考 by some form of smoothing operation; for example 考(缶) = ln(1+ e缶), which illustrated in Fig. 3.2.2. Other functions, used since the early days of neural networks, have the property ?﹢ < lim 缶↙?﹢ 考(缶) < lim 缶↙﹢ 考(缶) < ﹢; see Fig. 3.2.3. Such functions are called sigmoids, and some common choices are the hyperbolic tangent function 考(缶) = tanh(缶) = e缶 ? e?缶 e缶 + e?缶 , and the logistic function 考(缶) = 1 1 + e?缶 . In what follows, we will ignore the character of the function 考 (except for differentiability), and simply refer to it as a ※nonlinear unit§ and to the corresponding layer as a ※nonlinear layer.§ 152 Parametric Approximation Chap. 3 . . . State x 1 0 -1 Encoding 1 0 -1 Encoding y(x) State Cost Approximation Feature ExtractioCost 0 Cost g n Feature Extraction . . . . . . ) Neural Network ) Neural Network } ? J(x) Figure 3.2.4 Nonlinear architecture with a view of the state encoding process as a feature extraction mapping preceding the neural network. At the outputs of the nonlinear units, we obtain the scalars 耳?(x, v) = 考??(Ay(x) + b)?, ? = 1, . . . ,m. One possible interpretation is to view 耳?(x, v) as features of x, which are linearly combined using weights r?, ? = 1, . . . ,m, to produce the final output J?(x, v, r) = m X?=1 r?耳?(x, v) = m X?=1 r?考??(Ay(x) + b)?. Note that each value 耳?(x, v) depends on just the ?th row of A and the ?th component of b, not on the entire vector v. In some cases this motivates placing some constraints on individual components of A and b to achieve special problem-dependent ※handcrafted§ effects. State Encoding and Direct Feature Extraction The state encoding operation that transforms x into the neural network input y(x) can be instrumental in the success of the approximation scheme. Examples of possible state encodings are components of the state x, numerical representations of qualitative characteristics of x, and more generally features of x, i.e., functions of x that aim to capture ※important nonlinearities§ of the optimal cost-to-go function. With the latter view of state encoding, we may consider the approximation process as consisting of a feature extraction mapping, followed by a neural network with input the extracted features of x, and output the cost-to-go approximation; see Fig. 3.2.4. The idea here is that with a good feature extraction mapping, the neural network need not be very complicated (may involve few nonlinear units and corresponding parameters), and may be trained more easily. This intuition is borne out by simple examples and practical experience. However, as is often the case with neural networks, it is hard to support it with a quantitative analysis. Sec. 3.2 Neural Networks 153 Figure 3.2.5 Illustration of the graph of the cost function of a two-dimensional neural network training problem. The cost function is nonconvex, and it becomes flat for large values of the weights. In particular, the cost tends asymptotically to a constant as the vector of weights is changed towards 1 along rays emanating from the origin. The graph in the figure corresponds to an extremely simple training problem: a single input, a single nonlinear unit, two weights, and five input-output data pairs (corresponding to the five ※petals§ of the graph). 3.2.1 Training of Neural Networks Given a set of state-cost training pairs (xs, 汕s), s = 1, . . . , q, the parameters of the neural network A, b, and r are obtained by solving the training problem min A,b,r q Xs=1 m X?=1 r?考 ????Ay(xs) + b?? 汕s!2 . (3.21) Note that the cost function of this problem is generally nonconvex, so there may exist multiple local minima. In fact the graph of cost function of the above training problem can be very complicated; see Fig. 3.2.5, which illustrates a very simple special case. In practice it is common to augment the cost function of this problem with a regularization function, such as a quadratic in the parameters A, b, and r. This is customary in least squares problems in order to make the problem easier to solve algorithmically. However, in the context of neural network training, regularization is primarily important for a different reason: it helps to avoid overfitting, which occurs when the number of parameters of the neural network is relatively large (comparable to the 154 Parametric Approximation Chap. 3 size of the training set). In this case a neural network model matches the training data very well but may not do as well on new data. This is a well known difficulty in machine learning, which is the subject of much current research, particularly in the context of deep neural networks. We refer to machine learning and neural network textbooks for a discussion of algorithmic questions regarding regularization and other issues that relate to the practical implementation of the training process. In any case, the training problem (3.21) is an unconstrained nonconvex differentiable optimization problem that can in principle be addressed with any of the standard gradient-type methods. Significantly, it is well-suited for the incremental methods discussed in Section 3.1.3 (a sense of this can be obtained by comparing the cost structure illustrated in Fig. 3.2.5 with the one of Fig. 3.1.4, which explains the advantage of incrementalism). Let us now consider a few issues regarding the neural network formulation and training process just described: (a) The first issue is to select a method to solve the training problem (3.21). While we can use any unconstrained optimization method that is based on gradients, in practice it is important to take into account the cost function structure of problem (3.21). The salient characteristic of this cost function is that it is the sum of a potentially very large number q of component functions. This makes the computation of the cost function value of the training problem and/or its gradient very costly. For this reason the incremental methods of Section 3.1.3 are universally used for training.? Experience has shown that these methods can be vastly superior to their nonincremental counterparts in the context of neural network training. The implementation of the training process has benefited from experience that has been accumulated over time, and has provided guidelines for scaling, regularization, initial parameter selection, and other practical issues; we refer to books on neural networks such as Bishop [Bis95], Goodfellow, Bengio, and Courville [GBC16], and Haykin [Hay08] for related accounts. Still, incremental methods can be quite slow, and training may be a time-consuming process. Fortunately, the training is ordinarily done off-line, in which case computation time may not be a serious issue. Moreover, in practice the neural network training problem typically need not be solved with great accuracy. ? The incremental methods are valid for an arbitrary order of component selection within the cycle, but it is common to randomize the order at the beginning of each cycle. Also, in a variation of the basic method, we may operate on a batch of several components at each iteration, called a minibatch, rather than a single component. This has an averaging effect, which reduces the tendency of the method to oscillate and allows for the use of a larger stepsize; see the end-of-chapter references. Sec. 3.2 Neural Networks 155 (b) Another important question is how well we can approximate the optimal cost-to-go functions J* k with a neural network architecture, assuming we can choose the number of the nonlinear units m to be as large as we want. The answer to this question is quite favorable and is provided by the so-called universal approximation theorem. Roughly, the theorem says that assuming that x is an element of a Euclidean space X and y(x) √ x, a neural network of the form described can approximate arbitrarily closely (in an appropriate mathematical sense), over a closed and bounded subset S ? X, any piecewise continuous function J : S 7↙ ?, provided the number m of nonlinear units is sufficiently large. For proofs of the theorem at different levels of generality, we refer to Cybenko [Cyb89], Funahashi [Fun89], Hornik, Stinchcombe, and White [HSW89], and Leshno et al. [LLP93]. For intuitive explanations we refer to Bishop ([Bis95], pp. 129-130) and Jones [Jon90]. (c) While the universal approximation theorem provides some assurance about the adequacy of the neural network structure, it does not predict how many nonlinear units we may need for ※good§ performance in a given problem. Unfortunately, this is a difficult question to even pose precisely, let alone to answer adequately. In practice, one is reduced to trying increasingly larger values of m until one is convinced that satisfactory performance has been obtained for the task at hand. Experience has shown that in many cases the number of required nonlinear units and corresponding dimension of A can be very large, adding significantly to the difficulty of solving the training problem. This has given rise to many suggestions for modifications of the neural network structure. One possibility is to concatenate multiple single layer perceptrons so that the output of the nonlinear layer of one perceptron becomes the input to the linear layer of the next, giving rise to deep neural networks, which we will discuss in Section 3.2.2. What are the Features that can be Produced by Neural Networks? The short answer is that just about any feature that can be of practical interest can be produced or be closely approximated by a neural network. What is needed is a single layer that consists of a sufficiently large number of nonlinear units, preceded and followed by a linear layer. This is a consequence of the universal approximation theorem. In particular it is not necessary to have more than one nonlinear layer (although it is possible that fewer nonlinear units may be needed with a neural network that involves multiple nonlinear layers). As an illustration, we will consider features of a scalar state variable x, and a neural network that uses the rectifier function 考(缶) = max{0, 缶} 156 Parametric Approximation Chap. 3 x } Linear Unit Rectifier x !(x ? ") max Linear Unit Rectifier ) max{0, "} Linear Unit Rectifier !!,"(x) Slope ! " x ) Slope ! " Cost 0 Cost Figure 3.2.6 The feature 汕,污(x) = max0, (x ? ) , produced by a rectifier preceded by the linear function L(x) = (x ? ). as the basic nonlinear unit. Thus a single rectifier preceded by a linear function L(x) = 污(x ? 汕), where 汕 and 污 are scalars, produces the feature 耳汕,污(x) = max 0, 污(x ? 汕) , (3.22) illustrated in Fig. 3.2.6. We can now construct more complex features by adding or subtracting single rectifier features of the form (3.22). In particular, subtracting two rectifier functions with the same slope but two different horizontal shifts 汕1 and 汕2, we obtain the feature 耳汕1,汕2,污(x) = 耳汕1,污(x) ? 耳汕2,污(x) shown in Fig. 3.2.7(a). By subtracting again two features of the preceding form, we obtain the ※pulse§ feature 耳汕1,汕2,汕3,汕4,污(x) = 耳汕1,汕2,污(x) ? 耳汕3,汕4,污(x), shown in Fig. 3.2.7(b). The ※pulse§ feature can be used in turn as a fundamental block to approximate any desired feature by a linear combination of ※pulses.§ This explains why neural networks are capable of producing features of any shape by using linear layers to precede and follow nonlinear layers, at least in the case of a scalar state x. The mechanism for feature formation just described can be extended to the case of a multidimensional x, and lies at the heart of the universal approximation theorem and its proof (see Cybenko [Cyb89]). Sec. 3.2 Neural Networks 157 x x } Linear Unit RectifieLinear Unit Recrtifier !) max{0, "} x ) Slope ! " Cost 0 Cost } LineaLrinUenaritURneicttiRfieecrtifier !) max{0, "} x !(x ? "1) ! " ) !(x ? "2) + } LineaLrinUenaritURneicttiRfieecrtifier !) max{0, "} } LineaLrinUenaritURneicttiRfieecrtifier !) max{0, "} x !(x ? "1) ! " ) !(x ? "2) + } LineaLrinUenaritURneicttiRfieecrtifier !) max{0, "} } LineaLrinUenaritURneicttiRfieecrtifier !) max{0, "} ) + ) + ) + ? ? ? !!1,!2,"(x) = ) !11 !2 x !(x ? "3) ? ) !(x ? "4) + x ) Slope ! " Cost 0 C)o!s11t !2 ) !3 3 !4 4 !!1,!2,!3,!4,"(x) 4 (a) (b) (a) (b) Figure 3.2.7 (a) Illustration of how the feature 汕1,汕2,污(x) = 汕1,污(x) ? 汕2,污(x) is formed by a neural network of a two-rectifier nonlinear layer, preceded by a linear layer. (b) Illustration of how the ※pulse§ feature 汕1,汕2,汕3,汕4,污(x) = 汕1,汕2,污(x) ? 汕3,汕4,污(x) is formed by a neural network of a four-rectifier nonlinear layer, preceded by a linear layer. 3.2.2 Multilayer and Deep Neural Networks An important generalization of the single layer perceptron architecture involves a concatenation of multiple layers of linear and nonlinear functions; see Fig. 3.2.8. In particular the outputs of each nonlinear layer become the inputs of the next linear layer. In some cases it may make sense to add as additional inputs some of the components of the state x or the state encoding y(x). 158 Parametric Approximation Chap. 3 . . . . . . . . . . . . . . . . . . 1 0 -1 Encoding State r x ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting ) Sigmoidal Layer Linear Weighting Cost Approximation r ∩ !(x, v) Nonlinear Ay Nonlinear Ay Figure 3.2.8 A deep neural network, with multiple layers. Each nonlinear layer constructs the inputs of the next linear layer. There are a few questions to consider here. The first has to do with the reason for having multiple nonlinear layers, when a single one is sufficient to guarantee the universal approximation property. Here are some qualitative (and somewhat speculative) explanations: (a) If we view the outputs of each nonlinear layer as features, we see that the multilayer network produces a hierarchy of features, where each set of features is a function of the preceding set of features [except for the first set of features, which is a function of the encoding y(x) of the state x]. In the context of specific applications, this hierarchical structure can be exploited to specialize the role of some of the layers and to enhance some characteristics of the state. (b) Given the presence of multiple linear layers, one may consider the possibility of using matrices A with a particular sparsity pattern, or other structure that embodies special linear operations such as convolution. When such structures are used, the training problem often becomes easier, because the number of parameters in the linear layers is drastically decreased. (c) Overparametrization (more weights than data, as in a multilayer neural network) may help to mitigate the detrimental effects of overfitting, and the attendant need for regularization. The explanation for this fascinating phenomenon (observed as early as the late 90s) is the subject of much current research; see [ZBH16], [BMM18], [BRT18], [SJL18], [BLL19], [HMR19], [MVS19] for some representative works. Note that while in the early days of neural networks practitioners tended to use few nonlinear layers (say one to three), more recently a lot of success in certain problem domains (including image and speech processing, as well as approximate DP) has been achieved with so called deep neural networks, which involve a considerably larger number of layers. In particular, the use of deep neural networks has been an important factor for the success of the AlphaGo and AlphaZero programs that play Go and chess, respectively; see [SHM16], [SHS17]. By contrast, Tesauro*s backgammon program and its descendants (cf. Section 2.4.2) have performed well with one or two nonlinear layers [PaR12]. Sec. 3.2 Neural Networks 159 Training and Backpropagation Let us now consider the training problem for multilayer networks. It has the form min v,r q Xs=1 m X?=1 r?耳?(xs, v) ? 汕s!2 , where v represents the collection of all the parameters of the linear layers, and 耳?(x, v) is the ?th feature produced at the output of the final nonlinear layer. Various types of incremental gradient methods, which modify the weight vector in the direction opposite to the gradient of a single sample term m X?=1 r?耳?(xs, v) ? 汕s!2 , can also be applied here. They are the methods most commonly used in practice. An important fact is that the gradient with respect to v of each feature 耳?(x, v) can be efficiently calculated, as we will describe shortly. Multilayer perceptrons can be represented compactly by introducing certain mappings to describe the linear and the nonlinear layers. In particular, let L1, . . . ,Lm+1 denote the matrices representing the linear layers; that is, at the output of the 1st linear layer we obtain the vector L1x and at the output of the kth linear layer (k > 1) we obtain Lk缶, where 缶 is the output of the preceding nonlinear layer. Similarly, let 1, . . . ,m denote the mappings representing the nonlinear layers; that is, when the input of the kth nonlinear layer (k > 1) is the vector y with components y(j), we obtain at the output the vector ky with components 考??y(j). The output of the multilayer perceptron is F(L1, . . . ,Lm+1, x) = Lm+1mLm ﹞ ﹞ ﹞1L1x. The special nature of this formula has an important computational consequence: the gradient (with respect to the weights) of the squared error between the output and a desired output y, E(L1, . . . ,Lm+1) = 1 2 ??y ? F(L1, . . . ,Lm+1, x)2 , can be efficiently calculated using a special procedure known as backprop- agation, which is just an intelligent way of using the chain rule.? In particular, the partial derivative of the cost function E(L1, . . . ,Lm+1) with ? The name backpropagation is used in several different ways in the neural networks literature. For example feedforward neural networks of the type shown in Fig. 3.2.8 are sometimes referred to as backpropagation networks. The somewhat abstract derivation of the backpropagation formulas given here comes from Section 3.1.1 of the book [BeT96]. 160 Parametric Approximation Chap. 3 respect to Lk(i, j), the ijth component of the matrix Lk, is given by ?E(L1, . . . ,Lm+1) ?Lk(i, j) = ?e∩Lm+1mLm ﹞ ﹞ ﹞Lk+1kIijk?1Lk?1 ﹞ ﹞ ﹞1L1x, (3.23) where e is the error vector e = y ? F(L1, . . . ,Lm+1, x), n, n = 1, . . . ,m, is the diagonal matrix with diagonal terms equal to the derivatives of the nonlinear functions 考 of the nth hidden layer evaluated at the appropriate points, and Iij is the matrix obtained from Lk by setting all of its components to 0 except for the ijth component which is set to 1. This formula can be used to obtain efficiently all of the terms needed in the partial derivatives (3.23) of E using a two-step calculation: (a) Use a forward pass through the network to calculate sequentially the outputs of the linear layers L1x,L21L1x, . . . ,Lm+1mLm ﹞ ﹞ ﹞1L1x. This is needed in order to obtain the points at which the derivatives in the matrices n are evaluated, and also in order to obtain the error vector e = y ? F(L1, . . . ,Lm+1, x). (b) Use a backward pass through the network to calculate sequentially the terms e∩Lm+1mLm ﹞ ﹞ ﹞Lk+1k in the derivative formulas (3.23), starting with e∩Lm+1m, proceeding to e∩Lm+1mLmm?1, and continuing to e∩Lm+1m ﹞ ﹞ ﹞L21. As a final remark, we mention that the ability to simultaneously extract features and optimize their linear combination is not unique to neural networks. Other approaches that use a multilayer architecture have been proposed (see the survey by Schmidhuber [Sch15]), and they admit similar training procedures based on appropriately modified forms of backpropagation. An example of an alternative multilayer architecture approach is the Group Method for Data Handling (GMDH), which is principally based on the use of polynomial (rather than sigmoidal) nonlinearities. The GMDH approach was investigated extensively in the Soviet Union starting with the work of Ivakhnenko in the late 60s; see e.g., [Iva68]. It has been used in a large variety of applications, and its similarities with the neural network methodology have been noted (see the survey by Ivakhnenko [Iva71], and the large literature summary at the web site http://www.gmdh.net). Most of the GMDH research relates to inference-type problems, and apparently there have not been any applications to approximate DP to date. However, this may be a fruitful area of investigation, since in some applications it may turn out that polynomial nonlinearities are more suitable than sigmoidal or rectified linear unit nonlinearities. Sec. 3.3 Sequential Dynamic Programming Approximation 161 3.3 SEQUENTIAL DYNAMIC PROGRAMMING APPROXIMATION Let us describe a popular approach for training an approximation architecture ? Jk(xk, rk) for a finite horizon DP problem. The parameter vectors rk are determined sequentially, with an algorithm known as fitted value iteration, starting from the end of the horizon, and proceeding backwards as in the DP algorithm: first rN?1 then rN?2, and so on. The algorithm samples the state space for each stage k, and generates a large number of states xs k, s = 1, . . . , q. It then determines sequentially the parameter vectors rk to obtain a good ※least squares fit§ to the DP algorithm. In particular, each rk is determined by generating a large number of sample states and solving a least squares problem that aims to minimize the error in satisfying the DP equation for these states at time k. At the typical stage k, having obtained rk+1, we determine rk from the least squares problem rk ﹋ argmin r q Xs=1 ? Jk(xs k, r) ? min u﹋Uk(xs k ) Eng(xs k, u,wk) + ? Jk+1??fk(xs k, u,wk), rk+1o!2 where xs k, i = 1, . . . , q, are the sample states that have been generated for the kth stage. Since rk+1 is assumed to be already known, the complicated minimization term in the right side of this equation is the known scalar 汕s k = min u﹋Uk(xs k ) Eng(xs k, u,wk) + ? Jk+1??fk(xs k, u,wk), rk+1o, (3.24) so that rk is obtained as rk ﹋ argmin r q Xs=1 ??? Jk(xs k, r) ? 汕s k2 . (3.25) The algorithm starts at stage N ? 1 with the minimization rN?1 ﹋ argmin r q Xs=1 ? JN?1(xs N?1, r) ? min u﹋UN?1(xs N?1 ) EngN?1(xs N?1, u,wN?1) + gN??fN?1(xs N?1, u,wN?1)o!2 and ends with the calculation of r0 at k = 0. 162 Parametric Approximation Chap. 3 In the case of a linear architecture, where the approximate cost-to-go functions are ? Jk(xk, rk) = r∩ k耳k(xk), k = 0, . . . ,N ? 1, the least squares problem (3.25) greatly simplifies, and admits the closed form solution rk = q Xs=1 耳k(xs k)耳k(xs k)∩!?1 q Xs=1 汕s k耳k(xs k); cf. Eq. (3.4). For a nonlinear architecture such as a neural network, incremental gradient algorithms may be used. An important implementation issue is how to select the sample states xs k, s = 1, . . . , q, k = 0, . . . ,N ? 1. In practice, they are typically obtained by some form of Monte Carlo simulation, but the distribution by which they are generated is important for the success of the method. In particular, it is important that the sample states are ※representative§ in the sense that they are visited often under a nearly optimal policy. More precisely, the frequencies with which various states appear in the sample should be roughly proportional to the probabilities of their occurrence under an optimal policy. This point will be discussed later, in the context of infinite horizon problems, and the notion of ※representative§ state will be better quantified in probabilistic terms. Aside from the issue of selection of the sampling distribution that we have just described, a difficulty with fitted value iteration arises when the horizon is very long, since then the number of parameters may become excessive. In this case, however, the problem is often stationary, in the sense that the system and cost per stage do not change as time progresses. Then it may be possible to treat the problem as one with an infinite horizon and bring to bear additional methods for training approximation architectures; see the relevant discussions in Chapters 5 and 6. 3.4 Q-FACTOR PARAMETRIC APPROXIMATION We will now consider an alternative form of approximation in value space and fitted value iteration. It involves approximation of the optimal Qfactors of state-control pairs (xk, uk) at time k, with no intermediate approximation of cost-to-go functions. The optimal Q-factors are defined by Q* k(xk, uk) = Engk(xk, uk,wk)+J* k+1??fk(xk, uk,wk)o, k = 0, . . . ,N?1, (3.26) where J* k+1 is the optimal cost-to-go function for stage k+1. Thus Q* k(xk, uk) is the cost attained by using uk at state xk, and subsequently using an optimal policy. Sec. 3.4 Q-Factor Parametric Approximation 163 As noted in Section 1.2, the DP algorithm can be written as J* k (xk) = min u﹋Uk(xk) Q* k(xk, uk), and by using this equation, we can write Eq. (3.26) in the following equivalent form that relates Q* k with Q* k+1: Q* k(xk, uk) = Engk(xk, uk,wk) + min u﹋Uk+1(fk(xk,uk,wk)) Q* k+1??fk(xk, uk,wk), uo. (3.27) This suggests that in place of the Q-factors Q* k(xk, uk), we may use Q-factor approximations and Eq. (3.27) as the basis for suboptimal control. We can obtain such approximations by using methods that are similar to the ones we have considered so far (parametric approximation, enforced decomposition, certainty equivalent control, etc). Parametric Q-factor approximations ?Q k(xk, uk, rk) may involve a neural network, or a featurebased linear architecture. The feature vector may depend on just the state, or on both the state and the control. In the former case, the architecture has the form ?Q k(xk, uk, rk) = rk(uk)∩耳k(xk), (3.28) where rk(uk) is a separate weight vector for each control uk. In the latter case, the architecture has the form ?Q k(xk, uk, rk) = r∩ k耳k(xk, uk), (3.29) where rk is a weight vector that is independent of uk. The architecture (3.28) is suitable for problems with a relatively small number of control options at each stage. In what follows, we will focus on the architecture (3.29), but the discussion with few modifications, also applies to the architecture (3.28) and to nonlinear architectures as well. We may adapt the fitted value iteration approach of the preceding section to compute sequentially the parameter vectors rk in Q-factor parametric approximations, starting from k = N ? 1. This algorithm is based on Eq. (3.27), with rk obtained by solving least squares problems similar to the ones of the cost function approximation case [cf. Eq. (3.25)]. As an example, the parameters rk of the architecture (3.29) are computed sequentially by collecting sample state-control pairs (xs k, us k), s = 1, . . . , q, and solving the linear least squares problems rk ﹋ argmin r q Xs=1 ??r∩耳k(xs k, us k) ? 汕s k2 , (3.30) 164 Parametric Approximation Chap. 3 where 汕s k = Engk(xs k, us k,wk)+ min u﹋Uk+1(fk(xs k ,us k ,wk)) r∩ k+1耳k+1??fk(xs k, us k,wk), uo. (3.31) Thus, having obtained rk+1, we obtain rk through a least squares fit that aims to minimize the sum of the squared errors in satisfying Eq. (3.27). Note that the solution of the least squares problem (3.30) can be obtained in closed form as rk = q Xs=1 耳k(xs k, us k)耳k(xs k, us k)∩!?1 q Xs=1 汕s k耳k(xs k, us k); [cf. Eq. (3.4)]. Once rk has been computed, the one-step lookahead control ?米k(xk) is obtained on-line as ?米k(xk) ﹋ arg min u﹋Uk(xk) ?Q k(xk, u, rk), (3.32) without the need to calculate any expected value. This latter property is a primary incentive for using Q-factors in approximate DP, particularly when there are tight constraints on the amount of on-line computation that is possible in the given practical setting. The samples 汕s k of Eq. (3.31) involve the computation of an expected value. In an alternative implementation, we may replace 汕s k with an average of just a few samples (even a single sample) of the random variable gk(xs k, us k,wk) + min u﹋Uk+1(fk(xs k ,us k ,wk)) r∩ k+1耳k+1??fk(xs k, us k,wk), u, (3.33) collected according to the probability distribution of wk. This distribution may either be known explicitly, or in a model-free situation, through a computer simulator; cf. the discussion of Section 2.1.4. In particular, to implement this scheme, we only need a simulator that for any pair (xk, uk) generates a sample of the stage cost gk(xk, uk,wk) and the next state fk(xk, uk,wk) according to the distribution of wk. Note that the samples of the random variable (3.33) do not require the computation of an expected value like the samples (3.24) in the cost approximation method of the preceding chapter. Moreover the samples of (3.33) involve a simpler minimization than the samples (3.24). This is an important advantage of working with Q-factors rather than state costs. Having obtained the weight vectors r0, . . . , rN?1, and hence the onestep lookahead policy ?羽 = {?米0, . . . , ?米N?1} through Eq. (3.32), a further possibility is to approximate this policy with a parametric architecture. This is approximation in policy space built on top of approximation in value space; see the discussion of Section 2.1.5. The idea here is to simplify even further the on-line computation of the suboptimal controls by avoiding the minimization of Eq. (3.32). Sec. 3.5 Parametric Approximation in Policy Space by Classification 165 Advantage Updating Let us finally note an alternative to computing Q-factor approximations. It is motivated by the potential benefit of approximating Q-factor differences rather than Q-factors. In this method, called advantage updating, instead of computing and comparing Q* k(xk, uk) for all uk ﹋ Uk(xk), we compute Ak(xk, uk) = Q* k(xk, uk) ? min uk﹋Uk(xk) Q* k(xk, uk). The function Ak(xk, uk) can serve just as well as Q* k(xk, uk) for the purpose of comparing controls, but may have a much smaller range of values than Q* k(xk, uk). In the absence of approximations, selecting controls by advantage updating is clearly equivalent to selecting controls by comparing their Qfactors. However, when approximation is involved, using Ak can be important, because Q* k may embody sizable quantities that are independent of u, and which may interfere with algorithms such as the fitted value iteration (3.30)-(3.31). In particular, when training an architecture to approximate Q* k, the training algorithm may naturally try to capture the large scale behavior of Q* k, which may be irrelevant because it may not be reflected in the Q-factor differences Ak. However, with advantage updating, we may instead focus the training process on finer scale variations of Q* k, which may be all that matters. This question is discussed further and is illustrated with an example in the neuro-dynamic programming book [BeT96], Section 6.6.2. The technique of subtracting a suitable constant (often called a base- line) from a quantity that is estimated by simulation is a useful idea; see Fig. 3.4.1 (in the case of advantage updating, the constants depend on xk, but the same general idea applies). It can also be used in the context of the sequential DP approximation method of Section 3.3, as well as in conjunction with other simulation-based methods in RL. 3.5 PARAMETRIC APPROXIMATION IN POLICY SPACE BY CLASSIFICATION We have focused so far on approximation in value space using parametric architectures. In this section we will discuss briefly how the cost function approximation methods of this chapter can be suitably adapted for the purpose of approximation in policy space, whereby we select the policy by using optimization over a parametric family of some form. In particular, suppose that for a given stage k, we have access to dataset of ※good§ sample state-control pairs (xs k, us k), s = 1, . . . , q, obtained through some unspecified process, such as rollout or problem approximation. We may then wish to ※learn§ this process by training the parameter 166 Parametric Approximation Chap. 3 1 . . . . . . b = 0 = 0 b ! b ! = Optimized b u u ) u u 1 = 0= 0 u 2 u q q u q?1 h(u) (a) (b) Category ? Figure 3.4.1. Illustration of the idea of subtracting a baseline constant from a cost or Q-factor approximation. Here we have samples h(u1), . . . , h(uq) of a scalar function h(u) at sample points u1, . . . , uq, and we want to approximate h(u) with a linear function ?h(u, r) = ru, where r is a scalar tunable weight. We subtract a baseline constant b from the samples, and we solve the problem ‘r 2 argmin r q Xs=1 ??h(us) ? b? rus2 . By properly adjusting b, we can improve the quality of the approximation, which after subtracting b from all the sample values, takes the form ?h(u, b, r) = b + ru. Conceptually, b serves as an additional weight (multiplying the basis function 1), which enriches the approximation architecture. vector rk of a parametric family of policies ?米k(xk, rk), using least squares minimization/regression: rk ﹋ argmin rk q Xs=1 us k ? ?米k(xs k, rk) 2 ; (3.34) cf. our discussion of approximation in policy space in Section 2.1.5. It is useful to make the connection of this regression approach with classification, an important problem in machine learning. This is the problem of constructing an algorithm, called a classifier , which assigns a given ※object§ to one of a finite number of ※categories§ based on its ※characteristics.§ Here we use the term ※object§ generically. In some cases, the classification may relate to persons or situations. In other cases, an object may represent a hypothesis, and the problem is to decide which of the hypotheses is true, based on some data. In the context of approximation in policy space, objects correspond to states, and categories correspond to Sec. 3.5 Parametric Approximation in Policy Space by Classification 167 controls to be applied at the different states. Thus in this case, we view each sample ??xs k, us k as an object-category pair. Generally, in (multiclass) classification we assume that we have a population of objects, each belonging to one of m categories c = 1, . . . ,m. We want to be able to assign a category to any object that is presented to us. Mathematically, we represent an object with a vector x (e.g., some raw description or a vector of features of the object), and we aim to construct a rule that assigns to every possible object x a unique category c. To illustrate a popular classification method, let us assume that if we draw an object x at random from this population, the conditional probability of the object being of category c is p(c | x). If we know the probabilities p(c | x), we can use a classical statistical approach, whereby we assign x to the category c?(x) that has maximal posterior probability, i.e., c?(x) ﹋ arg max c=1,...,m p(c | x). (3.35) This is called the Maximum a Posteriori rule (or MAP rule for short; see for example the book [BeT08], Section 8.2, for a discussion). When the probabilities p(c | x) are unknown, we may try to estimate them using a least squares optimization, based on the following property. Proposition 3.5.1: (Least Squares Property of Conditional Probabilities) Let 缶(x) be any prior distribution of x, so that the joint distribution of (c, x) is 汎(c, x) =Xx 缶(x) m Xc=1 p(c | x). Let z(c, x) be the function of (c, x) defined by z(c, x) = n1 if x is of category c, 0 otherwise. For any function h(c, x) of (c, x), consider En??z(c, x) ? h(c, x)2o, the expected value with respect to the distribution 汎(c, x) of the random variable ??z(c, x)?h(c, x)2 . Then p(c | x) minimizes this expected value over all functions h(c, x), i.e., for all functions h, we have En??z(c, x) ? p(c | x)2o ≒ En??z(c, x) ? h(c, x)2o. (3.36) 168 Parametric Approximation Chap. 3 The proof of the proposition may be found in textbooks that deal with Bayesian least squares estimation (see, e.g., [BeT08], Section 8.3).? The proposition states that p(c | x) is the function of (c, x) that minimizes En??z(c, x) ? h(c, x)2o (3.37) over all functions h of (c, x), independently of the prior distribution of x. This suggests that we can obtain approximations to the probabilities p(c | x), c = 1, . . . ,m, by minimizing an empirical/simulation based approximation of the expected value (3.37). More specifically, let us assume that we have a training set consisting of q object-category pairs (xs, cs), s = 1, . . . , q, and corresponding vectors zs(c) = 1 if cs = c, 0 otherwise, c = 1, . . . ,m, and adopt a parametric approach. In particular, for each category c = 1, . . . ,m, we approximate the probability p(c | x) with a function ? h(c, x, r) ? For a quick argument, consider for any pair (c, x) the conditional expected value En??z(c, x) ? y2 c, xo, where y is any scalar. Given (c, x), the random variable z(c, x) takes the value z(c, x) = 1 with probability p(c | x) and the value z(c, x) = 0 with probability 1 ? p(c | x), so we have En??z(c, x) ? y2 c, xo= p(c | x)(y ? 1)2 + ??1 ? p(c | x)y2. We minimize this expression with respect to y, by setting to 0 its derivative, i.e., 0 = 2p(c | x)(y ? 1) + 2??1 ? p(c | x)y = 2??? p(c | x) + y. We thus obtain the minimizing value of y, namely p(c | x), so that En??z(c, x) ? p(c | x)2 | c, xo≒ En??z(c, x) ? y2 | c, xo, for all scalars y. For any function h of (c, x) we set y = h(c, x) in the above expression and obtain En??z(c, x) ? p(c | x)2 | c, xo≒ En??z(c, x) ? h(c, x)2 | c, xo. Since this is true for all (c, x), we also have X(c,x) 汎(c, x)En??z(c, x)?p(c | x)2 c, xo ≒X(c,x) 汎(c, x)En??z(c, x)?h(c, x)2 c, xo, for all functions h, i.e., Eq. (3.36) holds. Sec. 3.5 Parametric Approximation in Policy Space by Classification 169 ... ... ! Object x ) p(c | x) (a) (b) Category c !(x) System PID Controller ... ... ! Object x (a) (b) Category ?c(x, Next Partial Tours, MAP Classifier Data-Trained Max MAX max r‘) Next Partial Tours, MAP Classifier Data-Trained Max MAX max Next Partial Tours, MAP Classifier Data-Trained Classifier ... (e.g., a NN) ) Randomized Policy Idealized ) ? h(c, x, r) Maxc Maxc Figure 3.5.1. Illustration of the MAP classifier c?(x) for the case where the probabilities p(c | x) are known [cf. Eq. (3.35)], and its data-trained version ?c(x, ‘r) [cf. Eq. (3.39)]. The classifier may be obtained by using the data set (xs k, us k), s = 1, . . . , q, and an approximation architecture such as a feature-based architecture or a neural network. that is parametrized by a vector r, and optimize over r the empirical approximation to the expected squared error of Eq. (3.37). In particular, we obtain r by the least squares regression: ‘r ﹋ argmin r q Xs=1 m Xc=1??zs(c) ??h(c, xs, r)2 . (3.38) The functions ?h(c, x, r) may be provided for example by a feature-based architecture or a neural network. Note that each training pair (xs, cs) is used to generate m examples for use in the regression problem (3.38): m ? 1 ※negative§ examples of the form (xs, 0), corresponding to the m ? 1 categories c 6= cs, and one ※positive§ example of the form (xs, 1), corresponding to c = cs. Note also that the incremental training methods described in Sections 3.1.3 and 3.2.1 can be applied to the solution of this problem. The regression problem (3.38) approximates the minimization of the expected value (3.37), so we conclude that its solution ?h(c, x, ‘r), c = 1, . . . ,m, approximates the probabilities p(c | x). Once this solution is obtained, we may use it to classify a new object x according to the rule Estimated Object Category = ?c(x, ‘r) ﹋ arg max c=1,...,m ?h (c, x, ‘r), (3.39) which approximates the MAP rule (3.35); cf. Fig. 3.5.1. Returning to approximation in policy space, for a given training set (xs k, us k), s = 1, . . . , q, the classifier just described provides (approximations 170 Parametric Approximation Chap. 3 ... ... Next Partial Tours, MAP Classifier Data-Trained Max MAX max Next Partial Tours, MAP Classifier Data-Trained Max MAX max State xk k Policy ?米k(xk, rk) ) Randomized Policy ) Randomized Policy ) ? h(u, xk, rk) Maxu Figure 3.5.2 Illustration of classification-based approximation in policy space. The classifier, defined by the parameter rk, is constructed by using the training set (xs k, us k), s = 1, . . . , q. It yields a randomized policy that consists of the probability ?h(u, xk, rk) of using control u 2 Uk(xk) at state xk. This policy is approximated by the deterministic policy ?米k(xk, rk) that uses at state xk the control that maximizes over u 2 Uk(xk) the probability ? h(u, xk, rk) [cf. Eq. (3.39)]. to) the ※probabilities§ of using the controls uk ﹋ Uk(xk) at the states xk, so it yields a ※randomized§ policy ?h(u, xk, rk) for stage k [once the values ?h (u, xk, rk) are normalized so that, for any given xk, they add to 1]; cf. Fig. 3.5.2. In practice, this policy is usually approximated by the deterministic policy ?米k(xk, rk) that uses at state xk the control of maximal probability at that state; cf. Eq. (3.39). For the simpler case of a classification problem with just two categories, say A and B, a similar formulation is to hypothesize a relation of the following form between object x and its category: Object Category = A if ? h(x, r) = 1, B if ? h(x, r) = ?1, where ?h is a given function and r is the unknown parameter vector. Given a set of q object-category pairs (x1, z1), . . . , (xq, zq) where zs = 1 if x is of category A, ?1 if x is of category B, we obtain r by the least squares regression: ‘r ﹋ argmin r q Xs=1??zs ??h (xs, r)2 . The optimal parameter vector ‘r is used to classify a new object with data vector x according to the rule Estimated Object Category = A if ?h(x, ‘r) > 0, B if ?h(x, ‘r) < 0. In the context of DP, this classifier may be used, among others, in stopping problems where there are just two controls available at each state: stopping Sec. 3.6 Notes and Sources 171 (i.e., moving to a termination state) and continuing (i.e., moving to some nontermination state). There are several variations of the preceding classification schemes, for which we refer to the specialized literature. Moreover, there are several commercially and publicly available software packages for solving the associated regression problems and their variants. They can be brought to bear on the problem of parametric approximation in policy space using any training set of state-control pairs, regardless of how it was obtained. 3.6 NOTES AND SOURCES Section 3.1: Our discussion of approximation architectures, neural networks, and training has been limited, and aimed just to provide the connection with approximate DP and a starting point for further exploration. The literature on the subject is vast, and the textbooks mentioned in the references to Chapter 1 provide detailed accounts and many sources, in addition to the ones given in Sections 3.1.3 and 3.2.1. There are two broad directions of inquiry in parametric architectures: (1) The design of architectures, either in a general or a problem-specific context. Research in this area has intensified following the increased popularity of deep neural networks. (2) The training of neural networks as well as linear architectures. Research on both of these issues has been extensive and is continuing. An important direction, not discussed here, is how to take advantage of distributed computation, particularly in conjunction with partitioned architectures (see [BeT96], Section 3.1.3, [BeY10], [BeY12], [Ber19c]). Methods for selection of basis functions have received much attention, particularly in the context of neural network research and deep reinforcement learning (see e.g., the book by Goodfellow, Bengio, and Courville [GBC16]). For discussions that are focused outside the neural network area, see Bertsekas and Tsitsiklis [BeT96], Keller, Mannor, and Precup [KMP06], Jung and Polani [JuP07], Bertsekas and Yu [BeY09], and Bhatnagar, Borkar, and Prashanth [BBP13]. Moreover, there has been considerable research on the optimal feature selection within given parametric classes (see Menache, Mannor, and Shimkin [MMS05], Yu and Bertsekas [YuB09b], Busoniu et al. [BBD10], and Di Castro and Mannor [DiM10]). Incremental algorithms are supported by substantial theoretical analysis, which addresses issues of convergence, rate of convergence, stepsize selection, and component order selection. Moreover, incremental algorithms have been extended to constrained optimization settings, where the constraints are also treated incrementally, first by Nedi∩c [Ned11], and then by several other authors: Bertsekas [Ber11c], Wang and Bertsekas [WaB14], [WabB16], Bianchi [Bia16], Iusem, Jofre, and Thompson [IJT18]. It is beyond our scope to cover this analysis. The author*s surveys [Ber10] and 172 Parametric Approximation Chap. 3 [Ber15b], and convex optimization and nonlinear programming textbooks [Ber15a], [Ber16a], collectively contain an extensive account of incremental methods, including the Kaczmarz, and incremental gradient, subgradient, aggregated gradient, Newton, Gauss-Newton, and extended Kalman filtering methods, and give many references. We also refer to the book [BeT96] and paper [BeT00] by Bertsekas and Tsitsiklis, and the survey by Bottou, Curtis, and Nocedal [BCN18] for a theoretically oriented treatments. Section 3.2: The publicly and commercially available neural network training programs incorporate heuristics for scaling and preprocessing data, stepsize selection, initialization, etc, which can be very effective in specialized problem domains. We refer to books on neural networks such as Bishop [Bis95], Goodfellow, Bengio, and Courville [GBC16], and Haykin [Hay08]. Deep neural networks have created a lot of excitement in the machine learning field, in view of some high profile successes in image and speech recognition, and in RL with the AlphaGo and AlphaZero programs. One question is whether and for what classes of target functions we can enhance approximation power by increasing the number of layers while keeping the number of weights constant. For analysis and speculation around this question, see Bengio [Ben09], Liang and Srikant [LiS16], Yarotsky [Yar17], Daubechies et al. [DDF19], and the references quoted there. Another research question relates to the role of overparametrization in the success of deep neural networks. With more weights than training data, the training problem has infinitely many solutions. The question then is how to select a solution that works well on test data (i.e., data outside the training set); see [ZBH16], [BMM18], [BRT18], [SJL18], [BLL19], [HMR19], [MVS19]. Section 3.3: Fitted value iteration has a long history; it has been mentioned by Bellman among others. It has interesting properties, and at times exhibits pathological/unstable behavior due to accumulation of errors over a long horizon. We will discuss this behavior in Section 5.2. Section 3.4: Advantage updating was proposed by Baird [Bai93], [Bai94], and is discussed in Section 6.6 of the book [BeT96]. Section 3.5: Classification (sometimes called ※pattern classification§ or ※pattern recognition§) is a major subject in machine learning, for which there are many approaches and an extensive literature; see e.g. the textbooks by Bishop [Bis95], [Bis06], and Duda, Hart, and Stork [DHS12]. Approximation in policy space was formulated as a classification problem in the context of DP by Lagoudakis and Parr [LaP03], and was followed up by several other authors; see our subsequent discussion of rollout-based policy iteration for infinite horizon problems (cf. Section 5.7.3). While we have focused on a classification approach that makes use of least squares regression and a parametric architecture, other classification methods may also be used. For example the paper [LaP03] discusses the use of nearest neighbor schemes, support vector machines, as well as neural networks.