5
Adaptive and Model Predictive
Control
Contents
5.1. Systems with Unknown Parameters - Robust and PID
Control . . . . . . . . . . . . . . . . . . . . . p. 102
5.2. Approximation in Value Space, Rollout, and Adaptive
Control . . . . . . . . . . . . . . . . . . . . . p. 105
5.3. Approximation in Value Space, Rollout, and Model
Predictive Control . . . . . . . . . . . . . . . . p. 109
5.4. Terminal Cost Approximation - Stability Issues . . . . p. 112
5.5. Notes and Sources . . . . . . . . . . . . . . . . p. 118
101
102 Adaptive and Model Predictive Control Chap. 5
In this chapter, we discuss some of the core control system design methodologies
within the context of our approximation in value space framework.
In particular, in the next two sections, we will discuss problems with unknown
or changing problem parameters, and briefly review some of the
principal types of adaptive control methods. We will then focus on schemes
that are based on on-line replanning, including the use of rollout. The idea
here is to use an approximation in value space scheme/Newton step in place
of a full reoptimization of the controller, in response to the changed system
parameters; we have noted this possibility in Chapter 1. Subsequently, in
Sections 5.3 and 5.4, we will discuss the model predictive control methodology,
and its connections with approximation in value space, Newton’s
method, adaptive control, and the attendant stability issues.
5.1 SYSTEMS WITH UNKNOWN PARAMETERS - ROBUST
AND PID CONTROL
Our discussion so far dealt with problems with a known and unchanging
mathematical model, i.e., one where the system equation, cost function,
control constraints, and probability distributions of disturbances are perfectly
known. The mathematical model may be available through explicit
mathematical formulas and assumptions, or through a computer program
that can emulate all of the mathematical operations involved in the model,
including Monte Carlo simulation for the calculation of expected values.
From our point of view, it makes no difference whether the mathematical
model is available through closed form mathematical expressions or
through a computer simulator: the methods that we discuss are valid either
way, only their suitability for a given problem may be affected by the
availability of mathematical formulas.
In practice, however, it is common that the system involves parameters
that are either not known exactly or may change over time. In such
cases it is important to design controllers that take the parameter changes
into account. The methodology for doing so is generally known as adaptive
control , an intricate and multifaceted subject, with many and diverse
applications, and a long history.?
We should note also that unknown problem environments are an integral
part of the artificial intelligence view of RL. In particular, to quote
? The difficulties of designing adaptive controllers are often underestimated.
Among others, they complicate the balance between off-line training and on-line
play, which we discussed in Chapter 1 in connection to AlphaZero. It is worth
keeping in mind that as much as learning to play high quality chess is a great
challenge, the rules of play are stable and do not change unpredictably in the
middle of a game! Problems with changing system parameters can be far more
challenging!
Sec. 5.1 Systems with Unknown Parameters - Robust and PID Control 103
from the book by Sutton and Barto [SuB18], “learning from interaction
with the environment is a foundational idea underlying nearly all theories
of learning and intelligence.” The idea of interaction with the environment
is typically connected with the idea of exploring the environment to identify
its characteristics. In control theory this is often viewed as part of
the system identification methodology, which aims to construct mathematical
models of dynamic systems by using data. The system identification
process is often combined with the control process to deal with unknown
or changing problem parameters. This is one of the most challenging areas
of stochastic optimal and suboptimal control, and has been studied
extensively since the early 1960s.
Robust and PID Control
Given a controller design that has been obtained assuming a nominal DP
problem model, one possibility is to simply ignore changes in problem parameters.
We may then try to design a controller that is adequate for the
entire range of the changing parameters. This is sometimes called a robust
controller . A robust controller makes no effort to keep track of changing
problem parameters. It is just designed so that it is resilient to parameter
changes, and in practice, it often tends to be biased towards addressing the
worst case.
An important and time-honored robust control approach for continuous-
state problems is the PID (Proportional-Integral-Derivative) controller ;
see e.g., the books by Astr¨om and Hagglund [AsH95], [AsH06]. In particular,
PID control aims to maintain the output of a single-input single-output
dynamic system around a set point or to follow a given trajectory, as the
system parameters change within a relatively broad range. In its simplest
form, the PID controller is parametrized by three scalar parameters,
which may be determined by a variety of methods, some of them manual/
heuristic. PID control is used widely and with success, although its
range of application is mainly restricted to relatively simple, single-input
and single-output continuous-state control systems.
Combined System Identification and Control
In robust control schemes, such as PID control, no attempt is made to maintain
a mathematical model and to track unknown model parameters as they
change. Alternatively we may introduce into the controller a mechanism
for measuring or estimating the unknown or changing system parameters,
and make suitable control adaptations in response.?
? In the adaptive control literature, schemes that involve parameter estimation
are sometimes called indirect, while schemes that do not involve parameter
estimation (like PID control) are called direct. To quote fromthebook byAstr¨om
104 Adaptive and Model Predictive Control Chap. 5
k Controller
) System Data Control Parameter Estimation
System Data Control Parameter Estimation
System State Data Control Parameter Estimation
System State Data Control Parameter Estimation
System State Data Control Parameter Estimation
System State Data Control Parameter Estimation
Adaptive
Figure 5.1.1 Schematic illustration of concurrent parameter estimation and system
control. The system parameters are estimated on-line and the estimates are
passed on to the controller whenever this is desirable (e.g., after the estimates
change substantially). This structure is also known as indirect adaptive control.
Let us note here that updating problem parameters need not require
an elaborate algorithm. In many cases the set of problem parameters may
take a known finite set of values (for example each set of parameter values
may correspond to a distinct maneuver of a vehicle, motion of a robotic
arm, flying regime of an aircraft, etc). Once the control scheme detects
a change in problem parameters, it can incorporate the change into the
approximation in value space scheme, and in the case of policy rollout, it
may switch to a corresponding predesigned base policy.
In what follows in this chapter (including our discussion of MPC in
Section 5.3), we will assume that there is a mechanism to learn (perhaps
imperfectly and by some unspecified procedure) the model of the system
as it evolves over time. We will loosely refer to this learning process with
the classical name system identification, but we will not go into specific
identification methods, keeping in mind that such methods could be imprecise
and challenging, but could also be fast and simple, depending on
the problem at hand.
An apparently reasonable scheme is to separate the control process
into two phases, a system identification phase and a control phase. In
the first phase the unknown parameters are estimated, while the control
takes no account of the interim results of estimation. The final parameter
and Wittenmark [AsW08], “indirect methods are those in which the estimated
parameters are used to calculate required controller parameters” (see Fig. 5.1.1).
The methods subsequently described in this section, and the rollout-based adaptive
control methods discussed in the next section should be viewed as indirect.
Sec. 5.2 Approximation in Value Space, Rollout, and Adaptive Control 105
estimates from the first phase are then used to implement an optimal or
suboptimal policy in the second phase.
This alternation of estimation and control phases may be repeated
several times during the system’s operation in order to take into account
subsequent changes of the parameters. Note that it is not necessary to introduce
a hard separation between the identification and the control phases.
They may be going on simultaneously, with new parameter estimates being
generated in the background, and introduced into the control process,
whenever this is thought to be desirable; see Fig. 5.1.1.
One drawback of this approach is that it is not always easy to determine
when to terminate one phase and start the other. A second difficulty,
of a more fundamental nature, is that the control process may make some of
the unknown parameters invisible to the estimation process. This is known
as the problem of parameter identifiability, which is discussed in the context
of adaptive control in several sources. On-line parameter estimation algorithms,
which address among others the issue of identifiability, have been
discussed extensively in the control theory literature, but the corresponding
methodology is complex and beyond our scope in this book. However,
assuming that we can make the estimation phase work somehow, we are
free to reoptimize the controller using the newly estimated parameters, in
a form of on-line replanning process.
Unfortunately, there is still another difficulty with this type of online
replanning: it may be hard to recompute an optimal or near-optimal
policy on-line, using a newly identified system model. In particular, it may
be impossible to use time-consuming and/or data-intensive methods that
involve for example the training of a neural network, or discrete/integer
control constraints. A simpler possibility is to use rollout, which we discuss
in the next section.
5.2 APPROXIMATION IN VALUE SPACE, ROLLOUT, AND
ADAPTIVE CONTROL
We will now consider an approach for dealing with unknown or changing
parameters, which is based on rollout and on-line replanning. We have
already noted this approach in Chapter 1, where we stressed the importance
of fast on-line policy improvement.
Let us assume that some problem parameters change over time, while
the controller estimates the changes on-line, perhaps after a suitable delay
for data collection. The method by which the problem parameters are recalculated
or become known is immaterial for the purposes of the following
discussion. It may involve a limited form of parameter estimation, whereby
the unknown parameters are “tracked” by data collection over a few time
stages, with due attention paid to issues of parameter identifiability; or it
may involve new features of the control environment, such as a changing
number of servers and/or tasks in a service system.
106 Adaptive and Model Predictive Control Chap. 5
We thus assume away/ignore the detailed issues of parameter estimation,
and focus on revising the controller by on-line replanning based on the
newly obtained parameters. This revision may be based on any suboptimal
method, but rollout with some base policy is particularly attractive. The
base policy may be either a fixed robust controller (such as some form of
PID control) or it may be updated over time (in the background, on the
basis of some unspecified rationale), in which case the rollout policy will
be revised both in response to the changed base policy and in response to
the changing parameters.
Here the advantage of rollout is that it is simple, reliable, and relatively
fast. In particular, it does not require a complicated training procedure,
based for example on the use of neural networks or other approximation
architectures, so no new policy is explicitly computed in response to the
parameter changes. Instead the available controls at the current state are
compared through a one-step or multistep minimization, with cost function
approximation provided by the base policy (cf. Fig. 5.2.1).
Another issue to consider is the stability and robustness properties
of the rollout policy. In this connection, it can be generally proved, under
mild conditions, that if the base policy is stable within a range of parameter
values, the same is true for the rollout policy; this can also be inferred
from Fig. 3.4.3. Related ideas have a long history in the control theory
literature; see Beard [Bea95], Beard, Saridis, and Wen [BSW99], Jiang and
Jiang [JiJ17], Kalise, Kundu, Kunisch [KKK20], Pang and Jiang [PaJ21].
The principal requirement for using rollout in an adaptive control
context is that the rollout control computation should be fast enough to
be performed between stages. In this connection, we note that accelerated/
truncated or simplified versions of rollout, as well as parallel computation,
can be used to meet this time constraint.
Generally, adaptive control by rollout and on-line replanning makes
sense in situations where the calculation of the rollout controls for a given
set of problem parameters is faster and/or more convenient than the calculation
of the optimal controls for the same set of parameter values. These
problems include cases involving nonlinear systems and/or difficult (e.g.,
integer) constraints.
The following example illustrates on-line replanning with the use of
rollout in the context of the simple one-dimensional linear quadratic problem
that we discussed earlier. The purpose of the example is to show analytically
how rollout with a base policy that is optimal for a nominal set
of problem parameters works well when the parameters change from their
nominal values. This property is not practically useful in linear quadratic
problems because when the parameter change, it is possible to calculate
the new optimal policy in closed form, but it is indicative of the performance
robustness of rollout in other contexts; for example linear quadratic
problems with constraints.
Sec. 5.2 Approximation in Value Space, Rollout, and Adaptive Control 107
Multiagent Q-factor minimization xk
Possible States
Possible States xk+1
Rollout with Base Policy
Rollout with Base Policy
Changing System, Cost, and Constraint Parameters
Changing System, Cost, and Constraint Parameters
Changing System, Cost, and Constraint Parameters
Lookahead Minimization
Lookahead Minimization
Figure 5.2.1 Schematic illustration of adaptive control by on-line replanning
based on rollout. One-step lookahead minimization is followed by simulation with
the base policy, which stays fixed. The system, cost, and constraint parameters
are changing over time, and the most recent estimates of their values are incorporated
into the lookahead minimization and rollout operations. Truncated rollout
with multistep lookahead minimization and terminal cost approximation is also
possible. The base policy may also be revised based on various criteria. For the
discussion of this section, we may assume that all the changing parameter information
is provided by some computation and sensor “cloud” that is beyond our
control.
Example 5.2.1 (On-Line Replanning for Linear Quadratic
Problems Based on Rollout)
Consider a deterministic undiscounted infinite horizon linear quadratic problem
involving the linear system
xk+1 = xk + buk,
and the quadratic cost function
lim
N!"
N?1
!
k=0
(x2
k + ru2
k).
This is the one-dimensional problem of the preceding section for the special
case where a = 1 and q = 1. The optimal cost function is given by
J$(x) = K$x2,
108 Adaptive and Model Predictive Control Chap. 5
where K$ is the unique positive solution of the Riccati equation
K =
rK
r + b2K
+ 1. (5.1)
The optimal policy has the form
μ$(x) = L$x, (5.2)
where
L$ = ?
bK$
r + b2K$ . (5.3)
As an example, consider the optimal policy that corresponds to the
nominal problem parameters b = 2 and r = 0.5: this is the policy (5.2)-(5.3),
with K computed as the positive solution of the quadratic Riccati Eq. (5.1)
for b = 2 and r = 0.5 . For these nominal parameter values, we have
K$ =
2 +
"
6
4
# 1.11.
From Eq. (5.3) we then also obtain
L$ = ?
2 +
"
6
5 +2
"
6
. (5.4)
We will now consider changes of the values of b and r while keeping L constant
to the preceding value, and we will compare the quadratic cost coefficients of
the following three cost functions as b and r vary:
(a) The optimal cost function K$x2, where K$ is given by the positive
solution of the Riccati Eq. (5.1).
(b) The cost function KLx2 that corresponds to the base policy
μL(x) = Lx,
where L is given by Eq. (5.4). Here, we have (cf. Section 4.1)
KL =
1 + rL2
1 ? (1 + bL)2 . (5.5)
(c) The cost function ?KLx2 that corresponds to the rollout policy
?μL(x) = ?Lx,
obtained by using the policy μL as base policy. Using the formulas
derived earlier, we have [cf. Eq. (5.5)]
?L
= ?
bKL
r + b2KL
,
Sec. 5.3 Approximation in Value Space - Model Predictive Control 109
and (cf. Section 4.1)
?K
L
=
1 + r?L2
1 ? (1 + b?L)2
.
Figure 5.2.2 shows the coefficients K$, KL, and ?KL for a range of values
of r and b. We have
K$ $ ?KL $ KL.
The difference KL ? K$ is indicative of the robustness of the policy μL, i.e.,
the performance loss incurred by ignoring the changes in the values of b and
r, and continuing to use the policy μL, which is optimal for the nominal
values b = 2 and r = 0.5, but suboptimal for other values of b and r. The
difference ?KL ? K$ is indicative of the performance loss due to using online
replanning by rollout rather than using optimal replanning. Finally, the
difference KL ? ?KL is indicative of the performance improvement due to online
replanning using rollout rather than keeping the policy μL unchanged.
Note that Fig. 5.2.2 illustrates the behavior of the error ratio
? J ? J$
J ? J$ ,
where for a given initial state, ? J is the rollout performance, J$ is the optimal
performance, and J is the base policy performance. This ratio approaches 0
as J ? J$ becomes smaller because of the superlinear/quadratic convergence
rate of Newton’s method that underlies the rollout algorithm.
5.3 APPROXIMATION IN VALUE SPACE, ROLLOUT, AND
MODEL PREDICTIVE CONTROL
In this section, we briefly discuss the MPC methodology, with a view towards
its connection with approximation in value space and the rollout
algorithm. We will focus on the undiscounted infinite horizon deterministic
problem, which involves the system
xk+1 = f(xk, uk),
whose state xk and control uk are finite-dimensional vectors. The cost per
stage is assumed nonnegative
g(xk, uk) ! 0, for all (xk, uk),
(e.g., a positive definite quadratic cost). There are control constraints uk "
U(xk), and to simplify the following discussion, we will initially consider
no state constraints. We assume that the system can be kept at the origin
at zero cost, i.e.,
f(0, uk) = 0, g(0, uk) = 0 for some control uk " U(0).
110 Adaptive and Model Predictive Control Chap. 5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Fixed Base Policy Adaptive
Fixed Base Policy Adaptive
Fixed Base Policy Adaptive Rollout Adaptive Reoptimization
Adaptive Reoptimization
With the Newton Step Adaptive Rollout
0 5 10 15 20 25 30
1
2
3
4
5
6
7
8
Fixed Base Policy Adaptive
Fixed Base Policy Adaptive
Fixed Base Policy Adaptive Rollout Adaptive Reoptimization
Adaptive Reoptimization
With the Newton Step Adaptive Rollout
Figure 5.2.2 Illustration of control by rollout under changing problem parameters.
The quadratic cost coefficients K$ (optimal, denoted by solid line),
KL (base policy, denoted by circles), and ?KL (rollout policy, denoted by asterisks)
are shown for the two cases where r = 0.5 and b varies, and b = 2
and r varies. The value of L is fixed at the value that is optimal for b = 2
and r = 0.5 [cf. Eq. (5.4)]. The rollout policy performance is very close to
optimal, even when the base policy is far from optimal.
Note that, as the figure illustrates, we have
lim
J!J!
? J ? J$
J ? J$ = 0,
where for a given initial state, ? J is the rollout performance, J$ is the optimal
performance, and J is the base policy performance. This is a consequence of
the superlinear/quadratic convergence rate of Newton’s method that underlies
rollout, and guarantees that the rollout performance approaches the optimal
much faster than the base policy performance does.
For a given initial state x0, we want to obtain a sequence {u0, u1, . . .} that
satisfies the control constraints, while minimizing the total cost.
Sec. 5.3 Approximation in Value Space - Model Predictive Control 111
uk x
k xk+1
-Factors Current State x
Current State xk
Next Cities Next States
Sample Q-Factors Simulation Control 1 Control 2 Control 3
,n Stage k k Stages
Stages k+1, . . . ,k+!?1
Sample Q-Factors Simulation Control 1 State
Sample Q-Factors Simulation Control 1 State xk+! = 0
1)-Stages Base Heuristic Minimization
k (! ? 1)-Stages Base Heuristic Minimization
Figure 5.3.1 Illustration of the problem solved by a classical form of MPC at
state xk. We minimize the cost function over the next ! stages while imposing
the requirement that xk+! = 0. We then apply the first control of the optimizing
sequence. In the context of rollout, the minimization over uk is the one-step
lookahead, while the minimization over uk+1, . . . ,uk+!?1 that drives xk+! to 0
is the base heuristic.
This is a classical problem in control system design, known as the
regulation problem, where the aim is to keep the state of the system near the
origin (or more generally some desired set point), in the face of disturbances
and/or parameter changes. In an important variant of the problem, there
are additional state constraints of the form xk " X, and there arises the
issue of maintaining the state within X, not just at the present time but
also in future times. We will address this issue later in this section.
The Classical Form of MPC - View as a Rollout Algorithm
We will first focus on a classical form of the MPC algorithm, proposed in
the form given here by Keerthi and Gilbert [KeG88]. In this algorithm,
at each encountered state xk, we apply a control ?uk that is computed as
follows; see Fig. 5.3.1:
(a) We solve an !-stage optimal control problem involving the same cost
function and the requirement that the state after ! steps is driven to
0, i.e., xk+! = 0. This is the problem
min
ut, t=k,...,k+!?1
k+!?1
!
t=k
g(xt, ut), (5.6)
subject to the system equation constraints
xt+1 = f(xt, ut), t= k, . . . , k + ! ? 1, (5.7)
112 Adaptive and Model Predictive Control Chap. 5
the control constraints
ut " U(xt), t= k, . . . , k + ! ? 1, (5.8)
and the terminal state constraint
xk+! = 0. (5.9)
Here ! is an integer with ! > 1, which is chosen in some largely
empirical way.
(b) If {?uk, . . . , ?uk+!?1} is the optimal control sequence of this problem,
we apply ?uk and we discard the other controls ?uk+1, . . . , ?uk+!?1.
(c) At the next stage, we repeat this process, once the next state xk+1 is
revealed.
To make the connection of the preceding MPC algorithm with rollout,
we note that the one-step lookahead function ? J implicitly used by MPC [cf.
Eq. (5.6)] is the cost function of a certain stable base policy. This is the
policy that drives to 0 the state after ! ? 1 stages (not ! stages) and keeps
the state at 0 thereafter, while observing the state and control constraints,
and minimizing the associated (!?1)-stages cost. This rollout view of MPC
was first discussed in the author’s paper [Ber05]. It is useful for making
a connection with the approximate DP/RL, rollout, and its interpretation
in terms of Newton’s method. In particular, an important consequence is
that the MPC policy is stable, since rollout with a stable base policy yields
a stable policy, as we have discussed in Section 3.2.
We may also equivalently view the preceding MPC algorithm as rollout
with ˉ!-step lookahead, where 1 < ˉ! < !, with the base policy that drives
to 0 the state after ! ? ˉ! stages and keeps the state at 0 thereafter. This
suggests variations of MPC that involve truncated rollout with terminal
cost function approximation, which we will discuss shortly.
Note also that when faced with changing problem parameters, it is
natural to consider on-line replanning as per our earlier discussion. In
particular, once new estimates of system and/or cost function parameters
become available, MPC can adapt accordingly by introducing the new parameter
estimates into the !-stage optimization problem in (a) above.
5.4 TERMINAL COST APPROXIMATION - STABILITY ISSUES
In a common variant ofMPC, the requirement of driving the system state to
0 in ! steps in the !-stage MPC problem (5.6), is replaced by a nonnegative
terminal cost G(xk+!). Thus at state xk, we solve the problem
min
ut, t=k,...,k+!?1
"
G(xk+!) +
k+!?1
!
t=k
g(xt, ut)
#
, (5.10)
Sec. 5.4 Terminal Cost Approximation - Stability Issues 113
instead of problem (5.6) where we require that xk+! = 0. This variant can
be viewed as rollout with one-step lookahead, and a base policy, which at
state xk+1 applies the first control ?uk+1 of the sequence {?uk+1, . . . , ?uk+!?1}
that minimizes
G(xk+!) +
k+!?1
!
t=k+1
g(xt, ut).
It can also be viewed outside the context of rollout, as approximation in
value space with !-step lookahead minimization and terminal cost approximation
given by G. Thus the preceding MPC controller may have cost
function that is much closer to J* than G is. This is due to the superlinear/
quadratic convergence rate of Newton’s method that underlies
approximation in value space, as we have discussed in Chapter 3.
An important question is to choose the terminal cost approximation so
that the resulting MPC controller is stable. Our discussion of Section 3.3 on
the region of stability of approximation in value space schemes applies here.
In particular, under the nonnegative cost assumption of this section, the
MPC controller will be stable if TG $ G (using the abstract DP notation
introduced in Chapter 3), or equivalently
(TG)(x) = min
u"U(x)
$g(x, u) + G%f(x, u)&' $ G(x), for all x, (5.11)
as noted in Section 3.2. This condition is sufficient for stability of the
MPC controller, but it is not necessary. Figure 5.4.1 provides a graphical
illustration. It shows that the condition TG $ G implies that J* $ T !G $
T !?1G for all ! ! 1 (the books [Ber12] and [Ber18a] provide mathematical
proofs of this fact). This in turn implies that T !G lies within the region of
the stability for all ! ! 0.
We also expect that as the length ! of the lookahead minimization
is increased, the stability properties of the MPC controller are improved.
In particular, given G ! 0, the resulting MPC controller is likely to be
stable for ! sufficiently large, since T !G ordinarily converges to J*, which
lies within the region of stability. Results of this type are known within
the MPC framework under various conditions (see the papers by Mayne
at al. [MRR00], Magni et al. [MDM01], the MPC book [RMD17], and
the author’s book [Ber20a], Section 3.1.2). Our discussion of stability in
Sections 4.4 and 4.6 is also relevant within this context.
In another variant of MPC, in addition to the terminal cost function
approximation G, we use truncated rollout, which involves running
some stable base policy μ for a number of steps m; see Fig. 5.4.2. This
is quite similar to standard truncated rollout, except that the computational
solution of the lookahead minimization problem (5.10) may become
complicated when the control space is infinite. As discussed in Section 3.3,
increasing the length of the truncated rollout enlarges the region of stability
of the MPC controller . The reason is that by increasing the length of the
114 Adaptive and Model Predictive Control Chap. 5
! TJ
Prob. = 1 Prob. =
J J! = TJ!
0 Prob. = 1
1 J J
Optimal cost Cost of rollout policy ?
1 J J
also Newton Step
T J G ? J J?μ
Defined by MPC Policy ?μ
Cost of Truncated Rollout Policy ?
Yields Truncated Rollout Policy ? Defined by
Defined by MPC Policy ?μ
) T?μ T !?1G = T !G
? J ! = 3
Instability Region Stability Region 0
? J Region where TG ! G
1 T!?1G
Slope = 1
Figure 5.4.1 Illustration of the condition TG " G or equivalently
(TG)(x) = min
u%U(x)
$g(x, u) + G%f(x, u)&' " G(x), for all x.
When satisfied by the terminal cost function approximation G, it guarantees the
stability of the MPC policy ?μ with !-step lookahead minimization, defined by
T?μT!?1G = T!G,
where for a generic policy μ, Tμ is defined (using the abstract DP notation of
Chapter 3) by
(TμJ)(x) = g%x, μ(x)& + J(f%x, μ(x)&), for all x.
In this figure, ! = 3.
truncated rollout, we push the start of the Newton step towards of the cost
function Jμ of the stable policy, which lies within the region of stability
since TJμ $ TμJμ = Jμ; see also the discussion on linear quadratic problems
in Section 4.7. The base policy may also be used to address state
constraints; see the papers by Rosolia and Borelli [RoB17], [RoB19], and
the discussion in the author’s RL book [Ber20a].
Sec. 5.4 Terminal Cost Approximation - Stability Issues 115
! TJ
Prob. = 1 Prob. =
J J! = TJ!
0 Prob. = 1
1 J J
approximation Expected value approximation TμJ
Optimal cost Cost of rollout policy ?
Yields Truncated Rollout Policy ? Defined by
1 J J
also Newton Step
T?μ T !?1(Tm
μ G) = T !(Tm
μ G) Yields Truncated fined by
TJ G
Defined by MPC Policy ?μ
? J J?μ
Defined by MPC Policy ?μ
Cost of Truncated Rollout Policy ?
! = 2, m = 4
) Tm
μ G
Base Policy
Terminal Cost Approximation
Terminal Cost Approximation
Terminal Cost Approximation G
m-Step Truncated Rollout with Stable Policy
-Step Truncated Rollout with Stable Policy
-Step Truncated Rollout with Stable Policy μ
!-Step Lookahead Minimization
-Step Lookahead Minimization
-Step Lookahead Minimization
μ x
G T!( ) Tm
μ G
Instability Region Stability Region 0
T!?1(Tm
μ G)
Slope = 1
Figure 5.4.2 An MPC scheme with !-step lookahead minimization, m-step truncated
rollout with a stable base policy μ, and a terminal cost function approximation
G, together with its interpretation as a Newton step. In this figure, ! = 2
and m = 4. As m increases, Tm
μ
G moves closer to Jμ, which lies within the region
of stability.
A Rollout Variant of MPC withMultiple Terminal States and
Base Policies
In another variation of MPC, proposed in the paper by Li et al. [LJM21],
instead of driving the state to 0 at the end of ! steps, we consider multiple
terminal system states at the end of the !-step horizon, as well as the use
of multiple base policies for rollout. In particular, in this scheme we have
a finite set of states X and a finite set of stable base policies M, and we
assume that we have computed off-line the cost function values Jμ(x) for
all x " X and μ " M. At state xk, to compute the MPC control ?uk, we
solve for each x " X a problem that is the same as the problem (5.6)-(5.9),
which is solved by the classical form of MPC, except that the terminal state
xk+! is equal to x instead of xk+! = 0. This is the problem
min
ut, t=k,...,k+!?1
k+!?1
!
t=k
g(xt, ut), (5.12)
116 Adaptive and Model Predictive Control Chap. 5
subject to the system equation constraints
xt+1 = f(xt, ut), t= k, . . . , k + ! ? 1, (5.13)
the control constraints
ut " U(xt), t= k, . . . , k + ! ? 1, (5.14)
and the terminal state constraint
xk+! = x. (5.15)
Let V (xk; x) be the optimal value of this problem. Having computed
V (xk; x) for all x " X, we compare all values
V (xk; x) + Jμ(x), x" X, μ "M,
and find the pair (x, μ) that yields the minimal value of V (xk; x) + Jμ(x).
We then define the MPC control ?uk to be the control uk that attains the
minimum in the corresponding problem (5.12)-(5.15) with x = x.
Thus, in this variant of MPC we solve multiple problems of the type
that is solved in the classical form of MPC, for multiple values of the
terminal state xk+!, and we then compute the MPC control based on the
“best” terminal state x " X, assuming that the “best” base policy μ will be
used after state k+!. It is possible to show, under appropriate conditions,?
that the cost function J?μ of the MPC policy ?μ, which applies ?μ(xk) = ?uk
as described above, has the cost improvement property
J?μ(x) $ Jμ(x), for all x " X, μ "M; (5.16)
see [LJM21]. Moreover, based on this property and the assumption that
the base policies μ "M are stable, it follows that the MPC policy ?μ thus
obtained is also stable.
The preceding variation can also be used for systems with arbitrary
state and control spaces, continuous as well as discrete. It is also well-suited
for addressing state constraints, provided the base policies are designed to
satisfy these constraints. In this case, the state constraints are included in
the constraints of the !-step problems (5.12)-(5.15). We refer to the paper
[LJM21], which provides supporting analysis, extensions to the case where
X is an infinite set, as well as computational results involving several types
of problems, with both discrete and continuous state and control spaces.
We mention that the idea of using multiple base policies to evaluate
the available controls at a given state, and selecting the control that yields
the least cost, has been known since the original proposal of the paper
[BTW97]. The main result for such schemes is a cost improvement property,
whereby the rollout policy outperforms simultaneously all the base policies;
cf. Eq. (5.16). This property is also discussed in Sections 6.3 and 6.4, as
well as the books [Ber17a], [Ber19a], [Ber20a].
? These conditions include that for every x % X, we have f%x, μ(x)& % X for
some μ % X, which plays the same role as the assumption that the origin is cost
free and absorbing in the classical form of MPC.
Sec. 5.4 Terminal Cost Approximation - Stability Issues 117
Stochastic MPC by Certainty Equivalence
Let us mention that while in this section we have focused on deterministic
problems, there are variants of MPC, which include the treatment
of uncertainty. The books and papers cited earlier contain several ideas
along these lines; see for example the books by Kouvaritakis and Cannon
[KoC16], Rawlings, Mayne, and Diehl [RMD17], and the survey by Mesbah
[Mes16].
In this connection it is also worth mentioning the certainty equivalence
approach that we discussed briefly in Section 3.2. As noted in that
section, upon reaching state xk we may perform the MPC calculations
after replacing the uncertain quantities wk+1, wk+2, . . . with deterministic
quantities wk+1,wk+2, . . ., while allowing for the stochastic character of
the disturbance wk of just the current stage k. This MPC calculation is
not much more difficult that the one for deterministic problems, while still
implementing a Newton step for solving the associated Bellman equation;
see the discussion of Section 3.2, and also Section 2.5.3 of the RL book
[Ber19a].
State Constraints, Target Tubes, and O?-Line Training
Our discussion so far has skirted a major issue in MPC, which is that there
may be additional state constraints of the form xk 2 X, for all k, where X
is some subset of the true state space. Indeed much of the original work on
MPC was motivated by control problems with state constraints, imposed
by the physics of the problem, which could not be handled e?ectively with
the nice unconstrained framework of the linear quadratic problem that we
have discussed in Chapter 4.
The treatment of state constraints is connected to the theory of reachability
of target tubes, first formulated and studied by the author in his
Ph.D. thesis [Ber71], and subsequent papers [BeR71], [Ber72]; see the books
[Ber17a], [Ber19a], [Ber20a] for a discussion that is consistent with the viewpoint
of this section. A target tube is a subset ?X of the state constraint
set X, within which the state can be kept indefinitely with feasible control
choices, assuming that the initial state belongs to ?X. In other words, the
problem (5.10) may not be feasible for every xk 2 X, once the constraint
xt 2 X for all t = k + 1, k + 2, . . ., is added in problem (5.10). However, a
suitable target tube is one specified by a subset ?X ? X such that the problem
(5.10) is feasible under the constraint xt 2 ?X for all t = k+1, k+2, . . .,
provided xk 2 ?X .
There are several ways to compute sets ?X with this property, for
which we refer to the aforementioned author’s work and theMPC literature;
see e.g., the book by Rawlings, Mayne, and Diehl [RMD17], and the survey
by Mayne [May14]. The important point here is that the computation of a
target tube must be done o?-line with one of several available algorithmic
118 Adaptive and Model Predictive Control Chap. 5
approaches, so it becomes part of the off-line training (in addition to the
terminal cost function G).
Given an off-line training process, which provides a target tube constraint
xk " ?X for all k, a terminal cost function G, and possibly one or
more base policies for truncated rollout, MPC becomes an on-line play algorithm
for which our earlier discussion applies. Note, however, that in an
indirect adaptive control context, where a model is estimated on-line as it
is changing, it may be difficult to recompute on-line a target tube that can
be used to enforce the state constraints of the problem, particularly if the
states constraints change themselves as part of the changing problem data.
This is a problem-dependent issue that deserves further attention.
5.5 NOTES AND SOURCES
The literature for PID control is extensive and includes the books by
Astr¨om and Hagglund [AsH95], [AsH06]. For detailed accounts of adaptive
control, we refer to the books by Astr¨om and Wittenmark [AsW08],
Bodson [Bod20], Goodwin and Sin [GoS84], Ioannou and Sun [IoS96], Jiang
and Jiang [JiJ17], Krstic, Kanellakopoulos, and Kokotovic [KKK95], Kokotovic
[Kok91], Kumar and Varaiya [KuV86], Liu, et al. [LWW17], Lavretsky
and Wise [LaW13], Narendra and Annaswamy [NaA12], Sastry and Bodson
[SaB11], Slotine and Li [SlL91], and Vrabie, Vamvoudakis, and Lewis
[VVL13].
The literature on MPC is voluminous, and has grown over time to
include problem and algorithm variations and extensions. For detailed accounts,
we refer to the textbooks by Maciejowski [Mac02], Goodwin, Seron,
and De Dona [GSD06], Camacho and Bordons [CaB07], Kouvaritakis and
Cannon [KoC16], Borrelli, Bemporad, and Morari [BBM17], and Rawlings,
Mayne, and Diehl [RMD17].
Deterministic optimal control with infinite state and control spaces
can exhibit unusual/pathological behavior. For the case of nonnegative cost
per stage, an analysis of the exact value and policy iteration algorithms,
including convergence issues and counterexamples, is given in the author’s
paper [Ber17b] and abstract DP book [Ber22a]. The case of nonpositive
cost per stage has been addressed in classical analyses, beginning with the
work of Blackwell [Bla65]; see also [Str66], [BeS78], [YuB15].