Dynamic programming
Dynamic programming methods are well developed mathematically, but require a complete and accurate model of the environment.
Policy evaluation (prediction)
Policy evaluation is to compute the state-value function for an arbitrary policy .
for all . is a fixed point for this update rule because the Bellman equation for assures us of equality in this case. This algorithm is called iterative policy evaluation.
This kind of update is called expected update because it is based on an expectation over all possible next states rather than on a sample next state.
A complete in-place version of iterative policy evaluation is
Policy improvement
The reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function for an arbitrary deterministic policy . For some state we would like to know whether or not we should change the policy to deterministically choose an action . We know how good it is to follow the current policy from , that is , but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting in and thereafter following the existing policy . The value of this way of behaving is
If it is greater than , that is, if it is better to select once in and thereafter follow than it would be to follow all the time, then one would expect it to be better still to select every time is encountered, and that the new policy would in fact be a better one overall.
That this is true is a special case of a general result called the policy improvement theorem. Let and be any pair of deterministic policies such that, for all ,
Then the policy must be as good as, or better than, . That is, it must obtain greater or equal expected return from all states
So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to . In other words, to consider the new greedy policy , given by
The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.
In the general case, a stochastic policy specifies probabilities, , for taking each action, , in each state, . The idea of improvement of deterministic policies extend to stochastic policies.
Policy iteration
Once a policy has been improved using to yield a better policy , we can then compute and improve it again to yield an even better . This way of finding an optimal policy is called policy iteration.
Policy iteration often converges in surprisingly few iterations.
Value iteration
One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to occurs only in the limit.
In fact, the policy evaluation step of policy iteration can be truncated in serveral ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as a particularly simple udpate operation that combines the policy improvement and truncated policy evaluation steps:
for all .
Another way of understanding value iteration is by reference to the Bellman optimality equation. Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration update is identical to the policy evaluation update except that it requires the maximum to be taken over all actions.
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement.
Generalized policy iteration
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right. It is easy to see that if both the evaluation process and the improvement process stabilize, that is, no longer produce changes, then the value function and policy must be optimal. The value function stabilizes only when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function. Thus, both processes stabilize only when a policy has been found that is greedy with respect to its own evaluation function. This implies that the Bellman optimality equation holds, and thus that the policy and the value function are optimal.
Monte Carlo methods
Monte Carlo methods don’t require a model and are conceptually simple, but are not well suited for step-by-step incremental computation.
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks.
Monte Carlo methods sample and average returns for each state–action pair much like the bandit methods sample and average rewards for each action. The main difference is that now there are multiple states, each acting like a different bandit problem (like an associative-search or contextual bandit) and the different bandit problems are interrelated. That is, the return after taking an action in one state depends on the actions taken in later states in the same episode. Because all the action selections are undergoing learning, the problem becomes nonstationary from the point of view of the earlier state.
To handle the nonstationarity, we adapt the idea of general policy iteration (GPI) developed for DP. Whereas there we computed value functions from knowledge of the MDP, here we learn value functions from sample returns with the MDP. The value functions and corresponding policies still interact to attain optimality in essentially the same way (GPI).
Monte Carlo prediction
Suppose we wish to estimate , the value of a state under policy , given a set of episodes obtained by following and passing through . Each occurrence of state in an episode is called a visit to . Of course, may be visited multiple times in the same episode; let us call the first time it is visited in an episode the first visit to . The first-visit MC method estimates as the average of the returns following first visits to , whereas the every-visit MC method averages the returns following all visits to .
First-visit MC prediction is
If a model is not available, then it is particularly useful to estimate action values rather than state values. With a model, state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate . It is essentially the same as for state values, except now we use visits to a state-action pair rather than to a state.
The only complication is that many state–action pairs may never be visited. To compare alternatives we need to estimate the value of all the actions from each state. This is the general problem of maintaining exploration. One way to assure continual exploration is by specifying that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start. This guarantees that all state–action pairs will be visited an infinite number of times in the limit of an infinite number of episodes. We call this the assumption of exploring starts.
Monte Carlo control
Let’s consider how Monte Carlo estimation can be used in control, that is, to aproximate optimal policies. In Monte Carlo version of classical policy iteration, we perform alternating complete steps of policy evaluation and policy improvement. Policy evaluation is done by Monte Carlo estimation. Many episodes are experienced, with the approximate action-value function approaching the true function asymptotically. Policy improvement is done by making the policy greedy with respect to the current value functioin.
In order to easily obtain the guarantee of convergence for the Monte Carlo method, we make two unlikely assumptions. One is that the episodes have exploring starts, and the other is that policy evaluation is done with an infinite number of episodes.
Complete Monte Carlo with Exploring Starts is
On-policy Monte Carlo control without exploring
To avoid the unlikely assumption of exploring starts, the only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data. The Monte Carlo ES method shown above is an example of an on-policy method.
In on-policy control methods the policy is generally soft, meaning that for all and all , but gradually shifted closer to a deterministic optimal policy.
The overall idea of on-policy Monte Carlo control ( policies) is still that of GPI. As in Monte Carlo ES, we use first-visit MC methods to estimate the action-value function for the current policy. Without the assumption of exploring starts, however, we cannot simply improve the policy by making it greedy with respect to the current value function, because that would prevent further exploration of nongreedy actions. Fortunately, GPI does not require that the policy be taken all the way to a greedy policy, only that it be moved toward a greedy policy.
Off-policy prediction via importance sampling
All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behaviour, but they need to behave non-optimally in order to explore all actions. The on-policy approach learns action values not for the optimal policy, but for a near-optimal policy that still explores. The off-policy learning uses two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behaviour. The policy being learned about is called the target policy, and the policy used to generate behaviour is called the behaviour policy.
In the prediction problem for off-policy methods, both target and behaviour policies are fixed. Suppose we want to estimate or , but all we have are episodes following another policy , where . In order to use episodes from to estimate values for , we require that every action taken under is also taken, at least occasionally, under . That is, we require that implies . This is called the assumption of coverage. It follows from coverage that must be stochastic in states where it is not identical to . The target policy may be deterministic, and in fact this is a case of particular interest in control problems. In control, the target policy is typically the deterministic greedy policy with respect to the current action-value function estimate. This policy becomes a deterministic optimal policy while the behaviour policy remains stochastic and more exploratory, for example, policy.
Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behaviour policies, called the importance-sampling ratio.
Given a starting state , the probability of the subsequent state-action trajectory, , occurring under any policy is
where here is the state-transition probability function. Thus the relative probability of the trajectory under the target and behaviour policies (the importance-sampling ratio) is
Although the trajectory probabilities depend on the MDP’s transition probabilities, which are generally unknown, they appear identically in both the numerator and denominator, and thus cancel.
We define a set of all time steps in which state is visited for every-visited method, time steps that were first visits to within their episodes. We define the first time of termination following time , and the return after up through . Then are the returns that pertain to state , and are corresponding importance-sampling ratios. To estimate , we scale the returns by the ratios and average the results:
When importance sampling is done as a simple average in this way it is called ordinary importance sampling.
An important alternative is weighted importance sampling, which uses a weighted average, defined as
or zero if the denominator is zero.
In weighted-average estimate, the ratio for the single return cancels in the numerator and denominator, so that the estimate is equal to the observed return independent of the ratio. As a result, the expectation is rather than , and in statistical sense it is biased. In contrast, the simple average is always in expectation, but it can be extreme.
The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased. On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one. In fact, assuming bounded returns, the variance of the weighted importance-sampling estimator converges to zero even if the variance of the ratios themselves is infinite.
In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred. The ordinary importance sampling is easier to extend to the approximate methods using function estimation.
Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis. For ordinary importance sampling, it has similar incremental methods as shown in previous section. For weighted importance sampling, the incremental algorithm is
Off-policy Monte Carlo control
Based on GPI and weighted importance sampling, for estimating and , the procedure for off-policy Monte Carlo control method is
The behaviour policy can be anything, but in order to assure convergence of to the optimal policy, an infinite number of returns must be obtained for each pair of state and action. This can be assured by choosing to be . A potential problem is that this method learns only from the tails of episodes, when all of the remaining actions in the episode are greedy. If nongreedy actions are common, then learning will be slow, particularly for states appearing in the early portions of long episodes. There has been insufficient experience with off-policy Monte Carlo methods to assess how serious this problem is. If it is serious, the most important way to address it is probably by incorporating temporal-difference learning.
Temporal-difference learning
Temporal-difference methods require no model and are fully incremental, but are more complex to analyze. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome.
TD prediction
Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for . A simple every-visit Monte Carlo method suitable for nonstationary environment is
where is the actual return following time , and is a constant step-size parameter. This method is called . Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to , TD methods need to wait only until the next time step. At time they immediately form a target and make a useful update using the observed reward and estimate . The simplest TD method makes the update
immediately on transition to and receiving . In effect, the target for the Monte Carlo update is , whereas the target for the TD update is . This TD method is called , or one-step TD, because it is a special case of the .
The procedure for is
Because bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP.
Roughly speaking, Monte Carlo methods use an estimate of as a target, whereas DP methods use an estimate of as a target. The Monte Carlo target is an estimate because the expected value in is not known; a sample return is used in place of the real expected return. The DP target is an estimate not because of the expected values, which are assumed to be completely provided by a model of the environment, but because is not known and the current estimate, , is used instead. The TD target is an estimate for both reasons: it samples the expected values in and it uses the current estimate instead of the true . Thus, TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.
Note that the quantity in brackets in the update is a sort of error, measuring the difference between the estimated value of and the better estimte . This quantity is called the TD error
Notice that the TD error at each time is the error in the estimate made at that time. Because the TD error depends on the next state and next reward, it is not actually available until one time step later. Also note that if the array does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors. This identity is not exact if is updated during the episode (as it is in ), but if the step size is small then it may still hold approximately.
SARSA: on-policy TD control
Action value update rule is
The general form of the SARSA control algorithm is
Q-learning: off-policy TD control
Q-learning is defined by
In this case, the learned action-value function directly approximates the optimal action-value function , independent of the policy being followed.
The procedure for Q-learning is
Expected SARSA
The expected SARSA follows the schema of Q-learning and moves deterministically in the same direction as SARSA moves in expectation
Expected SARSA is more complex computationally than SARSA but it eliminates the variance due to the random selection of .
Maximization bias and double learning
In Q-learning the target policy is the greedy policy given the current action values, which is defined with a max, and in SARSA the policy is the , which also involves a maximization operation. In these algorithms, a maximization over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias. Consider a single state where there are many actions whose true values, , are all zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. This is called the maximization bias.
The maximization bias is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divide the plays in two sets and use them to learn two independent estimates, call them and , each an estimate of the true value , for all . We could then use one estimate, say , to determine the maximizing action , and the other, , to provide the estimate of its value, . This estimate will then be unbiased in the sense that . We can also repeat the process with the role of the two estimates reversed to yield a second unbiased estimate . This is the idea of double learning. Note that although we learn two estimates, only one estimate is updated on each play; double learning doubles the memory requirements, but does not increase the amount of computation per step.
Double Q-learning divides the time steps in two, perhaps by flipping a coin on each step. If the coin comes up heads, the update is
If the coins comes up tails, then the same update is done with and switched, so that is updated.