In many of the tasks, the state space of reinforcement learning is combinatorial and enormous. The problem with large state spaces is not just the memory needed for large tables, but the time and data needed to fill them accurately. In many tasks, almost every state encountered will never have been seen before. To make sensible decisions in such states it is necessary to generalize from previous encounters with different states that are in some sense similar to the current one.

On-policy prediction with approximation

To use function approximation in estimating the state-value function from on-policy data, we approximate from value function is represented not as a table but as a parameterized functional form with weight vector . We will write for the approximate value of state given weight vector . Typically, the number of weights (the dimensionality of ) is much less than the number of states (), and changing one weight changes the estimated value of many states. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states.

Let us refer to an individual update by the notation , where is the state updated and is the update target that ’s estimated value is shifted toward.

In the tabular case a continuous measure of prediction quality was not necessary because the learned value function could come to equal the true value function exactly. Moreover, the learned values at each state were decoupled – an update at one state affected no other. But with genuine approximation, an update at one state affects many others, and it is not possible to get the values of all states exactly correct. By assumption we have far more states than weights, so making one state’s estimate more accurate invariably means making others’ less accurate. We are obligated then to say which states we care most about. We must specify a state weighting or distribution , , representing how much we care about the error in each state . By the error in a state we mean the square of the difference between the approximate value and the true value . Weighting this over the state space by , we obtain a natural objective function, the Mean Squared Value Error, denoted :

Often is chosen to be the fraction of time spent in . Under on-policy training this is called the on-policy distribution.

But it is not completely clear that the is the right performance objective for reinforcement learning. Remember that our ultimate purpose – the reason we are learning a value function – is to find a better policy. The best value function for this purpose is not necessarily the best for minimizing .

Stochastic-gradient and semi-gradient methods

SGD methods minimize error by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:

where is a positive step-size parameter, and , for any scalar expression , denotes the vector of partial derivatives with respect to the components of the weight vector.

We turn now to the case in which the target output, here denoted , of the th training example, , is not the true value, , but some, possibly random, approximation to it. For example, might be a noise-corrupted version of , or it might be one of the bootstrapping targets using . In these cases we cannot perform the exact update because is unknown, but we can approximate it by substituting in place of . This yields the following general SGD method for state-value prediction:

The procedure for gradient monte carlo algorithm is

Bootstrapping methods are not in fact instances of true gradient descent. They take into account the effect of changing the weight vector on the estimate, but ignore its effect on the target. They include only a part of the gradient and we call them semi-gradient methods.

The procedure for semi-gradient TD(0) is

Tips on artificial neural networks

The backpropagation algorithm can produce good results for shallow networks having or hidden layers, but it may not work well for deeper ANNs. Explaining results like these is not easy, but several factors are important. First, the large number of weights in a typical deep ANN makes it difficult to avoid the problem of overfitting, that is, the problem of failing to generalize correctly to cases on which the network has not been trained. Second, backpropagation does not work well for deep ANNs because the partial derivatives computed by its backward passes either decay rapidly toward the input side of the network, making learning by deep layers extremely slow, or the partial derivatives grow rapidly toward the input side of the network, making learning unstable.

A particularly effective method for reducing overfitting by deep ANNs is the dropout method. During training, units are randomly removed from the network (dropped out) along with their connections. This can be thought of as training a large number of “thinned” networks. Combining the results of these thinned networks at test time is a way to improve generalization performance. The dropout method efficiently approximates this combination by multiplying each outgoing weight of a unit by the probability that unit was retained during training. The dropout method significantly improves generalization performance. It encourages individual hidden units to learn features that work well with random collections of other features. This increases the versatility of the features formed by the hidden units so that the network does not overly specialize to rarely-occurring cases.

Another method is to train the deepest layers one at a time using an unsupervised learning algorithm. Without relying on the overall objective function, unsupervised learning can extract features that capture statistical regularities of the input stream. The deepest layer is trained first, then with input provided by this trained layer, the next deepest layer is trained, and so on, until the weights in all, or many, of the network’s layers are set to values that now act as initial values for supervised learning. The network is then fine-tuned by backpropagation with respect to the overall objective function. Studies show that this approach generally works much better than backpropagation with weights initialized with random values. The better performance of networks trained with weights initialized this way could be due to many factors, but one idea is that this method places the network in a region of weight space from which a gradient-based algorithm can make good progress.

Batch normalization is another technique that makes it easier to train deep ANNs. It has long been known that ANN learning is easier if the network input is normalized, for example, by adjusting each input variable to have zero mean and unit variance. Batch normalization for training deep ANNs normalizes the output of deep layers before they feed into the following layer. This method used statistics from subsets, or “mini-batches” of training examples to normalize these between-layer signals to improve the learning rate of deep ANNs.

Another technique useful for training deep ANNs is deep residual learning. Sometimes it is easier to learn how a function differs from the identity function than to learn the function itself. Then adding this difference, or residual function, to the input produces the desired function. In deep ANNs, a block of layers can be made to learn a residual function simply by adding shortcut, or skip, connections around the block. These connections add the input to the block to its output, and no additional weights are needed.

On-policy control with approximation

Episodic semi-gradient control

The approximate action-value function is represented as a parameterized functional form with weight vector . Consider the random training examples of the form , the update target can be any approximation of , including the usual backed-up values such as the full Monte Carlo return, , or any of the n-step SARSA returns. The general gradient-descent update for action-value prediction is

For example, the update for the one-step SARSA method is

We call this method episodic semi-gradient one-step SARSA.

To form control methods, we need to couple such action-value prediction methods with techniques for policy improvement and action selection.

n-step semi-gradient SARSA

We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target in the semi-gradient SARSA update equation. The n-step return immediately generalizes from its tabular form to a function approximation form:

with if , as usual. The n-step update equation is

n-step differential semi-gradient SARSA

In order to generalize to n-step bootstrapping, we need an n-step version of the TD error.

Off-policy methods with approximation

Semi-gradient methods

The one-step state-value algorithm is semi-gradient off-policy TD(0), which is just like the corresponding on-policy algorithm except for the addition of :

where is the per-step importance sampling ratio and is defined appropriately depending on whether the problem is episodic and discounted, or continuing and undiscounted using average reward:

For action-values, the one-step algorithm is semi-gradient expected SARSA:

Note that this algorithm does not use importance sampling. In the tabular case it is clear that this is appropriate because the only sample action is , and in learning its value we do not have to consider any other actions. With function approximation it is less clear because we might want to weight different state–action pairs differently once they all contribute to the same overall approximation.

In the multi-step generalizations of these algorithms, both the state-value and action-value algorithms involve importance sampling. For example, the n-step version of semi-gradient expected SARSA is

The semi-gradient version of the n-step tree-backup algorithm which does not involve importance sampling is

The deadly triad

The danger of instability and divergence arises whenever we combine all of the following three elements, making up what we call the deadly triad:

  • Function approximation: A powerful, scalable way of generalizing from a state space much larger than the memory and computational resources

  • Bootstrapping: Update targets that include existing estimates (as in dynamic programming or TD methods) rather than relying exclusively on actual rewards and complete returns (as in MC methods)

  • Off-policy training: Training on a distribution of transitions other than that produced by the target policy. Sweeping through the state space and updating all states uniformly, as in dynamic programming, does not respect the target policy and is an example of off-policy training

In particular, note that the danger is not due to control, or to generalized policy iteration. Those cases are more complex to analyze, but the instability arises in the simpler prediction case whenever it includes all three elements of the deadly triad. The danger is also not due to learning or to uncertainties about the environment, because it occurs just as strongly in planning methods, such as dynamic programming, in which the environment is completely known.

If any two elements of the deadly triad are present, but not all three, then instability can be avoided. It is natural, then, to go through the three and see if there is any one that can be given up.

Of the three, function approximation most clearly cannot be given up. We need methods that scale to large problems and to great expressive power.

Doing without bootstrapping is possible, at the cost of computational and data efficiency. Perhaps most important are the losses in computational efficiency. The losses in data efficiency by giving up bootstrapping are also significant. Bootstrapping often results in faster learning because it allows learning to take advantage of the state property, the ability to recognize a state upon returning to it. On the other hand, bootstrapping can impair learning on problems where the state representation is poor and causes poor generalization.

Finally, there is off-policy learning. On-policy methods are often adequate. For model-free reinforcement learning, one can simply use SARSA rather than q-learning. Off-policy methods free behavior from the target policy. This could be considered an appealing convenience but not a necessity. However, off-policy learning is essential to other anticipated use cases. In these use cases, the agent learns not just a single value function and single policy, but large numbers of them in parallel.

Bellman error

The Bellman equation for value function is

is the only value function that solves this equation exactly. If an approximate value function were substituted for , the difference between the right and left sides of the modified equation could be used as a measure of how far off is from . We call this the Bellman error at state

We can see that the Bellman error is the expectation of the TD error .

The vector of all the Bellman errors, at all states, , is called the Bellman error vector. The overall size of this vector, in the norm, is an overall measure of the error in the value function, called the Mean Squared Bellman Error:

The Bellman operator is defined by

for all and . The Bellman error vector for can be written .

If the Bellman operator is applied to a value function in the representable subspace, then, in general, it will produce a new value function that is outside the subspace. In dynamic programming (without function approximation), this operator is applied repeatedly to the points outside the representable space. Eventually that process converges to the true value function , the only fixed point for the Bellman operator, the only value function for which

which is just another way of writing the Bellman equation for .

Attempts to avoid instability in deadly triad

The most popular has been to seek to perform true stochastic gradient descent in the Bellman error. However this is not an appealing goal in many cases, and that anyway it is impossible to achieve with a learning algorithm – the gradient of the is not learnable from experience that reveals only feature vectors and not underlying states. Another approach, gradient-TD methods, performs SGD in the projected Bellman error. The gradient of the is learnable with complexity, but at the cost of a second parameter vector with a second step size. The newest family of methods, emphatic-TD methods, refine an old idea for reweighting updates, emphasizing some and de-emphasizing others. In this way they restore the special properties that make on-policy learning stable with computationally simple semi-gradient methods.