Bandit problems
action selection
For a simple non-associative bandit algorithm, the averaging method is as
where we denote the estimated value of action  as ,
which approaches to  (expected or mean reward given 
). We denote the number of action  has been selected 
as . And  is the step size.
upper-confidence-bound action selection
action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. The upper-confidence-bound action selection selects action according to
where denotes the natural logarithm of , denotes the number of times that action has been selected prior to time , and the number controls the degree of exploration. If , then is considered to be a maximizing action.
The idea of UCB action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of ’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action , with determining the confidence level. Each time is selected the uncertainty is presumably reduced: increments, and, as it appears in the denominator, the uncertainty term decreases. On the other hand, each time an action other than is selected, increases but does not; because appears in the numerator, the uncertainty estimate increases. The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.
Gradient bandit algorithm
Both the and UCB estimate action values and use those estimates to select actions. We can also learn a numerical preference for each action , which we denote . The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over anohter is important, the action probabilities are determined according to a soft-max distribution:
where denotes the probability of taking action at time . Initially all preferences are the same (e.g., for all ) so that all actions have an equal probability of being selected.
On each step, after selecting action and receiving the reward , preferences are updated by
where is a step-size parameter, and is the average of all the rewards up through and including time , which can be computed incrementally. The term serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.
Associative search (contextual bandits)
In non-associative tasks, the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is non-stationary. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations.
Associative search tasks are often now called contextual bandits in the literature. Associative search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like k-armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem.
Bayesian methods
Bayesian methods assume a known initial distribution over the action values and then update the distribution exactly after each step (assuming that the true action values are stationary). In general, the update computations can be very complex, but for certain special distributions (called conjugate priors) they are easy. One possibility is to then select actions at each step according to their posterior probability of being the best action. This method, sometimes called posterior sampling or Thompson sampling, often performs similarly to the best of the distribution-free methods described above.
In the Bayesian setting it is even conceivable to compute the optimal balance between exploration and exploitation. One can compute for any possible action the probability of each possible immediate reward and the resultant posterior distributions over action values. This evolving distribution becomes the information state of the problem. Given a horizon, say of 1000 steps, one can consider all possible actions, all possible resulting rewards, all possible next actions, all next rewards, and so on for all 1000 steps. Given the assumptions, the rewards and probabilities of each possible chain of events can be determined, and one need only pick the best. But the tree of possibilities grows extremely rapidly; even if there were only two actions and two rewards, the tree would have leaves. It is generally not feasible to perform this immense computation exactly, but perhaps it could be approximated efficiently. This approach would effectively turn the bandit problem into an instance of the full reinforcement learning problem. In the end, we may be able to use approximate reinforcement learning methods to approach this optimal solution.
Finite Markov decision process
MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made.
Finite MDP involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in different situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated the value of each action , in MDPs we estimate the value of each action in each state , or we estimate the value of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.
In a finite MDP, the sets of states, actions, and rewards all have a finite number of elements. In this case, the random variables and have well defined discrete probability distributions dependent only on the preceding state and action. That is, for particular values of these random variables, and , there is a probability of those values occurring at time , given particular values of the preceding state and action:
for all , and . The function is an ordinary deterministic function of four arguments, and it specifies a probability distribution for each choice of and ,
The probabilities given by the four-argument function completely characterize the dynamics of a finite MDP. The state-transition probabilities is
and the expected rewards for state-action pairs is
and the expectd rewards for state-action-next-state is
Return
For MDP, we seek to maximize the expected return, where the return, denoted , is defined as some specific function of the reward sequence.
In the simplest case the return is the sum of the rewards:
where is a final time step. This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent-environment interaction breaks naturally into subsequences, which we call episodes. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
On the other hand, in many cases the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. We call these continuing tasks. In this case, using the sum of the rewards as the return is problematic.
The discounted return is defined as
where is a parameter, , called the discount rate.
Policies and value functions
Almost all reinforcement learning algorithms involve estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The value functions are defined with respect to particular ways of acting, called polices.
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy at time , then is the probability that if .
The value of a state under a policy , denoted , is the expected return when starting in and following thereafter.
We call function the state-value function for policy .
Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy
We call the action-value function for policy .
For any policy and any state , the following consistency condition holds between the value of and the value of its possible successor states
The above equation is the Bellman equation for . It expresses a relationship between the value of a state and the values of its successor states. The value function is the unique solution to its Bellman equation.
Optimal policies and optimal value functions
A policy is defined to be better than or equal to a policy if its expected return is greater than or equal to that of for all states, if and only if for all . There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by . They share the same state-value function, called the optimal state-value function, denoted , and defined as
for all .
Optimal policies also share the same optimal action-value function, denoted , and defined as
for all and . For the state-action pair , this function gives the expected return for taking action in state and thereafter following an optimal policy. Thus, we can write in terms of as follows
Because is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation for state values. Because it is the optimal value function, however, ’s consistency condition can be written in a special form without reference to any specific policy. This is the Bellman equation for , or the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
The last two equations are two forms of the Bellman optimality equation for . The Bellman optimality equation for is
For finite MDPs, the Bellman optimality equation for has a unique solution independent of the policy. The Bellman optimality equation is actually a system of equations, one for each state, so if there are states, then there are equations in unknowns. If the dynamics of the environment are known, then in principle one can solve this system of equations for using any one of a variety of methods for solving systems of nonlinear equations. One can solve a related set of equations for .
Once one has , it is relatively easy to determine an optimal policy. For each state , there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy. You can think of this as a one-step search.
Having makes choosing optimal actions even easier. With , the agent does not even have to do a one-step-ahead search: for any state , it can simply find any action that maximizes . The action-value function effectively caches the results of all one-step-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state–action pair. Hence, at the cost of representing a function of state–action pairs, instead of just of states, the optimal action-value function allows optimal actions to be selected without having to know anything about possible successor states and their values, that is, without having to know anything about the environment’s dynamics.
