Dynamic-Programming Approach

Next: Gittins Allocation Indices Up: Formally Justified Techniques Previous: Formally Justified Techniques

Dynamic-Programming Approach

If the agent is going to be acting for a total of h steps, it can use basic Bayesian reasoning to solve for an optimal strategy [12]. This requires an assumed prior joint distribution for the parameters , the most natural of which is that each is independently uniformly distributed between 0 and 1. We compute a mapping from belief states (summaries of the agent's experiences during this run) to actions. Here, a belief state can be represented as a tabulation of action choices and payoffs: denotes a state of play in which each arm i has been pulled times with payoffs. We write as the expected payoff remaining, given that a total of h pulls are available, and we use the remaining pulls optimally.

If , then there are no remaining pulls, and . This is the basis of a recursive definition. If we know the value for all belief states with t pulls remaining, we can compute the value of any belief state with t+1 pulls remaining:

displaymath1820

where is the posterior subjective probability of action i paying off given , and our prior probability. For the uniform priors, which result in a beta distribution, .

The expense of filling in the table of values in this way for all attainable belief states is linear in the number of belief states times actions, and thus exponential in the horizon.

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996