next up previous
Next: Gittins Allocation Indices Up: Formally Justified Techniques Previous: Formally Justified Techniques

Dynamic-Programming Approach

If the agent is going to be acting for a total of h steps, it can use basic Bayesian reasoning to solve for an optimal strategy [12]. This requires an assumed prior joint distribution for the parameters tex2html_wrap_inline1824 , the most natural of which is that each tex2html_wrap_inline1816 is independently uniformly distributed between 0 and 1. We compute a mapping from belief states (summaries of the agent's experiences during this run) to actions. Here, a belief state can be represented as a tabulation of action choices and payoffs: tex2html_wrap_inline1828 denotes a state of play in which each arm i has been pulled tex2html_wrap_inline1832 times with tex2html_wrap_inline1834 payoffs. We write tex2html_wrap_inline1836 as the expected payoff remaining, given that a total of h pulls are available, and we use the remaining pulls optimally.

If tex2html_wrap_inline1840 , then there are no remaining pulls, and tex2html_wrap_inline1842 . This is the basis of a recursive definition. If we know the tex2html_wrap_inline1844 value for all belief states with t pulls remaining, we can compute the tex2html_wrap_inline1844 value of any belief state with t+1 pulls remaining:

displaymath1820

where tex2html_wrap_inline1854 is the posterior subjective probability of action i paying off given tex2html_wrap_inline1832 , tex2html_wrap_inline1834 and our prior probability. For the uniform priors, which result in a beta distribution, tex2html_wrap_inline1862 .

The expense of filling in the table of tex2html_wrap_inline1844 values in this way for all attainable belief states is linear in the number of belief states times actions, and thus exponential in the horizon.



Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996