Multi-Armed Bandits (MABs)

The multi-armed bandit problem is a one-state MDP where the agent has \(k\) actions, each corresponding to some reward distribution \(R_i\) with mean \(\mu_i\).

The actions can be thought of as selecting a slot machine, thus the term bandit. The distributions can be stationary or non-stationary, below we discuss the stationary case.

Each action \(A_t\) at time \(t\) corresponds to an arm chosen, and there is a value function \(q_*(a) =\mathbb{E}[R_t|A_t=a]\) that represents the expected reward for each arm, in this case action, \(a\). If we happen to know \(q_*\) then the problem is solved, since we can simply choose the \(a = argmax_{a} q_*(a)\) however we do not have access to this, and we wish to estimate it. The estimated value of action \(a\) at time \(t\) is denoted \(Q_t(a)\) and we want this to be as close as possible to \(q_*(a)\).

A natural estimate for \(Q_t(a)\) is the following: \[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \mathbb{I}_{A_i=a}}{\sum_{i=1}^{t-1}\mathbb{I}_{A_i=a}} \] Where \(\mathbb{I}_P\) is the indicator variable for proposition \(P\). From here, the greedy action selection method would be to choose \(A_t = argmax_{a}Q_t(a)\), however one can see that assuming we selected some each action exactly once, this method exploits its current knowledge but does not explore further. Exploration is important since there is uncertainty in the reward distribution.

A strategy here is to use ε-greedy exploration.

An incremental approach to estimating \(Q_t(a)\) maybe favourable since then we would only have keep \(Q_{t-1}(a)\) in memory. This involves rewriting the formula above as \[ Q_t(a) = Q_{t-1}+\frac{1}{t}[R_t - Q_{t-1}(a)] \]

A similar update rule also occurs in Temporal-Difference (TD) learning. Here, \(R_t - Q_{t-1}(a)\) is called the error in the estimate.

Multi-Armed Bandits (MABs)

Thoughts