Contextual Bandits

Contextual bandits generalize the Multi-Armed Bandits (MABs) setting, where we alongside a reward \(R_t\) at time \(t\), the agent also observes some state \(X_t\), called the context, which can be thought of as a situation the agent is currently in.

The agent thus now has to learn how the contexts and rewards are intertwined, so they can use the context information to select the best action.

This problem is also called associative search, since the agent has to search for the best actions and associate them to the contexts they work best in.

Similar to MABs, once the actions selected have an influence on reward and in this case the contexts, we have

Thoughts

(Sutton and Barto 2018) P41.
TODO Need to read Kakade’s slides.

References

Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.