Reinforcement learning (RL)
A Markov Decision Process (MDP) with unknown dynamics i.e. unknown state transition functions and reward functions, is a Reinforcement learning problem. Learning through trial and error, and the concept of delayed rewards are important features of RL problems.
There are 2 main problems in RL:
- The prediction problem, i.e. estimating the value function for a given policy.
- The control problem, i.e. finding an optimal policy.
Below are some general methods to approach RL problems:
- Model based methods: Estimate the MDP dynamics, then apply MDP methods.
- Model-free methods: Methods based on learning from trial-and-error.
Other taxonomies include:
- On-policy learning which uses a policy while learning how to optimize it, can be thought of as leanring on the job and off-policy methods, which have 2 policies, one learning policy (e.g. some other agent's policy or an $ε$-greedy policy) and one operation policy (the target policy you learn to be optimal). Here, one can learn the optimal policy without necessarily knowing it i.e. via using suboptimal policies intermediately.
- Episodic tasks, which involve tasks with have a finite horizon \(T\) which is often a random variable and non-episodic/continuing tasks, where the tasks which continue without limit.
- Value-based methods, which involve learning the state/action-value functions which give an implicit policy, or policy-based methods which do not rely on an intermediate value function and learn an explicit policy, and actor-critic methods which learn and use both an explicit value function and a policy.
Thoughts
- Put backup diagrams , as discussed in p59,60 in (Sutton and Barto 2018).