# Reinforcement learning (RL)

A Markov Decision Process (MDP) with unknown dynamics i.e. unknown state transition functions and reward functions, is a Reinforcement learning problem. Learning through trial and error, and the concept of delayed rewards are important features of RL problems.

There are 2 main problems in RL:

- The
**prediction**problem, i.e. estimating the value function for a given policy. - The
**control**problem, i.e. finding an optimal policy.

Below are some general methods to approach RL problems:

- Model based methods: Estimate the MDP dynamics, then apply MDP methods.
- Model-free methods: Methods based on learning from trial-and-error.

Other taxonomies include:

**On-policy**learning which uses a policy while learning how to optimize it, can be thought of as leanring on the job and**off-policy**methods, which have 2 policies, one learning policy (e.g. some other agent's policy or an $ε$-greedy policy) and one operation policy (the target policy you learn to be optimal). Here, one can learn the optimal policy without necessarily knowing it i.e. via using suboptimal policies intermediately.**Episodic**tasks, which involve tasks with have a finite horizon \(T\) which is often a random variable and**non-episodic/continuing**tasks, where the tasks which continue without limit.**Value-based**methods, which involve learning the state/action-value functions which give an implicit policy, or**policy-based**methods which do not rely on an intermediate value function and learn an explicit policy, and**actor-critic**methods which learn and use both an explicit value function and a policy.

## Thoughts

- Put backup diagrams , as discussed in p59,60 in (Sutton and Barto 2018).