Markov Decision Processes (MDPs)
A Markov Decision Process (MDP) is a 5 tuple \((S,A,P_a,R_a,\gamma)\) where
- \(S\) is the state space.
- \(A\) is the action space.
- \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the transition probability for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property.
- \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate reward for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\).
\(\gamma \in [0,1]\) is the discount factor for rewards.
An MDP is finite if the state, action and reward spaces are finite.
The above definition is from David Silver's UCL RL course.
In Sutton-Barto (Sutton and Barto 2018)p.48, the MDP is imagined as a feedback loop between the agent and environment, where at time \(t\) the agent receives a state \(s_t\) and reward \(r_t\) from the environment, then generates action \(a_t\). Namely, the MDP and agent give rise to a trajectory \(s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots\). The dynamics is defined as the distribution \(\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)\) over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.
An MDP is considered solved if we have an optimal policy \(\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)\) that maximizes the sum of cumulative future rewards \(R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}\) where \(r_t\) is the reward after the action taken at time \(t-1\) i.e. \(a_{t-1}\). (Convention from Sutton-Barto (Sutton and Barto 2018)p.48). For any MDP, there always exist a deterministic optimal policy.
Denote \(G_t\) to be the accumulated rewards since time \(t\) i.e. \(G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}\) where \(H\) is the horizon i.e. number of steps, which could be infinity. Note that \(G_t = R_{t+1} + \gamma G_{t+1}\). For finite \(H\), the agent-environment interaction breaks into subsequences called episodes e.g. trips through a maze (Sutton and Barto 2018). \(H\) could be a random variable.
There are 2 important concepts for MDPs, denoted below.
- State-value function: The expected cumulative rewards with policy \(\pi\) starting from state \(s\), \(V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\).
- Action-value function: (Q-function) The expected cumulative rewards with policy \(\pi\) starting from state \(s\), taking action \(a\), \(Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\).
Importantly, we can write each in terms of the other.
\[ V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s) \] \[ \sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)\]
Similarly, \(Q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a] = \mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g,s'|s,a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s,s',a)p(s'|s,a)\) \(= \sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s',a)\sum_r p(s',r|s,a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)\sum_g gp(g|s',a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)V_\pi(s')\) \(=\sum_r \sum_{s'} p(r,s'|s,a)[r + \gamma V_\pi(s')]\)
The optimal state-value function \(V^*(s)\) and optimal action-value function \(Q^*(s,a)\) is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.
One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.
Thoughts
- Value functions deserve their own node, mention how they induce a partial ordering for policies.
- Starting out with Markov processes, then Markov reward processes then MDPs, which I think is a great way to lay out the ideas.
- Need to look at precise sigma-algebras and formal details when decomposing the value functions into immediate and future components by the Bellman expectation equation.
- Sutton-Barto (Sutton and Barto 2018) p.26 states \(R_t\) as the reward from action in at time \(t\) i.e. \(A_t\), yet in p.48, \(R_t\) is denoted the reward from the action taken in time \(t-1\) i.e. \(A_{t-1}\).