# Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) is a 5 tuple \((S,A,P_a,R_a,\gamma)\) where

- \(S\) is the
**state space**. - \(A\) is the
**action space**. - \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the
**transition probability**for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property. - \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate
**reward**for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\). \(\gamma \in [0,1]\) is the discount factor for rewards.

An MDP is finite if the state, action and reward spaces are finite.

The above definition is from David Silver's UCL RL course.

In Sutton-Barto (Sutton and Barto 2018)p.48, the MDP is imagined as a feedback loop between the agent and environment, where at time \(t\) the agent receives a state \(s_t\) and reward \(r_t\) from the environment, then generates action \(a_t\). Namely, the MDP and agent give rise to a

**trajectory**\(s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots\). The**dynamics**is defined as the distribution \(\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)\) over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.An MDP is considered solved if we have an

**optimal policy**\(\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)\) that maximizes the sum of cumulative future rewards \(R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}\) where \(r_t\) is the reward**after the action taken at time**\(t-1\) i.e. \(a_{t-1}\). (Convention from Sutton-Barto (Sutton and Barto 2018)p.48). For any MDP, there always exist a deterministic optimal policy.Denote \(G_t\) to be the accumulated rewards since time \(t\) i.e. \(G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}\) where \(H\) is the horizon i.e. number of steps, which could be infinity. Note that \(G_t = R_{t+1} + \gamma G_{t+1}\). For finite \(H\), the agent-environment interaction breaks into subsequences called

**episodes**e.g. trips through a maze (Sutton and Barto 2018). \(H\) could be a random variable.There are 2 important concepts for MDPs, denoted below.

**State-value function**: The expected cumulative rewards with policy \(\pi\) starting from state \(s\), \(V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\).**Action-value function**: (Q-function) The expected cumulative rewards with policy \(\pi\) starting from state \(s\), taking action \(a\), \(Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\).

Importantly, we can write each in terms of the other.

\[ V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s) \] \[ \sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)\]

Similarly, \(Q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a] = \mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g,s'|s,a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s,s',a)p(s'|s,a)\) \(= \sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s',a)\sum_r p(s',r|s,a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)\sum_g gp(g|s',a)\) \(=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)V_\pi(s')\) \(=\sum_r \sum_{s'} p(r,s'|s,a)[r + \gamma V_\pi(s')]\)

The optimal state-value function \(V^*(s)\) and optimal action-value function \(Q^*(s,a)\) is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.

One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.

## Thoughts

- Value functions deserve their own node, mention how they induce a partial ordering for policies.
- Starting out with Markov processes, then Markov reward processes then MDPs, which I think is a great way to lay out the ideas.
- Need to look at precise sigma-algebras and formal details when decomposing the value functions into immediate and future components by the Bellman expectation equation.
- Sutton-Barto (Sutton and Barto 2018) p.26 states \(R_t\) as the reward from action in at time \(t\) i.e. \(A_t\), yet in p.48, \(R_t\) is denoted the reward from the action taken in time \(t-1\) i.e. \(A_{t-1}\).