# Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) is a 5 tuple $$(S,A,P_a,R_a,\gamma)$$ where

• $$S$$ is the state space.
• $$A$$ is the action space.
• $$P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)$$ is the transition probability for the next possible state $$s'$$ given the current state $$s$$ under action $$a$$, which obeys the Markov property.
• $$R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]$$ is the immediate or expected immediate reward for transitioning to the new state $$s'$$ given the current state $$s$$ under action $$a$$.
• $$\gamma \in [0,1]$$ is the discount factor for rewards.

An MDP is finite if the state, action and reward spaces are finite.

The above definition is from David Silver's UCL RL course.

In Sutton-Barto (Sutton and Barto 2018)p.48, the MDP is imagined as a feedback loop between the agent and environment, where at time $$t$$ the agent receives a state $$s_t$$ and reward $$r_t$$ from the environment, then generates action $$a_t$$. Namely, the MDP and agent give rise to a trajectory $$s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots$$. The dynamics is defined as the distribution $$\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)$$ over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.

An MDP is considered solved if we have an optimal policy $$\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)$$ that maximizes the sum of cumulative future rewards $$R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}$$ where $$r_t$$ is the reward after the action taken at time $$t-1$$ i.e. $$a_{t-1}$$. (Convention from Sutton-Barto (Sutton and Barto 2018)p.48). For any MDP, there always exist a deterministic optimal policy.

Denote $$G_t$$ to be the accumulated rewards since time $$t$$ i.e. $$G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}$$ where $$H$$ is the horizon i.e. number of steps, which could be infinity. Note that $$G_t = R_{t+1} + \gamma G_{t+1}$$. For finite $$H$$, the agent-environment interaction breaks into subsequences called episodes e.g. trips through a maze (Sutton and Barto 2018). $$H$$ could be a random variable.

There are 2 important concepts for MDPs, denoted below.

1. State-value function: The expected cumulative rewards with policy $$\pi$$ starting from state $$s$$, $$V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]$$.
2. Action-value function: (Q-function) The expected cumulative rewards with policy $$\pi$$ starting from state $$s$$, taking action $$a$$, $$Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]$$.

Importantly, we can write each in terms of the other.

$V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s)$ $\sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)$

Similarly, $$Q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a] = \mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]$$ $$=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g,s'|s,a)$$ $$=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s,s',a)p(s'|s,a)$$ $$= \sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_g \sum_{s'} gp(g|s',a)\sum_r p(s',r|s,a)$$ $$=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)\sum_g gp(g|s',a)$$ $$=\sum_r \sum_{s'} rp(r,s'|s,a) + \gamma \sum_{s'} \sum_r p(s',r|s,a)V_\pi(s')$$ $$=\sum_r \sum_{s'} p(r,s'|s,a)[r + \gamma V_\pi(s')]$$

The optimal state-value function $$V^*(s)$$ and optimal action-value function $$Q^*(s,a)$$ is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.

One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.

## Thoughts

• Value functions deserve their own node, mention how they induce a partial ordering for policies.
• Starting out with Markov processes, then Markov reward processes then MDPs, which I think is a great way to lay out the ideas.
• Need to look at precise sigma-algebras and formal details when decomposing the value functions into immediate and future components by the Bellman expectation equation.
• Sutton-Barto (Sutton and Barto 2018) p.26 states $$R_t$$ as the reward from action in at time $$t$$ i.e. $$A_t$$, yet in p.48, $$R_t$$ is denoted the reward from the action taken in time $$t-1$$ i.e. $$A_{t-1}$$.
Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.

Created: 2022-03-13 Sun 21:44

Validate