Partially Observable Markov Decision Processes (POMDPs)

A Partially Observable Markov Decision Process (POMDP) is a 7 tuple \((S,A,O,P_a,R_a,Z,\gamma)\) where

\(S\) is the state space.
\(A\) is the action space.
\(O\) is the observation space.
\(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the transition probability for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property.
\(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate reward for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\).
\(Z_a(o,s'): O \times S \times A \rightarrow [0,1] = \mathbb{P}(O_{t+1}=o|S_{t+1}=s', A_t=a)\) is the observation model.
\(\gamma \in [0,1]\) is the discount factor for rewards.

It is an MDP with hidden states, or equivalently a Hidden Markov Model (HMM) with actions.

Thoughts