Value iteration

An algorithm to solve Markov Decision Processes (MDPs). Start from the value function at iteration 0 to be 0 at all states i.e. \(\forall s \; V_0^*(s)=0\). Then, iterate as follows: \(V_{i+1}^*(s) = \max_a \mathbb{E}[R_{t+1}+\gamma V(S_{t+1})] = \max_a \sum_{s'} p(s'|s,a)[r(s,a,s')+\gamma V_i^*(s')]\).

In the above formulation the reward is assumed to be deterministic.

This algorithm converges to \(V^*(s)\).

Thoughts

Look into convergence details, and uniqueness of \(V^*(s)\).