Policy iteration

When using value iteration to solve an MDP, one can notice that the optimal actions do for each state become fixed before the value function itself converges.

Policy iteration thus alternates between policy evaluation (i.e. compute the value function for the policy we have currently) and policy improvement (i.e. updating the policy using one-step look-ahead from the converged but not optimal value function) until the policy converges.

In the initial step, one could start with some value function (e.g. 0 for all states) or some policy (e.g. random or same action for all states).

Thoughts

Second paragraph is paraphrased from around 52:00 in here.