Actor-Critic Methods

A policy-gradient method that makes use of the state value function to "critique" the policy being learnt. Here, the state value function is called the critic, and the policy is called the actor.

In comparison to REINFORCE with baseline, actor-critic methods make use of short term (often 1-step) returns instead of the true return which requires each episode to run until it terminates.

Thoughts

What are the relations to the Bias-Variance tradeoff?