Unlike the SARSA algorithm, Q-learning uses the current knowledge of the action-value function (Q-function) to update the target policy greedily:

\(Q(S_t,A_t) = Q(S_t,A_t) + \alpha(R_t+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t))\)


  • Look into epsilon greedy behaviour to choose and target policy being greedy with respect to the Q-function.
  • Note the similarity with TD (exponential moving average) and value iteration (when taking the max over Q-values)

Author: Nazaal

Created: 2022-03-13 Sun 21:44