Q-learning

Unlike the SARSA algorithm, Q-learning uses the current knowledge of the action-value function (Q-function) to update the target policy greedily:

\(Q(S_t,A_t) = Q(S_t,A_t) + \alpha(R_t+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t))\)

Thoughts

Look into epsilon greedy behaviour to choose and target policy being greedy with respect to the Q-function.
Note the similarity with TD (exponential moving average) and value iteration (when taking the max over Q-values)