# Q-learning

Unlike the SARSA algorithm, Q-learning uses the current knowledge of the action-value function (Q-function) to update the target policy greedily:

$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha(R_t+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t))$$

## Thoughts

• Look into epsilon greedy behaviour to choose and target policy being greedy with respect to the Q-function.
• Note the similarity with TD (exponential moving average) and value iteration (when taking the max over Q-values)

Created: 2022-03-13 Sun 21:44

Validate