# Temporal-Difference (TD) learning

TD (specifically, TD(0)) allows us to learn value functions without full episodes via the following estimate: $$V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1}+\gamma V(S_{t+1})-V(S_t))$$

In the above, $$R_{t+1}+\gamma V(S_{t+1})$$ is called the target, $$R_{t+1}+\gamma V(S_{t+1})-V(S_t)$$ is referred to as the TD error.

This is called the TD(0) algorithm, and $$0<\alpha<1$$.

Recall that given a list $$[a_1,...,a_n]$$, the running average can be computed recursively as $$\mu_k = \frac{(k-1)\mu_{k-1} + a_k}{k} = \mu_{k-1} + \frac{1}{k}(a_k - \mu_{k-1})$$.

In TD learning we apriori do not know how many steps are in the episode, or whether the task is even episodic at all, so the $$\frac{1}{k}$$ is subtituted with $$\alpha$$, where $$0<\alpha <1$$.

However, information could take a long time to propagate to the actual states in TD learning, so instead of using 1 step, use $$n$$ steps, such that the target is now $$R_{t+1}+\sum_{i=2}^{n-1} \gamma^{i-1} R_{t+i}+ \gamma^n V(S_{t+n})$$. Namely, we only bootstrap the previous value function estimate after taking $$n$$ steps forward rather than 1.

Here, taking $$n \rightarrow \infty$$ returns the full Monte-Carlo approach, where the value functions are updated once we finish one whole episode, giving the following update equation: $$V(S_t) = V(S_t)+\alpha(G_t - V(S_t))$$.

Note that the target updates for TD and $$n$$ -step TD are as follows:

• $$G_t^{(1)} = R_{t+1}+\gamma V(S_t)$$ for TD learning.
• $$G_t^{(n)} = R_{t+1}+\gamma G_{t+1}^{(n-1)}$$ for $$n$$ -step TD learning.

TD($$\lambda$$) aims for a compromise between the TD and Monte-Carlo approach above, by taking an average of the two targets above using the following target: $$G^{\lambda}_t = R_{t+1}+\gamma ((1-\lambda)V(S_{t+1})+\lambda G^{\lambda}_{t+1})$$

In the above equation, setting $$\lambda=0,1$$ recovers the TD learning and Monte-Carlo approach respectively. Intuitively, $$\frac{1}{1-\lambda}$$ is the horizon, i.e. the $$n$$ in the $$n$$ -step TD learning algorithm .

TD($$\lambda$$) an be written in the following manner: $$G_t^\lambda= (1-\lambda)\sum_{k=1}^\infty \lambda^{k-1}G_t^{(k)}$$ $$G^{(k)}_t = \sum_{i=1}^k \gamma^{i-1}R_{t+i} + \gamma^kV(S_{t+k})$$ $$V(S_t) = V(S_t) +\alpha(G_t^\lambda -V(S_t))$$

Note however TD($$\lambda$$) requires full episodes to learn, hence eligibility traces can be used.

## Thoughts

Created: 2022-03-13 Sun 21:45

Validate