TD_INLINE_MATH_1 is a generalisation of TD_INLINE_MATH_2 reinforcement learning algorithms, but it employs an eligibility trace $\lambda$ and $\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:
$$ \textbf{z}_{-1} = \mathbf{0} $$ $$ \textbf{z}_{t} = \gamma\lambda\textbf{z}_{t-1} + \nabla\hat{v}\left(S_{t}, \mathbf{w}_{t}\right), 0 \leq t \leq T$$
The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\nabla\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$ is the feature vector.
The TD error for state-value prediction is:
$$ \delta_{t} = R_{t+1} + \gamma\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right) - \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $$
In TD_INLINE_MATH_1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:
$$ \mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha\delta\mathbf{z}_{t} $$
Source: Sutton and Barto, Reinforcement Learning, 2nd Edition
Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Starcraft II | 7 | 29.17% |
Starcraft | 6 | 25.00% |
Reinforcement Learning (RL) | 4 | 16.67% |
Decision Making | 2 | 8.33% |
Language Modelling | 1 | 4.17% |
Large Language Model | 1 | 4.17% |
Offline RL | 1 | 4.17% |
Imitation Learning | 1 | 4.17% |
Hierarchical Reinforcement Learning | 1 | 4.17% |