One can further write this recursively using per-decision importance sampling (Sutton and Barto, 2018; Precup, 2000), but it is not essential to our derivations

Note, we use ρt+1:t+n−1, not ρt+1:t+n as in any N-step expected SARSA such as this one, all possible actions are taken into account in the last state, the one actually taken has no effect, does not have to be corrected for (Sutton and · 2018

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.

citing papers explorer

Showing 1 of 1 citing paper.

Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces cs.LG · 2026-04-15 · unverdicted · none · ref 8
Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.

One can further write this recursively using per-decision importance sampling (Sutton and Barto, 2018; Precup, 2000), but it is not essential to our derivations

fields

years

verdicts

representative citing papers

citing papers explorer