Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 7representative citing papers
A generalization of differential TD extends it to episodic settings while preserving policy ordering, inheriting linear TD guarantees, and improving sample efficiency.
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.
citing papers explorer
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.