QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
arXiv preprint arXiv:2006.02243 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2representative citing papers
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
citing papers explorer
-
Behavior-Consistent Deep Reinforcement Learning
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.