QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
Generalization and regularization in dqn
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Proposes a tool-use inspired framework with multiple test sets to measure specified types of generalization in RL.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Adding a hindsight factor that integrates historic temporal differences into the Q-learning loss reduces overestimation and yields higher average scores than DQN, DDQN and dueling networks on ATARI games after 10 million frames.
citing papers explorer
-
Behavior-Consistent Deep Reinforcement Learning
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
-
Scaling Laws for Reward Model Overoptimization
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Reasoning and Generalization in RL: A Tool Use Perspective
Proposes a tool-use inspired framework with multiple test sets to measure specified types of generalization in RL.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
In Hindsight: A Smooth Reward for Steady Exploration
Adding a hindsight factor that integrates historic temporal differences into the Q-learning loss reduces overestimation and yields higher average scores than DQN, DDQN and dueling networks on ATARI games after 10 million frames.