A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
Reinforcement learning: An introduction
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
MORL generates more diverse requirement-violation scenarios while SORL produces higher-severity violations when testing interdependent requirements in an end-to-end AV controller.
Combines D-optimality and diversity-maximizing selection in an epsilon-greedy loop to create compact training sets for sparse group additivity and kernel ridge regression models of molecular properties.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
-
Reinforcement Learning for Testing Interdependent Requirements in Autonomous Vehicles: An Empirical Study
MORL generates more diverse requirement-violation scenarios while SORL produces higher-severity violations when testing interdependent requirements in an end-to-end AV controller.
-
Designing compact training sets for data-driven molecular property prediction
Combines D-optimality and diversity-maximizing selection in an epsilon-greedy loop to create compact training sets for sparse group additivity and kernel ridge regression models of molecular properties.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.