Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
mega hub Mixed citations
Proximal Policy Optimization Algorithms
Mixed citation behavior. Most common role is background (52%).
abstract
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.
DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.
Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.
A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.
Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.
AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
citing papers explorer
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
-
Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration
QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.
-
Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion
Explicit conditioning of a PPO policy on interpretable stair parameters (height, depth, yaw) yields improved generalization to unseen stairs and reliable real-world traversal on the Unitree G1, including 33 consecutive outdoor steps.
-
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching
DRIS improves zero-shot sim-to-real transfer for reactive catching by maintaining and acting on sets of randomized dynamics instances instead of single instances per episode.
-
LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts
LineRides enables commandable bicycle robot stunts via line-guided RL that uses spatial guidelines, a tracking margin for feasibility, distance-based progress, and sparse key-orientations.
-
SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision
SigLoMa enables dynamic loco-manipulation on quadrupeds from ego-centric 5 Hz vision alone by using Sigma Points for scalable exteroception, an ego-centric Kalman Filter for high-rate state estimation, and an active sampling curriculum, matching expert human teleoperation performance.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 objects with hundreds of trajectories per task.
-
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots
DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb success versus 10% for baselines.
-
Vision-Based End-to-End Learning for UAV Traversal of Irregular Gaps via Differentiable Simulation
An end-to-end vision-based framework enables UAVs to traverse complex irregular gaps in unseen environments by mapping depth images to SE(3) control commands using differentiable simulation.
-
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
DP3 uses compact 3D representations from sparse point clouds inside diffusion policies to learn generalizable visuomotor skills from few demonstrations, reporting 24% gains in simulation and 85% success on real robots.
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
-
Learning Versatile Humanoid Manipulation with Touch Dreaming
HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-rich humanoid loco-manipulation tasks.
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
-
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gain over evolutionary strategies.
- Learning Dexterous Grasping from Sparse Taxonomy Guidance