Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
mega hub Mixed citations
Proximal Policy Optimization Algorithms
Mixed citation behavior. Most common role is background (52%).
abstract
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.
IRumAI is the first RL agent for Indian Rummy, trained on weak heuristics to beat strong search opponents at 7000x speed.
DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.
Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.
A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.
Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.
AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
citing papers explorer
-
Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors
K-nearest neighbor from a knowledge graph beats most methods on out-of-distribution transcriptomic perturbation prediction, and an RL-trained reasoning LLM matches SOTA on Replogle et al. (2022) cell lines while improving downstream differential expression prediction.
-
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
-
IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking
IR-SIM is a YAML-defined simulator for mobile robot navigation that supports text-prompt scenario creation, policy training, benchmarking, and bridging to higher-fidelity or real-world settings.
-
Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation
AdaGRPO gates GRPO reinforcement learning with supervised NLL using per-sample binary clips based on policy difficulty and reward discriminability, raising HR@10 from 11.01% to 12.18% while keeping hallucination below 0.22% on large-scale e-commerce data and showing A/B gains.
-
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking
A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.
-
Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation
A three-level hierarchical RL framework uses pose affordances to guide navigation and interaction-point affordances to guide pedipulation, enabling autonomous object manipulation by quadrupeds in simulation and real-world tests.
-
QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation
QuadVerse integrates 3D Gaussian Splatting scene reconstruction, friction calibration via trajectory search, and a residual dynamics compensator to improve quadruped simulation fidelity and enable zero-shot policy transfer.
-
T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion
T-GMP learns a terrain-conditioned latent motion manifold via CVAE from demonstrations and integrates it into an adversarial pipeline with a foothold penalty for versatile, natural humanoid locomotion.
-
AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO
AdaGRPO enhances GRPO for flow models via online curriculum filtering of prompts and cross-level advantage fusion, yielding performance gains and training stability.
-
Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards
Progress-SQL introduces a multi-turn RL framework with ODT-based structural alignment and progressive rewards that measure improvement across refinement turns, yielding gains on BIRD, Spider, and robustness benchmarks.
-
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension
Reinforcement learning produces a single unified controller that lets an actively suspended planetary rover autonomously cross heterogeneous rough terrains after sim training and zero-shot hardware transfer.
-
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
HANDOFF is a distilled mixture-of-experts humanoid whole-body controller that follows a compact task-space interface, matches SOTA velocity tracking, provides large manipulation workspace on Unitree G1, and supports VLM-driven agentic planning with no task-specific data.
-
EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading
EDIT improves LLM rubric grading faithfulness by diagnosing problematic reasoning steps via posterior belief and grounding scores then applying local SFT revisions and belief-penalizing RL.
-
L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation
L-SDPPO optimizes a spiking diffusion policy with RL and adds SDLI to handle microgravity dynamics, reporting higher success rates and lower energy use than prior methods on five intra-vehicular tasks.
-
LadderMan: Learning Humanoid Perceptive Ladder Climbing
A hybrid motion-tracking and imitation-reinforcement pipeline produces a depth-based visuomotor policy that lets humanoids climb varied ladders zero-shot on hardware and perform teleoperated manipulation while climbing.
-
EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction
EEGDancer integrates VQ-VAE latent space learning, masked Transformer modeling, and SAC reinforcement learning to improve continuous EEG emotion prediction over prior methods on SEED datasets.
-
BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection
BMCR uses RL to adaptively compose modules from CNN and ViT backbones with an OT alignment interface, reporting mAP gains of up to 2.5 points on DOTA and DIOR-R datasets.
-
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
MR.Q combines predictive auxiliary tasks with high-capacity value functions in a model-free architecture to achieve strong multitask RL performance without planning.
-
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
SA-AH-GRPO applies asymmetric entropy-based discounting only to negative-advantage trajectories in GRPO, yielding similar peak Pass@1 accuracy with 3.6x lower training variance on GSM8K for Qwen 2.5 models.
-
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
GRAIL creates over 20,000 synthetic loco-manipulation sequences from known 3D configurations and video priors, then trains policies that achieve 84% pick-up and 90% stair-climbing success on a real Unitree G1 humanoid using only the generated data.
-
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning
Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.
-
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
CoRe-MoE uses a two-stage RL framework with contrastive reweighting in a Mixture-of-Experts architecture to enable gait transitions and multi-terrain adaptation for humanoid locomotion.
-
Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models
DIA is a training-free method that dynamically adjusts anchor positions in diffusion LLMs to improve format compliance and accuracy on reasoning benchmarks like GSM8K and MATH.
-
Self-Optimizing Control of Continuous Processes Based on Reinforcement Learning
Reinforcement learning optimizes controlled variable selection for self-optimizing control by embedding the structure in an actor network and using economic rewards, showing better dynamic performance than a steady-state baseline in a CSTR simulation under disturbances.
-
CoPark: Learning Reactive Parking via Self-Play
CoPark uses multi-agent self-play RL with a residual policy and threat-modulated asymmetric prior release to achieve 70-85% success and 3-6% collision rates in reactive parking benchmarks.
-
QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards
QUBRIC co-designs queries and rubrics via teacher key points, contrastive generation, and learnability filtering to support GRPO training, yielding +5.5 on ArenaHard and +6.3 average transfer to legal/moral/narrative benchmarks.
-
Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning
TAO-RL improves agentic RL by filtering degenerate trajectories and reshaping advantages with tool-aware entropy bonuses, yielding better performance on reasoning benchmarks.
-
PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion
PerchRL applies two-stage RL with randomized trajectories, temporal augmentation, and visibility-aware rewards to achieve vision-based perching on irregularly moving inclined platforms.
-
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
-
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
An empirical study of RLHF pipelines classifies failure modes such as reward hacking by analyzing directions of change in learned reward and judge scores across training checkpoints and shows they can be localized and partially predicted.
-
ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control
ConTrack introduces a constrained RL method with online dual-variable adaptation and adaptive resets for improved long-horizon hand tracking in simulation and on real robots.
-
Constitutional On-Policy Safe Distillation
COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.
-
Efficient Hyperparameter Optimization for LLM Reinforcement Learning
JF-HPO jointly adapts model size and training budget as fidelity for efficient HPO in LLM RL, reporting up to 14.9x trial speedup and performance gains of 5.8-111.6% over the VeRL recipe.
-
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.
-
MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching
MT-EditFlow applies flow-matching RL with multi-reward aggregation to improve multi-turn image editing performance on models like FLUX.1-Kontext-dev by 6.85 points at turn-3.
-
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization
S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.
-
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
POPO uses recency-based prioritized group replay and decoupled off-policy optimization to avoid zero-variance ineffective samples in RLVR, accelerating LLM reasoning finetuning with fewer rollouts.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space
COLLIE constructs a semantically coherent skill latent space from unsupervised data to enable training-free guidance with sparse online feedback in guided skill discovery.
-
Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance
A hybrid framework uses MARL value guidance to steer diffusion-generated trajectories for coordinated multi-robot planning, cutting interference from 55.4% to 41.8% in a 4-robot maze simulation.
-
Certificate-Guided Evaluation of Reinforcement Learning Generalization
A logic-driven framework defines inductive reach-avoid tasks and uses neural certificates to certify RL generalization, with empirical results linking fewer violations to more solved test tasks.
-
MESA: Improving MoE Safety Alignment via Decentralized Expertise
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
-
Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion
GLAD decomposes terrain encoding via coarse-to-fine attention on elevation maps to separate broad awareness from precise foothold selection in perceptive humanoid locomotion.
-
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
SPADER is an RL method for multi-answer QA that claims better recall and F1 via peer-aligned step-level advantages and diversity rewards on four benchmarks.
-
Improving Visual Representation Alignment Generation with GRPO
VRPO applies generative representation policy optimization to dynamically align diffusion features with pretrained visual encoders, claiming +1.8 FID gains and 2.3x faster training versus REPA.
-
CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.
-
Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning
Linear recurrent filters exactly reproduce HMM belief logits under deterministic transitions and achieve near-zero decoding error under nearly deterministic ones, extending to action-controlled cases.
-
Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning
PVPO is a sample-efficient RL method that improves semantic, geometric, and physical quality in LLM LEGO assembly generation by mitigating the PhysHack failure mode where validity alone fails to ensure fidelity.
-
Automating Formal Verification with Reinforcement Learning and Recursive Inference
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.
-
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Policy gradient methods suffer from zero collapse in discontinuous reward environments such as first-price auctions, where exploration causes policies to enter flat zero-reward regions from which recovery is sample-inefficient due to absent gradient signals.