Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
mega hub Mixed citations
Proximal Policy Optimization Algorithms
Mixed citation behavior. Most common role is background (52%).
abstract
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.
Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.
AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
citing papers explorer
-
Reinforcement learning for ion shuttling on trapped-ion quantum computers
Reinforcement learning optimizes ion shuttling on trapped-ion quantum chips and reduces operations by up to 36.3% versus heuristics across multiple architectures.
-
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
-
Unified Data Selection for LLM Reasoning
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
-
Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors
Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple rewards for mocap deployment.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
Emergence of agriculture in an artificial society of reinforcement learning agents
Agriculture emerges spontaneously in an RL agent society through planning for delayed rewards, social learning that counters cheaters, and an irreversible lock-in effect.
-
CLORE: Content-Level Optimization for Reasoning Efficiency
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
-
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
-
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation
CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta
-
Token-weighted Direct Preference Optimization with Attention
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
-
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
-
Value-Gradient Hypothesis of RL for LLMs
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
-
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
-
DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning
DeCoR co-optimizes crosswalk placement and signal control via reinforcement learning on a real 750 m urban corridor, reporting 23% faster pedestrian access to crossings and 79%/65% reductions in pedestrian/vehicle wait times versus fixed-time baselines.
-
Behavior-Consistent Deep Reinforcement Learning
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
-
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
ScenePilot uses RSS-derived physical feasibility score and online-learned AV-risk predictor in constrained RL with feasibility-aware shielding to generate boundary-band scenarios, yielding +6.2 pp higher collision rates on SafeBench while preserving validity.
-
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
GRPO suffers advantage collapse on uniform-reward groups; ACR quantifies it and AVSPO adds virtual samples to restore gradients, yielding 4-6% accuracy gains on math benchmarks across 0.5B-14B models.
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
-
SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front
SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.
-
Spectral Souping: A Unified Framework for Online Preference Alignment
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
-
SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework
SUGAR turns diverse human videos into deployable humanoid loco-manipulation policies via automated prior extraction, physics refinement, and hierarchical distillation, showing scaling with data volume and zero-shot real-world transfer on six tasks.
-
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
-
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
-
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
-
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
-
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
-
General Preference Reinforcement Learning
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
-
Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors
State-dependent adversarial motion priors with a gravity-threshold gate enable one frozen policy to unify walking, running, and recovery on humanoid hardware.
-
DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization
DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.
-
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights
A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.
-
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
-
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
-
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
-
Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization
DEPPA reformulates the denoising process of pocket-aware diffusion models as a multi-step MDP and applies RL fine-tuning with a coarse scheduler to optimize ligands for binding affinity, drug-likeness, synthesizability and diversity on CrossDocked2020.
-
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
-
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
-
QPLEX Decision Processes: Formulation via Nonlinear Markov Chains and Optimization via Policy Gradients
QPLEX Decision Processes embed QPLEX transient approximations into a nonlinear MDP framework and optimize policies via deterministic gradients and natural-gradient methods, demonstrated on a dynamic pricing problem with waiting costs or chance constraints.
-
A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems
A hybrid LLM-RL red teaming framework generates adaptive attack campaigns in simulated enterprise networks to evaluate the robustness of AI-enabled SOAR systems.
-
Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning
XDiffuser combines extrinsic graph planning with diffusion models to guide denoising and improve performance on long-horizon robotic tasks including multi-agent coordination and TSP-style problems.
-
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
-
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
-
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing
Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
-
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
-
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
-
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
-
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
-
Offline Reinforcement Learning with Universal Horizon Models
Universal horizon models extend geometric horizon models to arbitrary horizons and apply winsorized distributions for stable offline RL value learning, outperforming baselines on 100 OGBench tasks.