Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
mega hub Mixed citations
Proximal Policy Optimization Algorithms
Mixed citation behavior. Most common role is background (52%).
abstract
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.
IRumAI is the first RL agent for Indian Rummy, trained on weak heuristics to beat strong search opponents at 7000x speed.
DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.
Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.
A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.
Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.
AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.
A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
citing papers explorer
-
Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences
FedVPA-GP applies variational preference learning in a federated setting with a mixture prior and orthogonal loss to disentangle user preferences on the HH-RLHF dataset.
-
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
DARTS accelerates LLM RL training up to 1.77x by distribution-aware trajectory sampling and adaptive redundancy allocation that shapes rollouts toward conciseness without performance loss.
-
SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World
SSR is an end-to-end vision-based framework for humanoid traversal that learns imagined foothold guidance, equivariant latent-space symmetry augmentation, and terrain-specific multi-discriminator motion priors to enable safe locomotion on diverse real-world terrains.
-
Representation Collapse in Sequential Post-Training of Large Language Models
Sequential post-training of LLMs induces representation collapse that correlates with reduced plasticity, weaker generalization, and poorer calibration, with lightweight interventions tested to mitigate it.
-
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Introduces SVEB benchmark and Numca/Hista methods claiming more accurate state value estimates and better RL training performance for LLMs.
-
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models
An iterative writer-editor multi-agent LLM process improves perceived story quality in simulations of child collaborative storytelling.
-
Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory
PPRO improves user-aware memory retrieval in conversational agents by using derived user profiles for ranking and training a query rewriter via Group Relative Policy Optimization, with reported gains on LoCoMo and LongMemEval-S benchmarks.
-
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design
SkillPCF is a closed-loop agent framework with a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded evolution that improves design quality and efficiency for photonic crystal fiber inverse design under limited simulation budgets.
-
SPRINT: Efficient Spectral Priors for Humanoid Athletic Sprints
SPRINT generates sprint trajectories for humanoids via spectral priors from five human motion sequences, achieving 6 m/s peak velocity with zero-shot sim-to-real transfer on Unitree G1.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
COSE uses LLM intrinsic confidence to weight PPO updates and prioritize replay, yielding better average performance than base models on reasoning and math benchmarks across multiple small backbones.
-
ABot-OCR Technical Report
ABot-OCR is a new end-to-end VLM for direct image-to-Markdown transcription using a custom data engine and structure-constrained RL optimization, reporting SOTA scores of 92.81/93.30 on OmniDocBench v1.5/v1.6.
-
SANTS: A State-Adaptive Scheduler for World Action Models
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
-
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
-
Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment
Introduces kernel contracts framework with derived bounds on divergence from logit drift to reward drift, specialized for RL post-training under support and norm assumptions.
-
Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation
Proposes Bayesian posterior inference on probabilistic landing capability to enable sequential approve/reject/continue decisions for RL landing controllers under finite validation evidence.
-
Quantifying Uncertainty in Space Debris Capture with Active Tether-Net Systems Caused by Noisy Observations
Presents a UQ pipeline applying Sobol sensitivity analysis and perturbation methods to quantify noisy-observation effects on Capture Quality Index for fixed-control and neuro-controlled active tether-net systems, using high- and low-fidelity simulators.
-
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
-
LitSeg: Narrative-Aware Document Segmentation for Literary RAG
LitSeg segments literary texts using narrative analysis via multi-stage prompting and offers a distilled lightweight version for efficient use in RAG systems.
-
Ratio-Variance Regularized Policy Optimization
R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.
-
KARMA: Karma-Aligned Reward Model Adaptation
KARMA adapts reward models from Reddit karma data to align LLMs with conversational pragmatics, finding that context-only rewards outperform karma-predictive ones downstream while reducing factuality across conditions.
-
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
A GRPO-based RL framework with probabilistic risk minimization, disagreement-aware synergy rewards, and entropy-guided sampling enables instance-level tool selection that closes the single-oracle risk gap on medical benchmarks.
-
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
-
Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation
The paper releases two adversarial malware datasets (44k family-labelled, 33k type-labelled) with high evasion rates and demonstrates that 0.5% poisoning injection raises evasion from 26.1% to 92.8%.
-
ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour Locomotion
ParkourFormer achieves 93.85% average success on multi-terrain humanoid parkour by fusing Transformer sequence modeling with supervised future-state prediction.
-
Reinforcement Learning from Denoising Feedback
RLDF is a new RL paradigm for diffusion language models that optimizes toward clipped clean states with weighted timestep sampling and reports substantial gains on reasoning benchmarks for LLaDA and Dream.
-
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
-
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
-
Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion
Targeted changes to policy initialization, critic targets, and return estimation let SAC match PPO performance across legged locomotion tasks in massively parallel simulation.
-
Integrated Sensing, Communication, and Computing for NR-V2X: A Cross-Layer Resource Allocation Framework Using Multi-Agent Reinforcement Learning
MAPPO-SPS applies multi-agent proximal policy optimization to a cooperative partially observable Markov game formulation of ISCC-aware SB-SPS scheduling in NR-V2X, yielding balanced simulation tradeoffs across CRLB sensing accuracy, PRR, throughput, energy, and delay.
-
DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting
DTO is a new differentiable objective combining fidelity to reference rewrites and semantic consistency that outperforms MLE and preference baselines while matching LLMs on TimeTravel and ART datasets.
-
Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems
MRC computes coalition Shapley credits from performance histories to weight three LLM agents, stabilized by Bayesian mixture and regime multipliers, achieving SR 1.51 and 440.1% cumulative return over 1037 days on 13 crypto assets.
-
Vision-Guided Outdoor Flight and Obstacle Evasion via Reinforcement Learning
A sensorimotor policy with a pre-trained autoencoder perception head and LSTM controller, trained in two stages via privileged learning and curriculum reinforcement learning with domain randomization, achieves zero-shot transfer for outdoor obstacle evasion on unseen environments and platforms.
-
SafeSABR: Risk-Calibrated Adaptive Bitrate Streaming over Starlink Networks
SafeSABR cuts severe-stall sessions in Starlink video streaming from 22.8% to 7.2% and worst-5% rebuffering from 54.30 s to 22.68 s at 1.8% QoE cost via behavior-cloning pretraining, risk-calibrated RL, and safe-capacity auditing.
-
StepAudio 2.5 Technical Report
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
-
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization
MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.
-
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
-
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
-
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
-
Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems
Meta-learning framework adapting iMAML for rapid controller tuning on uncertain nonlinear systems via offline source data and limited online target adaptation, shown with neural state-space and DQN variants.
-
ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps
ACCoRD trains an ANN with PPO-Clip reinforcement learning to select conflict resolution actions in O-RAN, reducing negative network events versus rule-based methods in medium and high traffic simulations.
-
Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs
RGoT uses RL to adaptively generate task-specific graphs of operations for GoT-style LLM prompting from a human-provided set, with results suggesting feasibility under constraints.
-
One-Way Policy Optimization for Self-Evolving LLMs
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
-
OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization
OPERA jointly optimizes restoration planning via RL over tool compositions and execution via agent-guided co-training of tools, claiming consistent gains over all-in-one models and prior agent methods on multi-degradation benchmarks.
-
ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
-
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
-
Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control
A closed-loop sim-to-real RL policy trained in a simplified frictionless simulator achieves sub-millimeter microfiber shape control on physical hardware via visual feedback without retraining.
-
torchtune: PyTorch native post-training library
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
-
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
Stochastic MeanFlow Policies enable one-step generative control in off-policy mirror descent by mapping noise through a MeanFlow transform, yielding tractable entropy and improved MuJoCo performance over Gaussian and generative baselines.
-
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models
LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.