RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
hub Mixed citations
Proximal policy optimization algorithms
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.
C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
RTMC aggregates returns across rollout trees to produce step-level Q-values and advantages, improving pass@1 by 3.2 points over GRPO on SWE-bench Verified.
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.
PokeRL trains PPO agents to finish early Pokemon Red tasks using a loop-aware environment wrapper, multi-layer anti-loop mechanisms, and dense hierarchical rewards.
Introduces an off-policy adversarial imitation learning method with double Q stabilization that reduces samples required to match expert behavior.
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.
citing papers explorer
-
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
-
Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
-
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.
-
Distributional Reinforcement Learning via the Cram\'er Distance
C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.
-
COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams
COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.
-
RTMC: Step-Level Credit Assignment via Rollout Trees
RTMC aggregates returns across rollout trees to produce step-level Q-values and advantages, improving pass@1 by 3.2 points over GRPO on SWE-bench Verified.
-
RAGEN-2: Reasoning Collapse in Agentic RL
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
-
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
-
SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
-
RewardBench 2: Advancing Reward Model Evaluation
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Efficient 3D Content Reconstruction and Generation
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
-
Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
-
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.
-
PokeRL: Reinforcement Learning for Pokemon Red
PokeRL trains PPO agents to finish early Pokemon Red tasks using a loop-aware environment wrapper, multi-layer anti-loop mechanisms, and dense hierarchical rewards.
-
Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization
Introduces an off-policy adversarial imitation learning method with double Q stabilization that reduces samples required to match expert behavior.
-
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
-
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
-
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.
-
InvDesFlow-AL: active learning-based workflow for inverse design of functional materials
InvDesFlow-AL combines active learning with diffusion generative models to improve crystal structure prediction accuracy by 33% and identifies Li2AuH6 as a candidate BCS superconductor with 140 K transition temperature.
-
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Rule-based RL on 5K logic puzzles induces advanced reasoning in a 7B model that transfers to AIME and AMC.
-
Reinforcement Learning for LLM Post-Training: A Survey
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
- The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
- Hint Tuning: Less Data Makes Better Reasoners