super hub Mixed citations

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al · 2025 · cs.LG · arXiv 2503.14476

Mixed citation behavior. Most common role is background (56%).

579 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 579 citing papers more from Qiying Yu et al arXiv PDF

abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 61 method 26 baseline 11 dataset 10 other 1

citation-polarity summary

background 61 use method 24 baseline 11 use dataset 10 unclear 3

claims ledger

abstract Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50

authors

Qiying Yu et al

co-cited works

representative citing papers

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0 · 2 refs

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

cs.AI · 2026-06-04 · conditional · novelty 8.0

Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Verifiable Rewards for Calibrated Probabilistic Forecasting

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A verifiable empirical win rate reward combined with gradient masking enables RL training of a 7B model to reach betting-market calibration on NFL win probabilities using only outcome data.

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

GRPO, Dr. GRPO, and DAPO are three settings of one dial on the group standard deviation of binary rewards, unified by the group-standard-deviation identity where disagreement equals update magnitude.

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

cs.LG · 2026-06-24 · unverdicted · novelty 7.0

Introduces the Generalization Spectrum evaluation framework to track per-example generalization across transfer distances in competitive programming tasks.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

cs.DC · 2026-06-10 · unverdicted · novelty 7.0

ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

CERO uses Beta posteriors and Fenchel-dual online optimization to adaptively allocate a fixed rollout budget across prompts and epochs in LLM RL, outperforming fixed-allocation GRPO on math reasoning benchmarks.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 12 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Teaching Language Models to Think in Code cs.CL · 2026-05-08 · unverdicted · none · ref 21 · 2 links · internal anchor
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 46 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unverdicted · none · ref 66 · 2 links · internal anchor
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 51 · internal anchor
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 59 · internal anchor
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale cs.CV · 2026-04-06 · unverdicted · none · ref 47 · internal anchor
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 121 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 58 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 76 · internal anchor
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction cs.LG · 2026-05-12 · unverdicted · none · ref 31 · internal anchor
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories cs.AI · 2026-05-09 · unverdicted · none · ref 38 · internal anchor
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 73 · 2 links · internal anchor
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CV · 2026-05-07 · unverdicted · none · ref 46 · internal anchor
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime cs.LG · 2026-05-06 · unverdicted · none · ref 2 · 3 links · internal anchor
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates near 50% in binary-reward RL, delivering wall-clock speedups and modest performance gains on SWE-bench Verified and AIME tasks.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space cs.IR · 2026-04-28 · unverdicted · none · ref 46 · internal anchor
GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baselines on benchmarks and industrial data.
Watch Before You Answer: Learning from Visually Grounded Post-Training cs.CV · 2026-04-06 · unverdicted · none · ref 54 · internal anchor
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 49 · internal anchor
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 35 · internal anchor
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO · 2026-04-09 · unverdicted · none · ref 130 · internal anchor
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 57 · internal anchor
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 47 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 37 · internal anchor

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer