super hub Mixed citations

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al · 2025 · cs.LG · arXiv 2503.14476

Mixed citation behavior. Most common role is background (55%).

549 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 549 citing papers more from Qiying Yu et al arXiv PDF

abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 59 method 26 baseline 11 dataset 10 other 1

citation-polarity summary

background 59 use method 24 baseline 11 use dataset 10 unclear 3

claims ledger

abstract Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50

authors

Qiying Yu et al

co-cited works

representative citing papers

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

cs.AI · 2026-06-04 · conditional · novelty 8.0

Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Verifiable Rewards for Calibrated Probabilistic Forecasting

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A verifiable empirical win rate reward combined with gradient masking enables RL training of a 7B model to reach betting-market calibration on NFL win probabilities using only outcome data.

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

GRPO, Dr. GRPO, and DAPO are three settings of one dial on the group standard deviation of binary rewards, unified by the group-standard-deviation identity where disagreement equals update magnitude.

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

CERO uses Beta posteriors and Fenchel-dual online optimization to adaptively allocate a fixed rollout budget across prompts and epochs in LLM RL, outperforming fixed-allocation GRPO on math reasoning benchmarks.

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

SAGC dynamically adjusts group sizes in synchronous GRPO and DAPO via online constrained optimization to cut stragglers, improve wall-clock speed, and maintain or improve rewards and downstream reasoning performance.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

RL2ML introduces a parameterized family of surrogate objectives bridging RL and ML with unbiased gradient estimators, group-level update-scale analysis, and metric-dependent optimization for finite-rollout LLM training.

citing papers explorer

Showing 50 of 455 citing papers after filters.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider cs.LG · 2026-06-22 · conditional · none · ref 24 · internal anchor
RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.
A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR cs.AI · 2026-06-04 · conditional · none · ref 17 · internal anchor
Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing cs.DC · 2026-06-02 · unverdicted · none · ref 71 · internal anchor
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning cs.LG · 2026-05-09 · conditional · none · ref 21 · internal anchor
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Verifiable Rewards for Calibrated Probabilistic Forecasting cs.LG · 2026-06-30 · unverdicted · none · ref 17 · internal anchor
A verifiable empirical win rate reward combined with gradient masking enables RL training of a 7B model to reach betting-market calibration on NFL win probabilities using only outcome data.
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity cs.LG · 2026-06-30 · unverdicted · none · ref 8 · internal anchor
GRPO, Dr. GRPO, and DAPO are three settings of one dial on the group standard deviation of binary rewards, unified by the group-standard-deviation identity where disagreement equals update magnitude.
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning cs.LG · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.
When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon cs.LG · 2026-06-29 · unverdicted · none · ref 66 · internal anchor
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 43 · internal anchor
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning cs.LG · 2026-06-28 · unverdicted · none · ref 14 · internal anchor
Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.
Tandem Reinforcement Learning with Verifiable Rewards cs.AI · 2026-06-26 · unverdicted · none · ref 20 · internal anchor
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR cs.AI · 2026-06-23 · unverdicted · none · ref 60 · internal anchor
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs cs.CL · 2026-06-10 · unverdicted · none · ref 79 · internal anchor
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training cs.LG · 2026-06-10 · unverdicted · none · ref 21 · internal anchor
ART optimizes visual pixel inputs to frozen MLLMs to achieve LoRA-competitive accuracy on math and structured tool-use benchmarks without modifying computational graphs.
On the Geometry of On-Policy Distillation cs.LG · 2026-06-05 · unverdicted · none · ref 20 · internal anchor
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training cs.LG · 2026-06-04 · unverdicted · none · ref 15 · internal anchor
CERO uses Beta posteriors and Fenchel-dual online optimization to adaptively allocate a fixed rollout budget across prompts and epochs in LLM RL, outperforming fixed-allocation GRPO on math reasoning benchmarks.
Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces cs.AI · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents cs.LG · 2026-06-03 · unverdicted · none · ref 30 · internal anchor
CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification cs.LG · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
TTRL-CoCoV is a confidence-conditioned test-time RL framework that selectively applies verification to address pseudo-label errors and diversity collapse, yielding +9.8% Pass@1 and +18.7% Pass@16 gains over prior TTRL on reasoning benchmarks.
Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing cs.LG · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
SAGC dynamically adjusts group sizes in synchronous GRPO and DAPO via online constrained optimization to cut stragglers, improve wall-clock speed, and maintain or improve rewards and downstream reasoning performance.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 35 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood cs.LG · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
RL2ML introduces a parameterized family of surrogate objectives bridging RL and ML with unbiased gradient estimators, group-level update-scale analysis, and metric-dependent optimization for finite-rollout LLM training.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL cs.LG · 2026-05-27 · conditional · none · ref 47 · internal anchor
Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor cs.AI · 2026-05-27 · unverdicted · none · ref 43 · internal anchor
Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use cs.LG · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
CARL trains a critic for segment-level credit assignment from binary outcomes in LLM tool-use trajectories, yielding 6.7-9.7 point accuracy gains and 53% fewer calls on solvable questions across five benchmarks.
Touch-R1: Reinforcing Touch Reasoning in MLLMs cs.CV · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 40 · internal anchor
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models cs.CL · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning cs.LG · 2026-05-23 · unverdicted · none · ref 34 · internal anchor
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval cs.CV · 2026-05-22 · unverdicted · none · ref 24 · internal anchor
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection cs.CV · 2026-05-22 · unverdicted · none · ref 75 · internal anchor
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Learnability-Informed Fine-Tuning of Diffusion Language Models cs.CL · 2026-05-21 · unverdicted · none · ref 16 · internal anchor
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
Unlocking Proactivity in Task-Oriented Dialogue cs.AI · 2026-05-21 · unverdicted · none · ref 11 · 2 links · internal anchor
Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks cs.CV · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
Scalable Reinforcement Learning via Adaptive Batch Scaling stat.ML · 2026-05-20 · unverdicted · none · ref 20 · 2 links · internal anchor
ABS uses Behavioral Divergence to adaptively scale batch sizes in RL according to policy volatility, enabling effective large-batch large-network training on ALE benchmarks.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 39 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization cs.LG · 2026-05-19 · conditional · none · ref 22 · internal anchor
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 38 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation cs.AI · 2026-05-18 · unverdicted · none · ref 22 · internal anchor
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 61 · internal anchor
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
Weak-to-Strong Elicitation via Mismatched Wrong Drafts cs.CL · 2026-05-17 · unverdicted · none · ref 5 · 2 links · internal anchor
Mismatched wrong drafts from Qwen2.5-Math-1.5B improve Mathstral-7B GRPO training, reaching 71.98% greedy pass@1 on MATH-500 and lifting AIME 2025/2026 pass@k over baselines and other draft variants.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 9 · internal anchor
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking cs.LG · 2026-05-15 · unverdicted · none · ref 30 · internal anchor
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 42 · internal anchor
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 43 · internal anchor
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training cs.LG · 2026-05-14 · unverdicted · none · ref 59 · internal anchor
PODS is a plug-and-play oscillatory data-volume scheduler that alternates low-ratio regularization phases with high-ratio recovery phases to improve data selection efficiency across training tasks.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents cs.AI · 2026-05-13 · unverdicted · none · ref 55 · 2 links · internal anchor
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 12 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer