Part i: Tricks or traps? a deep dive into rl for llm reasoning

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning , author= · 2025 · arXiv 2508.08221

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

What are Key Factors for Updates in RL for LLM Reasoning?

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

Theoretical analysis of RLVR update dynamics leads to ACPO, an adaptive clipping method that outperforms DAPO and CISPO on reasoning benchmarks with 3B and 7B models.

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable training and higher benchmark scores.

Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation

cs.CV · 2026-02-17 · unverdicted · novelty 6.0

MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

cs.CL · 2025-12-08 · unverdicted · novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

cs.LG · 2026-06-20 · unverdicted · novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.

TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

q-bio.QM · 2025-11-20 · unverdicted · novelty 5.0

TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.

citing papers explorer

Showing 11 of 11 citing papers.

KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 24
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
What are Key Factors for Updates in RL for LLM Reasoning? cs.CL · 2026-06-21 · unverdicted · none · ref 40
Theoretical analysis of RLVR update dynamics leads to ACPO, an adaptive clipping method that outperforms DAPO and CISPO on reasoning benchmarks with 3B and 7B models.
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier cs.LG · 2026-06-10 · unverdicted · none · ref 128
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 30
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO cs.LG · 2026-04-14 · unverdicted · none · ref 11
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable training and higher benchmark scores.
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation cs.CV · 2026-02-17 · unverdicted · none · ref 78
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning cs.CL · 2025-12-08 · unverdicted · none · ref 6
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning cs.LG · 2026-06-20 · unverdicted · none · ref 123
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 213
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning cs.CL · 2026-05-25 · unverdicted · none · ref 15
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots q-bio.QM · 2025-11-20 · unverdicted · none · ref 61
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.

Part i: Tricks or traps? a deep dive into rl for llm reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer