vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
Part i: Tricks or traps? a deep dive into rl for llm reasoning
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable training and higher benchmark scores.
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
citing papers explorer
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
-
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable training and higher benchmark scores.
-
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
-
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.