hub Mixed citations

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu · 2025

Mixed citation behavior. Most common role is method (38%).

15 Pith papers citing it

Method 38% of classified citations

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

method 4 baseline 2 other 2

citation-polarity summary

use method 3 baseline 2 unclear 2 background 1

representative citing papers

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower attack success while preserving task performance.

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

cs.CL · 2026-03-09 · unverdicted · novelty 6.0

CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

citing papers explorer

Showing 15 of 15 citing papers.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 27
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 29 · 2 links
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 48 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation cs.LG · 2026-05-08 · unverdicted · none · ref 27
PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower attack success while preserving task performance.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 30 · 2 links
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 28
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models cs.AI · 2026-05-13 · unverdicted · none · ref 11 · 2 links
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction cs.LG · 2026-05-12 · unverdicted · none · ref 21
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information cs.LG · 2026-05-12 · unverdicted · none · ref 24
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 30
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 86 · 2 links
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora cs.SE · 2026-04-20 · unverdicted · none · ref 57
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning cs.CL · 2026-03-09 · unverdicted · none · ref 30
CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 32
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CV · 2026-05-08 · unverdicted · none · ref 21
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

Hybridflow: A flexible and efficient rlhf framework

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer