Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
hub Mixed citations
Hybridflow: A flexible and efficient rlhf framework
Mixed citation behavior. Most common role is method (38%).
hub tools
citation-role summary
citation-polarity summary
years
2026 15representative citing papers
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower attack success while preserving task performance.
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
citing papers explorer
No citing papers match the current filters.