Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.
hub
In Practice and Experience in Advanced Research Comput- ing 2019: Rise of the Machines (Learning) , PEARC ’19
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Eval-Skill synthesizes reusable domain-level evaluation skills from 100 cases via two-stage exploration-guided evolution and injects them into judge context, improving LLM judges on RewardBench 2 by 13-18%.
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
EntroRouter applies entropy regulation in a single-round routing framework to decouple reasoning from routing, retaining 98.3% of top expert accuracy at 48.25% lower compute cost.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
Agentic misalignment in multi-agent systems arises from generic utilities causing posterior collapse; Agentic Evidence Attribution using self-reflection or weak-to-strong generalization provides context-specific evidence to align agent posteriors.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.
-
Self-Aligned Reward: Towards Effective and Efficient Reasoners
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
- VRPRM: Process Reward Modeling via Visual Reasoning