Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.
hub
In Practice and Experience in Advanced Research Comput- ing 2019: Rise of the Machines (Learning) , PEARC ’19
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Eval-Skill synthesizes reusable domain-level evaluation skills from 100 cases via two-stage exploration-guided evolution and injects them into judge context, improving LLM judges on RewardBench 2 by 13-18%.
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
EntroRouter applies entropy regulation in a single-round routing framework to decouple reasoning from routing, retaining 98.3% of top expert accuracy at 48.25% lower compute cost.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
Agentic misalignment in multi-agent systems arises from generic utilities causing posterior collapse; Agentic Evidence Attribution using self-reflection or weak-to-strong generalization provides context-specific evidence to align agent posteriors.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Counsel: A Meta-Evaluation Dataset for Agentic Tasks
Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.
-
Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling
Eval-Skill synthesizes reusable domain-level evaluation skills from 100 cases via two-stage exploration-guided evolution and injects them into judge context, improving LLM judges on RewardBench 2 by 13-18%.
-
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
-
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
-
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
-
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
-
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.
-
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
-
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
-
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
-
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
EntroRouter: Learning Efficient Model Routing via Entropy Regulation
EntroRouter applies entropy regulation in a single-round routing framework to decouple reasoning from routing, retaining 98.3% of top expert accuracy at 48.25% lower compute cost.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
-
A Sober Look at Agentic Misalignment in Automated Workflows
Agentic misalignment in multi-agent systems arises from generic utilities causing posterior collapse; Agentic Evidence Attribution using self-reflection or weak-to-strong generalization provides context-specific evidence to align agent posteriors.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.
-
Self-Aligned Reward: Towards Effective and Efficient Reasoners
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
-
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- VRPRM: Process Reward Modeling via Visual Reasoning