ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Advances in Neural Information Processing Systems , volume=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
citing papers explorer
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
-
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.