Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 2polarities
background 2representative citing papers
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
PilotBench reveals that LLMs follow safety instructions well in flight trajectory prediction but deliver lower numerical precision than traditional forecasters, exposing a precision-controllability tradeoff.
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
citing papers explorer
-
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
-
PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
PilotBench reveals that LLMs follow safety instructions well in flight trajectory prediction but deliver lower numerical precision than traditional forecasters, exposing a precision-controllability tradeoff.
-
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
-
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.