StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li · 2024 · DOI 10.18653/v1/2024.findings-acl.664

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

PilotBench reveals that LLMs follow safety instructions well in flight trajectory prediction but deliver lower numerical precision than traditional forecasters, exposing a precision-controllability tradeoff.

PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

cs.AI · 2026-04-02 · unverdicted · novelty 7.0

PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

cs.AI · 2025-01-27 · unverdicted · novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

citing papers explorer

Showing 5 of 5 citing papers.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents cs.LG · 2026-05-22 · unverdicted · none · ref 17
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory cs.AI · 2026-05-08 · unverdicted · none · ref 9
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
PilotBench: A Benchmark for General Aviation Agents with Safety Constraints cs.AI · 2026-04-10 · unverdicted · none · ref 16
PilotBench reveals that LLMs follow safety instructions well in flight trajectory prediction but deliver lower numerical precision than traditional forecasters, exposing a precision-controllability tradeoff.
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools cs.AI · 2026-04-02 · unverdicted · none · ref 7
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions cs.AI · 2025-01-27 · unverdicted · none · ref 49
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer