hub Mixed citations

Metatool benchmark for large language models: Deciding whether to use tools and which to use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al · 2023 · arXiv 2310.03128

Mixed citation behavior. Most common role is background (67%).

17 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 2

citation-polarity summary

background 6 use dataset 2 unclear 1

representative citing papers

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

cs.AI · 2025-06-09 · unverdicted · novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.

Prompt Injection Attack to Tool Selection in LLM Agents

cs.CR · 2025-04-28 · conditional · novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

cs.AI · 2024-06-17 · unverdicted · novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

Memory in the Age of AI Agents

cs.CL · 2025-12-15 · unverdicted · novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

cs.AI · 2025-07-28 · unverdicted · novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

cs.AI · 2026-05-05 · unverdicted · novelty 5.0

A framework automates multi-agent system creation via LLM planning and two-stage agent recommendation, claiming higher recall than prior methods.

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

A hybrid deterministic-plus-semantic interception layer for continuous task-based authorization of multi-turn LLM agent tool invocations, with new multi-turn datasets and initial experiments.

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

cs.CL · 2025-10-27 · unverdicted · novelty 5.0

LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

cs.CV · 2024-02-27 · unverdicted · novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

cs.AI · 2026-05-01

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

cs.AI · 2025-10-03

citing papers explorer

Showing 17 of 17 citing papers.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 164 · 2 links
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use cs.AI · 2026-05-13 · unverdicted · none · ref 8 · 2 links
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 9
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment cs.AI · 2025-06-09 · unverdicted · none · ref 9
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 22
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains cs.AI · 2024-06-17 · unverdicted · none · ref 11
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 15
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval cs.AI · 2026-05-04 · unverdicted · none · ref 14
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 261
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 50
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
From Intent to Execution: Composing Agentic Workflows with Agent Recommendation cs.AI · 2026-05-05 · unverdicted · none · ref 4
A framework automates multi-agent system creation via LLM planning and two-stage agent recommendation, claiming higher recall than prior methods.
Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI cs.AI · 2026-05-04 · unverdicted · none · ref 18
A hybrid deterministic-plus-semantic interception layer for continuous task-based authorization of multi-turn LLM agent tool invocations, with new multi-turn datasets and initial experiments.
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception cs.CL · 2025-10-27 · unverdicted · none · ref 13
LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 140
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 122
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling cs.AI · 2026-05-01 · unreviewed · ref 39
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents cs.AI · 2025-10-03 · unreviewed · ref 6

Metatool benchmark for large language models: Deciding whether to use tools and which to use

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer