hub Canonical reference

On the tool manipulation capability of open-source large language models

· 2023 · arXiv 2305.16504

Canonical reference. 100% of citing Pith papers cite this work as background.

11 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

Intent2Tx shows that LLMs often generate syntactically valid but functionally incorrect Ethereum transactions, especially on multi-step and out-of-distribution intents, despite gains from scaling and retrieval augmentation.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.

Memory in the Age of AI Agents

cs.CL · 2025-12-15 · unverdicted · novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

cs.AI · 2025-07-28 · unverdicted · novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

cs.CL · 2023-06-08 · conditional · novelty 6.0

ToolAlpaca trains 7B and 13B models on 3938 simulated tool-use cases to reach generalized tool-use performance comparable to GPT-3.5 on unseen APIs.

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

cs.AI · 2025-08-10 · unverdicted · novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

A Comprehensive Overview of Large Language Models

cs.CL · 2023-07-12 · unverdicted · novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

citing papers explorer

Showing 11 of 11 citing papers.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents cs.IR · 2026-05-11 · unverdicted · none · ref 5
RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems cs.AI · 2026-05-11 · unverdicted · none · ref 15
The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions cs.AI · 2026-04-30 · unverdicted · none · ref 23
Intent2Tx shows that LLMs often generate syntactically valid but functionally incorrect Ethereum transactions, especially on multi-step and out-of-distribution intents, despite gains from scaling and retrieval augmentation.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 38
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 203
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 46
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 260
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 140
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases cs.CL · 2023-06-08 · conditional · none · ref 3
ToolAlpaca trains 7B and 13B models on 3938 simulated tool-use cases to reach generalized tool-use performance comparable to GPT-3.5 on unseen APIs.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems cs.AI · 2025-08-10 · unverdicted · none · ref 106
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 222
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

On the tool manipulation capability of open-source large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer