Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
hub
Tool Learning with Foundation Models
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
-
FactoryBench: Evaluating Industrial Machine Understanding
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.