This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
hub Canonical reference
A Survey on Large Language Model based Autonomous Agents
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at https://github.com/Paitesanshi/LLM-Agent-Survey.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
Introduces Sentinel Challenge benchmark and CoSaR framework for cooperative spatial reasoning and planning among 3-5 decentralized embodied agents across 14 city-scale scenes.
Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.
The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
The OpenHands Software Agent SDK provides a composable architecture for production software agents featuring native sandboxed execution, multi-LLM routing, and built-in security analysis.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
Visual Inception poisons images to hijack long-term memory in agentic recommenders and steer planning, while CognitiveGuard reduces success to about 10% via perceptual sanitization and reasoning verification.
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
Introduces the concept of agentic inequality and develops a three-dimensional framework (availability, quality, quantity) to analyze how autonomous AI agents could deepen or mitigate existing divides through scalable goal delegation.
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
PRIMA introduces resilience layers, sub-agent operating disciplines, multi-phase patterns, prime-power identities, and a protocol for convergent multi-agent LLM research, demonstrated in a graph isomorphism case study.
HarnessAPI derives streaming HTTP endpoints, OpenAPI UI, and MCP tools from a single handler.py plus Pydantic schemas, cutting framework boilerplate by 74%.
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
citing papers explorer
-
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
-
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
-
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.