Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
super hub Canonical reference
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for mor
authors
co-cited works
representative citing papers
The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.
RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.
Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.
EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
TianJi-Environ is a WRF-Chem-based multi-agent AI framework for autonomous validation of atmospheric chemistry mechanisms through executable experiments and evidence assessment.
Introduces a 3-axis taxonomy (what info, alignment, fusion) for latent communication in multi-agent LLMs and identifies five design patterns from 18 methods.
ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.
OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
citing papers explorer
-
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
-
Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets
A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
-
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
-
Prompt Injection Attack to Tool Selection in LLM Agents
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
-
MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems
MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when monitoring the top 10%.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
Device-Native Autonomous Agents for Privacy-Preserving Negotiations
A device-native autonomous agent system using zero-knowledge proofs and distilled world models achieves 87% negotiation success, 2.4x lower latency than cloud systems, and 27% higher user trust in privacy-sensitive scenarios.
-
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.
-
Towards Cybersecurity SuperIntelligence (CSI): What's the best harness for cybersecurity?
CSI meta-scaffold unifies five LLM agent harnesses; a blackboard multi-agent system solves 19/33 cybench challenges (57.6%) versus 15/33 for the best single scaffold.