Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
super hub Canonical reference
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for mor
authors
co-cited works
representative citing papers
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
citing papers explorer
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
PSMAS reduces token use in LLM multi-agent systems by 27.3% on average via phase-based temporal scheduling and context compression, with task performance staying within 2.1 points of full activation.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power
AgentCity introduces a Separation of Power constitutional architecture on blockchain for governing autonomous agent economies through agent legislation, automated execution, and human accountability.
-
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage
PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
-
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
-
Collective AI can amplify tiny perturbations into divergent decisions
Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning
LLM agents iteratively generate and optimize data processing strategies for fine-tuning, delivering over 80% win rates versus unprocessed data and 65% versus LLM-based AutoML baselines while cutting search time by up to 10x.
-
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web
Holos is a five-layer LLM-based multi-agent system architecture using the Nuwa engine for agent generation, a market-driven Orchestrator for coordination, and an endogenous value cycle for incentive-compatible persistence in the Agentic Web.
-
Device-Native Autonomous Agents for Privacy-Preserving Negotiations
A device-native autonomous agent system using zero-knowledge proofs and distilled world models achieves 87% negotiation success, 2.4x lower latency than cloud systems, and 27% higher user trust in privacy-sensitive scenarios.
-
TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM on representative benchmarks.
-
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.
-
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
-
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society
AgentSociety is a large-scale LLM agent-based social simulator validated on polarization, UBI, disasters, and sustainability issues with alignment to real experiments.
-
RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models
RankFlow deploys four LLM roles in sequence to rewrite queries, generate pseudo-answers, summarize passages, and rerank candidates, outperforming prior methods on TREC-DL, BEIR, and NovelEval.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
-
ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy
ATOM uses a nucleus-electron hierarchy and task-driven RL to generate budget-controllable multi-agent collaboration graphs for LLMs, claiming SOTA performance with up to 30% better token efficiency on six benchmarks.
-
Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities
Agent Manufacturing is defined as manufacturing systems whose principal coordination uses foundation-model agents for open-ended goal interpretation, long-horizon planning, tool invocation, and negotiation.
-
PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback
PRIMA introduces resilience layers, sub-agent operating disciplines, multi-phase patterns, prime-power identities, and a protocol for convergent multi-agent LLM research, demonstrated in a graph isomorphism case study.
-
Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains
Introduces Augment Engineering as a six-phase multi-tool orchestration methodology, supported by exploratory statistics from a single-practitioner case study across seven domains.
-
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
-
MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation
MetaCogAgent equips multi-agent LLMs with metacognitive self-assessment, adaptive delegation, and capability learning to reach 82.4% accuracy on a 700-task benchmark while using fewer API calls than baselines.
-
TopoClaw: A Human-Centric and Topology-Aware Agent Operating System
TopoClaw is a human-centric Agent OS that uses physical and social topology modeling to enable cross-boundary execution with identity attribution and context-aware governance.
-
SHM-Agents: A Generalist-Specialist Integrated Agent System for Structural Health Monitoring
SHM-Agents is an LLM-plus-specialist-agent framework that claims to execute a wide range of SHM tasks end-to-end via natural language on data from a long-span cable-stayed bridge.
-
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
-
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt-Engineering Quality Assurance
A case study of iterative auditing on AEGIS LLM multi-agent specs found 51 defects over nine rounds with non-monotonic convergence, a post-hoc seven-category taxonomy, and replications showing category kappa of 0.80.
-
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
-
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.
-
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
-
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
-
Knowledge Compounding: An Empirical Economic Analysis of Self-Evolving Knowledge Wikis under the Agentic ROI Framework
A four-query experiment demonstrates 84.6% token savings through knowledge compounding in self-evolving wikis compared to standard RAG, by amortizing ingestion costs and reusing synthesized knowledge over time.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
-
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.
-
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
-
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
-
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
-
Semantic-Aware Logical Reasoning via a Semiotic Framework
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
-
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.