Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
super hub Canonical reference
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for mor
authors
co-cited works
representative citing papers
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
citing papers explorer
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
-
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
-
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
-
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
-
SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO Docker snapshots, and Prosecutor-Judge verification.
-
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study
An empirical evaluation of philosophical dispositions constraining AI code review on 50 PRs shows 46% human convergence, 75% unique findings, zero author-judged false positives, and 51% findings absent from generic prompting.
-
Tail-aware N-version Machine Learning Models for Reliable API Recommendation
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
-
Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains
Introduces Augment Engineering as a six-phase multi-tool orchestration methodology, supported by exploratory statistics from a single-practitioner case study across seven domains.
-
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt-Engineering Quality Assurance
A case study of iterative auditing on AEGIS LLM multi-agent specs found 51 defects over nine rounds with non-monotonic convergence, a post-hoc seven-category taxonomy, and replications showing category kappa of 0.80.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development
A multi-case study plus survey produces seven actionable recommendations for efficient and responsible LLM use in industrial software engineering.
-
Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
-
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
-
Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report
Presents a contract-driven adversarial verification architecture for AI-native software production with early deployment observations from 17 features.
-
Code Broker: A Multi-Agent System for Automated Code Quality Assessment
Code Broker deploys a five-agent hierarchy that combines LLM semantic analysis with static linting to generate actionable Python code quality reports.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
-
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.