super hub Canonical reference

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Ceyao Zhang, Jiaqi Chen, Mingchen Zhuge, Sirui Hong, Xiawu Zheng, Yuheng Cheng · 2023 · cs.AI · arXiv 2308.00352

Canonical reference. 92% of citing Pith papers cite this work as background.

157 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 157 citing papers more from Ceyao Zhang arXiv PDF

abstract

Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 method 2 dataset 1 other 1

citation-polarity summary

background 36 unclear 1 use dataset 1 use method 1

claims ledger

abstract Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for mor

authors

Ceyao Zhang Jiaqi Chen Mingchen Zhuge Sirui Hong Xiawu Zheng Yuheng Cheng

co-cited works

representative citing papers

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

The Khipu Problem: Institutional Legibility Under Distributed Cognition

cs.CY · 2026-05-06 · unverdicted · novelty 8.0

The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

cs.DC · 2026-06-30 · unverdicted · novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research

physics.ao-ph · 2026-06-05 · unverdicted · novelty 7.0

TianJi-Environ is a WRF-Chem-based multi-agent AI framework for autonomous validation of atmospheric chemistry mechanisms through executable experiments and evidence assessment.

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Introduces a 3-axis taxonomy (what info, alignment, fusion) for latent communication in multi-agent LLMs and identifies five design patterns from 18 methods.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

cs.CY · 2026-05-11 · unverdicted · novelty 7.0

TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.

Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets

cs.CR · 2026-04-28 · unverdicted · novelty 7.0

A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.

citing papers explorer

Showing 50 of 157 citing papers.

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints cs.AI · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
The Khipu Problem: Institutional Legibility Under Distributed Cognition cs.CY · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation cs.CR · 2025-07-14 · unverdicted · none · ref 14 · internal anchor
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 54 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering cs.DC · 2026-06-30 · unverdicted · none · ref 24 · internal anchor
SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 21 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents cs.MA · 2026-06-25 · accept · none · ref 44 · internal anchor
Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.
Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries cs.CL · 2026-06-09 · unverdicted · none · ref 7 · internal anchor
EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems cs.AI · 2026-06-05 · unverdicted · none · ref 31 · internal anchor
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research physics.ao-ph · 2026-06-05 · unverdicted · none · ref 39 · internal anchor
TianJi-Environ is a WRF-Chem-based multi-agent AI framework for autonomous validation of atmospheric chemistry mechanisms through executable experiments and evidence assessment.
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems cs.CL · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
Introduces a 3-axis taxonomy (what info, alignment, fusion) for latent communication in multi-agent LLMs and identifies five design patterns from 18 methods.
ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer cs.SE · 2026-06-04 · unverdicted · none · ref 6 · internal anchor
ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.
OctoT2I: A Self-Evolving Agentic Text-to-Image Router cs.AI · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 41 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents cs.CY · 2026-05-11 · unverdicted · none · ref 32 · internal anchor
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding cs.CV · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.
Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 93 · internal anchor
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 16 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates cs.SE · 2026-04-29 · unverdicted · none · ref 15 · internal anchor
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets cs.CR · 2026-04-28 · unverdicted · none · ref 13 · internal anchor
A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery cs.SE · 2026-04-27 · unverdicted · none · ref 14 · internal anchor
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents cs.CV · 2026-04-26 · unverdicted · none · ref 19 · internal anchor
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 107 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 20 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data cs.AI · 2026-04-22 · unverdicted · none · ref 64 · internal anchor
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration cs.AI · 2026-04-17 · unverdicted · none · ref 24 · internal anchor
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities cs.CR · 2026-04-15 · conditional · none · ref 12 · internal anchor
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability cs.AI · 2026-04-01 · conditional · none · ref 11 · internal anchor
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization? cs.AI · 2026-03-26 · conditional · none · ref 14 · internal anchor
An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network cs.CL · 2026-03-09 · unverdicted · none · ref 26 · internal anchor
Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.
Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems cs.MA · 2026-02-23 · unverdicted · none · ref 14 · internal anchor
Agentic Hives apply dynamic general equilibrium theory to variable populations of language-model agents, proving existence of equilibria, Pareto optimality, multiplicity, comparative-statics analogs, Hopf bifurcations, and stability conditions.
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation cs.SE · 2026-02-06 · conditional · none · ref 19 · internal anchor
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications cs.SE · 2025-09-23 · conditional · none · ref 18 · internal anchor
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 57 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 4 · internal anchor
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 161 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation cs.CL · 2023-12-20 · accept · none · ref 17 · internal anchor
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 51 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems cs.CR · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when monitoring the top 10%.
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting cs.SE · 2026-06-28 · unverdicted · none · ref 16 · internal anchor
Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables cs.LG · 2026-06-27 · unverdicted · none · ref 23 · internal anchor
Introduces the Contagion Tensor and CAF metrics to measure output-distribution coupling in multi-agent LLM systems, with simulation ablations showing artifact removal and real-API tests on GPT-4o-mini and DeepSeek validating modality effects.
LLM-as-Code: Agentic Programming for Agent Harness cs.AI · 2026-06-14 · unverdicted · none · ref 5 · internal anchor
Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.
Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection cs.CL · 2026-06-10 · unverdicted · none · ref 25 · internal anchor
A manager-worker multi-agent framework adaptively allocates reasoning-only agents and synthesizes their explanations to improve stance detection on implicit cases across three datasets.
PDE-Agents: An LLM-Orchestrated Multi-Agent Framework for Automated Finite Element Simulations with Knowledge Graph-Augmented Reasoning physics.comp-ph · 2026-06-05 · unverdicted · none · ref 41 · internal anchor
PDE-Agents shows a LangGraph-orchestrated multi-agent LLM framework with GraphRAG that reaches 100% task success and perfect material fidelity on novel materials in ablation tests, with 97.8% success across 1369 production runs.
Parthenon Law: A Self-Evolving Legal-Agent Framework cs.AI · 2026-06-03 · unverdicted · none · ref 18 · internal anchor
Parthenon is a self-evolving legal-agent framework that factors components for traceability and uses an anti-leakage learning loop to improve from scored failures on legal matters.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories cs.CL · 2026-05-31 · unverdicted · none · ref 42 · internal anchor
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer