pith. sign in

arxiv: 2308.08155 · v2 · submitted 2023-08-16 · 💻 cs.AI · cs.CL

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Pith reviewed 2026-05-24 07:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords applicationsautogenagentsbuildconversationdevelopersframeworkvarious
0
0 comments X

The pith

AutoGen provides an open-source framework for multi-agent LLM conversations that support customizable interactions across diverse applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes AutoGen as a system where developers create teams of AI agents. Each agent can use large language models, accept human input, or call tools. Agents talk to each other in patterns that can be defined in natural language or code. The framework is shown through examples in math, coding, question answering, and decision-making tasks.

Core claim

AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks.

Load-bearing premise

That multi-agent conversation patterns, when flexibly defined, will reliably enable effective task completion across varied domains and LLM capacities as claimed in the empirical studies.

Figures

Figures reproduced from arXiv: 2308.08155 by Ahmed Hassan Awadallah, Beibin Li, Chi Wang, Doug Burger, Erkang Zhu, Gagan Bansal, Jiale Liu, Jieyu Zhang, Li Jiang, Qingyun Wu, Ryen W White, Shaokun Zhang, Xiaoyun Zhang, Yiran Wu.

Figure 1
Figure 1. Figure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns. Abstract AutoGen2… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of how to use AutoGen to program a multi-agent conversation. The top sub￾figure illustrates the built-in agents provided by AutoGen, which have unified conversation interfaces and can be customized. The middle sub-figure shows an example of using AutoGen to develop a two-agent system with a custom reply function. The bottom sub-figure illustrates the resulting automated agent chat from the two… view at source ↗
Figure 3
Figure 3. Figure 3: Six examples of diverse applications built using [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance on four applications A1-A4. (a) shows that [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Default system message for the built-in assistant agent in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of three settings utilized to solve math problems using [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of Retrieval-augmented Chat which involves two agents, including a Retrieval [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Retrieval-augmented Chat without (W/O) and with (W/) [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: We use AutoGen to solve tasks in the ALFWorld benchmark, which contains household tasks described in natural language. We propose two designs: a two-agent design where the assistant agent suggests the next step, and the Executor executes actions and provides feedback. The three￾agent design adds a grounding agent that supplies commonsense facts to the executor when needed. ALFWorld (Shridhar et al., 2021) … view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of results from two designs: (a) Two-agent design which consists of an [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Our re-implementation of OptiGuide with AutoGen streamlining agents’ interactions. The Commander receives user questions (e.g., What if we prohibit shipping from supplier 1 to roastery 2?) and coordinates with the Writer and Safeguard. The Writer crafts the code and inter￾pretation, the Safeguard ensures safety (e.g., not leaking information, no malicious code), and the Commander executes the code. If iss… view at source ↗
Figure 12
Figure 12. Figure 12: A5: Dynamic Group Chat: Overview of how AutoGen enables dynamic group chats to solve tasks. The Manager agent, which is an instance of the GroupChatManager class, performs the following three steps–select a single speaker (in this case Bob), ask the speaker to respond, and broadcast the selected speaker’s message to all other agents To validate the necessity of multi-agent dynamic group chat and the effec… view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of two-agent chat (a) and group chat (b) on a given task. The group chat [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A6: Conversational Chess: Our conversational chess application can support various [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example conversations during a game involving two AI player agents and a board agent. [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of two designs–(a) without a board agent, and (b) with a board agent–in [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: We use AutoGen to build MiniWobChat, which solves tasks in the MiniWob++ bench￾mark. MiniWobChat consists of two agents: an assistant agent and an executor agent. The assistant agent suggests actions to manipulate the browser while the executor executes the suggested actions and returns rewards/feedback. The assistant agent records the feedback and continues until the feed￾back indicates task success or f… view at source ↗
Figure 18
Figure 18. Figure 18: Comparisons between RCI (state-of-the-art prior work) and MiniWobChat on the Mini [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
read the original abstract

AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AutoGen, an open-source framework for building LLM applications via multiple customizable, conversable agents that interact to complete tasks. Agents support various modes combining LLMs, human inputs, and tools; interaction behaviors can be defined flexibly in natural language or code. The work positions AutoGen as generic infrastructure for applications of varying complexity and claims that empirical studies demonstrate its effectiveness across domains including mathematics, coding, question answering, operations research, online decision-making, and entertainment.

Significance. If the framework functions as described, it offers a practical, extensible infrastructure that could reduce the engineering effort required to prototype multi-agent LLM systems. The open-source release supports reproducibility and community adoption. The contribution is primarily infrastructural rather than theoretical, with significance hinging on whether the reported examples generalize beyond the specific cases shown.

major comments (2)
  1. [Abstract and Empirical Studies] Abstract and sections describing empirical studies: the claim that 'empirical studies demonstrate the effectiveness of the framework in many example applications' is supported only by reported examples; without methods, data, controls, baselines, or quantitative metrics, the support remains at the level of illustration rather than rigorous verification, which is load-bearing for the effectiveness assertion.
  2. [Framework Description] Sections on agent modes and conversation patterns: while the framework is described as supporting flexible definition of behaviors, the manuscript provides no formal characterization (e.g., termination guarantees, consistency properties, or complexity bounds) of the conversation patterns, leaving open whether the claimed generality holds for arbitrary LLM capacities.
minor comments (1)
  1. [Introduction] Notation for agent roles and conversation modes could be introduced with a small table or diagram on first use to improve readability for readers unfamiliar with multi-agent setups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the strength of the empirical claims and the formal characterization of the framework. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Empirical Studies] Abstract and sections describing empirical studies: the claim that 'empirical studies demonstrate the effectiveness of the framework in many example applications' is supported only by reported examples; without methods, data, controls, baselines, or quantitative metrics, the support remains at the level of illustration rather than rigorous verification, which is load-bearing for the effectiveness assertion.

    Authors: We agree that the reported applications function as illustrative case studies rather than controlled experiments with baselines, quantitative metrics, or statistical analysis. The manuscript's primary contribution is the open-source framework and its design for flexible multi-agent interactions; the examples demonstrate how the framework can be applied across domains but do not constitute rigorous verification of effectiveness. We will revise the abstract and the empirical studies sections to replace the phrasing 'empirical studies demonstrate the effectiveness' with 'case studies illustrate the applicability and utility' and will add explicit language clarifying the illustrative nature of the examples. revision: yes

  2. Referee: [Framework Description] Sections on agent modes and conversation patterns: while the framework is described as supporting flexible definition of behaviors, the manuscript provides no formal characterization (e.g., termination guarantees, consistency properties, or complexity bounds) of the conversation patterns, leaving open whether the claimed generality holds for arbitrary LLM capacities.

    Authors: The framework's generality stems from its support for defining interaction behaviors in natural language or code, allowing customization for different LLM capacities as shown in the examples. Because agent behaviors depend on the stochastic outputs of underlying LLMs, providing general formal guarantees such as termination or consistency bounds is not feasible without strong assumptions that do not hold across arbitrary models. We will add a new subsection discussing practical mechanisms already present in the framework (e.g., configurable termination conditions) and limitations arising from LLM variability, but a full theoretical analysis lies outside the scope of this systems-oriented paper. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a framework description for AutoGen, an open-source system enabling multi-agent LLM conversations, with no derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. Claims rest on customizable agent behaviors and example applications across domains, presented as empirical demonstrations rather than derived results. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The structure is self-contained as a software contribution with no internal reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLMs can function as effective conversable agents when augmented with tools and interaction patterns; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Large language models can serve as effective conversable agents when combined with human inputs and tools in multi-agent settings.
    This premise is required for the framework to deliver on its stated purpose of enabling task accomplishment through conversation.

pith-pipeline@v0.9.0 · 5691 in / 1058 out tokens · 37880 ms · 2026-05-24T07:43:53.022732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

    cs.AI 2026-05 unverdicted novelty 8.0

    Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experimen...

  2. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  3. Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

    quant-ph 2025-10 accept novelty 8.0 full

    A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...

  4. ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

    cs.CR 2025-07 unverdicted novelty 8.0

    ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

  5. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  6. Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

    cs.MA 2024-10 unverdicted novelty 8.0

    Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

  7. AgentReview: Exploring Peer Review Dynamics with LLM Agents

    cs.CL 2024-06 unverdicted novelty 8.0

    AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.

  8. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    cs.CL 2026-05 unverdicted novelty 7.0

    Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...

  9. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    cs.CL 2026-05 unverdicted novelty 7.0

    Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

  10. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

  11. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

  12. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...

  13. S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

    cs.LG 2026-05 unverdicted novelty 7.0 partial

    S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.

  14. S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

    cs.LG 2026-05 unverdicted novelty 7.0 partial

    S-Bus uses a DeliveryLog to reconstruct read sets from HTTP traffic and enforce Observable-Read Isolation, preventing structural race conditions in multi-agent LLM coordination.

  15. Coding Agent Is Good As World Simulator

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.

  16. Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries

    cs.CR 2026-05 unverdicted novelty 7.0

    Identifies concrete attacks from a malicious Provider on SAGA and proposes SAGA-BFT, SAGA-MON, SAGA-AUD, and SAGA-HYB mitigations offering different security-performance trade-offs.

  17. SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

    cs.AI 2026-05 unverdicted novelty 7.0

    SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on S...

  18. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 7.0

    Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...

  19. Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

  20. TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

    cs.CY 2026-05 unverdicted novelty 7.0

    TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.

  21. TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

  22. Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.

  23. TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

    cs.AI 2026-05 conditional novelty 7.0

    TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.

  24. MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

    cs.SE 2026-05 unverdicted novelty 7.0

    MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...

  25. MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

    cs.SE 2026-05 unverdicted novelty 7.0

    MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...

  26. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  27. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  28. When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

  29. TeamBench: Evaluating Agent Coordination under Enforced Role Separation

    cs.AI 2026-05 unverdicted novelty 7.0

    Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

  30. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  31. QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

    cs.CR 2026-05 unverdicted novelty 7.0

    A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.

  32. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...

  33. Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

    cs.AI 2026-04 unverdicted novelty 7.0

    Current AI agents achieve only 26% success on SciCrafter's redstone tasks requiring causal discovery and application, indicating the discovery-to-application loop remains challenging with shifting bottlenecks.

  34. Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

    cs.DC 2026-04 unverdicted novelty 7.0

    Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...

  35. PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

    cs.RO 2026-04 unverdicted novelty 7.0

    PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.

  36. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  37. Dr.Sai: An agentic AI for real-world physics analysis at BESIII

    hep-ex 2026-04 unverdicted novelty 7.0

    Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.

  38. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...

  39. ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies

    cs.MA 2026-04 unverdicted novelty 7.0

    ClawCoin is a compute-cost-indexed token with oracle, vault, and settlement layers that stabilizes multi-agent workflows under cost shocks better than fiat baselines in simulator tests.

  40. Provable Coordination for LLM Agents via Message Sequence Charts

    cs.PL 2026-04 unverdicted novelty 7.0

    A message sequence chart language for LLM agents enables provable deadlock-free coordination by projecting global specifications to local programs independent of LLM nondeterminism.

  41. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  42. Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

    cs.AI 2026-04 unverdicted novelty 7.0

    Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.

  43. Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities

    cs.CR 2026-04 conditional novelty 7.0

    LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained...

  44. The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

    cs.AI 2026-04 unverdicted novelty 7.0

    A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.

  45. SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SemiFA is a four-agent LangGraph pipeline that combines DINOv2 and LLaVA image analysis with SECS/GEM telemetry and vector retrieval to produce complete FA reports in 48 seconds.

  46. MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration

    cs.MA 2026-04 accept novelty 7.0

    MPAC defines a multi-principal agent coordination protocol across Session, Intent, Operation, Conflict, and Governance layers, with 21 message types and state machines, delivering 95% lower coordination overhead in a ...

  47. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  48. Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

    cs.MA 2026-04 unverdicted novelty 7.0

    Multi-agent LLM simulations with trait-conditioned agents and a reinforcement-learning orchestrator show heterogeneous teams and dynamic trait selection outperform static configurations in simulated legal argumentation.

  49. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

    cs.AI 2026-04 unverdicted novelty 7.0

    Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...

  50. Architecture Without Architects: How AI Coding Agents Shape Software Architecture

    cs.SE 2026-04 unverdicted novelty 7.0

    AI coding agents perform vibe architecting by making prompt-driven architectural choices that produce structurally different systems for identical tasks.

  51. Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

    cs.DC 2026-03 unverdicted novelty 7.0

    This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.

  52. What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

    cs.CL 2026-03 unverdicted novelty 7.0

    Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.

  53. Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems

    cs.MA 2026-02 unverdicted novelty 7.0

    Agentic Hives apply dynamic general equilibrium theory to variable populations of language-model agents, proving existence of equilibria, Pareto optimality, multiplicity, comparative-statics analogs, Hopf bifurcations...

  54. Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

    cs.SE 2026-02 conditional novelty 7.0

    SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

  55. Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

    cs.HC 2026-01 unverdicted novelty 7.0

    Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.

  56. Emergent Coordination in Multi-Agent Language Models

    cs.MA 2025-10 unverdicted novelty 7.0

    Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information deco...

  57. An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

    cs.SE 2025-09 conditional novelty 7.0

    Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.

  58. AlphaEvolve: A coding agent for scientific and algorithmic discovery

    cs.AI 2025-06 unverdicted novelty 7.0

    AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...

  59. From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

    cs.MA 2025-06 accept novelty 7.0

    A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

  60. SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

    cs.CV 2025-05 conditional novelty 7.0

    Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 211 Pith papers

  1. [1]

    Self-collaboration code generation via chatgpt

    Association for Computational Linguistics. Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590, 2023. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2...

  2. [2]

    For example, AssistantAgent is pre-configured to be backed by GPT-4, with a carefully designed system message for generic problem-solving via code

    Consider using built-in agents first. For example, AssistantAgent is pre-configured to be backed by GPT-4, with a carefully designed system message for generic problem-solving via code. The UserProxyAgent is configured to solicit human inputs and perform tool execution. Many problems can be solved by simply combining these two agents. When customizing age...

  3. [3]

    Consider using the two-agent chat or the group chat setup first, as they can often be extended with the least code

    Start with a simple conversation topology. Consider using the two-agent chat or the group chat setup first, as they can often be extended with the least code. Note that the two-agent chat can be easily extended to involve more than two agents by using LLM-consumable functions in a dynamic way

  4. [4]

    A5 in Section 3)

    Try to reuse built-in reply methods based on LLM, tool, or human before implementing a custom reply method because they can often be reused to achieve the goal in a simple way (e.g., the built-in agent GroupChatManager’s reply method reuses the built-in LLM-based reply function when selecting the next speaker, ref. A5 in Section 3)

  5. [5]

    This helps evaluate the effectiveness of AssistantAgent, tuning the prompt, dis- covering corner cases, and debugging

    When developing a new application with UserProxyAgent, start with humans always in the loop , i.e., human input mode=‘ALW AYS’, even if the target operation mode is more au- tonomous. This helps evaluate the effectiveness of AssistantAgent, tuning the prompt, dis- covering corner cases, and debugging. Once confident with small-scale success, consider sett...

  6. [6]

    TERMINATE

    Despite the numerous advantages of AutoGen agents, there could be cases/scenarios whereother libraries/packages could help. For example: (1) For (sub)tasks that do not have requirements for back-and-forth trouble-shooting, multi-agent interaction, etc., a unidirectional (no back-and- forth message exchange) pipeline can also be orchestrated with LangChain...

  7. [7]

    Enter your answer in the form Ax + By + Cz + D = 0, where A, B, C, D are integers such that A > 0 and gcd(|A|, |B|, |C|, |D|) = 1

    Input the problem: Find the equation of the plane which bisects the angle between the planes 3x − 6y + 2z + 5 = 0 and 4x − 12y + 3z − 3 = 0 , and which contains the point (−5, −1, −5). Enter your answer in the form Ax + By + Cz + D = 0, where A, B, C, D are integers such that A > 0 and gcd(|A|, |B|, |C|, |D|) = 1

  8. [8]

    We then give a hint to the model: Your idea is not correct

    The response from the system does not solve the problem correctly. We then give a hint to the model: Your idea is not correct. Let’s solve this together. Suppose P = ( x, y, z) is a point that lies on a plane that bisects the angle, the distance from P to the two planes is the same. Please set up this equation first

  9. [9]

    Since the equation involves an absolute sign that is hard to solve, we would give the next hint: Consider the two cases to remove the abs sign and get two possible solutions

    We expect the system to give the correct distance equation. Since the equation involves an absolute sign that is hard to solve, we would give the next hint: Consider the two cases to remove the abs sign and get two possible solutions

  10. [10]

    If the system returns the two possible solutions and doesn’t continue to the next step, we give the last hint: Use point (-5,-1,-5) to determine which is correct and give the final answer

  11. [11]

    We observed that AutoGen consistently solved the problem across all three trials

    Final answer is 11x+6y+5z+86=0 . We observed that AutoGen consistently solved the problem across all three trials. ChatGPT+Code Interpreter and ChatGPT+Plugin managed to solve the problem in two out of three trials, while Au- toGPT failed to solve it in all three attempts. In its unsuccessful attempt, ChatGPT+Code Interpreter failed to adhere to human hin...

  12. [12]

    Question and Contexts

  13. [13]

    Satisfied Answers or Terminate

    Terminate,feedbacks or `Update Context`4. Satisfied Answers or Terminate

  14. [14]

    Given a set of documents, the Retrieval-augmented User Proxy first automatically processes documents—splits, chunks, and stores them in a vector database

    Satisfied Answers or `Update Context` Figure 7: Overview of Retrieval-augmented Chat which involves two agents, including a Retrieval- augmented User Proxy and a Retrieval-augmented Assistant. Given a set of documents, the Retrieval-augmented User Proxy first automatically processes documents—splits, chunks, and stores them in a vector database. Then for ...

  15. [15]

    The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding simi- larity, and sends them along with the question to the Retrieval-Augmented Assistant

  16. [16]

    Update Context

    The Retrieval-Augmented Assistant employs an LLM to generate code or text as answers based on the question and context provided. If the LLM is unable to produce a satisfactory response, it is instructed to reply with “Update Context” to the Retrieval-Augmented User Proxy

  17. [17]

    If there are no code blocks or instructions to update the context, it terminates the conversation

    If a response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback. If there are no code blocks or instructions to update the context, it terminates the conversation. Otherwise, it updates the context and forwards the question along with the new context to the Retrieval-Augmented Assistant. Note that ...

  18. [18]

    Update Context

    If the Retrieval-Augmented Assistant receives “Update Context”, it requests the next most similar chunks of documents as new context from the Retrieval-Augmented User Proxy. Otherwise, it generates new code or text based on the feedback and chat history. If the LLM fails to generate an answer, it replies with “Update Context” again. This process can be re...

  19. [19]

    What if we prohibit shipping from supplier 1 to roastery 2?

    is an open-source Python library designed for efficient AutoML and tuning. It was open- sourced in December 2020, and is included in the training data of GPT-4. However, the question necessitates the use of Spark-related APIs, which were added in December 2022 and are not encom- passed in the GPT-4 training data. Consequently, the original GPT-4 model is ...

  20. [20]

    Broadcast AliceBobUser Proxy

  21. [21]

    How much money would I earn if I bought 200 $AAPL stocks at the lowest price in the last 30 days and sold them at the highest price? Save the results into a file

    Select a Speaker AliceBobUser Proxy Bob2. Ask the Speaker to Respond Manager Manager Response Figure 12: A5: Dynamic Group Chat: Overview of how AutoGen enables dynamic group chats to solve tasks. The Manager agent, which is an instance of the GroupChatManager class, performs the following three steps–select a single speaker (in this case Bob), ask the sp...

  22. [22]

    What if the roasting cost is increased by 5% because of the potential salary increase?

    The negative side shows a better understanding of the simplification process.37 Table 13: Application A3. ChatGPT+ Code Interpreter for OptiGuide. A sample question “What if the roasting cost is increased by 5% because of the potential salary increase?” is asked. Action ChatGPT+ Code Interpreter /usr Prompt Writer Customer open Web browser. For the source...

  23. [23]

    Math Solver

    Simplify and rationalize the denominator for the expression √ 225√ 45 × √ 200√ 125 2. Simplify and rationalize the denominator for the expression √ 289√ 361 × √ 100√ 72 ...Until 10 Adding new tasks to task storage ‘task name’: ‘Simplify and rationalize the denominator for the expression frac- sqrt225sqrt45timesfracsqrt200sqrt125’, ‘taskid’: 2 ‘task name’:...

  24. [25]

    //button[@id=‘subbtn2’]

    Click the button with xpath “//button[@id=‘subbtn2’]”. Current task: Click button ONE, then click button TWO. plan: *************************************************************** AssistantAgent to Executor agent:

  25. [27]

    //button[@id=‘subbtn2’]

    Click the button with xpath “//button[@id=‘subbtn2’]”. *************************************************************** Executor agent to AssistantAgent: Below is the HTML code of the webpage where the agent should solve a task. 1 < div id = " wrap " data - wob_ref = " 2 " data - wob_eps = " e0 " > 2 < div id = " query " > Click button ONE , then click but...

  26. [29]

    //button[@id=‘subbtn2’]

    Click the button with xpath “//button[@id=‘subbtn2’]”. We have a history of instructions that have been already executed by the autonomous agent so far. No instruction has been executed yet. Based on the plan and the history of instructions executed so far, the first instruction should be ‘ *************************************************************** A...

  27. [30]

    //button[@id=‘subbtn’]

    Click the button with xpath “//button[@id=‘subbtn’]”

  28. [31]

    //button[@id=‘subbtn2’]

    Click the button with xpath “//button[@id=‘subbtn2’]”. We have a history of instructions that have been already executed by the autonomous agent so far. 1: clickxpath //button[@id=‘subbtn’] Based on the plan and the history of instructions executed so far, the next proper instruction should be ‘ ************************************************************...