SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang

arxiv: 2605.15204 · v1 · pith:J276MSRKnew · submitted 2026-04-20 · 💻 cs.AI

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang This is my paper

Pith reviewed 2026-05-19 18:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent orchestrationstate-constrained dispatchfinite state machineintent routeradversarial routingtask completionGRPOalignment tax

0 comments

The pith

SDOF models multi-agent orchestration as a constrained state machine to let a 7B router beat zero-shot GPT-4o on adversarial routing while blocking all illegal operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multi-agent frameworks route tasks through open graphs that ignore the stage constraints governing real business processes. SDOF corrects this by casting execution as a finite state machine enforced through two defensive layers: a GRPO-trained intent router and a state-aware dispatcher that performs automaton checks plus skill precondition validation. On a recruitment platform with 185 expert scenarios and 1671 live calls, the 7B router reaches 80.9 percent joint accuracy against GPT-4o's 48.9 percent and the full system completes 86.5 percent of tasks while stopping every one of 22 injection attempts. A reader should care because the result suggests smaller models can deliver reliable, auditable automation in constrained enterprise settings without depending on ever-larger general models.

Core claim

SDOF treats multi-agent execution as a constrained state machine whose two primary defensive layers are an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling and a StateAwareDispatcher that applies GoalStage finite-automaton checks together with precondition and postcondition SkillRegistry validation. This produces 80.9 percent joint accuracy on an FSM-constrained adversarial routing benchmark versus 48.9 percent for zero-shot GPT-4o, 86.5 percent end-to-end task completion, complete blocking of the 22-operation injection and illegal-HR subset, and 100 percent precision with 88 percent recall under message-level blocking audit.

What carries the argument

GoalStage finite-automaton checks inside the StateAwareDispatcher, which enforce stage-order constraints and SkillRegistry precondition/postcondition validation during dispatch.

If this is right

Complete blocking of all 22 injection and illegal-HR operations occurs in the tested live system.
Task completion reaches 86.5 percent with 95 percent confidence interval 80.8 to 90.7.
Message-level blocking audit yields 100 percent precision, 88 percent recall, and expert agreement kappa of 0.94.
The FSM mapping surfaces 201 stage-order conflicts across 960 dialogues in eight service domains, including 41 in the normal split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraint mechanism could be ported to other service domains such as finance or customer support once domain-specific stage mappings are supplied.
Strict state enforcement may allow even smaller models to suffice for orchestration roles, lowering inference cost in production.
Auditable stage tracking could integrate with existing compliance logging systems to produce automatic execution traces for audits.

Load-bearing premise

The 185 expert-curated scenarios and the Beisen iTalent platform data represent general multi-agent orchestration challenges, and the finite-automaton mapping captures real business-process constraints without missing edge cases.

What would settle it

Evaluating the same 7B router on a fresh collection of adversarial routing scenarios outside the original 185 expert-curated ones and finding accuracy below GPT-4o or any unblocked illegal operations would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2605.15204 by Zhantao Wang.

**Figure 1.** Figure 1: SDOF as an enterprise harness architecture. The generative LLM core (top layer) is constrained by deterministic [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Finite-State Workflow (FSM) for Recruitment. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Real API call latency measurements [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-domain generalization on the SGD-derived [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDOF layers FSM constraints onto multi-agent routing and gets a 7B RLHF router to beat zero-shot GPT-4o on accuracy while blocking illegal steps, but the tests sit on one platform's curated scenarios.

read the letter

The main thing here is that SDOF treats multi-agent execution as a state machine so that stage constraints actually block bad operations at dispatch time, and their GSPO-aligned 7B router posts 80.9% joint accuracy against 48.9% for zero-shot GPT-4o on the adversarial routing test they built around those constraints. End-to-end they hit 86.5% task completion and stopped every one of the 22 injection or illegal HR cases in their subset, with 100% precision on the broader audit and kappa 0.94 from experts. They also map 960 SGD dialogues and surface 201 stage-order conflicts under the same FSM rules. That is the concrete output worth noting. What they actually ship is the combination of an Online-RLHF Specialized Intent Router, the StateAwareDispatcher, GoalStage finite-automaton checks, and SkillRegistry precondition validation. This sits on top of graph-style pipelines and adds enforceable, auditable control that LangChain, LangGraph, or CrewAI do not provide out of the box. They run the whole thing against live API calls on the Beisen iTalent platform, which is more grounded than pure simulation. The numbers come with a 95% confidence interval and a secondary cross-domain check, so there is something to evaluate rather than just claims. The soft spot is the evaluation base. All the headline results rest on 185 expert-curated scenarios from a single recruitment system. The adversarial examples are tied to their own FSM mapping, and it is not shown that the distribution covers edge cases from other domains or that the split was strictly held out during router training. If the benchmark patterns match the platform's processes too closely, the reported gains and perfect blocking may not travel. This is a real but contained limitation rather than a fatal one. The paper is aimed at teams that need to keep multi-agent orchestration inside real business process rules, especially in enterprise settings like HR or service workflows. Readers who care about safety constraints in routing will find the components and the live-call results useful to examine. I would send it for peer review. The mechanism is practical, the metrics are specific, and reviewers can press on generalization and ablations without starting from zero.

Referee Report

2 major / 2 minor

Summary. The paper introduces SDOF, a multi-agent orchestration framework that models execution as a constrained state machine via a StateAwareDispatcher implementing GoalStage finite-automaton checks and SkillRegistry precondition/postcondition validation. It pairs this with an Online-RLHF Specialized Intent Router trained via GRPO/GSPO. On 185 expert-curated scenarios from the Beisen iTalent recruitment platform (triggering 1671 live API calls), the GSPO-aligned 7B router reports 80.9% joint accuracy versus 48.9% for zero-shot GPT-4o; end-to-end SDOF achieves 86.5% task completion (95% CI 80.8-90.7), blocks all 22 injection/illegal-HR operations, and attains 100% precision / 88% recall (kappa=0.94) on message-level blocking. A secondary evaluation on 960 SGD-derived dialogues across 8 domains surfaces 201 stage-order conflicts.

Significance. If the results hold under broader testing, the combination of RLHF-tuned routing with explicit finite-automaton state constraints offers a practical, auditable defense against misalignment in business-process multi-agent systems. The reported confidence interval, expert-agreement kappa, and perfect blocking on the illegal subset are concrete strengths that would support adoption in constrained domains.

major comments (2)

[Evaluation section] Evaluation section (Beisen iTalent experiments): the headline claims (80.9% joint accuracy, 86.5% task completion, 100% blocking precision) rest exclusively on 185 expert-curated scenarios from a single recruitment platform. No evidence is supplied that the scenario distribution covers edge cases from other domains or that the adversarial examples were generated independently of the FSM rules; this directly undermines the general claim that SDOF tames the alignment tax in multi-agent orchestration.
[Method / Training subsection] Training and split description: the manuscript provides no details on whether the 7B Intent Router was trained on a strict held-out split of the 185 scenarios or on ablations of the GRPO objective, leaving open the possibility that the reported gains over GPT-4o are due to overfitting to the curated distribution rather than the state-constrained dispatch mechanism.

minor comments (2)

[Abstract / §3] Abstract and §3: the finite-automaton mapping from GoalStage is described at high level; a short pseudocode or diagram of the state-transition function and how it interacts with SkillRegistry would improve clarity.
[Results / SGD evaluation] Table or results section: the 960-dialogue SGD evaluation reports 201 conflicts but does not break down how many arise from the normal versus adversarial splits or provide per-domain statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we will incorporate changes in the next version of the paper to address the concerns raised while preserving the core contributions of SDOF.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (Beisen iTalent experiments): the headline claims (80.9% joint accuracy, 86.5% task completion, 100% blocking precision) rest exclusively on 185 expert-curated scenarios from a single recruitment platform. No evidence is supplied that the scenario distribution covers edge cases from other domains or that the adversarial examples were generated independently of the FSM rules; this directly undermines the general claim that SDOF tames the alignment tax in multi-agent orchestration.

Authors: The primary experimental results are based on the Beisen iTalent platform as described. However, the manuscript does report a secondary evaluation on 960 SGD-derived dialogues from 8 service domains, revealing 201 stage-order conflicts under the FSM mapping. This provides supporting evidence for the generality of the state-constrained approach. We concede that the main benchmark is domain-specific and that the adversarial scenarios were tailored to the FSM rules. In the revised manuscript, we will update the Evaluation section to more explicitly discuss the limitations of the current evaluation scope, provide additional context on how the scenarios were curated, and qualify the general claims accordingly. We believe this addresses the concern without undermining the practical value demonstrated. revision: partial
Referee: [Method / Training subsection] Training and split description: the manuscript provides no details on whether the 7B Intent Router was trained on a strict held-out split of the 185 scenarios or on ablations of the GRPO objective, leaving open the possibility that the reported gains over GPT-4o are due to overfitting to the curated distribution rather than the state-constrained dispatch mechanism.

Authors: We acknowledge that the original manuscript lacked sufficient detail on the training procedure and data splits for the Intent Router. To clarify, the training utilized a held-out portion of the data and included ablations of the GRPO objective. We will revise the Method / Training subsection to include a comprehensive description of the data partitioning, training hyperparameters, and ablation results. This revision will help demonstrate that the performance improvements stem from the proposed alignment and dispatch mechanisms rather than potential overfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks and curated scenarios

full rationale

The paper describes a framework with an Intent Router trained via GRPO and a StateAwareDispatcher using GoalStage finite-automaton checks. Central performance numbers (80.9% joint accuracy vs GPT-4o, 86.5% task completion, 100% blocking precision) are measured on 185 expert-curated scenarios from the external Beisen iTalent platform plus a separate 960-dialogue SGD set. No equations, fitted parameters, or self-citations are presented as load-bearing for the core claims; the evaluation distribution and adversarial examples are not shown to reduce to quantities defined solely inside the paper. The derivation chain is therefore self-contained against independent external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that business processes can be faithfully modeled as finite state machines with precondition/postcondition checks; no free parameters are explicitly listed in the abstract, but training of the 7B router via GRPO implies hyperparameter choices.

axioms (1)

domain assumption Business processes can be accurately represented as GoalStage finite automata with precondition and postcondition validations.
Invoked in the description of the StateAwareDispatcher and SkillRegistry for auditable execution control.

invented entities (2)

StateAwareDispatcher no independent evidence
purpose: Enforce stage constraints and validate skills during multi-agent execution.
New component introduced to add FSM checks on top of existing orchestration frameworks.
Online-RLHF Specialized Intent Router no independent evidence
purpose: Route tasks using generative reward modeling (GRPO) aligned to FSM constraints.
Specialized 7B model presented as achieving superior accuracy on the adversarial benchmark.

pith-pipeline@v0.9.0 · 5824 in / 1376 out tokens · 38611 ms · 2026-05-19T18:01:50.686244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the workflow automaton as a tuple G = (S, s0, T, δ, I, Λ) ... Definition 1 (Intent-Stage Binding). ... SkillRegistry with Formal Preconditions ... Algorithm 1 StateAwareDispatch
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) ... GSPO-aligned 7B Intent Router

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

AgentAuditor: Safety and security evaluation for large language model agents

AgentAuditor Team. AgentAuditor: Safety and security evaluation for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[2]

Langchain: Building applications with LLMs through composability.https://github

Harrison Chase. Langchain: Building applications with LLMs through composability.https://github. com/langchain-ai/langchain, 2023

work page 2023
[3]

AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi- Hsin Hung, Chen Qian, et al. AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors. InInternational Conference on Learning Rep- resentations (ICLR), 2024

work page 2024
[4]

Cooperative AI: machines must learn to find common ground

Allan Dafoe, Yoram Bachrach, Gillian Hadfield, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative AI: machines must learn to find common ground. In Nature, volume 593, pages 33–36, 2021

work page 2021
[5]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Ding, Anh Fan, Ang Ho Mok, Adian Liusie, et al. AgentScope: A flexible yet robust multi- agent platform.arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024
[6]

MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, et al. MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

work page 2026
[7]

MetaGPT: Meta programming for a multi-agent col- laborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuhao Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent col- laborative framework. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[8]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Mober, et al. DSPy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

LangChain Team. Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

work page 2024
[10]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[11]

Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

Jo ˜ao Moura. Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

work page 2024
[12]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Taskweaver: A code-first agent framework

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. TaskWeaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[14]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language mod- els to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. 11

work page 2020
[16]

Toolformer: Lan- guage models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[17]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[18]

Restgpt: Connecting large language models with real-world restful apis

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. RestGPT: Connecting large language models with real-world RESTful APIs. arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023
[19]

Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

Wil MP van der Aalst. Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

work page 2012
[20]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

work page 2024
[21]

Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion. InInternational Conference on Learning Repre- sentations (ICLR), 2024

work page 2024
[22]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Yiran Wu, Tianwei Yue, Shaokun Zhang, Qingyun Chi, and Qingyun Wu. Stateflow: Enhancing LLM task- solving through state-driven workflows.arXiv preprint arXiv:2403.11322, 2024

work page arXiv 2024
[23]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023

work page 2023
[25]

AMA- Bench: Evaluating long-horizon memory for agentic applications

Yujie Zhao, Boqin Yuan, Junbo Huang, et al. AMA- Bench: Evaluating long-horizon memory for agentic applications. InInternational Conference on Machine Learning (ICML), 2026. 12

work page 2026

[1] [1]

AgentAuditor: Safety and security evaluation for large language model agents

AgentAuditor Team. AgentAuditor: Safety and security evaluation for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[2] [2]

Langchain: Building applications with LLMs through composability.https://github

Harrison Chase. Langchain: Building applications with LLMs through composability.https://github. com/langchain-ai/langchain, 2023

work page 2023

[3] [3]

AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi- Hsin Hung, Chen Qian, et al. AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors. InInternational Conference on Learning Rep- resentations (ICLR), 2024

work page 2024

[4] [4]

Cooperative AI: machines must learn to find common ground

Allan Dafoe, Yoram Bachrach, Gillian Hadfield, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative AI: machines must learn to find common ground. In Nature, volume 593, pages 33–36, 2021

work page 2021

[5] [5]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Ding, Anh Fan, Ang Ho Mok, Adian Liusie, et al. AgentScope: A flexible yet robust multi- agent platform.arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024

[6] [6]

MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, et al. MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

work page 2026

[7] [7]

MetaGPT: Meta programming for a multi-agent col- laborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuhao Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent col- laborative framework. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[8] [8]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Mober, et al. DSPy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

LangChain Team. Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

work page 2024

[10] [10]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[11] [11]

Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

Jo ˜ao Moura. Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

work page 2024

[12] [12]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Taskweaver: A code-first agent framework

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. TaskWeaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023

[14] [14]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language mod- els to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. 11

work page 2020

[16] [16]

Toolformer: Lan- guage models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[17] [17]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[18] [18]

Restgpt: Connecting large language models with real-world restful apis

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. RestGPT: Connecting large language models with real-world RESTful APIs. arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023

[19] [19]

Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

Wil MP van der Aalst. Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

work page 2012

[20] [20]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

work page 2024

[21] [21]

Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion. InInternational Conference on Learning Repre- sentations (ICLR), 2024

work page 2024

[22] [22]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Yiran Wu, Tianwei Yue, Shaokun Zhang, Qingyun Chi, and Qingyun Wu. Stateflow: Enhancing LLM task- solving through state-driven workflows.arXiv preprint arXiv:2403.11322, 2024

work page arXiv 2024

[23] [23]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023

work page 2023

[25] [25]

AMA- Bench: Evaluating long-horizon memory for agentic applications

Yujie Zhao, Boqin Yuan, Junbo Huang, et al. AMA- Bench: Evaluating long-horizon memory for agentic applications. InInternational Conference on Machine Learning (ICML), 2026. 12

work page 2026