pith. sign in

arxiv: 2605.15204 · v1 · pith:J276MSRKnew · submitted 2026-04-20 · 💻 cs.AI

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Pith reviewed 2026-05-19 18:01 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent orchestrationstate-constrained dispatchfinite state machineintent routeradversarial routingtask completionGRPOalignment tax
0
0 comments X

The pith

SDOF models multi-agent orchestration as a constrained state machine to let a 7B router beat zero-shot GPT-4o on adversarial routing while blocking all illegal operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multi-agent frameworks route tasks through open graphs that ignore the stage constraints governing real business processes. SDOF corrects this by casting execution as a finite state machine enforced through two defensive layers: a GRPO-trained intent router and a state-aware dispatcher that performs automaton checks plus skill precondition validation. On a recruitment platform with 185 expert scenarios and 1671 live calls, the 7B router reaches 80.9 percent joint accuracy against GPT-4o's 48.9 percent and the full system completes 86.5 percent of tasks while stopping every one of 22 injection attempts. A reader should care because the result suggests smaller models can deliver reliable, auditable automation in constrained enterprise settings without depending on ever-larger general models.

Core claim

SDOF treats multi-agent execution as a constrained state machine whose two primary defensive layers are an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling and a StateAwareDispatcher that applies GoalStage finite-automaton checks together with precondition and postcondition SkillRegistry validation. This produces 80.9 percent joint accuracy on an FSM-constrained adversarial routing benchmark versus 48.9 percent for zero-shot GPT-4o, 86.5 percent end-to-end task completion, complete blocking of the 22-operation injection and illegal-HR subset, and 100 percent precision with 88 percent recall under message-level blocking audit.

What carries the argument

GoalStage finite-automaton checks inside the StateAwareDispatcher, which enforce stage-order constraints and SkillRegistry precondition/postcondition validation during dispatch.

If this is right

  • Complete blocking of all 22 injection and illegal-HR operations occurs in the tested live system.
  • Task completion reaches 86.5 percent with 95 percent confidence interval 80.8 to 90.7.
  • Message-level blocking audit yields 100 percent precision, 88 percent recall, and expert agreement kappa of 0.94.
  • The FSM mapping surfaces 201 stage-order conflicts across 960 dialogues in eight service domains, including 41 in the normal split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constraint mechanism could be ported to other service domains such as finance or customer support once domain-specific stage mappings are supplied.
  • Strict state enforcement may allow even smaller models to suffice for orchestration roles, lowering inference cost in production.
  • Auditable stage tracking could integrate with existing compliance logging systems to produce automatic execution traces for audits.

Load-bearing premise

The 185 expert-curated scenarios and the Beisen iTalent platform data represent general multi-agent orchestration challenges, and the finite-automaton mapping captures real business-process constraints without missing edge cases.

What would settle it

Evaluating the same 7B router on a fresh collection of adversarial routing scenarios outside the original 185 expert-curated ones and finding accuracy below GPT-4o or any unblocked illegal operations would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2605.15204 by Zhantao Wang.

Figure 1
Figure 1. Figure 1: SDOF as an enterprise harness architecture. The generative LLM core (top layer) is constrained by deterministic [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Finite-State Workflow (FSM) for Recruitment. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real API call latency measurements [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-domain generalization on the SGD-derived [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SDOF, a multi-agent orchestration framework that models execution as a constrained state machine via a StateAwareDispatcher implementing GoalStage finite-automaton checks and SkillRegistry precondition/postcondition validation. It pairs this with an Online-RLHF Specialized Intent Router trained via GRPO/GSPO. On 185 expert-curated scenarios from the Beisen iTalent recruitment platform (triggering 1671 live API calls), the GSPO-aligned 7B router reports 80.9% joint accuracy versus 48.9% for zero-shot GPT-4o; end-to-end SDOF achieves 86.5% task completion (95% CI 80.8-90.7), blocks all 22 injection/illegal-HR operations, and attains 100% precision / 88% recall (kappa=0.94) on message-level blocking. A secondary evaluation on 960 SGD-derived dialogues across 8 domains surfaces 201 stage-order conflicts.

Significance. If the results hold under broader testing, the combination of RLHF-tuned routing with explicit finite-automaton state constraints offers a practical, auditable defense against misalignment in business-process multi-agent systems. The reported confidence interval, expert-agreement kappa, and perfect blocking on the illegal subset are concrete strengths that would support adoption in constrained domains.

major comments (2)
  1. [Evaluation section] Evaluation section (Beisen iTalent experiments): the headline claims (80.9% joint accuracy, 86.5% task completion, 100% blocking precision) rest exclusively on 185 expert-curated scenarios from a single recruitment platform. No evidence is supplied that the scenario distribution covers edge cases from other domains or that the adversarial examples were generated independently of the FSM rules; this directly undermines the general claim that SDOF tames the alignment tax in multi-agent orchestration.
  2. [Method / Training subsection] Training and split description: the manuscript provides no details on whether the 7B Intent Router was trained on a strict held-out split of the 185 scenarios or on ablations of the GRPO objective, leaving open the possibility that the reported gains over GPT-4o are due to overfitting to the curated distribution rather than the state-constrained dispatch mechanism.
minor comments (2)
  1. [Abstract / §3] Abstract and §3: the finite-automaton mapping from GoalStage is described at high level; a short pseudocode or diagram of the state-transition function and how it interacts with SkillRegistry would improve clarity.
  2. [Results / SGD evaluation] Table or results section: the 960-dialogue SGD evaluation reports 201 conflicts but does not break down how many arise from the normal versus adversarial splits or provide per-domain statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we will incorporate changes in the next version of the paper to address the concerns raised while preserving the core contributions of SDOF.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (Beisen iTalent experiments): the headline claims (80.9% joint accuracy, 86.5% task completion, 100% blocking precision) rest exclusively on 185 expert-curated scenarios from a single recruitment platform. No evidence is supplied that the scenario distribution covers edge cases from other domains or that the adversarial examples were generated independently of the FSM rules; this directly undermines the general claim that SDOF tames the alignment tax in multi-agent orchestration.

    Authors: The primary experimental results are based on the Beisen iTalent platform as described. However, the manuscript does report a secondary evaluation on 960 SGD-derived dialogues from 8 service domains, revealing 201 stage-order conflicts under the FSM mapping. This provides supporting evidence for the generality of the state-constrained approach. We concede that the main benchmark is domain-specific and that the adversarial scenarios were tailored to the FSM rules. In the revised manuscript, we will update the Evaluation section to more explicitly discuss the limitations of the current evaluation scope, provide additional context on how the scenarios were curated, and qualify the general claims accordingly. We believe this addresses the concern without undermining the practical value demonstrated. revision: partial

  2. Referee: [Method / Training subsection] Training and split description: the manuscript provides no details on whether the 7B Intent Router was trained on a strict held-out split of the 185 scenarios or on ablations of the GRPO objective, leaving open the possibility that the reported gains over GPT-4o are due to overfitting to the curated distribution rather than the state-constrained dispatch mechanism.

    Authors: We acknowledge that the original manuscript lacked sufficient detail on the training procedure and data splits for the Intent Router. To clarify, the training utilized a held-out portion of the data and included ablations of the GRPO objective. We will revise the Method / Training subsection to include a comprehensive description of the data partitioning, training hyperparameters, and ablation results. This revision will help demonstrate that the performance improvements stem from the proposed alignment and dispatch mechanisms rather than potential overfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks and curated scenarios

full rationale

The paper describes a framework with an Intent Router trained via GRPO and a StateAwareDispatcher using GoalStage finite-automaton checks. Central performance numbers (80.9% joint accuracy vs GPT-4o, 86.5% task completion, 100% blocking precision) are measured on 185 expert-curated scenarios from the external Beisen iTalent platform plus a separate 960-dialogue SGD set. No equations, fitted parameters, or self-citations are presented as load-bearing for the core claims; the evaluation distribution and adversarial examples are not shown to reduce to quantities defined solely inside the paper. The derivation chain is therefore self-contained against independent external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that business processes can be faithfully modeled as finite state machines with precondition/postcondition checks; no free parameters are explicitly listed in the abstract, but training of the 7B router via GRPO implies hyperparameter choices.

axioms (1)
  • domain assumption Business processes can be accurately represented as GoalStage finite automata with precondition and postcondition validations.
    Invoked in the description of the StateAwareDispatcher and SkillRegistry for auditable execution control.
invented entities (2)
  • StateAwareDispatcher no independent evidence
    purpose: Enforce stage constraints and validate skills during multi-agent execution.
    New component introduced to add FSM checks on top of existing orchestration frameworks.
  • Online-RLHF Specialized Intent Router no independent evidence
    purpose: Route tasks using generative reward modeling (GRPO) aligned to FSM constraints.
    Specialized 7B model presented as achieving superior accuracy on the adversarial benchmark.

pith-pipeline@v0.9.0 · 5824 in / 1376 out tokens · 38611 ms · 2026-05-19T18:01:50.686244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    AgentAuditor: Safety and security evaluation for large language model agents

    AgentAuditor Team. AgentAuditor: Safety and security evaluation for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  2. [2]

    Langchain: Building applications with LLMs through composability.https://github

    Harrison Chase. Langchain: Building applications with LLMs through composability.https://github. com/langchain-ai/langchain, 2023

  3. [3]

    AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi- Hsin Hung, Chen Qian, et al. AgentVerse: Facilitating multi-agent collaboration and exploring emergent be- haviors. InInternational Conference on Learning Rep- resentations (ICLR), 2024

  4. [4]

    Cooperative AI: machines must learn to find common ground

    Allan Dafoe, Yoram Bachrach, Gillian Hadfield, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative AI: machines must learn to find common ground. In Nature, volume 593, pages 33–36, 2021

  5. [5]

    AgentScope: A Flexible yet Robust Multi-Agent Platform,

    Dawei Gao, Zitao Ding, Anh Fan, Ang Ho Mok, Adian Liusie, et al. AgentScope: A flexible yet robust multi- agent platform.arXiv preprint arXiv:2402.14034, 2024

  6. [6]

    MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, et al. MemoryArena: Benchmarking agent memory in inter- dependent multi-session agentic tasks.arXiv preprint, 2026

  7. [7]

    MetaGPT: Meta programming for a multi-agent col- laborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuhao Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent col- laborative framework. InInternational Conference on Learning Representations (ICLR), 2024

  8. [8]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Mober, et al. DSPy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  9. [9]

    Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

    LangChain Team. Langgraph: Multi-agent work- flows with LLMs.https://github.com/ langchain-ai/langgraph, 2024

  10. [10]

    Evaluating very long-term conversational memory of LLM agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  11. [11]

    Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

    Jo ˜ao Moura. Crewai: Framework for orchestrating role-playing AI agents.https://github.com/ joaomdmoura/crewAI, 2024

  12. [12]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

  13. [13]

    Taskweaver: A code-first agent framework

    Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. TaskWeaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023

  14. [14]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language mod- els to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  15. [15]

    Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset

    Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scal- able multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. 11

  16. [16]

    Toolformer: Lan- guage models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  17. [17]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  18. [18]

    Restgpt: Connecting large language models with real-world restful apis

    Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. RestGPT: Connecting large language models with real-world RESTful APIs. arXiv preprint arXiv:2306.06624, 2023

  19. [19]

    Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

    Wil MP van der Aalst. Process mining: Overview and opportunities.ACM Transactions on Management In- formation Systems, 3(2):1–17, 2012

  20. [20]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 2024

  21. [21]

    Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversa- tion. InInternational Conference on Learning Repre- sentations (ICLR), 2024

  22. [22]

    State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

    Yiran Wu, Tianwei Yue, Shaokun Zhang, Qingyun Chi, and Qingyun Wu. Stateflow: Enhancing LLM task- solving through state-driven workflows.arXiv preprint arXiv:2403.11322, 2024

  23. [23]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023

  24. [24]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023

  25. [25]

    AMA- Bench: Evaluating long-horizon memory for agentic applications

    Yujie Zhao, Boqin Yuan, Junbo Huang, et al. AMA- Bench: Evaluating long-horizon memory for agentic applications. InInternational Conference on Machine Learning (ICML), 2026. 12