pith. sign in

arxiv: 2605.20173 · v1 · pith:5CVCJHOFnew · submitted 2026-05-19 · 💻 cs.AI · cs.SE

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Pith reviewed 2026-05-20 04:48 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords LLM agentsruntime architecturestochastic-deterministic boundaryarchitecture patternsproduction systemsagent reliabilitypattern selectionreplay divergence
0
0 comments X

The pith

The stochastic-deterministic boundary serves as the load-bearing primitive for runtime architectures in production LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the stochastic-deterministic boundary, or SDB, as a four-part contract between a proposer, verifier, commit step, and reject signal that turns an LLM output into a reliable system action. It organizes runtime design around three concerns—Coordination, State, and Control—and catalogs six patterns that compose the SDB differently for conversational, autonomous, and long-horizon agents. The work supplies a five-step methodology for selecting and composing these patterns, a diagnostic that links production failures to pattern weaknesses, and a reliability decomposition that separates model variance from architectural momentum. It also identifies replay divergence as a failure mode in which deterministic event logs produce inconsistent downstream outputs when models or prompts change.

Core claim

The central claim is that naming and treating the SDB as a first-class contract allows systematic composition of runtime patterns across the three concerns, that a stylized reliability decomposition shows architectural momentum growing in importance as model variance falls, and that a five-step methodology plus failure diagnostics can guide pattern selection for production workloads.

What carries the argument

The stochastic-deterministic boundary (SDB), a four-part contract consisting of proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action.

If this is right

  • As model variance decreases, pattern choice and SDB strength become primary levers for long-run reliability rather than model improvements alone.
  • Production failures can be systematically mapped to weaknesses in specific runtime patterns through the diagnostic procedure.
  • Replay divergence appears when LLM consumers of a deterministic event log produce different outputs under model-version or prompt changes.
  • The six patterns—hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop—apply across conversational, autonomous, and long-horizon agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the SDB contract is load-bearing, many existing agent frameworks may require explicit redesign around visible proposer-verifier-commit-reject interfaces instead of implicit integrations.
  • Replay divergence could function as a practical benchmark for evaluating architectural robustness when models are updated in deployed systems.
  • Extending the methodology to additional workloads might surface recurring failure signatures that cut across different industries and agent types.

Load-bearing premise

The four-part contract adequately captures the boundary between stochastic and deterministic components in a way that enables effective pattern composition across Coordination, State, and Control.

What would settle it

An experiment that applies the same set of workloads to two different boundary definitions and shows measurably lower failure rates or replay divergence with the alternative contract would falsify the claim that this specific SDB is the load-bearing primitive.

Figures

Figures reproduced from arXiv: 2605.20173 by Vasundra Srinivasan.

Figure 1
Figure 1. Figure 1: The 3 by 6 catalog. Three concerns. Two patterns in each. The framework is the [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The methodology’s geometry. Three concerns overlap. The production runtime is [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the stochastic-deterministic boundary (SDB) as a four-part contract (proposer, verifier, commit step, reject signal) that serves as the load-bearing primitive for production LLM agent runtimes. It organizes runtime design into three concerns—Coordination, State, and Control—and catalogs six patterns (hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, human in the loop), each traced to distributed-systems concepts with adaptations for stochastic workers. Contributions include a five-step pattern selection methodology, a diagnostic procedure mapping failures to pattern weaknesses, identification of replay divergence as a failure mode, and a stylized reliability decomposition separating per-call model variance from architectural momentum. The framework is applied to five workloads with one runnable 90-day contract-renewal reference implementation.

Significance. If the central claims hold, the work could meaningfully advance reliable LLM agent engineering by treating the SDB as a first-class architectural boundary and supplying a pattern catalog and selection methodology grounded in distributed systems. The runnable reference implementation and replay-divergence concept provide practical anchors. The emphasis on how decreasing model variance elevates the importance of pattern choice and SDB strength offers a forward-looking perspective on long-horizon reliability.

major comments (1)
  1. [Application to workloads and reference implementation] Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.
minor comments (1)
  1. [Pattern catalog] The descriptions of the six patterns would benefit from additional concrete pseudocode or annotated diagrams explicitly marking the proposer-verifier-commit-reject elements of the SDB in each case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying this important gap in the presentation of our empirical application. We address the comment below and commit to a substantive revision that strengthens the validation of the SDB and pattern catalog.

read point-by-point responses
  1. Referee: Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.

    Authors: We agree that the current version of the manuscript presents the five workloads and the 90-day contract-renewal reference implementation primarily as illustrative case studies that demonstrate the five-step selection methodology and the diagnostic procedure. While the section traces pattern choices back to the SDB contract and notes observed failure modes (including replay divergence), it does not supply the quantitative metrics, before/after error rates, or comparative reliability measurements that would allow direct assessment of the claimed benefits. This is a genuine limitation in the submitted draft. In the revised manuscript we will expand the section to include: (1) concrete failure counts and divergence incidents recorded during the 90-day run, (2) explicit mapping of each failure to weaknesses in the chosen pattern or SDB implementation, (3) any available quantitative indicators of reliability improvement attributable to the architectural choices, and (4) additional excerpts from the runnable implementation showing the proposer-verifier-commit-reject contract in operation. These additions will directly support the central claim that the SDB functions as the load-bearing primitive. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a methodological contribution that defines the stochastic-deterministic boundary (SDB) as a four-part contract and uses it to organize a catalog of six runtime patterns across Coordination, State, and Control. It traces lineages to distributed-systems concepts, presents a five-step selection methodology, a diagnostic procedure, and applies the framework to five workloads plus a reference implementation. No equations, fitted parameters, predictions, or self-referential derivations appear; the central claim that the SDB is load-bearing rests on conceptual organization and external literature rather than reducing to its own inputs by construction. The derivation chain is self-contained as a design lens.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on domain assumptions about stochastic vs deterministic components and introduces a conceptual entity without mathematical free parameters or additional invented entities beyond the named boundary.

axioms (1)
  • domain assumption Production LLM agents combine stochastic model outputs with deterministic software systems
    This distinction is the basis for defining the SDB as stated in the abstract.
invented entities (1)
  • stochastic-deterministic boundary (SDB) no independent evidence
    purpose: Specifies how an LLM output becomes a system action through a four-part contract among proposer, verifier, commit step, and reject signal
    Newly named primitive to organize agent runtime design; no external falsifiable prediction provided in abstract.

pith-pipeline@v0.9.0 · 5790 in / 1463 out tokens · 81029 ms · 2026-05-20T04:48:54.880594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 11 internal anchors

  1. [1]

    Fernández- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle

    Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing.Proceedings of the VLDB Endowment,...

  2. [2]

    PhD thesis, Royal Institute of Technology (KTH), 2003

    Joe Armstrong.Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, Royal Institute of Technology (KTH), 2003. URLhttp://kth.diva-portal. org/smash/get/diva2:9492/FULLTEXT01.pdf

  3. [3]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  4. [4]

    URLhttps://arxiv.org/abs/2212.08073

  5. [5]

    Eric A. Brewer. Towards robust distributed systems (invited talk). InProceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC), 2000. doi: 10.1145/343477.343502. URLhttps://doi.org/10.1145/343477.343502

  6. [6]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint, 2025. URLhttps://arxiv.org/abs/2503.13657

  7. [7]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint, 2023. URL https://arxiv.org/abs/2308.10848. Later published...

  8. [8]

    Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J

    James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, 22 David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...

  9. [9]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint,

  10. [10]

    URLhttps://arxiv.org/abs/2305.14325

  11. [11]

    Hector Garcia-Molina and Kenneth Salem. Sagas. InProceedings of the ACM SIGMOD International Conference on Management of Data, pages 249–259, 1987. doi: 10.1145/38713. 38742. URLhttps://doi.org/10.1145/38713.38742

  12. [12]

    Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.ACM SIGACT News, 33(2):51–59, 2002

    Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.ACM SIGACT News, 33(2):51–59, 2002. doi: 10.1145/564585.564601. URLhttps://doi.org/10.1145/564585.564601

  13. [13]

    Life beyond distributed transactions: An apostate’s opinion

    Pat Helland. Life beyond distributed transactions: An apostate’s opinion. InConference on Innovative Data Systems Research (CIDR), 2007. URL https://www.cidrdb.org/ cidr2007/papers/cidr07p15.pdf

  14. [14]

    A universal modular ACTOR formalism for artificial intelligence

    Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular ACTOR formalism for artificial intelligence. InProceedings of the 3rd International Joint Conference on Artificial Intelligence (IJCAI), pages 235–245. Morgan Kaufmann, 1973. URLhttps: //dl.acm.org/doi/10.5555/1624775.1624804

  15. [15]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, ZiliWang, Steven KaShing Yau, Zijuan Lin, Liyang Zhou, ChenyuRan, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.00352. Later published at ...

  16. [16]

    Telco customer churn

    IBM Cognos Analytics Sample Data. Telco customer churn. IBM Watson Analytics Community sample dataset, redistributed on Kaggle and GitHub, 2018. URLhttps:// www.kaggle.com/datasets/blastchar/telco-customer-churn. Publicly redistributable sample dataset under IBM Sample Data terms. 7,043 customer records

  17. [17]

    Rudolf E. Kalman. On the general theory of control systems. InProceedings of the First International Congress of the IFAC, Moscow, pages 481–492. Butterworths, London, 1960. Reprinted in IFAC Proceedings Volumes 1(1), pp. 491–502, doi:10.1016/S1474-6670(17)70094- 8

  18. [18]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2310.03714. 23

  19. [19]

    The log: What every software engineer should know about real-time data’s unifying abstraction

    Jay Kreps. The log: What every software engineer should know about real-time data’s unifying abstraction. LinkedIn Engineering Blog,

  20. [20]

    URL https://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying

  21. [21]

    The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998

    Leslie Lamport. The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998. doi: 10.1145/279227.279229. URLhttps://doi.org/10.1145/279227. 279229

  22. [22]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for "mind" exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760

  23. [23]

    In search of an understandable consensus algorithm

    Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference (USENIX ATC), pages 305–319, 2014. URLhttps: //www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro

  24. [24]

    Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations

    OpenAI Agents JS contributors. Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations. GitHub issue #1104, openai/openai-agents-js, 2024. URL https://github.com/openai/ openai-agents-js/issues/1104. Documents a reject-signal bug in which rejected tool calls were reported back to the model withstatus: ’completed’ ...

  25. [25]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023. doi: 10.1145/3586183.3606763. URLhttps://arxiv.org/abs/2304.03442

  26. [26]

    When model upgrades break your agents

    Promptfoo. When model upgrades break your agents. Promptfoo engineering blog, 2024. URL https://www.promptfoo.dev/blog/model-upgrades-break-agent-safety/. Doc- uments a 23-point drop (94% to 71%) in prompt-injection resistance after upgrading a production agent from GPT-4o to GPT-4.1 on an identical evaluation harness; recommended fix is an output classif...

  27. [27]

    NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv preprint, 2023

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2310.10501

  28. [28]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/ abs/2303.17580

  29. [29]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv. org/abs/2303.11366. 24

  30. [30]

    Wil M. P. van der Aalst. The application of Petri nets to workflow management.Journal of Circuits, Systems and Computers, 8(1):21–66, 1998. doi: 10.1142/S0218126698000043. URL https://doi.org/10.1142/S0218126698000043

  31. [31]

    Eventually consistent.ACM Queue, 6(6):14–19, 2008

    Werner Vogels. Eventually consistent.ACM Queue, 6(6):14–19, 2008. doi: 10.1145/1466443. 1466448. URLhttps://doi.org/10.1145/1466443.1466448

  32. [32]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.08155

  33. [33]

    A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022

    Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022. doi: 10.1016/j.future.2022.05.014. URLhttps://arxiv.org/abs/2108.00941

  34. [34]

    arXiv preprint arXiv:2406.09187 , year=

    Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning.arXiv preprint, 2024. URL https://arxiv.org/abs/2406.09187

  35. [35]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/a...