A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
Pith reviewed 2026-05-20 04:48 UTC · model grok-4.3
pith:5CVCJHOF Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5CVCJHOF}
Prints a linked pith:5CVCJHOF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
The stochastic-deterministic boundary serves as the load-bearing primitive for runtime architectures in production LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that naming and treating the SDB as a first-class contract allows systematic composition of runtime patterns across the three concerns, that a stylized reliability decomposition shows architectural momentum growing in importance as model variance falls, and that a five-step methodology plus failure diagnostics can guide pattern selection for production workloads.
What carries the argument
The stochastic-deterministic boundary (SDB), a four-part contract consisting of proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action.
If this is right
- As model variance decreases, pattern choice and SDB strength become primary levers for long-run reliability rather than model improvements alone.
- Production failures can be systematically mapped to weaknesses in specific runtime patterns through the diagnostic procedure.
- Replay divergence appears when LLM consumers of a deterministic event log produce different outputs under model-version or prompt changes.
- The six patterns—hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop—apply across conversational, autonomous, and long-horizon agents.
Where Pith is reading between the lines
- If the SDB contract is load-bearing, many existing agent frameworks may require explicit redesign around visible proposer-verifier-commit-reject interfaces instead of implicit integrations.
- Replay divergence could function as a practical benchmark for evaluating architectural robustness when models are updated in deployed systems.
- Extending the methodology to additional workloads might surface recurring failure signatures that cut across different industries and agent types.
Load-bearing premise
The four-part contract adequately captures the boundary between stochastic and deterministic components in a way that enables effective pattern composition across Coordination, State, and Control.
What would settle it
An experiment that applies the same set of workloads to two different boundary definitions and shows measurably lower failure rates or replay divergence with the alternative contract would falsify the claim that this specific SDB is the load-bearing primitive.
Figures
read the original abstract
Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the stochastic-deterministic boundary (SDB) as a four-part contract (proposer, verifier, commit step, reject signal) that serves as the load-bearing primitive for production LLM agent runtimes. It organizes runtime design into three concerns—Coordination, State, and Control—and catalogs six patterns (hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, human in the loop), each traced to distributed-systems concepts with adaptations for stochastic workers. Contributions include a five-step pattern selection methodology, a diagnostic procedure mapping failures to pattern weaknesses, identification of replay divergence as a failure mode, and a stylized reliability decomposition separating per-call model variance from architectural momentum. The framework is applied to five workloads with one runnable 90-day contract-renewal reference implementation.
Significance. If the central claims hold, the work could meaningfully advance reliable LLM agent engineering by treating the SDB as a first-class architectural boundary and supplying a pattern catalog and selection methodology grounded in distributed systems. The runnable reference implementation and replay-divergence concept provide practical anchors. The emphasis on how decreasing model variance elevates the importance of pattern choice and SDB strength offers a forward-looking perspective on long-horizon reliability.
major comments (1)
- [Application to workloads and reference implementation] Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.
minor comments (1)
- [Pattern catalog] The descriptions of the six patterns would benefit from additional concrete pseudocode or annotated diagrams explicitly marking the proposer-verifier-commit-reject elements of the SDB in each case.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for identifying this important gap in the presentation of our empirical application. We address the comment below and commit to a substantive revision that strengthens the validation of the SDB and pattern catalog.
read point-by-point responses
-
Referee: Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.
Authors: We agree that the current version of the manuscript presents the five workloads and the 90-day contract-renewal reference implementation primarily as illustrative case studies that demonstrate the five-step selection methodology and the diagnostic procedure. While the section traces pattern choices back to the SDB contract and notes observed failure modes (including replay divergence), it does not supply the quantitative metrics, before/after error rates, or comparative reliability measurements that would allow direct assessment of the claimed benefits. This is a genuine limitation in the submitted draft. In the revised manuscript we will expand the section to include: (1) concrete failure counts and divergence incidents recorded during the 90-day run, (2) explicit mapping of each failure to weaknesses in the chosen pattern or SDB implementation, (3) any available quantitative indicators of reliability improvement attributable to the architectural choices, and (4) additional excerpts from the runnable implementation showing the proposer-verifier-commit-reject contract in operation. These additions will directly support the central claim that the SDB functions as the load-bearing primitive. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a methodological contribution that defines the stochastic-deterministic boundary (SDB) as a four-part contract and uses it to organize a catalog of six runtime patterns across Coordination, State, and Control. It traces lineages to distributed-systems concepts, presents a five-step selection methodology, a diagnostic procedure, and applies the framework to five workloads plus a reference implementation. No equations, fitted parameters, predictions, or self-referential derivations appear; the central claim that the SDB is load-bearing rests on conceptual organization and external literature rather than reducing to its own inputs by construction. The derivation chain is self-contained as a design lens.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Production LLM agents combine stochastic model outputs with deterministic software systems
invented entities (1)
-
stochastic-deterministic boundary (SDB)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing.Proceedings of the VLDB Endowment,...
-
[2]
PhD thesis, Royal Institute of Technology (KTH), 2003
Joe Armstrong.Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, Royal Institute of Technology (KTH), 2003. URLhttp://kth.diva-portal. org/smash/get/diva2:9492/FULLTEXT01.pdf
work page 2003
-
[3]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
-
[4]
URLhttps://arxiv.org/abs/2212.08073
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Eric A. Brewer. Towards robust distributed systems (invited talk). InProceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC), 2000. doi: 10.1145/343477.343502. URLhttps://doi.org/10.1145/343477.343502
-
[6]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint, 2025. URLhttps://arxiv.org/abs/2503.13657
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint, 2023. URL https://arxiv.org/abs/2308.10848. Later published...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, 22 David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...
work page 2012
-
[9]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint,
-
[10]
URLhttps://arxiv.org/abs/2305.14325
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Hector Garcia-Molina and Kenneth Salem. Sagas. InProceedings of the ACM SIGMOD International Conference on Management of Data, pages 249–259, 1987. doi: 10.1145/38713. 38742. URLhttps://doi.org/10.1145/38713.38742
-
[12]
Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.ACM SIGACT News, 33(2):51–59, 2002. doi: 10.1145/564585.564601. URLhttps://doi.org/10.1145/564585.564601
-
[13]
Life beyond distributed transactions: An apostate’s opinion
Pat Helland. Life beyond distributed transactions: An apostate’s opinion. InConference on Innovative Data Systems Research (CIDR), 2007. URL https://www.cidrdb.org/ cidr2007/papers/cidr07p15.pdf
work page 2007
-
[14]
A universal modular ACTOR formalism for artificial intelligence
Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular ACTOR formalism for artificial intelligence. InProceedings of the 3rd International Joint Conference on Artificial Intelligence (IJCAI), pages 235–245. Morgan Kaufmann, 1973. URLhttps: //dl.acm.org/doi/10.5555/1624775.1624804
-
[15]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, ZiliWang, Steven KaShing Yau, Zijuan Lin, Liyang Zhou, ChenyuRan, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.00352. Later published at ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
IBM Cognos Analytics Sample Data. Telco customer churn. IBM Watson Analytics Community sample dataset, redistributed on Kaggle and GitHub, 2018. URLhttps:// www.kaggle.com/datasets/blastchar/telco-customer-churn. Publicly redistributable sample dataset under IBM Sample Data terms. 7,043 customer records
work page 2018
-
[17]
Rudolf E. Kalman. On the general theory of control systems. InProceedings of the First International Congress of the IFAC, Moscow, pages 481–492. Butterworths, London, 1960. Reprinted in IFAC Proceedings Volumes 1(1), pp. 491–502, doi:10.1016/S1474-6670(17)70094- 8
-
[18]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2310.03714. 23
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
The log: What every software engineer should know about real-time data’s unifying abstraction
Jay Kreps. The log: What every software engineer should know about real-time data’s unifying abstraction. LinkedIn Engineering Blog,
-
[20]
URL https://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying
-
[21]
The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998
Leslie Lamport. The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998. doi: 10.1145/279227.279229. URLhttps://doi.org/10.1145/279227. 279229
-
[22]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for "mind" exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
In search of an understandable consensus algorithm
Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference (USENIX ATC), pages 305–319, 2014. URLhttps: //www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
work page 2014
-
[24]
Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations
OpenAI Agents JS contributors. Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations. GitHub issue #1104, openai/openai-agents-js, 2024. URL https://github.com/openai/ openai-agents-js/issues/1104. Documents a reject-signal bug in which rejected tool calls were reported back to the model withstatus: ’completed’ ...
work page 2024
-
[25]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023. doi: 10.1145/3586183.3606763. URLhttps://arxiv.org/abs/2304.03442
-
[26]
When model upgrades break your agents
Promptfoo. When model upgrades break your agents. Promptfoo engineering blog, 2024. URL https://www.promptfoo.dev/blog/model-upgrades-break-agent-safety/. Doc- uments a 23-point drop (94% to 71%) in prompt-injection resistance after upgrading a production agent from GPT-4o to GPT-4.1 on an identical evaluation harness; recommended fix is an output classif...
work page 2024
-
[27]
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2310.10501
-
[28]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/ abs/2303.17580
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv. org/abs/2303.11366. 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Wil M. P. van der Aalst. The application of Petri nets to workflow management.Journal of Circuits, Systems and Computers, 8(1):21–66, 1998. doi: 10.1142/S0218126698000043. URL https://doi.org/10.1142/S0218126698000043
-
[31]
Eventually consistent.ACM Queue, 6(6):14–19, 2008
Werner Vogels. Eventually consistent.ACM Queue, 6(6):14–19, 2008. doi: 10.1145/1466443. 1466448. URLhttps://doi.org/10.1145/1466443.1466448
-
[32]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022. doi: 10.1016/j.future.2022.05.014. URLhttps://arxiv.org/abs/2108.00941
-
[34]
Guardagent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning.arXiv preprint, 2024. URL https://arxiv.org/abs/2406.09187
-
[35]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/a...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.