A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Vasundra Srinivasan

REVIEW 1 major objections 1 minor 35 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

The stochastic-deterministic boundary serves as the load-bearing primitive for runtime architectures in production LLM agents.

2026-05-20 04:48 UTC pith:5CVCJHOF

load-bearing objection The paper frames the stochastic-deterministic boundary as a four-part contract and catalogs six patterns for LLM agent runtimes, but the evidence for its practical value stays thin. the 1 major comments →

arxiv 2605.20173 v1 pith:5CVCJHOF submitted 2026-05-19 cs.AI cs.SE

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Vasundra Srinivasan This is my paper

classification cs.AI cs.SE

keywords LLM agentsruntime architecturestochastic-deterministic boundaryarchitecture patternsproduction systemsagent reliabilitypattern selectionreplay divergence

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the stochastic-deterministic boundary, or SDB, as a four-part contract between a proposer, verifier, commit step, and reject signal that turns an LLM output into a reliable system action. It organizes runtime design around three concerns—Coordination, State, and Control—and catalogs six patterns that compose the SDB differently for conversational, autonomous, and long-horizon agents. The work supplies a five-step methodology for selecting and composing these patterns, a diagnostic that links production failures to pattern weaknesses, and a reliability decomposition that separates model variance from architectural momentum. It also identifies replay divergence as a failure mode in which deterministic event logs produce inconsistent downstream outputs when models or prompts change.

Core claim

The central claim is that naming and treating the SDB as a first-class contract allows systematic composition of runtime patterns across the three concerns, that a stylized reliability decomposition shows architectural momentum growing in importance as model variance falls, and that a five-step methodology plus failure diagnostics can guide pattern selection for production workloads.

What carries the argument

The stochastic-deterministic boundary (SDB), a four-part contract consisting of proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action.

Load-bearing premise

The four-part contract adequately captures the boundary between stochastic and deterministic components in a way that enables effective pattern composition across Coordination, State, and Control.

What would settle it

An experiment that applies the same set of workloads to two different boundary definitions and shows measurably lower failure rates or replay divergence with the alternative contract would falsify the claim that this specific SDB is the load-bearing primitive.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

As model variance decreases, pattern choice and SDB strength become primary levers for long-run reliability rather than model improvements alone.
Production failures can be systematically mapped to weaknesses in specific runtime patterns through the diagnostic procedure.
Replay divergence appears when LLM consumers of a deterministic event log produce different outputs under model-version or prompt changes.
The six patterns—hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop—apply across conversational, autonomous, and long-horizon agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the SDB contract is load-bearing, many existing agent frameworks may require explicit redesign around visible proposer-verifier-commit-reject interfaces instead of implicit integrations.
Replay divergence could function as a practical benchmark for evaluating architectural robustness when models are updated in deployed systems.
Extending the methodology to additional workloads might surface recurring failure signatures that cut across different industries and agent types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

1 major / 1 minor

Summary. The paper introduces the stochastic-deterministic boundary (SDB) as a four-part contract (proposer, verifier, commit step, reject signal) that serves as the load-bearing primitive for production LLM agent runtimes. It organizes runtime design into three concerns—Coordination, State, and Control—and catalogs six patterns (hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, human in the loop), each traced to distributed-systems concepts with adaptations for stochastic workers. Contributions include a five-step pattern selection methodology, a diagnostic procedure mapping failures to pattern weaknesses, identification of replay divergence as a failure mode, and a stylized reliability decomposition separating per-call model variance from architectural momentum. The framework is applied to five workloads with one runnable 90-day contract-renewal reference implementation.

Significance. If the central claims hold, the work could meaningfully advance reliable LLM agent engineering by treating the SDB as a first-class architectural boundary and supplying a pattern catalog and selection methodology grounded in distributed systems. The runnable reference implementation and replay-divergence concept provide practical anchors. The emphasis on how decreasing model variance elevates the importance of pattern choice and SDB strength offers a forward-looking perspective on long-horizon reliability.

major comments (1)

[Application to workloads and reference implementation] Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.

minor comments (1)

[Pattern catalog] The descriptions of the six patterns would benefit from additional concrete pseudocode or annotated diagrams explicitly marking the proposer-verifier-commit-reject elements of the SDB in each case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying this important gap in the presentation of our empirical application. We address the comment below and commit to a substantive revision that strengthens the validation of the SDB and pattern catalog.

read point-by-point responses

Referee: Application to workloads section: The abstract and introduction state that the methodology is applied to five workloads plus a runnable reference implementation, yet no detailed evidence, error analysis, quantitative metrics, or validation results are visible. This gap directly affects assessment of whether the SDB contract and pattern compositions deliver the claimed reliability and composition benefits, which is load-bearing for the central argument that the SDB is the load-bearing primitive.

Authors: We agree that the current version of the manuscript presents the five workloads and the 90-day contract-renewal reference implementation primarily as illustrative case studies that demonstrate the five-step selection methodology and the diagnostic procedure. While the section traces pattern choices back to the SDB contract and notes observed failure modes (including replay divergence), it does not supply the quantitative metrics, before/after error rates, or comparative reliability measurements that would allow direct assessment of the claimed benefits. This is a genuine limitation in the submitted draft. In the revised manuscript we will expand the section to include: (1) concrete failure counts and divergence incidents recorded during the 90-day run, (2) explicit mapping of each failure to weaknesses in the chosen pattern or SDB implementation, (3) any available quantitative indicators of reliability improvement attributable to the architectural choices, and (4) additional excerpts from the runnable implementation showing the proposer-verifier-commit-reject contract in operation. These additions will directly support the central claim that the SDB functions as the load-bearing primitive. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a methodological contribution that defines the stochastic-deterministic boundary (SDB) as a four-part contract and uses it to organize a catalog of six runtime patterns across Coordination, State, and Control. It traces lineages to distributed-systems concepts, presents a five-step selection methodology, a diagnostic procedure, and applies the framework to five workloads plus a reference implementation. No equations, fitted parameters, predictions, or self-referential derivations appear; the central claim that the SDB is load-bearing rests on conceptual organization and external literature rather than reducing to its own inputs by construction. The derivation chain is self-contained as a design lens.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on domain assumptions about stochastic vs deterministic components and introduces a conceptual entity without mathematical free parameters or additional invented entities beyond the named boundary.

axioms (1)

domain assumption Production LLM agents combine stochastic model outputs with deterministic software systems
This distinction is the basis for defining the SDB as stated in the abstract.

invented entities (1)

stochastic-deterministic boundary (SDB) no independent evidence
purpose: Specifies how an LLM output becomes a system action through a four-part contract among proposer, verifier, commit step, and reject signal
Newly named primitive to organize agent runtime design; no external falsifiable prediction provided in abstract.

pith-pipeline@v0.9.0 · 5790 in / 1463 out tokens · 81029 ms · 2026-05-20T04:48:54.880594+00:00 · methodology

0 comments

read the original abstract

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

Figures

Figures reproduced from arXiv: 2605.20173 by Vasundra Srinivasan.

**Figure 2.** Figure 2: The methodology’s geometry. Three concerns overlap. The production runtime is [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 12 internal anchors

[1]

Pourreza, M

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing.Proceedings of the VLDB Endowment,...

work page doi:10.14778/2824032.2824076 2015
[2]

PhD thesis, Royal Institute of Technology (KTH), 2003

Joe Armstrong.Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, Royal Institute of Technology (KTH), 2003. URLhttp://kth.diva-portal. org/smash/get/diva2:9492/FULLTEXT01.pdf

work page 2003
[3]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page
[4]

URLhttps://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Eric A. Brewer. Towards robust distributed systems (invited talk). InProceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC), 2000. doi: 10.1145/343477.343502. URLhttps://doi.org/10.1145/343477.343502

work page doi:10.1145/343477.343502 2000
[6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint, 2025. URLhttps://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint, 2023. URL https://arxiv.org/abs/2308.10848. Later published...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, 22 David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...

work page 2012
[9]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint,

work page
[10]

URLhttps://arxiv.org/abs/2305.14325

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Hector Garcia-Molina and Kenneth Salem. Sagas. InProceedings of the ACM SIGMOD International Conference on Management of Data, pages 249–259, 1987. doi: 10.1145/38713. 38742. URLhttps://doi.org/10.1145/38713.38742

work page doi:10.1145/38713 1987
[12]

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.ACM SIGACT News, 33(2):51–59, 2002. doi: 10.1145/564585.564601. URLhttps://doi.org/10.1145/564585.564601

work page doi:10.1145/564585.564601 2002
[13]

Life beyond distributed transactions: An apostate’s opinion

Pat Helland. Life beyond distributed transactions: An apostate’s opinion. InConference on Innovative Data Systems Research (CIDR), 2007. URL https://www.cidrdb.org/ cidr2007/papers/cidr07p15.pdf

work page 2007
[14]

A universal modular ACTOR formalism for artificial intelligence

Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular ACTOR formalism for artificial intelligence. InProceedings of the 3rd International Joint Conference on Artificial Intelligence (IJCAI), pages 235–245. Morgan Kaufmann, 1973. URLhttps: //dl.acm.org/doi/10.5555/1624775.1624804

work page doi:10.5555/1624775.1624804 1973
[15]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, ZiliWang, Steven KaShing Yau, Zijuan Lin, Liyang Zhou, ChenyuRan, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.00352. Later published at ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Telco customer churn

IBM Cognos Analytics Sample Data. Telco customer churn. IBM Watson Analytics Community sample dataset, redistributed on Kaggle and GitHub, 2018. URLhttps:// www.kaggle.com/datasets/blastchar/telco-customer-churn. Publicly redistributable sample dataset under IBM Sample Data terms. 7,043 customer records

work page 2018
[17]

Rudolf E. Kalman. On the general theory of control systems. InProceedings of the First International Congress of the IFAC, Moscow, pages 481–492. Butterworths, London, 1960. Reprinted in IFAC Proceedings Volumes 1(1), pp. 491–502, doi:10.1016/S1474-6670(17)70094- 8

work page doi:10.1016/s1474-6670(17)70094- 1960
[18]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2310.03714. 23

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

The log: What every software engineer should know about real-time data’s unifying abstraction

Jay Kreps. The log: What every software engineer should know about real-time data’s unifying abstraction. LinkedIn Engineering Blog,

work page
[20]

URL https://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying

work page
[21]

The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998

Leslie Lamport. The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998. doi: 10.1145/279227.279229. URLhttps://doi.org/10.1145/279227. 279229

work page doi:10.1145/279227.279229 1998
[22]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for "mind" exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

In search of an understandable consensus algorithm

Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference (USENIX ATC), pages 305–319, 2014. URLhttps: //www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro

work page 2014
[24]

Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations

OpenAI Agents JS contributors. Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations. GitHub issue #1104, openai/openai-agents-js, 2024. URL https://github.com/openai/ openai-agents-js/issues/1104. Documents a reject-signal bug in which rejected tool calls were reported back to the model withstatus: ’completed’ ...

work page 2024
[25]

Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023. doi: 10.1145/3586183.3606763. URLhttps://arxiv.org/abs/2304.03442

work page doi:10.1145/3586183.3606763 2023
[26]

When model upgrades break your agents

Promptfoo. When model upgrades break your agents. Promptfoo engineering blog, 2024. URL https://www.promptfoo.dev/blog/model-upgrades-break-agent-safety/. Doc- uments a 23-point drop (94% to 71%) in prompt-injection resistance after upgrading a production agent from GPT-4o to GPT-4.1 on an identical evaluation harness; recommended fix is an output classif...

work page 2024
[27]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2310.10501

work page Pith review arXiv 2023
[28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/ abs/2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv. org/abs/2303.11366. 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Wil M. P. van der Aalst. The application of Petri nets to workflow management.Journal of Circuits, Systems and Computers, 8(1):21–66, 1998. doi: 10.1142/S0218126698000043. URL https://doi.org/10.1142/S0218126698000043

work page doi:10.1142/s0218126698000043 1998
[31]

Eventually consistent.ACM Queue, 6(6):14–19, 2008

Werner Vogels. Eventually consistent.ACM Queue, 6(6):14–19, 2008. doi: 10.1145/1466443. 1466448. URLhttps://doi.org/10.1145/1466443.1466448

work page doi:10.1145/1466443 2008
[32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022

Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022. doi: 10.1016/j.future.2022.05.014. URLhttps://arxiv.org/abs/2108.00941

work page doi:10.1016/j.future.2022.05.014 2022
[34]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning.arXiv preprint, 2024. URL https://arxiv.org/abs/2406.09187

work page internal anchor Pith review arXiv 2024
[35]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Pourreza, M

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing.Proceedings of the VLDB Endowment,...

work page doi:10.14778/2824032.2824076 2015

[2] [2]

PhD thesis, Royal Institute of Technology (KTH), 2003

Joe Armstrong.Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, Royal Institute of Technology (KTH), 2003. URLhttp://kth.diva-portal. org/smash/get/diva2:9492/FULLTEXT01.pdf

work page 2003

[3] [3]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page

[4] [4]

URLhttps://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Eric A. Brewer. Towards robust distributed systems (invited talk). InProceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC), 2000. doi: 10.1145/343477.343502. URLhttps://doi.org/10.1145/343477.343502

work page doi:10.1145/343477.343502 2000

[6] [6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint, 2025. URLhttps://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen-Ming Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint, 2023. URL https://arxiv.org/abs/2308.10848. Later published...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, 22 David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...

work page 2012

[9] [9]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint,

work page

[10] [10]

URLhttps://arxiv.org/abs/2305.14325

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Hector Garcia-Molina and Kenneth Salem. Sagas. InProceedings of the ACM SIGMOD International Conference on Management of Data, pages 249–259, 1987. doi: 10.1145/38713. 38742. URLhttps://doi.org/10.1145/38713.38742

work page doi:10.1145/38713 1987

[12] [12]

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.ACM SIGACT News, 33(2):51–59, 2002. doi: 10.1145/564585.564601. URLhttps://doi.org/10.1145/564585.564601

work page doi:10.1145/564585.564601 2002

[13] [13]

Life beyond distributed transactions: An apostate’s opinion

Pat Helland. Life beyond distributed transactions: An apostate’s opinion. InConference on Innovative Data Systems Research (CIDR), 2007. URL https://www.cidrdb.org/ cidr2007/papers/cidr07p15.pdf

work page 2007

[14] [14]

A universal modular ACTOR formalism for artificial intelligence

Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular ACTOR formalism for artificial intelligence. InProceedings of the 3rd International Joint Conference on Artificial Intelligence (IJCAI), pages 235–245. Morgan Kaufmann, 1973. URLhttps: //dl.acm.org/doi/10.5555/1624775.1624804

work page doi:10.5555/1624775.1624804 1973

[15] [15]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, ZiliWang, Steven KaShing Yau, Zijuan Lin, Liyang Zhou, ChenyuRan, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.00352. Later published at ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Telco customer churn

IBM Cognos Analytics Sample Data. Telco customer churn. IBM Watson Analytics Community sample dataset, redistributed on Kaggle and GitHub, 2018. URLhttps:// www.kaggle.com/datasets/blastchar/telco-customer-churn. Publicly redistributable sample dataset under IBM Sample Data terms. 7,043 customer records

work page 2018

[17] [17]

Rudolf E. Kalman. On the general theory of control systems. InProceedings of the First International Congress of the IFAC, Moscow, pages 481–492. Butterworths, London, 1960. Reprinted in IFAC Proceedings Volumes 1(1), pp. 491–502, doi:10.1016/S1474-6670(17)70094- 8

work page doi:10.1016/s1474-6670(17)70094- 1960

[18] [18]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint, 2023. URLhttps://arxiv.org/abs/ 2310.03714. 23

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

The log: What every software engineer should know about real-time data’s unifying abstraction

Jay Kreps. The log: What every software engineer should know about real-time data’s unifying abstraction. LinkedIn Engineering Blog,

work page

[20] [20]

URL https://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying

work page

[21] [21]

The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998

Leslie Lamport. The part-time parliament.ACM Transactions on Computer Systems, 16 (2):133–169, 1998. doi: 10.1145/279227.279229. URLhttps://doi.org/10.1145/279227. 279229

work page doi:10.1145/279227.279229 1998

[22] [22]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for "mind" exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2303.17760

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

In search of an understandable consensus algorithm

Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference (USENIX ATC), pages 305–319, 2014. URLhttps: //www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro

work page 2014

[24] [24]

Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations

OpenAI Agents JS contributors. Rejected tool calls use status: ’completed’ in function_call_result, causing model hallucinations. GitHub issue #1104, openai/openai-agents-js, 2024. URL https://github.com/openai/ openai-agents-js/issues/1104. Documents a reject-signal bug in which rejected tool calls were reported back to the model withstatus: ’completed’ ...

work page 2024

[25] [25]

Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023. doi: 10.1145/3586183.3606763. URLhttps://arxiv.org/abs/2304.03442

work page doi:10.1145/3586183.3606763 2023

[26] [26]

When model upgrades break your agents

Promptfoo. When model upgrades break your agents. Promptfoo engineering blog, 2024. URL https://www.promptfoo.dev/blog/model-upgrades-break-agent-safety/. Doc- uments a 23-point drop (94% to 71%) in prompt-injection resistance after upgrading a production agent from GPT-4o to GPT-4.1 on an identical evaluation harness; recommended fix is an output classif...

work page 2024

[27] [27]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2310.10501

work page Pith review arXiv 2023

[28] [28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/ abs/2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv. org/abs/2303.11366. 24

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Wil M. P. van der Aalst. The application of Petri nets to workflow management.Journal of Circuits, Systems and Computers, 8(1):21–66, 1998. doi: 10.1142/S0218126698000043. URL https://doi.org/10.1142/S0218126698000043

work page doi:10.1142/s0218126698000043 1998

[31] [31]

Eventually consistent.ACM Queue, 6(6):14–19, 2008

Werner Vogels. Eventually consistent.ACM Queue, 6(6):14–19, 2008. doi: 10.1145/1466443. 1466448. URLhttps://doi.org/10.1145/1466443.1466448

work page doi:10.1145/1466443 2008

[32] [32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework.arXiv preprint, 2023. URLhttps://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022

Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. A survey of human-in-the-loop for machine learning.Future Generation Computer Systems, 135:364– 381, 2022. doi: 10.1016/j.future.2022.05.014. URLhttps://arxiv.org/abs/2108.00941

work page doi:10.1016/j.future.2022.05.014 2022

[34] [34]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning.arXiv preprint, 2024. URL https://arxiv.org/abs/2406.09187

work page internal anchor Pith review arXiv 2024

[35] [35]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2023