Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

Hai-Duong Nguyen; Xuan-The Tran

arxiv: 2605.17998 · v2 · pith:OKS6V23Dnew · submitted 2026-05-18 · 💻 cs.SE · cs.AI

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

Hai-Duong Nguyen , Xuan-The Tran This is my paper

Pith reviewed 2026-05-22 10:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords multi-agent systemsverify-gated completionadmission controlfail-closedaudit tracesgoverned runtimebounded architecturesoftware engineering

0 comments

The pith

Read-only verify gate plus packetized records make multi-agent completion decisions inspectable and fail-closed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies verify-gated completion as an admission-control pattern for governed multi-agent runtimes. Agents propose completions while a read-only verifier decides admission, ambiguous cases resolve fail-closed, and packetized traces preserve an audit path. In one bounded reference implementation the released data showed 99.5 percent verify success among invoked events and 98.58 percent rule agreement in a shadow evaluation, supporting that decisions were inspectable and fail-closed under those observed conditions. A sympathetic reader cares because the approach reframes completion as a runtime-control issue for tool-using workflows with roles and persistent state rather than a purely generative step.

Core claim

Under observed conditions, a read-only verify gate plus packetized admission records made completion decisions inspectable and fail-closed. In the released verify-completed slice the known-outcome invoked-event verify success share reached 1,791 out of 1,800 or 99.5 percent, a shadow Policy/Governance Verifier showed 98.58 percent rule agreement with zero false successes among safe-to-proceed predictions, yet task-level coverage remains uncomputable and most events came from a single high-volume cluster.

What carries the argument

Read-only verify gate combined with packetized state and event traces for admission control and audit.

If this is right

Completion decisions shift from generative to runtime-control mechanisms in workflows with specialized roles and persistent state.
Ambiguous or weakly evidenced cases resolve fail-closed to preserve governance.
Packetized state and event traces create an explicit audit path for every admission decision.
The verify gate remains advisory when blocked precision stays low, as shown by the shadow evaluator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing the same gate on datasets with balanced cluster distribution could check whether the inspectability property holds more generally.
Measuring task-level verify coverage directly would clarify gaps hidden by event-level accounting.
Pairing the read-only gate with additional policy verifiers might raise blocked precision without losing the fail-closed property.

Load-bearing premise

The observed verification events, heavily skewed toward one reporting cluster with only seventeen production-classified cases, are representative enough to support the narrow claim of inspectability and fail-closed behavior.

What would settle it

A larger or more balanced set of verification events from the same bounded architecture that includes non-inspectable or non-fail-closed completion decisions would disprove the narrow conclusion.

Figures

Figures reproduced from arXiv: 2605.17998 by Hai-Duong Nguyen, Xuan-The Tran.

**Figure 2.** Figure 2: Packet and decision flow for verify-gated completion. Solid arrows show the [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

As multi-agent systems move from short interactions to tool-using workflows with specialized roles and persistent state, completion becomes a runtime-control problem rather than a purely generative one. This preprint studies verify-gated completion as an admission-control pattern for governed multi-agent runtimes: agents may propose completion, but a read-only verifier decides whether the claim is admitted. Ambiguous or weakly evidenced cases resolve fail-closed, while packetized state and event traces preserve an audit path. We examine one bounded reference implementation and ask what the released evidence can support about auditable, verify-gated completion. In the released verify-completed slice, the known-outcome invoked-event verify success share was 1,791/1,800 = 99.5%. This is an accounting measure over invoked verification events, not a task-completion, production-reliability, or benchmark-success rate. Task-level verify coverage is not computable; 1,762/1,801 rows came from one high-volume reporting cluster; and only 17 events were production-classified. A shadow Policy/Governance Verifier evaluation showed 1,526/1,548 = 98.58% rule agreement, 0/1,526 false-success among safe-to-proceed predictions, and blocked precision of 2/518 = 0.39%, so it remains advisory. The evidence supports a narrow conclusion: under observed conditions, a read-only verify gate plus packetized admission records made completion decisions inspectable and fail-closed. Claims about deployed operation, safety guarantees, outcome gains, task-level coverage, recovery effectiveness, or external validity remain outside scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrow case study of verify-gated completion that honestly reports inspectability in one bounded setup but rests on heavily skewed data with almost no production events.

read the letter

The main thing to know is that the paper walks through a verify-gated admission pattern for multi-agent completion and releases some direct counts from their reference implementation. Under the conditions they observed, the read-only gate plus packetized records made decisions inspectable and fail-closed, with 99.5 percent verify success on invoked events and zero false successes in the safe predictions they checked. That is the extent of the supported claim, and they say so plainly.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a narrowly scoped case study of verify-gated completion as an admission-control pattern in a governed multi-agent runtime. Agents may propose task completion, but a read-only verifier decides admission; ambiguous or weakly evidenced cases resolve fail-closed. Packetized state and event traces are used to preserve audit paths. In the released verify-completed slice from a bounded reference implementation, the authors report an invoked-event verify success share of 1,791/1,800 = 99.5%, a shadow Policy/Governance Verifier agreement rate of 1,526/1,548 = 98.58% with zero false-successes among safe-to-proceed predictions, and blocked precision of 2/518 = 0.39%. They explicitly flag that task-level coverage is not computable, that 1,762/1,801 rows came from one high-volume cluster, and that only 17 events were production-classified, limiting the conclusion to inspectability and fail-closed behavior under the observed conditions while disclaiming broader claims about safety, reliability, or external validity.

Significance. If the narrow claim holds, the paper supplies a concrete, transparent architectural example of how a read-only verification gate plus packetized records can render completion decisions inspectable and fail-closed in multi-agent systems. The work's strengths include its explicit qualification of scope, direct reporting of raw counts rather than fitted parameters, and absence of post-hoc adjustments. The stress-test concern regarding representativeness of the skewed, low-production-event sample does not undermine the central claim because the manuscript consistently conditions its conclusion on 'under observed conditions' and does not extrapolate beyond the released slice.

major comments (1)

Abstract and results discussion: the statement that 'task-level verify coverage is not computable' is load-bearing for the narrow inspectability conclusion yet is asserted without a brief supporting argument or example (e.g., why the packetized traces do not permit even an approximate coverage estimate). Adding one sentence of justification would make the scope limitation fully rigorous.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The recommendation for minor revision is noted, and we agree that a brief justification for the non-computability of task-level verify coverage will make the scope limitation more rigorous. We address the single major comment below.

read point-by-point responses

Referee: Abstract and results discussion: the statement that 'task-level verify coverage is not computable' is load-bearing for the narrow inspectability conclusion yet is asserted without a brief supporting argument or example (e.g., why the packetized traces do not permit even an approximate coverage estimate). Adding one sentence of justification would make the scope limitation fully rigorous.

Authors: We thank the referee for this observation. The packetized traces record only those completion proposals that reached the verify gate (i.e., invoked verification events) together with their outcomes and state. The released slice contains no information on the total population of tasks initiated in the runtime, on tasks whose completion was never proposed, or on tasks routed outside the verifier. Without a denominator for the full task set, neither an exact nor an approximate coverage ratio can be computed. We will add one sentence of justification to this effect in both the abstract and the results discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The preprint is a bounded case study that reports direct observational counts and agreement percentages from a running reference implementation rather than deriving predictions or first-principles results. The reported figures (1,791/1,800 success share, 1,526/1,548 rule agreement) are presented explicitly as raw accounting measures over invoked events, with task coverage and external validity disclaimed. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the chain; the narrow claim about inspectability and fail-closed behavior under observed conditions is supported by the data without reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on standard assumptions about multi-agent state management and introduces the verify gate as an architectural component without independent falsifiable evidence outside the case study.

axioms (1)

domain assumption A read-only verifier can be implemented without side effects on agent state.
Required for the admission-control pattern to remain non-interfering.

invented entities (1)

Verify gate no independent evidence
purpose: To enforce admission control on agent-proposed completions
New architectural component introduced for the governed runtime.

pith-pipeline@v0.9.0 · 5826 in / 1354 out tokens · 41661 ms · 2026-05-22T10:05:33.229121+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five-plane admission-control design with explicit acceptance semantics, packetized state, read-only verification, fail-closed completion rules
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Accept(ct) = 1[ϕ1 ∧ ϕ2 ∧ … ∧ ϕ11]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 6 internal anchors

[1]

ReAct: Synergizing reasoning and acting in language models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models”, inInternational Conference on Learning Representations (ICLR), 2023. [Online]. Available:https://openrevi ew.net/forum?id=WE_vluYUL-X

work page 2023
[2]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

G. Li, A. Hammoud, B. Itani, D. Khizbullin, and B. Ghanem, “CAMEL: Com- municative agents for mind exploration of large language model society”,arXiv preprint arXiv:2303.17760, 2023. DOI:https://doi.org/10.48550/arXiv.2303. 17760

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023
[3]

AutoGen: En- abling next-gen LLM applications via multi-agent conversations

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: En- abling next-gen LLM applications via multi-agent conversations”, inConference on Language Modeling (COLM), 2024. [Online]. Available:https://openreview.net /forum?id=BAakY1hNKS

work page 2024
[4]

MetaGPT: Meta programming for a multi-agent collaborative framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework”, inIn- ternational Conference on Learning Representations (ICLR), 2024. [Online]. Avail- able:https://openreview.net/forum?id=VtmBAGCN7o

work page 2024
[5]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Berman, and A. Goyal, “Reflexion: Language agents with verbal reinforcement learning”,arXiv preprint arXiv:2303.11366, 2023. DOI:http s://doi.org/10.48550/arXiv.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023
[6]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Clark, Y. Yang, M. Faruqui, P. Parikh, Y. Zhang, V. Nangia, S. Fried, and I. Celikyilmaz, “Self-Refine: Iterative refinement with self- feedback”,arXiv preprint arXiv:2303.17651, 2023. DOI:https://doi.org/10.485 50/arXiv.2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

SWE-agent: Agent-computer interfaces enable automated software en- gineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software en- gineering”, inAdvances in Neural Information Processing Systems (NeurIPS), 2024. [Online]. Available:https://openreview.net/forum?id=mXpq6ut8J3

work page 2024
[8]

Chain-of-Verification Reduces Hallucination in Large Language Models

S. Dhuliawala, M. Karpinska, P. R. Gribovskaya, V. Stoyanov, and A. Agha, “Chain-of-Verification reduces hallucination in large language models”,arXiv preprint arXiv:2309.11495, 2023. DOI:https://doi.org/10.48550/arXiv.2 309.11495

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2023
[9]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Gan- guli, T. Henighan, N. Joseph, B. Mann, A. Olsson, C. Olsson, B. Pursell, J. Skalse, E. Perez, and J. Kaplan, “Constitutional AI: Harmlessness from AI feedback”,arXiv preprint arXiv:2212.08073, 2022. DOI:https://doi.org/10.48550/arXiv.2212. 08073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212 2022
[10]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Fang, K. Patel, Y. Lin, S. Wooders, and V. K. Kuleshov, “MemGPT: Towards LLMs as operating systems”,arXiv preprint arXiv:2310.08560, 2023. DOI: https://doi.org/10.48550/arXiv.2310.08560

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
[11]

AgentBench: Evaluating LLMs as agents

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, 38 and others, “AgentBench: Evaluating LLMs as agents”, inInternational Conference on Learning Representations (ICLR), 2024. [Online]. Available:https://openrevi ew.net/forum?id=zAdUB0aCTQ

work page 2024
[12]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for General AI Assistants”, inInternational Conference on Learning Representa- tions (ICLR), 2024. [Online]. Available:https://openreview.net/forum?id=fibx vahvs3

work page 2024
[13]

Judging LLM-as-a- judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a- judge with MT-Bench and Chatbot Arena”,Advances in Neural Information Pro- cessing Systems, vol. 36, 2023, Datasets and Benchmarks Track. [Online]. Available: https://proceedings.neurips.cc/paper_file...

work page 2023
[14]

Large language models are inconsistent and biased evaluators

R. Stureborg, D. Alikaniotis, and Y. Suhara, “Large language models are in- consistent and biased evaluators”,arXiv preprint arXiv:2405.01724, 2024. DOI: https://doi.org/10.48550/arXiv.2405.01724

work page doi:10.48550/arxiv.2405.01724 2024
[15]

LangGraph overview

LangChain, “LangGraph overview”, online documentation. Available:https://docs .langchain.com/oss/python/langgraph/overview

work page
[16]

Introduction

CrewAI, “Introduction”, online documentation. Available:https://docs.crewai. com/en/introduction

work page
[17]

Guardrails and human review

OpenAI, “Guardrails and human review”, online documentation. Available:https: //developers.openai.com/api/docs/guides/agents/guardrails-approvals

work page
[18]

Running agents

OpenAI, “Running agents”, online documentation. Available:https://developers .openai.com/api/docs/guides/agents/running-agents

work page
[19]

Semantic Kernel agent framework

Microsoft, “Semantic Kernel agent framework”, online documentation. Available: https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/

work page
[20]

Dapper, a large-scale distributed systems tracing infrastructure

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure”, Google Research, technical report, 2010

work page 2010
[21]

Humble and D

J. Humble and D. Farley,Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Boston, MA, USA: Addison-Wesley, 2010

work page 2010
[22]

Forsgren, J

N. Forsgren, J. Humble, and G. Kim,Accelerate: The Science of Lean Software and DevOps. Portland, OR, USA: IT Revolution, 2018

work page 2018
[23]

Weill and J

P. Weill and J. W. Ross,IT Governance: How Top Performers Manage IT Deci- sion Rights for Superior Results. Boston, MA, USA: Harvard Business School Press, 2004. 39

work page 2004

[1] [1]

ReAct: Synergizing reasoning and acting in language models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models”, inInternational Conference on Learning Representations (ICLR), 2023. [Online]. Available:https://openrevi ew.net/forum?id=WE_vluYUL-X

work page 2023

[2] [2]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

G. Li, A. Hammoud, B. Itani, D. Khizbullin, and B. Ghanem, “CAMEL: Com- municative agents for mind exploration of large language model society”,arXiv preprint arXiv:2303.17760, 2023. DOI:https://doi.org/10.48550/arXiv.2303. 17760

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023

[3] [3]

AutoGen: En- abling next-gen LLM applications via multi-agent conversations

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: En- abling next-gen LLM applications via multi-agent conversations”, inConference on Language Modeling (COLM), 2024. [Online]. Available:https://openreview.net /forum?id=BAakY1hNKS

work page 2024

[4] [4]

MetaGPT: Meta programming for a multi-agent collaborative framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework”, inIn- ternational Conference on Learning Representations (ICLR), 2024. [Online]. Avail- able:https://openreview.net/forum?id=VtmBAGCN7o

work page 2024

[5] [5]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Berman, and A. Goyal, “Reflexion: Language agents with verbal reinforcement learning”,arXiv preprint arXiv:2303.11366, 2023. DOI:http s://doi.org/10.48550/arXiv.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023

[6] [6]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Clark, Y. Yang, M. Faruqui, P. Parikh, Y. Zhang, V. Nangia, S. Fried, and I. Celikyilmaz, “Self-Refine: Iterative refinement with self- feedback”,arXiv preprint arXiv:2303.17651, 2023. DOI:https://doi.org/10.485 50/arXiv.2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

SWE-agent: Agent-computer interfaces enable automated software en- gineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software en- gineering”, inAdvances in Neural Information Processing Systems (NeurIPS), 2024. [Online]. Available:https://openreview.net/forum?id=mXpq6ut8J3

work page 2024

[8] [8]

Chain-of-Verification Reduces Hallucination in Large Language Models

S. Dhuliawala, M. Karpinska, P. R. Gribovskaya, V. Stoyanov, and A. Agha, “Chain-of-Verification reduces hallucination in large language models”,arXiv preprint arXiv:2309.11495, 2023. DOI:https://doi.org/10.48550/arXiv.2 309.11495

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2023

[9] [9]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Gan- guli, T. Henighan, N. Joseph, B. Mann, A. Olsson, C. Olsson, B. Pursell, J. Skalse, E. Perez, and J. Kaplan, “Constitutional AI: Harmlessness from AI feedback”,arXiv preprint arXiv:2212.08073, 2022. DOI:https://doi.org/10.48550/arXiv.2212. 08073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212 2022

[10] [10]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Fang, K. Patel, Y. Lin, S. Wooders, and V. K. Kuleshov, “MemGPT: Towards LLMs as operating systems”,arXiv preprint arXiv:2310.08560, 2023. DOI: https://doi.org/10.48550/arXiv.2310.08560

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023

[11] [11]

AgentBench: Evaluating LLMs as agents

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, 38 and others, “AgentBench: Evaluating LLMs as agents”, inInternational Conference on Learning Representations (ICLR), 2024. [Online]. Available:https://openrevi ew.net/forum?id=zAdUB0aCTQ

work page 2024

[12] [12]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom, “GAIA: a benchmark for General AI Assistants”, inInternational Conference on Learning Representa- tions (ICLR), 2024. [Online]. Available:https://openreview.net/forum?id=fibx vahvs3

work page 2024

[13] [13]

Judging LLM-as-a- judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a- judge with MT-Bench and Chatbot Arena”,Advances in Neural Information Pro- cessing Systems, vol. 36, 2023, Datasets and Benchmarks Track. [Online]. Available: https://proceedings.neurips.cc/paper_file...

work page 2023

[14] [14]

Large language models are inconsistent and biased evaluators

R. Stureborg, D. Alikaniotis, and Y. Suhara, “Large language models are in- consistent and biased evaluators”,arXiv preprint arXiv:2405.01724, 2024. DOI: https://doi.org/10.48550/arXiv.2405.01724

work page doi:10.48550/arxiv.2405.01724 2024

[15] [15]

LangGraph overview

LangChain, “LangGraph overview”, online documentation. Available:https://docs .langchain.com/oss/python/langgraph/overview

work page

[16] [16]

Introduction

CrewAI, “Introduction”, online documentation. Available:https://docs.crewai. com/en/introduction

work page

[17] [17]

Guardrails and human review

OpenAI, “Guardrails and human review”, online documentation. Available:https: //developers.openai.com/api/docs/guides/agents/guardrails-approvals

work page

[18] [18]

Running agents

OpenAI, “Running agents”, online documentation. Available:https://developers .openai.com/api/docs/guides/agents/running-agents

work page

[19] [19]

Semantic Kernel agent framework

Microsoft, “Semantic Kernel agent framework”, online documentation. Available: https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/

work page

[20] [20]

Dapper, a large-scale distributed systems tracing infrastructure

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure”, Google Research, technical report, 2010

work page 2010

[21] [21]

Humble and D

J. Humble and D. Farley,Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Boston, MA, USA: Addison-Wesley, 2010

work page 2010

[22] [22]

Forsgren, J

N. Forsgren, J. Humble, and G. Kim,Accelerate: The Science of Lean Software and DevOps. Portland, OR, USA: IT Revolution, 2018

work page 2018

[23] [23]

Weill and J

P. Weill and J. W. Ross,IT Governance: How Top Performers Manage IT Deci- sion Rights for Superior Results. Boston, MA, USA: Harvard Business School Press, 2004. 39

work page 2004