arxiv: 2604.19818 · v1 · submitted 2026-04-18 · 💻 cs.SE · cs.HC· cs.MA

Recognition: unknown

Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI

Christopher Koch , Joshua Andreas Wellbrock

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:59 UTC · model grok-4.3

classification 💻 cs.SE cs.HCcs.MA

keywords agentic AIevaluationgovernanceorchestrationruntime assuranceevidence synthesisaction evidenceclosure gap

0 comments

The pith

Agentic AI requires a linked four-layer framework because evaluation and governance alone cannot bind obligations to concrete actions or prove compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a bounded evidence synthesis across twenty-four recent sources to identify a governance-to-action closure gap in agentic AI systems. Evaluation measures whether outcomes are good and governance states what should be allowed, yet neither connects specific obligations to the actual steps an agent takes nor supplies a way to demonstrate compliance afterward. The authors introduce three linked artifacts: a four-layer framework that spans evaluation, governance, orchestration, and assurance; an ODTA test using observability, decidability, timeliness, and attestability to place controls at runtime; and a minimum action-evidence bundle for state-changing actions. They illustrate the artifacts with an enterprise procurement-agent scenario that draws existing findings together without new experiments. The work matters because agentic systems plan, use tools, maintain state, and produce external effects, so task success alone cannot establish trustworthiness.

Core claim

The central claim is that current approaches leave a governance-to-action closure gap: evaluation tells us whether outcomes were good, governance defines what should be allowed, but neither identifies where obligations bind to concrete actions or how compliance can later be proven. To close the gap the paper introduces a four-layer framework spanning evaluation, governance, orchestration, and assurance, an ODTA runtime-placement test based on observability, decidability, timeliness, and attestability, and a minimum action-evidence bundle for state-changing actions, shown through a worked enterprise procurement-agent example.

What carries the argument

The four-layer framework that integrates evaluation, governance, orchestration, and assurance, using the ODTA test to decide runtime control placement and an action-evidence bundle to record state-changing actions.

If this is right

Evaluation papers must expand to trajectory-level measurement of safety and robustness rather than task outcomes alone.
Governance frameworks must incorporate execution-time control logic instead of relying only on static obligation definitions.
Orchestration research must treat the control plane as the locus for policy mediation, identity, and telemetry.
Runtime governance cannot rely on prompts or static permissions to handle path-dependent agent behavior.
Action-safety studies must recognize that text alignment does not reliably transfer to tool-using actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to existing agent platforms to test whether the ODTA criteria identify workable control placements and produce usable compliance evidence.
Similar closure gaps may appear in other autonomous systems that perform sequenced actions with external effects.
The action-evidence bundle suggests new logging standards may be needed for tool use in agents to support attestability.

Load-bearing premise

A manually coded synthesis of twenty-four recent sources is sufficient to establish the governance-to-action closure gap and that the proposed four-layer framework, ODTA test, and action-evidence bundle can be implemented without additional empirical validation.

What would settle it

A concrete deployment of the ODTA test and action-evidence bundle in the procurement-agent scenario that fails to produce verifiable binding of obligations to actions or that cannot generate later compliance proofs.

Figures

Figures reproduced from arXiv: 2604.19818 by Christopher Koch, Joshua Andreas Wellbrock.

read the original abstract

Agentic AI systems plan, use tools, maintain state, and act across multi-step workflows with external effects, meaning trustworthy deployment can no longer be judged by task completion alone. The current literature remains fragmented across benchmark-centered evaluation, standards-based governance, orchestration architectures, and runtime assurance mechanisms. This paper contributes a bounded evidence synthesis across a manually coded corpus of twenty-four recent sources. The core finding is a governance-to-action closure gap: evaluation tells us whether outcomes were good, governance defines what should be allowed, but neither identifies where obligations bind to concrete actions or how compliance can later be proven. To close that gap, the paper introduces three linked artifacts: (1) a four-layer framework spanning evaluation, governance, orchestration, and assurance; (2) an ODTA runtime-placement test based on observability, decidability, timeliness, and attestability; and (3) a minimum action-evidence bundle for state-changing actions. Across sources, evaluation papers identify safety, robustness, and trajectory-level measurement as open gaps; governance frameworks define obligations but omit execution-time control logic; orchestration research positions the control plane as the locus of policy mediation, identity, and telemetry; runtime-governance work shows path-dependent behavior cannot be governed through prompts or static permissions alone; and action-safety studies show text alignment does not reliably transfer to tool actions. A worked enterprise procurement-agent scenario illustrates how these artifacts consolidate existing evidence without introducing new experimental data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real disconnect between governance rules and runtime actions in agentic systems but leaves its proposed framework untested on any actual system.

read the letter

The main takeaway is that evaluation and governance work on agentic AI still sit in separate buckets, so obligations never quite map onto specific tool calls or state changes that can later be checked. The authors pull this from a manual review of 24 sources and then offer three pieces to close the loop: a four-layer stack, an ODTA placement test, and a minimum evidence bundle for actions that change state. That synthesis is the clearest part of the paper. It shows how benchmark papers keep focusing on final outcomes, governance documents stay at the policy level, orchestration papers talk about control planes without the proof requirements, and runtime assurance work highlights that prompts and static rules do not transfer to tool use. The enterprise procurement example helps make the gap concrete. The ODTA criteria themselves are straightforward and could be turned into a checklist without much extra machinery. The soft spots are straightforward too. The gap claim rests entirely on how the 24 papers were selected and interpreted; there is no quantitative backing or cross-check against a broader corpus. More importantly, the three artifacts are presented as solutions without being applied back to any of the cited systems to show that previously unprovable obligations now become observable and attestable. The single worked scenario is illustrative but does not substitute for that step. This is the kind of conceptual consolidation that could help teams already trying to operationalize agent governance. It is not a finished method yet, but the underlying observation is worth having in the literature. I would send it to review so the authors can either strengthen the empirical link or clarify the scope as a starting framework rather than a demonstrated fix.

Referee Report

3 major / 2 minor

Summary. The paper conducts a bounded evidence synthesis across a manually coded corpus of 24 recent sources on agentic AI. It identifies a governance-to-action closure gap: evaluation assesses whether outcomes were good and governance defines allowable actions, but neither specifies where obligations bind to concrete runtime actions nor how compliance can be attested afterward. To close the gap, the paper proposes three artifacts—a four-layer framework (evaluation, governance, orchestration, assurance), an ODTA runtime-placement test (observability, decidability, timeliness, attestability), and a minimum action-evidence bundle—illustrated via a single enterprise procurement-agent scenario without new experimental data.

Significance. If the synthesis is representative, the work usefully organizes fragmented threads in agentic AI research and names a concrete integration problem that existing benchmark, standards, and orchestration literatures leave open. The proposed artifacts supply a conceptual vocabulary for binding obligations to attestable controls. Credit is due for the explicit cross-source mapping of open gaps (safety/trajectory measurement, execution-time control logic, path-dependent behavior) and for avoiding new experiments while still producing actionable constructs.

major comments (3)

[section introducing the three linked artifacts] The central claim that the four-layer framework, ODTA test, and action-evidence bundle close the governance-to-action gap is not demonstrated. No system from the 24 reviewed sources is re-analyzed with the ODTA criteria to show that previously unbindable obligations become concrete and attestable; the enterprise procurement scenario remains purely illustrative (section introducing the three linked artifacts).
[evidence synthesis methodology section] The gap identification rests on qualitative coding of 24 sources without reported selection criteria, coding protocol, or inter-coder reliability metrics. This is load-bearing for the claim that the closure gap is both present and central across the literature (evidence synthesis methodology section).
[results of the synthesis section] Table or figure summarizing how each of the 24 sources maps onto the four layers or fails the ODTA test is absent; without it, readers cannot verify the cross-source consolidation that underpins the framework (results of the synthesis section).

minor comments (2)

[ODTA test definition] The ODTA acronym and four criteria are defined in prose but would benefit from a compact table with one literature-derived example per criterion to aid readability.
[action-evidence bundle description] The minimum action-evidence bundle is described at a high level; adding a short pseudocode or JSON schema example would clarify the required fields for state-changing actions.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and precise comments, which identify key opportunities to strengthen the transparency and demonstrative power of our bounded evidence synthesis. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [section introducing the three linked artifacts] The central claim that the four-layer framework, ODTA test, and action-evidence bundle close the governance-to-action gap is not demonstrated. No system from the 24 reviewed sources is re-analyzed with the ODTA criteria to show that previously unbindable obligations become concrete and attestable; the enterprise procurement scenario remains purely illustrative (section introducing the three linked artifacts).

Authors: We acknowledge that the manuscript does not re-analyze the 24 sources using the ODTA criteria to empirically demonstrate gap closure. The contribution is a synthesis that extracts the governance-to-action gap from the corpus and proposes the three artifacts as a conceptual response, with the procurement scenario serving only as illustration. A full re-application of ODTA to each source would require new empirical work beyond the paper's bounded synthesis scope. In revision we will add explicit cross-references in the artifacts section, showing how each component directly addresses specific gaps already coded from representative sources (e.g., trajectory measurement from evaluation papers, execution-time control from governance papers), thereby making the linkage more traceable without new data. revision: partial
Referee: [evidence synthesis methodology section] The gap identification rests on qualitative coding of 24 sources without reported selection criteria, coding protocol, or inter-coder reliability metrics. This is load-bearing for the claim that the closure gap is both present and central across the literature (evidence synthesis methodology section).

Authors: The referee is correct that the methodology section lacks sufficient detail on source selection and coding. We will expand it to describe the inclusion criteria (recency, topical focus on agentic AI evaluation/governance/orchestration/assurance, and peer-reviewed or high-impact preprints from 2023–2024), the qualitative coding protocol used to extract the four-layer mapping and ODTA-relevant gaps, and the collaborative process employed for consistency. Although formal inter-coder reliability statistics were not computed, we will report the steps taken to reduce subjectivity. revision: yes
Referee: [results of the synthesis section] Table or figure summarizing how each of the 24 sources maps onto the four layers or fails the ODTA test is absent; without it, readers cannot verify the cross-source consolidation that underpins the framework (results of the synthesis section).

Authors: We agree that a consolidated mapping is necessary for verifiability. We will add a table to the results section that lists each of the 24 sources, indicates their primary alignment with the four layers, and notes the ODTA criteria or gaps identified during coding. This will allow readers to directly inspect the evidence base for the framework. revision: yes

standing simulated objections not resolved

Re-analyzing the 24 sources with the newly proposed ODTA criteria to empirically prove gap closure, as this would constitute new empirical evaluation work outside the scope of the bounded evidence synthesis.

Circularity Check

0 steps flagged

No significant circularity: interpretive synthesis with independent proposals

full rationale

The paper performs a bounded evidence synthesis of twenty-four external sources to identify the governance-to-action closure gap, then proposes three new artifacts (four-layer framework, ODTA test, minimum action-evidence bundle) as interpretive responses to the synthesized findings. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims do not reduce by construction to the inputs; the gap is extracted from cross-source patterns while the artifacts are constructed proposals without self-definitional loops or renaming of known results. This is a standard non-circular synthesis structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a conceptual synthesis; it introduces no free parameters, mathematical axioms, or postulated physical entities. The framework layers and ODTA criteria are definitional constructs rather than derived quantities.

axioms (1)

domain assumption Agentic AI systems produce state-changing actions with external effects that require runtime governance beyond prompt-level alignment.
Stated in the abstract and introduction as the premise motivating the closure gap.

pith-pipeline@v0.9.0 · 5566 in / 1343 out tokens · 35960 ms · 2026-05-10T05:59:38.675578+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification
cs.CY 2026-04 unverdicted novelty 4.0

DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Inside the AI Index: 12 Takeaways from the 2026 Report,

Stanford HAI, “Inside the AI Index: 12 Takeaways from the 2026 Report,” Apr. 2026. [Online]. Available: https://hai.stanford.edu/news/ inside-the-ai-index-12-takeaways-from-the-2026-report

2026
[2]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Cohan, and M. Shmueli-Scheuer, “Survey on Evaluation of LLM-based Agents,” arXiv preprint arXiv:2503.16416, 2025. [Online]. Available: https://arxiv. org/abs/2503.16416

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv:2507.02825 , year =

Y . Zhuet al., “Establishing Best Practices for Building Rigorous Agentic Benchmarks,”arXiv preprint arXiv:2507.02825, 2025. [Online]. Available: https://arxiv.org/abs/2507.02825

work page arXiv 2025
[4]

General agent evaluation

E. Bandelet al., “General Agent Evaluation,”arXiv preprint arXiv:2602.22953, 2026. [Online]. Available: https://arxiv.org/abs/2602. 22953

work page internal anchor Pith review arXiv 2026
[5]

MultiAgentBench : Evaluating the collaboration and competition of LLM agents

K. Zhu, H. Du, Z. Hong, X. Yang, S. Guo, Z. Wang, Z. Wang, C. Qian, X. Tang, H. Ji, and J. You, “MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), Vienna, Austria, Jul. 2025, pp. 8580–8622, doi: 10.18653/v1/2025.acl-long.421

work page doi:10.18653/v1/2025.acl-long.421 2025
[6]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang, “Agent-SafetyBench: Evaluating the Safety of LLM Agents,”arXiv preprint arXiv:2412.14470, 2024. [Online]. Available: https://arxiv.org/ abs/2412.14470

work page internal anchor Pith review arXiv 2024
[8]

ToolSafe: Step-level pre-execution detection for LLM agent safety.arXiv preprint arXiv:2601.10156, 2026

[Online]. Available: https://arxiv.org/abs/2601.10156

work page arXiv
[9]

ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations,

Y . Xie, Y . Yuan, W. Wang, F. Mo, J. Guo, and P. He, “ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations,” inProc. 2025 Conf. Empirical Methods in Natural Language Processing, Suzhou, China, Nov. 2025, pp. 14135–14156, doi: 10.18653/v1/2025.emnlp-main.714

work page doi:10.18653/v1/2025.emnlp-main.714 2025
[10]

Mind the gap: Text safety does not transfer to tool-call safety in llm agents.arXiv preprint arXiv:2602.16943, 2026

A. Cartagena and A. Teixeira, “Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents,”arXiv preprint arXiv:2602.16943, 2026. [Online]. Available: https://arxiv.org/abs/2602. 16943

work page arXiv 2026
[11]

WebGuard: Building a Generalizable Guardrail for Web Agents,

B. Zhenget al., “WebGuard: Building a Generalizable Guardrail for Web Agents,”arXiv preprint arXiv:2507.14293, 2025. [Online]. Available: https://arxiv.org/abs/2507.14293

work page arXiv 2025
[12]

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

Z. Chen, M. Kang, and B. Li, “ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,” inProc. 42nd Int. Conf. Machine Learning, PMLR, vol. 267, 2025, pp. 8313–8344. [Online]. Available: https://proceedings.mlr.press/v267/chen25ae.html

2025
[13]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

D. Liuet al., “AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security,”arXiv preprint arXiv:2601.18491, 2026. [Online]. Available: https://arxiv.org/abs/2601.18491

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Proof-of- Guardrail in AI Agents and What (Not) to Trust from It,

X. Jin, M. Duan, Q. Lin, A. Chan, Z. Chen, J. Du, and X. Ren, “Proof-of- Guardrail in AI Agents and What (Not) to Trust from It,”arXiv preprint arXiv:2603.05786, 2026. [Online]. Available: https://arxiv.org/abs/2603. 05786

work page arXiv 2026
[15]

The orchestration of multi-agent systems: Architectures, protocols, and enterprise adoption

A. Adimulam, R. Gupta, and S. Kumar, “The Orchestration of Multi- Agent Systems: Architectures, Protocols, and Enterprise Adoption,”arXiv preprint arXiv:2601.13671, 2026. [Online]. Available: https://arxiv.org/ abs/2601.13671

work page arXiv 2026
[16]

Kaptein, V.-J

M. Kaptein, V .-J. Khan, and A. Podstavnychy, “Runtime Governance for AI Agents: Policies on Paths,”arXiv preprint arXiv:2603.16586, 2026. [Online]. Available: https://arxiv.org/abs/2603.16586

work page arXiv 2026
[17]

A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance,

C. Paduraru, P.-L. Bouruc, and A. Stefanescu, “A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance,”arXiv preprint arXiv:2603.18096, 2026. [Online]. Available: https://arxiv.org/abs/2603.18096

work page arXiv 2026
[18]

ISO/IEC 42001:2023 - Information technology - Artificial in- telligence - Management system,

ISO, “ISO/IEC 42001:2023 - Information technology - Artificial in- telligence - Management system,” 2023. [Online]. Available: https: //www.iso.org/standard/42001

2023
[19]

ISO/IEC 23894:2023 - Information technology - Artificial intel- ligence - Guidance on risk management,

ISO, “ISO/IEC 23894:2023 - Information technology - Artificial intel- ligence - Guidance on risk management,” 2023. [Online]. Available: https://www.iso.org/standard/77304.html

2023
[20]

ISO/IEC 42005:2025 - Information technology - Artificial in- telligence - AI system impact assessment,

ISO, “ISO/IEC 42005:2025 - Information technology - Artificial in- telligence - AI system impact assessment,” 2025. [Online]. Available: https://www.iso.org/standard/42005

2025
[21]

ISO/IEC 5338:2023 - Information technology - Artificial intel- ligence - AI system life cycle processes,

ISO, “ISO/IEC 5338:2023 - Information technology - Artificial intel- ligence - AI system life cycle processes,” 2023. [Online]. Available: https://www.iso.org/standard/81118.html

2023
[22]

ISO/IEC 38507:2022 - Information technology - Governance of IT - Governance implications of the use of artificial intelligence by organizations,

ISO, “ISO/IEC 38507:2022 - Information technology - Governance of IT - Governance implications of the use of artificial intelligence by organizations,” 2022. [Online]. Available: https://www.iso.org/standard/ 56641.html

2022
[23]

Tabassi.Artificial Intelligence Risk Management Framework (AI RMF 1.0)

NIST, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1, Jan. 2023. doi: 10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[24]

Autio et al.Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile

NIST, “Artificial Intelligence Risk Management Framework: Gener- ative Artificial Intelligence Profile,” NIST AI 600-1, Jul. 2024. doi: 10.6028/NIST.AI.600-1

work page doi:10.6028/nist.ai.600-1 2024
[25]

AI Agent Standards Initiative,

NIST, “AI Agent Standards Initiative,” Feb. 2026. [Online]. Available: https://www.nist.gov/caisi/ai-agent-standards-initiative

2026