arxiv: 2605.06457 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

Donghao Huang , Joon Kiat Chua , Zhaoxia Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsworkflow fidelitypayment systemsevaluation metricsmulti-agent systemstrajectory evaluationagentic success rate

0 comments

The pith

A sequence-fidelity metric shows that half of tested LLM payment agents skip required confirmation steps even when final outcomes succeed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard metrics like task success rate and handoff scores only check whether a payment completes or how agents pass work between them. The paper introduces a new metric that instead compares the exact sequence of agent actions against an expected workflow, including every required transition such as a confirmation checkpoint before checkout. When applied across 18 models and 90,000 instances, it finds that ten models routinely bypass the confirmation step, a shortcut invisible to the older measures, while eight models follow the sequence perfectly. Prompt changes informed by the new metric then produce large gains in overall success. In payment systems where order of operations matters for compliance and error prevention, checking the full trajectory therefore reveals problems that outcome-only scores conceal.

Core claim

The Agentic Success Rate metric compares observed agent execution sequences against an expected sequence at the transition level, decomposing performance into transition recall and precision. Applied to the Hierarchical Multi-Agent System for Payments, it reveals that ten of eighteen LLMs systematically skip a confirmation checkpoint during checkout despite achieving high task success rates, while eight models enforce the checkpoint without deviation. Models such as GPT-4.1 show hidden shortcuts despite perfect scores on prior metrics, and prompt refinements guided by the new diagnostics raise task success rates by up to 93.8 percentage points.

What carries the argument

Agentic Success Rate (ASR), a trajectory-fidelity metric that scores how closely an agent's observed sequence of actions matches the expected sequence by measuring transition-level recall and precision.

If this is right

Models can reach perfect task success and handoff scores while still omitting mandatory safety or compliance steps.
Diagnostics from sequence comparison can direct targeted prompt changes that improve final outcomes by nearly 94 percentage points in some cases.
In regulated domains, evaluation must move beyond final results to verify that every required transition occurs in order.
Eight of the eighteen models already maintain perfect sequence fidelity, showing that consistent workflow adherence is achievable with current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Workflow-fidelity checks could become standard for any multi-step agent task where order affects legal or safety outcomes, such as medical or financial processes.
Deterministic routing guards combined with the metric might reduce reliance on prompt engineering alone to enforce critical steps.
The same sequence-comparison approach might expose similar hidden shortcuts in non-payment domains where agents handle chained decisions.

Load-bearing premise

That a single objectively correct execution sequence exists for every payment workflow and that any deviation from it is always undesirable rather than potentially adaptive.

What would settle it

A controlled test in which an agent that skips the confirmation checkpoint completes payments faster or with lower error rates than agents that follow the full sequence, without increasing fraud or compliance violations.

read the original abstract

LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASR is a straightforward transition-level metric that flags workflow skips missed by TSR and HF1, with some empirical backing on 18 models, but the claims rest on an unexamined single expected sequence.

read the letter

The paper's core move is introducing Agentic Success Rate as a way to score how closely LLM agent trajectories match a target sequence in payment workflows, breaking it into transition recall and precision. This catches cases where models skip a confirmation step even when they hit the final outcome, something the authors say TSR and HF1 overlook. They run it across 18 models and 90,000 instances on their HMASP setup and report that half the models skip the checkpoint while the rest do not, plus big TSR lifts after ASR-guided prompt tweaks.

Referee Report

2 major / 0 minor

Summary. The paper introduces Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed agent execution sequences against a predefined expected workflow at the transition level (decomposed into Transition Recall and Transition Precision). Applied to a Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR identifies that 10 models systematically skip a confirmation checkpoint (invisible to TSR and HF1), while 8 enforce it; prompt refinements guided by ASR yield TSR gains up to +93.8 percentage points. The work argues that trajectory-level evaluation is essential in regulated domains.

Significance. If the central claims hold, the work provides a concrete demonstration that outcome-only metrics (TSR, HF1) can mask workflow deviations in payment agents, with a large-scale empirical evaluation (18 models, 90k instances) showing both diagnostic value and actionable improvements via ASR-guided refinements. This strengthens the case for sequence-aware metrics in safety-critical agent deployments.

major comments (2)

[Abstract] The abstract states that ASR 'compares observed and expected agent execution sequences' and reports that 10 of 18 models 'systematically skip a confirmation checkpoint,' yet supplies no description of how the expected sequence is constructed, validated (e.g., by domain experts or regulators), or shown to be the unique correct path rather than one of several permissible adaptive workflows. This definition is load-bearing for all reported deviations and the claim that they are undesirable.
[Abstract] The abstract reports concrete findings on 90,000 instances (e.g., 10/18 models skip checkpoint, TSR gains up to +93.8 pp) but provides no error bars, confidence intervals, or statistical tests for the model-wise differences or improvement claims. Without these, it is impossible to determine whether the observed patterns exceed sampling variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the Agentic Success Rate (ASR) metric. The comments highlight important areas for improving clarity around workflow definition and statistical reporting. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] The abstract states that ASR 'compares observed and expected agent execution sequences' and reports that 10 of 18 models 'systematically skip a confirmation checkpoint,' yet supplies no description of how the expected sequence is constructed, validated (e.g., by domain experts or regulators), or shown to be the unique correct path rather than one of several permissible adaptive workflows. This definition is load-bearing for all reported deviations and the claim that they are undesirable.

Authors: We agree that the construction and validation of the expected workflow is foundational. Section 3.2 of the manuscript details that the expected sequence is derived directly from regulatory requirements for payment systems (e.g., mandatory confirmation steps under standard financial compliance protocols) and cross-validated against the HMASP system design with input from domain experts in regulated transaction processing. The abstract omits this for brevity, but we will revise it to include a concise statement on the regulatory basis. Regarding uniqueness, the paper positions the confirmation checkpoint as a non-negotiable compliance requirement in this domain to mitigate risks such as unauthorized payments; while adaptive workflows may exist in less regulated settings, deviations here are fidelity violations by design. We will add clarifying language in the abstract and expand the discussion in Section 4 to distinguish regulated vs. permissible paths. revision: yes
Referee: [Abstract] The abstract reports concrete findings on 90,000 instances (e.g., 10/18 models skip checkpoint, TSR gains up to +93.8 pp) but provides no error bars, confidence intervals, or statistical tests for the model-wise differences or improvement claims. Without these, it is impossible to determine whether the observed patterns exceed sampling variability.

Authors: The evaluation scale (90,000 instances, 5,000 per model per task category) yields highly consistent patterns, with the 10-model skipping behavior replicated across independent runs. The full manuscript (Section 5) reports per-model standard deviations and notes the uniformity of results. However, the abstract lacks explicit statistical qualifiers due to length limits. We will revise the abstract to state that differences are consistent across large samples and reference the statistical robustness shown in the main text (including paired comparisons). We will also add error bars to the primary result figures and include a brief mention of confidence intervals for the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: ASR is a direct definition against external expected workflow

full rationale

The paper defines ASR explicitly as a comparison of observed agent execution sequences against a predefined expected workflow (an external input, not derived from the data or results). No equations, self-citations, or derivations reduce the reported findings (e.g., models skipping checkpoints or TSR gains from refinements) back to fitted parameters or the inputs by construction. The metric and empirical observations are self-contained applications of this definition, with the expected sequence serving as an independent reference rather than a self-referential quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a canonical expected agent sequence for the payment workflow and on the assumption that fidelity to that sequence is the appropriate success criterion.

axioms (1)

domain assumption Payment workflows possess a single, well-defined expected sequence of agent transitions that should be followed exactly.
ASR is computed by comparing observed transitions against this expected sequence; the metric is undefined without it.

pith-pipeline@v0.9.0 · 5491 in / 1302 out tokens · 28340 ms · 2026-05-08T09:47:34.528371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages · 2 internal anchors

[1]

WIREs Data Mining and Knowledge Discovery2(2), 182–192 (2012)

van der Aalst, W.M.P., Adriansyah, A., van Dongen, B.F.: Replaying history on process models for conformance checking and performance analysis. WIREs Data Mining and Knowledge Discovery2(2), 182–192 (2012)

2012
[2]

Springer (2018)

Carmona, J., van Dongen, B., Solti, A., Weidlich, M.: Conformance Checking: Relating Processes and Models. Springer (2018)

2018
[3]

Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al.: Why do multi-agent LLM systems fail? (2025)

2025
[4]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E

Chen, Z., Bhatia, A., Zhang, S.L., Choi, S., Saar-Tsechansky, M., Ghassemi, M.: Standard benchmarks fail – auditing LLM agents in finance must prioritize risk. arXiv preprint arXiv:2502.15865 (2025)

work page arXiv 2025
[5]

Proceedings of the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (2026), main Conference

Chua, J.K., Huang, D., Wang, Z.: A novel hierarchical multi-agent system for pay- ments using LLMs. Proceedings of the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (2026), main Conference

2026
[6]

In: 2024 IEEE Interna- tional Conference on Big Data (BigData)

Dahiphale, D., Madiraju, N., Lin, J., Karve, R., Agrawal, M., Modwal, A., Bal- akrishnan, R., Shah, S., Kaushal, G., Mandawat, P., et al.: Enhancing trust and safety in digital payments: An LLM-powered approach. In: 2024 IEEE Interna- tional Conference on Big Data (BigData). pp. 4854–4863. IEEE (2024)

2024
[7]

White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, and Nagu Rangan

Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., Zhuang, C.: In- telligent agents with LLM-based process automation. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing. pp. 5018–5027. KDD ’24, Association for Computing Machinery (2024). https://doi.org/10.1145/3637528.3671646 6 D. Huang, J.K. Chua, and Z. Wang

work page doi:10.1145/3637528.3671646 2024
[8]

In: The Twelfth International Conference on Learning Representations (2024),https:// openreview.net/forum?id=VtmBAGCN7o

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., Schmidhuber, J.: MetaGPT: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2024),https:// openreview.net/forum?id=VtmBAGCN7o

2024
[9]

Siegel, Nitya Nadgir, and Arvind Narayanan

Kapoor, S., Narayanan, A., et al.: AI agents that matter. arXiv preprint arXiv:2407.01502 (2024)

work page arXiv 2024
[10]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., et al.: AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688 (2023)

work page internal anchor Pith review arXiv 2023
[11]

In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track (2024)

Ma, C., Zhang, J., Long, Z., Xie, Z., Chen, J., Zheng, J., et al.: AgentBoard: An analytical evaluation board of multi-turn LLM agents. In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track (2024)

2024
[12]

Mastercard: Mastercard agent pay – powering the next frontier of commerce.ht tps://www.mastercard.com/global/en/business/artificial-intelligence/ mastercard-agent-pay.html(2025), accessed: 2025-11-15

2025
[13]

Evaluation and benchmark- ing of llm agents: A survey

Mohammadi, M., Rahmani, H.A., Nguyen, T.T., Ranasinghe, T., Macdonald, C., Ounis, I.: Evaluation and benchmarking of LLM agents: A survey. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2025). https://doi.org/10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025
[14]

PCI Security Standards Council: Ai principles: Securing the use of ai in payment environments.https://blog.pcisecuritystandards.org/ai-principles-secur ing-the-use-of-ai-in-payment-environments(2025), accessed: 2025-11-15

2025
[15]

In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=vu nPXOFmoi

Qiao, S., Fang, R., Qiu, Z., Liu, X., Cheng, J., Chen, H., Zhang, N.: Benchmark- ing agentic workflow generation. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=vu nPXOFmoi

2025
[16]

Schumacher, K., Roberts, R., Giebel, K.: The agentic commerce opportunity: How AI agents are ushering in a new era for consumers and merchants.https://www. mckinsey.com/capabilities/quantumblack/our-insights/the-agentic-comme rce-opportunity-how-ai-agents-are-ushering-in-a-new-era-for-consume rs-and-merchants(2025), accessed: 2025-11-15

2025
[17]

Visa: Enabling AI agents to buy securely and seamlessly.https://corporate.vi sa.com/en/products/intelligent-commerce.html(2025), accessed: 2025-11-15

2025
[18]

arXiv preprint arXiv:2412.20138 , year =

Xiao, Y., Zhao, M.K., Zhou, R., Boen, J.: TradingAgents: Multi-agents LLM fi- nancial trading framework. arXiv preprint arXiv:2412.20138 (2024)

work page arXiv 2024
[19]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., Narang, R., Bhatt, A., Chen, W., Bisk, Y.:τ-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045 (2024)

work page internal anchor Pith review arXiv 2024
[20]

Advances in Neural Information Processing Systems37, 137010–137045 (2024)

Yu, Y., Yao, Z., Li, H., Deng, Z., Jiang, Y., Cao, Y., Chen, Z., Suchow, J., Cui, Z., Liu, R., et al.: Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Advances in Neural Information Processing Systems37, 137010–137045 (2024)

2024