Recognition: unknown
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3
The pith
A sequence-fidelity metric shows that half of tested LLM payment agents skip required confirmation steps even when final outcomes succeed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Agentic Success Rate metric compares observed agent execution sequences against an expected sequence at the transition level, decomposing performance into transition recall and precision. Applied to the Hierarchical Multi-Agent System for Payments, it reveals that ten of eighteen LLMs systematically skip a confirmation checkpoint during checkout despite achieving high task success rates, while eight models enforce the checkpoint without deviation. Models such as GPT-4.1 show hidden shortcuts despite perfect scores on prior metrics, and prompt refinements guided by the new diagnostics raise task success rates by up to 93.8 percentage points.
What carries the argument
Agentic Success Rate (ASR), a trajectory-fidelity metric that scores how closely an agent's observed sequence of actions matches the expected sequence by measuring transition-level recall and precision.
If this is right
- Models can reach perfect task success and handoff scores while still omitting mandatory safety or compliance steps.
- Diagnostics from sequence comparison can direct targeted prompt changes that improve final outcomes by nearly 94 percentage points in some cases.
- In regulated domains, evaluation must move beyond final results to verify that every required transition occurs in order.
- Eight of the eighteen models already maintain perfect sequence fidelity, showing that consistent workflow adherence is achievable with current models.
Where Pith is reading between the lines
- Workflow-fidelity checks could become standard for any multi-step agent task where order affects legal or safety outcomes, such as medical or financial processes.
- Deterministic routing guards combined with the metric might reduce reliance on prompt engineering alone to enforce critical steps.
- The same sequence-comparison approach might expose similar hidden shortcuts in non-payment domains where agents handle chained decisions.
Load-bearing premise
That a single objectively correct execution sequence exists for every payment workflow and that any deviation from it is always undesirable rather than potentially adaptive.
What would settle it
A controlled test in which an agent that skips the confirmation checkpoint completes payments faster or with lower error rates than agents that follow the full sequence, without increasing fraud or compliance violations.
read the original abstract
LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed agent execution sequences against a predefined expected workflow at the transition level (decomposed into Transition Recall and Transition Precision). Applied to a Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR identifies that 10 models systematically skip a confirmation checkpoint (invisible to TSR and HF1), while 8 enforce it; prompt refinements guided by ASR yield TSR gains up to +93.8 percentage points. The work argues that trajectory-level evaluation is essential in regulated domains.
Significance. If the central claims hold, the work provides a concrete demonstration that outcome-only metrics (TSR, HF1) can mask workflow deviations in payment agents, with a large-scale empirical evaluation (18 models, 90k instances) showing both diagnostic value and actionable improvements via ASR-guided refinements. This strengthens the case for sequence-aware metrics in safety-critical agent deployments.
major comments (2)
- [Abstract] The abstract states that ASR 'compares observed and expected agent execution sequences' and reports that 10 of 18 models 'systematically skip a confirmation checkpoint,' yet supplies no description of how the expected sequence is constructed, validated (e.g., by domain experts or regulators), or shown to be the unique correct path rather than one of several permissible adaptive workflows. This definition is load-bearing for all reported deviations and the claim that they are undesirable.
- [Abstract] The abstract reports concrete findings on 90,000 instances (e.g., 10/18 models skip checkpoint, TSR gains up to +93.8 pp) but provides no error bars, confidence intervals, or statistical tests for the model-wise differences or improvement claims. Without these, it is impossible to determine whether the observed patterns exceed sampling variability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing the Agentic Success Rate (ASR) metric. The comments highlight important areas for improving clarity around workflow definition and statistical reporting. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] The abstract states that ASR 'compares observed and expected agent execution sequences' and reports that 10 of 18 models 'systematically skip a confirmation checkpoint,' yet supplies no description of how the expected sequence is constructed, validated (e.g., by domain experts or regulators), or shown to be the unique correct path rather than one of several permissible adaptive workflows. This definition is load-bearing for all reported deviations and the claim that they are undesirable.
Authors: We agree that the construction and validation of the expected workflow is foundational. Section 3.2 of the manuscript details that the expected sequence is derived directly from regulatory requirements for payment systems (e.g., mandatory confirmation steps under standard financial compliance protocols) and cross-validated against the HMASP system design with input from domain experts in regulated transaction processing. The abstract omits this for brevity, but we will revise it to include a concise statement on the regulatory basis. Regarding uniqueness, the paper positions the confirmation checkpoint as a non-negotiable compliance requirement in this domain to mitigate risks such as unauthorized payments; while adaptive workflows may exist in less regulated settings, deviations here are fidelity violations by design. We will add clarifying language in the abstract and expand the discussion in Section 4 to distinguish regulated vs. permissible paths. revision: yes
-
Referee: [Abstract] The abstract reports concrete findings on 90,000 instances (e.g., 10/18 models skip checkpoint, TSR gains up to +93.8 pp) but provides no error bars, confidence intervals, or statistical tests for the model-wise differences or improvement claims. Without these, it is impossible to determine whether the observed patterns exceed sampling variability.
Authors: The evaluation scale (90,000 instances, 5,000 per model per task category) yields highly consistent patterns, with the 10-model skipping behavior replicated across independent runs. The full manuscript (Section 5) reports per-model standard deviations and notes the uniformity of results. However, the abstract lacks explicit statistical qualifiers due to length limits. We will revise the abstract to state that differences are consistent across large samples and reference the statistical robustness shown in the main text (including paired comparisons). We will also add error bars to the primary result figures and include a brief mention of confidence intervals for the reported gains. revision: yes
Circularity Check
No circularity: ASR is a direct definition against external expected workflow
full rationale
The paper defines ASR explicitly as a comparison of observed agent execution sequences against a predefined expected workflow (an external input, not derived from the data or results). No equations, self-citations, or derivations reduce the reported findings (e.g., models skipping checkpoints or TSR gains from refinements) back to fitted parameters or the inputs by construction. The metric and empirical observations are self-contained applications of this definition, with the expected sequence serving as an independent reference rather than a self-referential quantity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Payment workflows possess a single, well-defined expected sequence of agent transitions that should be followed exactly.
Reference graph
Works this paper leans on
-
[1]
WIREs Data Mining and Knowledge Discovery2(2), 182–192 (2012)
van der Aalst, W.M.P., Adriansyah, A., van Dongen, B.F.: Replaying history on process models for conformance checking and performance analysis. WIREs Data Mining and Knowledge Discovery2(2), 182–192 (2012)
2012
-
[2]
Springer (2018)
Carmona, J., van Dongen, B., Solti, A., Weidlich, M.: Conformance Checking: Relating Processes and Models. Springer (2018)
2018
-
[3]
Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al.: Why do multi-agent LLM systems fail? (2025)
2025
-
[4]
Chen, Z., Bhatia, A., Zhang, S.L., Choi, S., Saar-Tsechansky, M., Ghassemi, M.: Standard benchmarks fail – auditing LLM agents in finance must prioritize risk. arXiv preprint arXiv:2502.15865 (2025)
-
[5]
Proceedings of the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (2026), main Conference
Chua, J.K., Huang, D., Wang, Z.: A novel hierarchical multi-agent system for pay- ments using LLMs. Proceedings of the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) (2026), main Conference
2026
-
[6]
In: 2024 IEEE Interna- tional Conference on Big Data (BigData)
Dahiphale, D., Madiraju, N., Lin, J., Karve, R., Agrawal, M., Modwal, A., Bal- akrishnan, R., Shah, S., Kaushal, G., Mandawat, P., et al.: Enhancing trust and safety in digital payments: An LLM-powered approach. In: 2024 IEEE Interna- tional Conference on Big Data (BigData). pp. 4854–4863. IEEE (2024)
2024
-
[7]
White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, and Nagu Rangan
Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., Zhuang, C.: In- telligent agents with LLM-based process automation. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing. pp. 5018–5027. KDD ’24, Association for Computing Machinery (2024). https://doi.org/10.1145/3637528.3671646 6 D. Huang, J.K. Chua, and Z. Wang
-
[8]
In: The Twelfth International Conference on Learning Representations (2024),https:// openreview.net/forum?id=VtmBAGCN7o
Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., Schmidhuber, J.: MetaGPT: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2024),https:// openreview.net/forum?id=VtmBAGCN7o
2024
-
[9]
Siegel, Nitya Nadgir, and Arvind Narayanan
Kapoor, S., Narayanan, A., et al.: AI agents that matter. arXiv preprint arXiv:2407.01502 (2024)
-
[10]
AgentBench: Evaluating LLMs as Agents
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., et al.: AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688 (2023)
work page internal anchor Pith review arXiv 2023
-
[11]
In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track (2024)
Ma, C., Zhang, J., Long, Z., Xie, Z., Chen, J., Zheng, J., et al.: AgentBoard: An analytical evaluation board of multi-turn LLM agents. In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track (2024)
2024
-
[12]
Mastercard: Mastercard agent pay – powering the next frontier of commerce.ht tps://www.mastercard.com/global/en/business/artificial-intelligence/ mastercard-agent-pay.html(2025), accessed: 2025-11-15
2025
-
[13]
Evaluation and benchmark- ing of llm agents: A survey
Mohammadi, M., Rahmani, H.A., Nguyen, T.T., Ranasinghe, T., Macdonald, C., Ounis, I.: Evaluation and benchmarking of LLM agents: A survey. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2025). https://doi.org/10.1145/3711896.3736570
-
[14]
PCI Security Standards Council: Ai principles: Securing the use of ai in payment environments.https://blog.pcisecuritystandards.org/ai-principles-secur ing-the-use-of-ai-in-payment-environments(2025), accessed: 2025-11-15
2025
-
[15]
In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=vu nPXOFmoi
Qiao, S., Fang, R., Qiu, Z., Liu, X., Cheng, J., Chen, H., Zhang, N.: Benchmark- ing agentic workflow generation. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=vu nPXOFmoi
2025
-
[16]
Schumacher, K., Roberts, R., Giebel, K.: The agentic commerce opportunity: How AI agents are ushering in a new era for consumers and merchants.https://www. mckinsey.com/capabilities/quantumblack/our-insights/the-agentic-comme rce-opportunity-how-ai-agents-are-ushering-in-a-new-era-for-consume rs-and-merchants(2025), accessed: 2025-11-15
2025
-
[17]
Visa: Enabling AI agents to buy securely and seamlessly.https://corporate.vi sa.com/en/products/intelligent-commerce.html(2025), accessed: 2025-11-15
2025
-
[18]
arXiv preprint arXiv:2412.20138 , year =
Xiao, Y., Zhao, M.K., Zhou, R., Boen, J.: TradingAgents: Multi-agents LLM fi- nancial trading framework. arXiv preprint arXiv:2412.20138 (2024)
-
[19]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Yao, S., Narang, R., Bhatt, A., Chen, W., Bisk, Y.:τ-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045 (2024)
work page internal anchor Pith review arXiv 2024
-
[20]
Advances in Neural Information Processing Systems37, 137010–137045 (2024)
Yu, Y., Yao, Z., Li, H., Deng, Z., Jiang, Y., Cao, Y., Chen, Z., Suchow, J., Cui, Z., Liu, R., et al.: Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Advances in Neural Information Processing Systems37, 137010–137045 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.