pith. machine review for the scientific record. sign in

arxiv: 2604.09285 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: unknown

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM evaluationservice agentsdialogue graphsSOP complianceadversarial testingexecution gapcustomer servicemulti-agent benchmark
0
0 comments X

The pith

A graph-guided benchmark shows LLMs classify customer intents correctly but fail to select the right subsequent actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SAGE, a benchmark that converts standard operating procedures into dynamic dialogue graphs to test how well AI service agents follow rules while responding to varied user behaviors. It moves beyond static tests by adding adversarial scenarios and using judge agents plus a rule engine to produce automatic, verifiable outcomes. Experiments with 27 LLMs across six industrial settings identify an execution gap between accurate intent classification and incorrect action choices, plus a pattern where models stay polite even as their logic fails under pressure. This matters for building reliable customer service automation that meets company procedures rather than just sounding reasonable.

Core claim

SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and comprehensive path coverage in agent-user interactions. Using an Adversarial Intent Taxonomy and modular extension mechanism, the benchmark generates deterministic ground truth via Judge Agents and a Rule Engine. Evaluations on 27 LLMs across 6 industrial scenarios demonstrate a significant Execution Gap, where models accurately classify intents but fail to derive correct subsequent actions, and an Empathy Resilience phenomenon, where polite conversational facades persist despite underlying logical failures under high adversarial intensity.

What carries the argument

Dynamic Dialogue Graphs derived from SOPs that support path verification and compliance checking, paired with Judge Agents and a Rule Engine for deterministic analysis.

If this is right

  • Service agents need improved mechanisms to convert recognized intents into procedurally correct next steps.
  • Graph-based verification provides deterministic measurement of SOP adherence that single-metric tests miss.
  • The modular extension mechanism and automated data synthesis allow low-cost scaling to additional domains.
  • Adversarial intensity testing exposes logical weaknesses hidden by surface-level politeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Incorporating explicit graph-path prediction into model training could reduce the execution gap.
  • Surface politeness that masks logical errors implies satisfaction surveys alone are unreliable for agent quality.
  • The graph formalization approach could transfer to evaluating AI in other procedure-heavy areas such as technical troubleshooting or compliance checks.
  • Adding consistency tracking across multiple turns might strengthen the benchmark's ability to catch cumulative failures.

Load-bearing premise

The Dynamic Dialogue Graphs constructed from unstructured SOPs accurately capture real-world logical compliance requirements and diverse user behaviors without significant bias or oversimplification.

What would settle it

Direct comparison of the same LLMs in live customer interactions versus the SAGE graph-constrained tests to check whether the execution gap and empathy resilience appear at similar rates outside the benchmark.

Figures

Figures reproduced from arXiv: 2604.09285 by Chaozheng Wang, Deiyi Xiong, Jinpeng Wang, Ling Shi, Ning Gao, Wei He, Wei Zhang, Yujie Wang, Yuqin Dai, Ziyin Wang.

Figure 1
Figure 1. Figure 1: Service Agent SOP Example (Telecom Scenario). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SAGE evaluation framework. all potential scenarios; (2) Graph-Guided Multi-Agent Evalu￾ation rigorously assesses these trajectories; and (3) an Scenario Extension Mechanism that leverages both user intents and SOPs to enable rapid adaptation to arbitrary scenarios through modu￾lar configuration. We begin by detailing the graph formalization process. 3.1 Dynamic Multi-Turn Dialogue Graph Modelin… view at source ↗
Figure 3
Figure 3. Figure 3: Logic performance gap analysis across six scenarios. 58 62 Score OA Score 56 60 Logic Score 63 64 Chat Score 35 37 AR Chat Length 67 68 Score 67 68 65 66 35 37 ER 60 61 Score 59 60 62 63 40 44 LD 68 72 Score 69 75 65 67 35 36 OE 57 63 Score 57 63 63 65 30 34 PS 1 5 10 15 Turn 62 66 Score 1 5 10 15 Turn 64 68 1 5 10 15 Turn 58 60 1 5 10 15 Turn 32.9 33.3 TP [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Correlation analysis between the Chat Quality Score [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall Average Score (OA) Heatmap. Rows represent Judge models, columns represent Agent models. qwen2.5-7B qwen2.5-14B qwen2.5-32B qwen3-8B qwen3-14B qwen3-32B Agent Model qwen2.5-7B qwen2.5-14B qwen2.5-32B qwen3-8B qwen3-14B qwen3-32B Judge Model 75 71 73 57 64 69 74 76 71 59 66 70 67 66 66 53 62 69 62 64 68 53 57 67 72 73 73 61 70 73 64 66 64 52 57 68 Chat Quality 40 50 60 70 80 90 100 Score [PITH_FULL… view at source ↗
Figure 10
Figure 10. Figure 10: Standard Operating Procedures (SOPs) for Six Industrial Scenarios evaluated in SAGE. These directed graphs define [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Standard Operating Procedures (SOPs) for Six Industrial Scenarios. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SAGE, a multi-agent benchmark for LLM service agents that formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and path coverage. It adds an Adversarial Intent Taxonomy and modular Extension Mechanism for domain adaptation and automated data synthesis. Evaluation uses Judge Agents plus a Rule Engine to produce deterministic ground truth. Experiments on 27 LLMs across 6 industrial scenarios report an Execution Gap (strong intent classification but weak subsequent action derivation) and Empathy Resilience (polite facades persisting under high-adversarial logical failures).

Significance. If the graph construction faithfully encodes real SOP logic and user behavior diversity, SAGE offers a dynamic, dual-axis alternative to static benchmarks and could surface actionable limitations for deploying LLMs in regulated service domains. The automated synthesis and extension features are practical strengths for reproducibility and reuse.

major comments (3)
  1. [Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.
  2. [Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.
  3. [Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.
minor comments (2)
  1. [Abstract] The anonymous code link is noted; ensure the final version includes a permanent, non-anonymous repository with exact graph-construction scripts and evaluation prompts.
  2. [Evaluation Framework] Clarify the precise definition of 'intent classification success' versus 'action derivation failure' with an example from one of the six scenarios.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications on design choices. We will incorporate several suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.

    Authors: We agree that validating the Dynamic Dialogue Graph construction is essential given its central role. The graphs were built by domain experts using a structured protocol (detailed in Section 3) that explicitly encodes branching paths, edge conditions, and SOP logic from the six industrial scenarios. In the revised manuscript, we will report inter-annotator agreement (Cohen's kappa) and quantitative coverage metrics for paths and conditions. Direct comparison to live service logs is not possible here due to privacy and proprietary constraints, but the Extension Mechanism is explicitly designed to let practitioners validate and augment graphs with their own logs. revision: partial

  2. Referee: [Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.

    Authors: We concur that stronger statistical presentation is needed. Experiments used fixed random seeds, consistent prompting, and over 5,000 dialogues across the 27 models and 6 scenarios. The revised manuscript will add error bars (standard deviation over multiple runs), statistical significance tests (paired t-tests and Wilcoxon signed-rank for the Execution Gap), explicit per-scenario sample sizes in tables, and a dedicated paragraph on controls against post-hoc selection, including pre-specified metrics. revision: yes

  3. Referee: [Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.

    Authors: The deterministic ground truth is an intentional design decision to guarantee objectivity and reproducibility, avoiding the variability of purely LLM-based judging. The Adversarial Intent Taxonomy was developed precisely to capture diverse and high-intensity user behaviors, and these are encoded as explicit branches during graph construction and automated synthesis. The modular Extension Mechanism directly supports adding missing adversarial branches or user-behavior variants. We will revise the text to more clearly explain this dependency and how the taxonomy plus extension features reduce propagation risk. revision: partial

standing simulated objections not resolved
  • Direct comparison of Dynamic Dialogue Graphs against live service logs due to data privacy and proprietary restrictions

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction or claims

full rationale

This paper proposes an empirical evaluation benchmark by formalizing SOPs into Dynamic Dialogue Graphs, introducing an Adversarial Intent Taxonomy, and running experiments on 27 external LLMs to observe Execution Gap and Empathy Resilience. No mathematical derivations, predictions, or first-principles results are present that reduce to fitted parameters or self-referential inputs by construction. The ground truth via Judge Agents and Rule Engine is defined externally to the tested models, with no load-bearing self-citations, uniqueness theorems, or ansatzes invoked. The framework is self-contained as a benchmark proposal without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that SOPs can be losslessly converted to dynamic graphs and that judge agents plus rule engine produce reliable ground truth. No free parameters or invented physical entities are evident from the abstract.

axioms (2)
  • domain assumption Unstructured SOPs can be formalized into Dynamic Dialogue Graphs that enable precise verification of logical compliance and path coverage
    This is the foundational step of the SAGE method described in the abstract.
  • domain assumption Judge Agents and Rule Engine can generate deterministic ground truth for agent interactions
    Required for the automated evaluation framework.
invented entities (2)
  • Dynamic Dialogue Graph no independent evidence
    purpose: Formal representation of SOPs for compliance checking and path coverage
    New construct introduced to structure the evaluation
  • Adversarial Intent Taxonomy no independent evidence
    purpose: To enable generation of tricky user behaviors for testing
    New taxonomy proposed for the benchmark

pith-pipeline@v0.9.0 · 5539 in / 1566 out tokens · 31857 ms · 2026-05-10T16:37:28.232834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 32 canonical work pages · 14 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  3. [3]

    Nolwenn Bernard and Krisztian Balog. 2023. MG-ShopDial: A multi-goal conver- sational dataset for E-commerce. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2775– 2785

  4. [4]

    Alessandro Berti, Humam Kourani, and Wil MP van der Aalst. 2024. PM-LLM- Benchmark: Evaluating large language models on process mining tasks. InInter- national Conference on Process Mining. Springer, 610–623

  5. [5]

    Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.arXiv preprint arXiv:1810.00278(2018)

  6. [6]

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scal- ing Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585(2025)

  7. [7]

    Lei Cui, Shaohan Huang, Furu Wei, Chuanqi Tan, Chaoqun Duan, and Ming Zhou. 2017. Superagent: A customer service chatbot for e-commerce websites. InProceedings of ACL 2017, system demonstrations. 97–102

  8. [8]

    Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, and Zhuosheng Zhang

  9. [9]

    GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents.arXiv preprint arXiv:2505.11368(2025)

  10. [10]

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. 2024. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718(2024)

  11. [11]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161(2019)

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  13. [13]

    Dirk Fahland, Fabiana Fournier, Lior Limonad, Inna Skarbovsky, and Ava JE Swevels. 2024. How well can large language models explain business processes? arXiv preprint arXiv:2401.12846(2024)

  14. [14]

    Fabiana Fournier, Lior Limonad, and Inna Skarbovsky. 2024. Towards a Bench- mark for Causal Business Process Reasoning with LLMs. InInternational Confer- ence on Business Process Management. Springer, 233–246

  15. [15]

    Michael Grohs, Luka Abb, Nourhan Elsayed, and Jana-Rebecca Rehse. 2023. Large language models can accomplish business process management tasks. In International conference on business process management. Springer, 453–465

  16. [16]

    Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. 2024. Multi-if: Benchmark- ing llms on multi-turn and multilingual instructions following.arXiv preprint arXiv:2410.15553(2024)

  17. [17]

    Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. 2024. Followbench: A multi- level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4667–4688

  18. [18]

    Humam Kourani, Alessandro Berti, Jasmin Hennrich, Wolfgang Kratsch, Robin Weidlich, Chiao-Yun Li, Ahmad Arslan, Wil MP van der Aalst, and Daniel Schuster

  19. [19]

    Leveraging large language models for enhanced process model compre- hension.Decision Support Systems(2025), 114563

  20. [20]

    Humam Kourani, Alessandro Berti, Daniel Schuster, and Wil MP van der Aalst

  21. [21]

    Kourani et al.Software and Systems Modeling(2025), 1–36

    Evaluating large language models on business process modeling: frame- work, benchmark, and self-improvement analysis: H. Kourani et al.Software and Systems Modeling(2025), 1–36

  22. [22]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  23. [23]

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244(2023)

  24. [24]

    Xiangci Li, Zhiyu Chen, Jason Ingyu Choi, Nikhita Vedula, Besnik Fetahu, Oleg Rokhlenko, and Shervin Malmasi. 2025. Wizard of shopping: Target-oriented e-commerce dialogue generation with decision tree branching.arXiv preprint arXiv:2502.00969(2025)

  25. [25]

    Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, et al

  26. [26]

    Sopbench: Evaluating language agents at following standard operating procedures and constraints.arXiv preprint arXiv:2503.08669(2025)

  27. [27]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  28. [28]

    Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang

  29. [29]

    Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124(2020)

  30. [30]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)

  31. [31]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  32. [32]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  33. [33]

    Subhrangshu Nandi, Arghya Datta, Nikhil Vichare, Indranil Bhattacharya, Huzefa Raja, Jing Xu, Shayan Ray, Giuseppe Carenini, Abhi Srivastava, Aaron Chan, et al

  34. [34]

    SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents.arXiv preprint arXiv:2506.08119(2025)

  35. [35]

    Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Dialogbench: Evaluating llms as human-like dialogue systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6137–6170

  36. [36]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  37. [37]

    Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. 2024. Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems37 (2024), 5244–5284

  38. [38]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)

  39. [39]

    Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. 2024. Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601(2024)

  40. [40]

    Jun Quan, Shian Zhang, Qian Cao, Zizhong Li, and Deyi Xiong. 2020. RiSAWOZ: A large-scale multi-domain Wizard-of-Oz dataset with rich semantic annotations for task-oriented dialogue modeling.arXiv preprint arXiv:2010.08738(2020)

  41. [41]

    Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI conference on artificial intelli- gence, Vol. 34. 8689–8696

  42. [42]

    Adrian Rebmann, Fabian David Schmidt, Goran Glavaš, and Han van Der Aa

  43. [43]

    In2024 6th International Conference on Process Mining (ICPM)

    Evaluating the ability of llms to solve semantics-aware process mining tasks. In2024 6th International Conference on Process Mining (ICPM). IEEE, 9–16

  44. [44]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 8634–8652

  45. [45]

    Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Parrot: Enhancing multi-turn instruction following for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9729–9750

  46. [46]

    Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, and Mark Gerstein. 2024. STRUC-BENCH: Are Large Language Models Good at Generating Complex Structured Tabular Data?. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 2:...

  47. [47]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  48. [48]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

  49. [49]

    Vicuna Team. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality.Vicuna: An open-source chatbot impressing gpt-4 with90 (2023)

  50. [50]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023). KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al

  51. [51]

    Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 276–284

  52. [52]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  53. [53]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language mod- els with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 13484–13508

  54. [54]

    Walter F Wiggins and Ali S Tejani. 2022. On the opportunities and risks of foundation models for natural language processing in radiology.Radiology: Artificial Intelligence4, 4 (2022), e220119

  55. [55]

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey. arXiv 2023.arXiv preprint arXiv:2309.0786410 (2025)

  56. [56]

    Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wen- peng Yin, and Caiming Xiong. 2024. FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability.arXiv preprint arXiv:2402.18667(2024)

  57. [57]

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al . 2024. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems37 (2024)

  58. [58]

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443 (2023)

  59. [59]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

  60. [60]

    Shunyu Yao, Howard Chen, Austin W Hanjie, Runzhe Yang, and Karthik Narasimhan. 2023. Collie: Systematic construction of constrained text generation tasks.arXiv preprint arXiv:2307.08689(2023)

  61. [61]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  62. [62]

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024. 3053–3077

  63. [63]

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414(2022)

  64. [64]

    Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen

  65. [65]

    Recommendation as instruction following: A large language model em- powered recommendation approach.ACM Transactions on Information Systems 43, 5 (2025), 1–37

  66. [66]

    Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. 2024. Agent-pro: Learning to evolve via policy-level reflection and optimization.arXiv preprint arXiv:2402.17574 (2024)

  67. [67]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys- chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998(2023)

  68. [68]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  69. [69]

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911(2023)

  70. [70]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

  71. [71]

    Execution Gap

    Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, and Fang Kong. 2025. Evaluating, Synthesizing, and Enhancing for Customer Support Conversation.arXiv preprint arXiv:2508.04423(2025). A Experiment Results Supplementary This appendix provides supplementary data substantiating our main findings. We first validate our multi-agent ensemble via...

  72. [72]

    diagonal dom- inance

    Egocentric Bias (Self-Preference).A distinct "diagonal dom- inance" is observable, particularly inFigure 8 (Chat Quality). Mod- els tend to assign higher scores to their own outputs (or outputs from the same model family) compared to external evaluators. For instance, the diagonal cells in Figure 8 often exhibit deeper colors than the off-diagonal cells i...

  73. [73]

    strict graders,

    Systematic Scoring Bias.Significant horizontal variations exist across all three heatmaps, especially inFigure 9 (Logic Abil- ity). This indicates that different judges possess different strictness standards. Some judges (represented by rows with consistently lighter colors) act as "strict graders, " systematically assigning lower scores across all agents...

  74. [74]

    Field Classification (stage1): Classify the following 4 fields based on the given dialogue history, then jump to stage2. - ConsumptionType: User dialogue intent (Enquiry/Change/Cancel) - ApplicationTendency: Whether user tends to apply for recommended package (Agree/Reject/Hesitate) - ConsumptionProfile: Package type user prefers (Data/Voice) - EmotionTag...

  75. [75]

    - Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5

    User Consumption Intent Judgment (stage2): Jump based on [ConsumptionType] field. - Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5

  76. [76]

    - Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6

    User Consumption Profile Judgment (stage3): Jump based on [ConsumptionProfile] field. - Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6

  77. [77]

    - Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END

    User Package Status Judgment (stage4): Jump based on system variable [PackageStatus]. - Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al

  78. [78]

    - Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7

    Contract Penalty Situation (stage5): Jump based on system variable [Penalty]. - Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7

  79. [79]

    - Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END

    User Application Tendency Judgment (stage6): Jump based on [ApplicationTendency] field. - Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END

  80. [80]

    classification_output

    User Emotion Judgment (stage7): Jump based on [EmotionTag] field. - Jump Logic: Based on [EmotionTag] field, Calm→ACTION=ChangeOrder→END; Discontent→ACTION=TransHuman→END. [Action Descriptions] - ChangeOrder: Change package - GoodBye: Politely end the conversation - TransHuman: Transfer to human agent [Output Format Requirements] You must output in the fo...

Showing first 80 references.