arxiv: 2604.09285 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: unknown

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Ling Shi , Yuqin Dai , Ziyin Wang , Ning Gao , Wei Zhang , Chaozheng Wang , Yujie Wang , Wei He

show 2 more authors

Jinpeng Wang Deiyi Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM evaluationservice agentsdialogue graphsSOP complianceadversarial testingexecution gapcustomer servicemulti-agent benchmark

0 comments

The pith

A graph-guided benchmark shows LLMs classify customer intents correctly but fail to select the right subsequent actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SAGE, a benchmark that converts standard operating procedures into dynamic dialogue graphs to test how well AI service agents follow rules while responding to varied user behaviors. It moves beyond static tests by adding adversarial scenarios and using judge agents plus a rule engine to produce automatic, verifiable outcomes. Experiments with 27 LLMs across six industrial settings identify an execution gap between accurate intent classification and incorrect action choices, plus a pattern where models stay polite even as their logic fails under pressure. This matters for building reliable customer service automation that meets company procedures rather than just sounding reasonable.

Core claim

SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and comprehensive path coverage in agent-user interactions. Using an Adversarial Intent Taxonomy and modular extension mechanism, the benchmark generates deterministic ground truth via Judge Agents and a Rule Engine. Evaluations on 27 LLMs across 6 industrial scenarios demonstrate a significant Execution Gap, where models accurately classify intents but fail to derive correct subsequent actions, and an Empathy Resilience phenomenon, where polite conversational facades persist despite underlying logical failures under high adversarial intensity.

What carries the argument

Dynamic Dialogue Graphs derived from SOPs that support path verification and compliance checking, paired with Judge Agents and a Rule Engine for deterministic analysis.

If this is right

Service agents need improved mechanisms to convert recognized intents into procedurally correct next steps.
Graph-based verification provides deterministic measurement of SOP adherence that single-metric tests miss.
The modular extension mechanism and automated data synthesis allow low-cost scaling to additional domains.
Adversarial intensity testing exposes logical weaknesses hidden by surface-level politeness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Incorporating explicit graph-path prediction into model training could reduce the execution gap.
Surface politeness that masks logical errors implies satisfaction surveys alone are unreliable for agent quality.
The graph formalization approach could transfer to evaluating AI in other procedure-heavy areas such as technical troubleshooting or compliance checks.
Adding consistency tracking across multiple turns might strengthen the benchmark's ability to catch cumulative failures.

Load-bearing premise

The Dynamic Dialogue Graphs constructed from unstructured SOPs accurately capture real-world logical compliance requirements and diverse user behaviors without significant bias or oversimplification.

What would settle it

Direct comparison of the same LLMs in live customer interactions versus the SAGE graph-constrained tests to check whether the execution gap and empathy resilience appear at similar rates outside the benchmark.

Figures

Figures reproduced from arXiv: 2604.09285 by Chaozheng Wang, Deiyi Xiong, Jinpeng Wang, Ling Shi, Ning Gao, Wei He, Wei Zhang, Yujie Wang, Yuqin Dai, Ziyin Wang.

**Figure 2.** Figure 2: Overview of SAGE evaluation framework. all potential scenarios; (2) Graph-Guided Multi-Agent Evaluation rigorously assesses these trajectories; and (3) an Scenario Extension Mechanism that leverages both user intents and SOPs to enable rapid adaptation to arbitrary scenarios through modular configuration. We begin by detailing the graph formalization process. 3.1 Dynamic Multi-Turn Dialogue Graph Modelin… view at source ↗

**Figure 3.** Figure 3: Logic performance gap analysis across six scenarios. 58 62 Score OA Score 56 60 Logic Score 63 64 Chat Score 35 37 AR Chat Length 67 68 Score 67 68 65 66 35 37 ER 60 61 Score 59 60 62 63 40 44 LD 68 72 Score 69 75 65 67 35 36 OE 57 63 Score 57 63 63 65 30 34 PS 1 5 10 15 Turn 62 66 Score 1 5 10 15 Turn 64 68 1 5 10 15 Turn 58 60 1 5 10 15 Turn 32.9 33.3 TP [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 6.** Figure 6: Correlation analysis between the Chat Quality Score [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Overall Average Score (OA) Heatmap. Rows represent Judge models, columns represent Agent models. qwen2.5-7B qwen2.5-14B qwen2.5-32B qwen3-8B qwen3-14B qwen3-32B Agent Model qwen2.5-7B qwen2.5-14B qwen2.5-32B qwen3-8B qwen3-14B qwen3-32B Judge Model 75 71 73 57 64 69 74 76 71 59 66 70 67 66 66 53 62 69 62 64 68 53 57 67 72 73 73 61 70 73 64 66 64 52 57 68 Chat Quality 40 50 60 70 80 90 100 Score [PITH_FULL… view at source ↗

**Figure 10.** Figure 10: Standard Operating Procedures (SOPs) for Six Industrial Scenarios evaluated in SAGE. These directed graphs define [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 10.** Figure 10: Standard Operating Procedures (SOPs) for Six Industrial Scenarios. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE turns SOPs into dynamic graphs for testing service agents and reports an execution gap, but the gap's meaning depends on how faithfully those graphs capture real compliance rules.

read the letter

The paper's core move is formalizing customer service SOPs as Dynamic Dialogue Graphs, pairing them with an Adversarial Intent Taxonomy, and running a judge-agent plus rule-engine setup to score both intent accuracy and action compliance. They run this on 27 LLMs across six industrial scenarios and surface two observations: models often get the intent right yet pick the wrong next action, and they keep polite language even when the logic has already broken under pressure. The code release is a plus for anyone who wants to extend it.

Referee Report

3 major / 2 minor

Summary. The paper introduces SAGE, a multi-agent benchmark for LLM service agents that formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and path coverage. It adds an Adversarial Intent Taxonomy and modular Extension Mechanism for domain adaptation and automated data synthesis. Evaluation uses Judge Agents plus a Rule Engine to produce deterministic ground truth. Experiments on 27 LLMs across 6 industrial scenarios report an Execution Gap (strong intent classification but weak subsequent action derivation) and Empathy Resilience (polite facades persisting under high-adversarial logical failures).

Significance. If the graph construction faithfully encodes real SOP logic and user behavior diversity, SAGE offers a dynamic, dual-axis alternative to static benchmarks and could surface actionable limitations for deploying LLMs in regulated service domains. The automated synthesis and extension features are practical strengths for reproducibility and reuse.

major comments (3)

[Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.
[Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.
[Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.

minor comments (2)

[Abstract] The anonymous code link is noted; ensure the final version includes a permanent, non-anonymous repository with exact graph-construction scripts and evaluation prompts.
[Evaluation Framework] Clarify the precise definition of 'intent classification success' versus 'action derivation failure' with an example from one of the six scenarios.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications on design choices. We will incorporate several suggested improvements in the revised version.

read point-by-point responses

Referee: [Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.

Authors: We agree that validating the Dynamic Dialogue Graph construction is essential given its central role. The graphs were built by domain experts using a structured protocol (detailed in Section 3) that explicitly encodes branching paths, edge conditions, and SOP logic from the six industrial scenarios. In the revised manuscript, we will report inter-annotator agreement (Cohen's kappa) and quantitative coverage metrics for paths and conditions. Direct comparison to live service logs is not possible here due to privacy and proprietary constraints, but the Extension Mechanism is explicitly designed to let practitioners validate and augment graphs with their own logs. revision: partial
Referee: [Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.

Authors: We concur that stronger statistical presentation is needed. Experiments used fixed random seeds, consistent prompting, and over 5,000 dialogues across the 27 models and 6 scenarios. The revised manuscript will add error bars (standard deviation over multiple runs), statistical significance tests (paired t-tests and Wilcoxon signed-rank for the Execution Gap), explicit per-scenario sample sizes in tables, and a dedicated paragraph on controls against post-hoc selection, including pre-specified metrics. revision: yes
Referee: [Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.

Authors: The deterministic ground truth is an intentional design decision to guarantee objectivity and reproducibility, avoiding the variability of purely LLM-based judging. The Adversarial Intent Taxonomy was developed precisely to capture diverse and high-intensity user behaviors, and these are encoded as explicit branches during graph construction and automated synthesis. The modular Extension Mechanism directly supports adding missing adversarial branches or user-behavior variants. We will revise the text to more clearly explain this dependency and how the taxonomy plus extension features reduce propagation risk. revision: partial

standing simulated objections not resolved

Direct comparison of Dynamic Dialogue Graphs against live service logs due to data privacy and proprietary restrictions

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction or claims

full rationale

This paper proposes an empirical evaluation benchmark by formalizing SOPs into Dynamic Dialogue Graphs, introducing an Adversarial Intent Taxonomy, and running experiments on 27 external LLMs to observe Execution Gap and Empathy Resilience. No mathematical derivations, predictions, or first-principles results are present that reduce to fitted parameters or self-referential inputs by construction. The ground truth via Judge Agents and Rule Engine is defined externally to the tested models, with no load-bearing self-citations, uniqueness theorems, or ansatzes invoked. The framework is self-contained as a benchmark proposal without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that SOPs can be losslessly converted to dynamic graphs and that judge agents plus rule engine produce reliable ground truth. No free parameters or invented physical entities are evident from the abstract.

axioms (2)

domain assumption Unstructured SOPs can be formalized into Dynamic Dialogue Graphs that enable precise verification of logical compliance and path coverage
This is the foundational step of the SAGE method described in the abstract.
domain assumption Judge Agents and Rule Engine can generate deterministic ground truth for agent interactions
Required for the automated evaluation framework.

invented entities (2)

Dynamic Dialogue Graph no independent evidence
purpose: Formal representation of SOPs for compliance checking and path coverage
New construct introduced to structure the evaluation
Adversarial Intent Taxonomy no independent evidence
purpose: To enable generation of tricky user behaviors for testing
New taxonomy proposed for the benchmark

pith-pipeline@v0.9.0 · 5539 in / 1566 out tokens · 31857 ms · 2026-05-10T16:37:28.232834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 32 canonical work pages · 14 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Nolwenn Bernard and Krisztian Balog. 2023. MG-ShopDial: A multi-goal conver- sational dataset for E-commerce. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2775– 2785

2023
[4]

Alessandro Berti, Humam Kourani, and Wil MP van der Aalst. 2024. PM-LLM- Benchmark: Evaluating large language models on process mining tasks. InInter- national Conference on Process Mining. Springer, 610–623

2024
[5]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.arXiv preprint arXiv:1810.00278(2018)

work page arXiv 2018
[6]

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scal- ing Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585(2025)

work page internal anchor Pith review arXiv 2025
[7]

Lei Cui, Shaohan Huang, Furu Wei, Chuanqi Tan, Chaoqun Duan, and Ming Zhou. 2017. Superagent: A customer service chatbot for e-commerce websites. InProceedings of ACL 2017, system demonstrations. 97–102

2017
[8]

Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, and Zhuosheng Zhang
[9]

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents.arXiv preprint arXiv:2505.11368(2025)

work page arXiv 2025
[10]

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. 2024. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718(2024)

work page internal anchor Pith review arXiv 2024
[11]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161(2019)

work page Pith review arXiv 2019
[12]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[13]

Dirk Fahland, Fabiana Fournier, Lior Limonad, Inna Skarbovsky, and Ava JE Swevels. 2024. How well can large language models explain business processes? arXiv preprint arXiv:2401.12846(2024)

work page arXiv 2024
[14]

Fabiana Fournier, Lior Limonad, and Inna Skarbovsky. 2024. Towards a Bench- mark for Causal Business Process Reasoning with LLMs. InInternational Confer- ence on Business Process Management. Springer, 233–246

2024
[15]

Michael Grohs, Luka Abb, Nourhan Elsayed, and Jana-Rebecca Rehse. 2023. Large language models can accomplish business process management tasks. In International conference on business process management. Springer, 453–465

2023
[16]

Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. 2024. Multi-if: Benchmark- ing llms on multi-turn and multilingual instructions following.arXiv preprint arXiv:2410.15553(2024)

work page arXiv 2024
[17]

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. 2024. Followbench: A multi- level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4667–4688

2024
[18]

Humam Kourani, Alessandro Berti, Jasmin Hennrich, Wolfgang Kratsch, Robin Weidlich, Chiao-Yun Li, Ahmad Arslan, Wil MP van der Aalst, and Daniel Schuster
[19]

Leveraging large language models for enhanced process model compre- hension.Decision Support Systems(2025), 114563

2025
[20]

Humam Kourani, Alessandro Berti, Daniel Schuster, and Wil MP van der Aalst
[21]

Kourani et al.Software and Systems Modeling(2025), 1–36

Evaluating large language models on business process modeling: frame- work, benchmark, and self-improvement analysis: H. Kourani et al.Software and Systems Modeling(2025), 1–36

2025
[22]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

2023
[23]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244(2023)

work page arXiv 2023
[24]

Xiangci Li, Zhiyu Chen, Jason Ingyu Choi, Nikhita Vedula, Besnik Fetahu, Oleg Rokhlenko, and Shervin Malmasi. 2025. Wizard of shopping: Target-oriented e-commerce dialogue generation with decision tree branching.arXiv preprint arXiv:2502.00969(2025)

work page arXiv 2025
[25]

Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, et al
[26]

Sopbench: Evaluating language agents at following standard operating procedures and constraints.arXiv preprint arXiv:2503.08669(2025)

work page arXiv 2025
[27]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang
[29]

Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124(2020)

work page arXiv 2007
[30]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)

work page internal anchor Pith review arXiv 2023
[31]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
[32]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

2023
[33]

Subhrangshu Nandi, Arghya Datta, Nikhil Vichare, Indranil Bhattacharya, Huzefa Raja, Jing Xu, Shayan Ray, Giuseppe Carenini, Abhi Srivastava, Aaron Chan, et al
[34]

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents.arXiv preprint arXiv:2506.08119(2025)

work page arXiv 2025
[35]

Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Dialogbench: Evaluating llms as human-like dialogue systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6137–6170

2024
[36]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[37]

Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. 2024. Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems37 (2024), 5244–5284

2024
[38]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. 2024. Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601(2024)

work page arXiv 2024
[40]

Jun Quan, Shian Zhang, Qian Cao, Zizhong Li, and Deyi Xiong. 2020. RiSAWOZ: A large-scale multi-domain Wizard-of-Oz dataset with rich semantic annotations for task-oriented dialogue modeling.arXiv preprint arXiv:2010.08738(2020)

work page arXiv 2020
[41]

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI conference on artificial intelli- gence, Vol. 34. 8689–8696

2020
[42]

Adrian Rebmann, Fabian David Schmidt, Goran Glavaš, and Han van Der Aa
[43]

In2024 6th International Conference on Process Mining (ICPM)

Evaluating the ability of llms to solve semantics-aware process mining tasks. In2024 6th International Conference on Process Mining (ICPM). IEEE, 9–16
[44]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 8634–8652

2023
[45]

Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Parrot: Enhancing multi-turn instruction following for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9729–9750

2024
[46]

Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, and Mark Gerstein. 2024. STRUC-BENCH: Are Large Language Models Good at Generating Complex Structured Tabular Data?. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 2:...

2024
[47]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review arXiv 2025
[49]

Vicuna Team. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality.Vicuna: An open-source chatbot impressing gpt-4 with90 (2023)

2023
[50]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023). KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 276–284

2025
[52]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

2024
[53]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language mod- els with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 13484–13508

2023
[54]

Walter F Wiggins and Ali S Tejani. 2022. On the opportunities and risks of foundation models for natural language processing in radiology.Radiology: Artificial Intelligence4, 4 (2022), e220119

2022
[55]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey. arXiv 2023.arXiv preprint arXiv:2309.0786410 (2025)

work page internal anchor Pith review arXiv 2025
[56]

Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wen- peng Yin, and Caiming Xiong. 2024. FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability.arXiv preprint arXiv:2402.18667(2024)

work page arXiv 2024
[57]

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al . 2024. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems37 (2024)

2024
[58]

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443 (2023)

work page arXiv 2023
[59]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380

2018
[60]

Shunyu Yao, Howard Chen, Austin W Hanjie, Runzhe Yang, and Karthik Narasimhan. 2023. Collie: Systematic construction of constrained text generation tasks.arXiv preprint arXiv:2307.08689(2023)

work page arXiv 2023
[61]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[62]

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024. 3053–3077

2024
[63]

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414(2022)

work page internal anchor Pith review arXiv 2022
[64]

Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen
[65]

Recommendation as instruction following: A large language model em- powered recommendation approach.ACM Transactions on Information Systems 43, 5 (2025), 1–37

2025
[66]

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. 2024. Agent-pro: Learning to evolve via policy-level reflection and optimization.arXiv preprint arXiv:2402.17574 (2024)

work page arXiv 2024
[67]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys- chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998(2023)

work page arXiv 2023
[68]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

2023
[69]

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

work page internal anchor Pith review arXiv 2023
[71]

Execution Gap

Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, and Fang Kong. 2025. Evaluating, Synthesizing, and Enhancing for Customer Support Conversation.arXiv preprint arXiv:2508.04423(2025). A Experiment Results Supplementary This appendix provides supplementary data substantiating our main findings. We first validate our multi-agent ensemble via...

work page arXiv 2025
[72]

diagonal dom- inance

Egocentric Bias (Self-Preference).A distinct "diagonal dom- inance" is observable, particularly inFigure 8 (Chat Quality). Mod- els tend to assign higher scores to their own outputs (or outputs from the same model family) compared to external evaluators. For instance, the diagonal cells in Figure 8 often exhibit deeper colors than the off-diagonal cells i...
[73]

strict graders,

Systematic Scoring Bias.Significant horizontal variations exist across all three heatmaps, especially inFigure 9 (Logic Abil- ity). This indicates that different judges possess different strictness standards. Some judges (represented by rows with consistently lighter colors) act as "strict graders, " systematically assigning lower scores across all agents...

2026
[74]

Field Classification (stage1): Classify the following 4 fields based on the given dialogue history, then jump to stage2. - ConsumptionType: User dialogue intent (Enquiry/Change/Cancel) - ApplicationTendency: Whether user tends to apply for recommended package (Agree/Reject/Hesitate) - ConsumptionProfile: Package type user prefers (Data/Voice) - EmotionTag...
[75]

- Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5

User Consumption Intent Judgment (stage2): Jump based on [ConsumptionType] field. - Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5
[76]

- Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6

User Consumption Profile Judgment (stage3): Jump based on [ConsumptionProfile] field. - Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6
[77]

- Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END

User Package Status Judgment (stage4): Jump based on system variable [PackageStatus]. - Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al

2026
[78]

- Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7

Contract Penalty Situation (stage5): Jump based on system variable [Penalty]. - Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7
[79]

- Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END

User Application Tendency Judgment (stage6): Jump based on [ApplicationTendency] field. - Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END
[80]

classification_output

User Emotion Judgment (stage7): Jump based on [EmotionTag] field. - Jump Logic: Based on [EmotionTag] field, Calm→ACTION=ChangeOrder→END; Discontent→ACTION=TransHuman→END. [Action Descriptions] - ChangeOrder: Change package - GoodBye: Politely end the conversation - TransHuman: Transfer to human agent [Output Format Requirements] You must output in the fo...

Showing first 80 references.