Recognition: unknown
SAGE: A Service Agent Graph-guided Evaluation Benchmark
Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3
The pith
A graph-guided benchmark shows LLMs classify customer intents correctly but fail to select the right subsequent actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and comprehensive path coverage in agent-user interactions. Using an Adversarial Intent Taxonomy and modular extension mechanism, the benchmark generates deterministic ground truth via Judge Agents and a Rule Engine. Evaluations on 27 LLMs across 6 industrial scenarios demonstrate a significant Execution Gap, where models accurately classify intents but fail to derive correct subsequent actions, and an Empathy Resilience phenomenon, where polite conversational facades persist despite underlying logical failures under high adversarial intensity.
What carries the argument
Dynamic Dialogue Graphs derived from SOPs that support path verification and compliance checking, paired with Judge Agents and a Rule Engine for deterministic analysis.
If this is right
- Service agents need improved mechanisms to convert recognized intents into procedurally correct next steps.
- Graph-based verification provides deterministic measurement of SOP adherence that single-metric tests miss.
- The modular extension mechanism and automated data synthesis allow low-cost scaling to additional domains.
- Adversarial intensity testing exposes logical weaknesses hidden by surface-level politeness.
Where Pith is reading between the lines
- Incorporating explicit graph-path prediction into model training could reduce the execution gap.
- Surface politeness that masks logical errors implies satisfaction surveys alone are unreliable for agent quality.
- The graph formalization approach could transfer to evaluating AI in other procedure-heavy areas such as technical troubleshooting or compliance checks.
- Adding consistency tracking across multiple turns might strengthen the benchmark's ability to catch cumulative failures.
Load-bearing premise
The Dynamic Dialogue Graphs constructed from unstructured SOPs accurately capture real-world logical compliance requirements and diverse user behaviors without significant bias or oversimplification.
What would settle it
Direct comparison of the same LLMs in live customer interactions versus the SAGE graph-constrained tests to check whether the execution gap and empathy resilience appear at similar rates outside the benchmark.
Figures
read the original abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SAGE, a multi-agent benchmark for LLM service agents that formalizes unstructured SOPs into Dynamic Dialogue Graphs to enable precise verification of logical compliance and path coverage. It adds an Adversarial Intent Taxonomy and modular Extension Mechanism for domain adaptation and automated data synthesis. Evaluation uses Judge Agents plus a Rule Engine to produce deterministic ground truth. Experiments on 27 LLMs across 6 industrial scenarios report an Execution Gap (strong intent classification but weak subsequent action derivation) and Empathy Resilience (polite facades persisting under high-adversarial logical failures).
Significance. If the graph construction faithfully encodes real SOP logic and user behavior diversity, SAGE offers a dynamic, dual-axis alternative to static benchmarks and could surface actionable limitations for deploying LLMs in regulated service domains. The automated synthesis and extension features are practical strengths for reproducibility and reuse.
major comments (3)
- [Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.
- [Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.
- [Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.
minor comments (2)
- [Abstract] The anonymous code link is noted; ensure the final version includes a permanent, non-anonymous repository with exact graph-construction scripts and evaluation prompts.
- [Evaluation Framework] Clarify the precise definition of 'intent classification success' versus 'action derivation failure' with an example from one of the six scenarios.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are warranted and providing clarifications on design choices. We will incorporate several suggested improvements in the revised version.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Dynamic Dialogue Graph construction): the process of converting unstructured SOPs into graphs is load-bearing for both the Execution Gap and Empathy Resilience claims, yet no validation is reported (e.g., inter-annotator agreement, coverage metrics for branching paths/edge conditions, or comparison against live service logs). Without this, observed failures may reflect graph incompleteness rather than model limitations.
Authors: We agree that validating the Dynamic Dialogue Graph construction is essential given its central role. The graphs were built by domain experts using a structured protocol (detailed in Section 3) that explicitly encodes branching paths, edge conditions, and SOP logic from the six industrial scenarios. In the revised manuscript, we will report inter-annotator agreement (Cohen's kappa) and quantitative coverage metrics for paths and conditions. Direct comparison to live service logs is not possible here due to privacy and proprietary constraints, but the Extension Mechanism is explicitly designed to let practitioners validate and augment graphs with their own logs. revision: partial
-
Referee: [Experimental Results] Experimental section (results on 27 LLMs): the headline quantitative observations lack statistical tests, error bars, per-scenario sample sizes, or explicit controls against post-hoc selection, making it impossible to assess whether the Execution Gap and Empathy Resilience are robust or sensitive to implementation choices.
Authors: We concur that stronger statistical presentation is needed. Experiments used fixed random seeds, consistent prompting, and over 5,000 dialogues across the 27 models and 6 scenarios. The revised manuscript will add error bars (standard deviation over multiple runs), statistical significance tests (paired t-tests and Wilcoxon signed-rank for the Execution Gap), explicit per-scenario sample sizes in tables, and a dedicated paragraph on controls against post-hoc selection, including pre-specified metrics. revision: yes
-
Referee: [Judge Agent and Rule Engine] Judge Agent + Rule Engine (ground-truth generation): because deterministic labels inherit directly from the Dynamic Dialogue Graphs, any oversimplification in SOP formalization (e.g., missing adversarial branches or user-behavior diversity) propagates to all reported gaps; the modular Extension Mechanism does not mitigate this dependency.
Authors: The deterministic ground truth is an intentional design decision to guarantee objectivity and reproducibility, avoiding the variability of purely LLM-based judging. The Adversarial Intent Taxonomy was developed precisely to capture diverse and high-intensity user behaviors, and these are encoded as explicit branches during graph construction and automated synthesis. The modular Extension Mechanism directly supports adding missing adversarial branches or user-behavior variants. We will revise the text to more clearly explain this dependency and how the taxonomy plus extension features reduce propagation risk. revision: partial
- Direct comparison of Dynamic Dialogue Graphs against live service logs due to data privacy and proprietary restrictions
Circularity Check
No significant circularity in empirical benchmark construction or claims
full rationale
This paper proposes an empirical evaluation benchmark by formalizing SOPs into Dynamic Dialogue Graphs, introducing an Adversarial Intent Taxonomy, and running experiments on 27 external LLMs to observe Execution Gap and Empathy Resilience. No mathematical derivations, predictions, or first-principles results are present that reduce to fitted parameters or self-referential inputs by construction. The ground truth via Judge Agents and Rule Engine is defined externally to the tested models, with no load-bearing self-citations, uniqueness theorems, or ansatzes invoked. The framework is self-contained as a benchmark proposal without circular reductions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Unstructured SOPs can be formalized into Dynamic Dialogue Graphs that enable precise verification of logical compliance and path coverage
- domain assumption Judge Agents and Rule Engine can generate deterministic ground truth for agent interactions
invented entities (2)
-
Dynamic Dialogue Graph
no independent evidence
-
Adversarial Intent Taxonomy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Nolwenn Bernard and Krisztian Balog. 2023. MG-ShopDial: A multi-goal conver- sational dataset for E-commerce. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2775– 2785
2023
-
[4]
Alessandro Berti, Humam Kourani, and Wil MP van der Aalst. 2024. PM-LLM- Benchmark: Evaluating large language models on process mining tasks. InInter- national Conference on Process Mining. Springer, 610–623
2024
- [5]
-
[6]
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scal- ing Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585(2025)
work page internal anchor Pith review arXiv 2025
-
[7]
Lei Cui, Shaohan Huang, Furu Wei, Chuanqi Tan, Chaoqun Duan, and Ming Zhou. 2017. Superagent: A customer service chatbot for e-commerce websites. InProceedings of ACL 2017, system demonstrations. 97–102
2017
-
[8]
Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, and Zhuosheng Zhang
- [9]
-
[10]
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. 2024. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718(2024)
work page internal anchor Pith review arXiv 2024
-
[11]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161(2019)
work page Pith review arXiv 2019
-
[12]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
2024
- [13]
-
[14]
Fabiana Fournier, Lior Limonad, and Inna Skarbovsky. 2024. Towards a Bench- mark for Causal Business Process Reasoning with LLMs. InInternational Confer- ence on Business Process Management. Springer, 233–246
2024
-
[15]
Michael Grohs, Luka Abb, Nourhan Elsayed, and Jana-Rebecca Rehse. 2023. Large language models can accomplish business process management tasks. In International conference on business process management. Springer, 453–465
2023
- [16]
-
[17]
Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. 2024. Followbench: A multi- level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4667–4688
2024
-
[18]
Humam Kourani, Alessandro Berti, Jasmin Hennrich, Wolfgang Kratsch, Robin Weidlich, Chiao-Yun Li, Ahmad Arslan, Wil MP van der Aalst, and Daniel Schuster
-
[19]
Leveraging large language models for enhanced process model compre- hension.Decision Support Systems(2025), 114563
2025
-
[20]
Humam Kourani, Alessandro Berti, Daniel Schuster, and Wil MP van der Aalst
-
[21]
Kourani et al.Software and Systems Modeling(2025), 1–36
Evaluating large language models on business process modeling: frame- work, benchmark, and self-improvement analysis: H. Kourani et al.Software and Systems Modeling(2025), 1–36
2025
-
[22]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626
2023
- [23]
- [24]
-
[25]
Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, et al
- [26]
-
[27]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang
- [29]
-
[30]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)
work page internal anchor Pith review arXiv 2023
-
[31]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
-
[32]
Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594
2023
-
[33]
Subhrangshu Nandi, Arghya Datta, Nikhil Vichare, Indranil Bhattacharya, Huzefa Raja, Jing Xu, Shayan Ray, Giuseppe Carenini, Abhi Srivastava, Aaron Chan, et al
- [34]
-
[35]
Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Dialogbench: Evaluating llms as human-like dialogue systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6137–6170
2024
-
[36]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[37]
Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. 2024. Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems37 (2024), 5244–5284
2024
-
[38]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [39]
- [40]
-
[41]
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema- guided dialogue dataset. InProceedings of the AAAI conference on artificial intelli- gence, Vol. 34. 8689–8696
2020
-
[42]
Adrian Rebmann, Fabian David Schmidt, Goran Glavaš, and Han van Der Aa
-
[43]
In2024 6th International Conference on Process Mining (ICPM)
Evaluating the ability of llms to solve semantics-aware process mining tasks. In2024 6th International Conference on Process Mining (ICPM). IEEE, 9–16
-
[44]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 8634–8652
2023
-
[45]
Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. Parrot: Enhancing multi-turn instruction following for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9729–9750
2024
-
[46]
Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, and Mark Gerstein. 2024. STRUC-BENCH: Are Large Language Models Good at Generating Complex Structured Tabular Data?. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 2:...
2024
-
[47]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)
work page internal anchor Pith review arXiv 2025
-
[49]
Vicuna Team. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality.Vicuna: An open-source chatbot impressing gpt-4 with90 (2023)
2023
-
[50]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023). KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 276–284
2025
-
[52]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
2024
-
[53]
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language mod- els with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 13484–13508
2023
-
[54]
Walter F Wiggins and Ali S Tejani. 2022. On the opportunities and risks of foundation models for natural language processing in radiology.Radiology: Artificial Intelligence4, 4 (2022), e220119
2022
-
[55]
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey. arXiv 2023.arXiv preprint arXiv:2309.0786410 (2025)
work page internal anchor Pith review arXiv 2025
- [56]
-
[57]
Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al . 2024. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems37 (2024)
2024
- [58]
-
[59]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2369–2380
2018
- [60]
-
[61]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
2022
-
[62]
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024. 3053–3077
2024
-
[63]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414(2022)
work page internal anchor Pith review arXiv 2022
-
[64]
Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen
-
[65]
Recommendation as instruction following: A large language model em- powered recommendation approach.ACM Transactions on Information Systems 43, 5 (2025), 1–37
2025
- [66]
- [67]
-
[68]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
2023
-
[69]
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)
work page internal anchor Pith review arXiv 2023
-
[71]
Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, and Fang Kong. 2025. Evaluating, Synthesizing, and Enhancing for Customer Support Conversation.arXiv preprint arXiv:2508.04423(2025). A Experiment Results Supplementary This appendix provides supplementary data substantiating our main findings. We first validate our multi-agent ensemble via...
-
[72]
diagonal dom- inance
Egocentric Bias (Self-Preference).A distinct "diagonal dom- inance" is observable, particularly inFigure 8 (Chat Quality). Mod- els tend to assign higher scores to their own outputs (or outputs from the same model family) compared to external evaluators. For instance, the diagonal cells in Figure 8 often exhibit deeper colors than the off-diagonal cells i...
-
[73]
strict graders,
Systematic Scoring Bias.Significant horizontal variations exist across all three heatmaps, especially inFigure 9 (Logic Abil- ity). This indicates that different judges possess different strictness standards. Some judges (represented by rows with consistently lighter colors) act as "strict graders, " systematically assigning lower scores across all agents...
2026
-
[74]
Field Classification (stage1): Classify the following 4 fields based on the given dialogue history, then jump to stage2. - ConsumptionType: User dialogue intent (Enquiry/Change/Cancel) - ApplicationTendency: Whether user tends to apply for recommended package (Agree/Reject/Hesitate) - ConsumptionProfile: Package type user prefers (Data/Voice) - EmotionTag...
-
[75]
- Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5
User Consumption Intent Judgment (stage2): Jump based on [ConsumptionType] field. - Jump Logic: Based on the value of [ConsumptionType], Enquiry→stage3; Change→stage4; Cancel→stage5
-
[76]
- Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6
User Consumption Profile Judgment (stage3): Jump based on [ConsumptionProfile] field. - Jump Logic: Based on the value of [ConsumptionProfile], jump to stage6
-
[77]
- Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END
User Package Status Judgment (stage4): Jump based on system variable [PackageStatus]. - Jump Logic: Based on the value of system variable [PackageStatus], Contracted→stage5; NoContract→ACTION=ChangeOrder→END. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Ling Shi et al
2026
-
[78]
- Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7
Contract Penalty Situation (stage5): Jump based on system variable [Penalty]. - Jump Logic: Based on the value of system variable [Penalty], Penalty=0→ACTION=ChangeOrder→END; Penalty≠0→stage7
-
[79]
- Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END
User Application Tendency Judgment (stage6): Jump based on [ApplicationTendency] field. - Jump Logic: Based on [ApplicationTendency] field, Agree→stage4; Reject/Hesitate→ACTION=GoodBye→END
-
[80]
classification_output
User Emotion Judgment (stage7): Jump based on [EmotionTag] field. - Jump Logic: Based on [EmotionTag] field, Calm→ACTION=ChangeOrder→END; Discontent→ACTION=TransHuman→END. [Action Descriptions] - ChangeOrder: Change package - GoodBye: Politely end the conversation - TransHuman: Transfer to human agent [Output Format Requirements] You must output in the fo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.