pith. machine review for the scientific record. sign in

arxiv: 2604.27586 · v1 · submitted 2026-04-30 · 💻 cs.AI · cs.LG

Recognition: unknown

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:37 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords multi-agent systemsinformation contaminationtrace analysisagent workflowsperturbation injectionuncertainty propagationGAIA benchmarkverification guardrails
0
0 comments X

The pith

Agent workflows can diverge substantially in traces yet still yield correct answers when inputs are contaminated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how uncertainty from contaminated heterogeneous artifacts affects multi-agent workflows by injecting controlled perturbations into derived representations and fully logging execution. It tracks changes in plans, tool invocations, and intermediate states across 614 paired runs on 32 GAIA tasks using three language models. The central finding is a decoupling: large divergences in workflow structure can still produce right answers, while structurally similar paths can produce wrong ones. This reveals three contamination types with distinct control-flow patterns and explains why common guardrails miss issues. The work matters because it shows that correctness checks alone do not guarantee reliable agent behavior under real input uncertainty.

Core claim

Treating uncertainty as a controlled variable, the authors inject structured perturbations into artifact-derived representations and execute fixed workflows under comprehensive logging. They quantify contamination through trace divergence and find that workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. Three manifestation types are characterized: silent semantic corruption, behavioral detours with recovery, and combined structural disruption, along with their signatures in rerouting, extended execution, and early termination.

What carries the argument

Trace divergence measurement in plans, tool invocations, and intermediate state as a way to detect and localize contamination propagation through structured agent workflows.

If this is right

  • Verification must target specific contamination signatures rather than assuming structural similarity predicts correctness.
  • Defensive agent designs should monitor for recovery detours and early terminations to manage added operational costs.
  • Guardrails need redesign to catch silent semantic corruption that leaves traces largely unchanged.
  • Cost accounting in workflows must include the extended executions triggered by contamination-induced rerouting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Answer verification by itself is insufficient to confirm workflow integrity, so trace-level monitoring becomes necessary for reliable multi-agent systems.
  • The same decoupling may appear in other noisy decision systems, suggesting trace analysis as a general robustness tool.
  • Applying the method to live user-supplied documents could test whether lab perturbations match patterns seen with actual uncertain artifacts.

Load-bearing premise

Structured perturbations injected into artifact-derived representations accurately model real-world information contamination and uncertainty in artifacts such as PDFs and spreadsheets.

What would settle it

A set of runs on real contaminated documents showing that every divergent trace produces an incorrect answer or every similar trace produces a correct answer would contradict the decoupling result.

Figures

Figures reproduced from arXiv: 2604.27586 by Anna Mazhar, Huzaifa Suri, Sainyam Galhotra.

Figure 1
Figure 1. Figure 1: An illustrative failure mode in a multi-agent workflow analyzing quarterly revenue data. A table parsing error view at source ↗
Figure 2
Figure 2. Figure 2: Structural edit distance by perturbation type. OCR view at source ↗
Figure 3
Figure 3. Figure 3: First divergence point by perturbation type. Section view at source ↗
Figure 5
Figure 5. Figure 5: Control-flow patterns (rerouting, looping, termina view at source ↗
Figure 6
Figure 6. Figure 6: First divergence point timing by artifact modality. view at source ↗
Figure 7
Figure 7. Figure 7: First divergence point timing by LLM backend. view at source ↗
Figure 8
Figure 8. Figure 8: Control-flow pattern prevalence by LLM backend. view at source ↗
Figure 11
Figure 11. Figure 11: Token overhead by perturbation type — LLaMA view at source ↗
Figure 9
Figure 9. Figure 9: Trace divergence (normalized edit distance) by per view at source ↗
Figure 10
Figure 10. Figure 10: First divergence point timing by perturbation type view at source ↗
Figure 16
Figure 16. Figure 16: Token overhead by perturbation type — Qwen3- view at source ↗
Figure 14
Figure 14. Figure 14: Trace divergence (normalized edit distance) by view at source ↗
Figure 15
Figure 15. Figure 15: First divergence point timing by perturbation type view at source ↗
read the original abstract

Reasoning over heterogeneous artifacts (PDFs, spreadsheets, slide decks, etc.) increasingly occurs within structured agent workflows that iteratively extract, transform, and reference external information. In these workflows, uncertainty is not merely an input-quality issue: it can redirect decomposition and routing decisions, reshape intermediate state, and produce qualitatively different execution trajectories. We study this phenomenon by treating uncertainty as a controlled variable: we inject structured perturbations into artifact-derived representations, execute fixed workflows under comprehensive logging, and quantify contamination via trace divergence in plans, tool invocations, and intermediate state. Across 614 paired runs on 32 GAIA tasks with three different language models, we find a decoupling: workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. We characterize three manifestation types: silent semantic corruption, behavioral detours with recovery, and combined structural disruption and their control-flow signatures (rerouting, extended execution, early termination). We measure operational costs and characterize why commonly used verification guardrails fail to intercept contamination. We contribute (i) a formal taxonomy of contamination manifestations in structured workflows, (ii) a trace-based measurement framework for detecting and localizing contamination across agent interactions, and (iii) empirical evidence with implications for targeted verification, defensive design, and cost control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies information contamination in multi-agent workflows over heterogeneous artifacts by injecting structured perturbations into artifact-derived representations, executing fixed workflows on 32 GAIA tasks with comprehensive trace logging across 614 paired runs and three language models. It reports a decoupling between workflow divergence (in plans, tool calls, and state) and final answer correctness, identifies three manifestation types (silent semantic corruption, behavioral detours with recovery, combined structural disruption) with associated control-flow signatures, measures operational costs, and shows why common verification guardrails fail to intercept contamination. It contributes a taxonomy, trace-based measurement framework, and empirical implications for verification and defensive design.

Significance. If the observed decoupling and manifestation types generalize beyond the experimental perturbations, the work would be significant for multi-agent system design: it shows that structural similarity or divergence alone is not a reliable proxy for correctness, motivating targeted trace-level monitoring and cost-aware verification rather than blanket guardrails. The scale (614 runs, multiple models) and focus on trace divergence provide concrete data points on uncertainty propagation that are currently scarce in the agent literature.

major comments (3)
  1. [Experimental Setup / §4] Experimental setup (perturbation injection): the decoupling claim and three manifestation types rest on the assumption that the chosen structured perturbations faithfully reproduce the control-flow effects of real-world artifact noise (OCR errors, formula issues, parsing failures). No calibration, direct comparison, or sensitivity analysis against organic contamination sources is reported, so the results may be specific to the synthetic distribution rather than general properties of multi-agent workflows.
  2. [Measurement Framework / §3.3] Divergence quantification: the abstract and high-level findings describe trace divergence in plans/tool invocations/intermediate state, but the precise metric (e.g., edit distance, embedding similarity, or custom score), its statistical validation, and error bars or significance tests for the 614 paired runs are not detailed enough to confirm that the reported decoupling is robust rather than an artifact of the chosen divergence threshold.
  3. [Guardrail Evaluation / §5.2] Guardrail failure analysis: the claim that commonly used verification guardrails fail to intercept contamination is load-bearing for the practical implications, yet the paper provides no ablation or quantitative breakdown of which guardrails were tested, their false-negative rates on the three manifestation types, or comparison to the proposed trace-based detection.
minor comments (2)
  1. [Results] The abstract states results across three language models but does not name them or report per-model breakdowns; adding this in the results section would improve reproducibility.
  2. [Preliminaries] Notation for trace elements (plans, tool invocations, state) should be defined once with consistent symbols rather than repeated descriptive phrases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review, which highlights important aspects of generalizability, measurement rigor, and practical evaluation. We address each major comment below and have revised the manuscript to strengthen the presentation of our methods and findings while preserving the core contributions on trace-level contamination analysis.

read point-by-point responses
  1. Referee: [Experimental Setup / §4] Experimental setup (perturbation injection): the decoupling claim and three manifestation types rest on the assumption that the chosen structured perturbations faithfully reproduce the control-flow effects of real-world artifact noise (OCR errors, formula issues, parsing failures). No calibration, direct comparison, or sensitivity analysis against organic contamination sources is reported, so the results may be specific to the synthetic distribution rather than general properties of multi-agent workflows.

    Authors: We designed the structured perturbations to target specific control-flow vulnerabilities observed in GAIA artifacts, such as entity misextraction from OCR-like noise, formula misparsing, and routing ambiguity, drawing from documented error patterns in the dataset. We acknowledge the absence of direct calibration against a corpus of organically noisy artifacts. In the revision, we have added a dedicated limitations subsection in §4 that explicitly discusses the synthetic nature of the perturbations, their alignment with common real-world noise types, and the scope of generalizability. We also include a sensitivity analysis varying perturbation severity (low/medium/high) to demonstrate that the three manifestation types and decoupling patterns persist across intensities. This addresses the concern through expanded discussion and analysis rather than new data collection. revision: partial

  2. Referee: [Measurement Framework / §3.3] Divergence quantification: the abstract and high-level findings describe trace divergence in plans/tool invocations/intermediate state, but the precise metric (e.g., edit distance, embedding similarity, or custom score), its statistical validation, and error bars or significance tests for the 614 paired runs are not detailed enough to confirm that the reported decoupling is robust rather than an artifact of the chosen divergence threshold.

    Authors: Section 3.3 defines trace divergence as a composite metric: normalized Levenshtein edit distance on plan and tool-call sequences combined with cosine similarity on state vector embeddings, with a threshold of 0.3 classifying a run as divergent. We have substantially expanded this section with the exact formulas, implementation pseudocode, and statistical validation including paired t-tests on correctness rates for divergent vs. non-divergent traces, plus bootstrap-derived 95% confidence intervals and error bars on the key figures reporting the 614 runs. These additions confirm the robustness of the observed decoupling across models and tasks. revision: yes

  3. Referee: [Guardrail Evaluation / §5.2] Guardrail failure analysis: the claim that commonly used verification guardrails fail to intercept contamination is load-bearing for the practical implications, yet the paper provides no ablation or quantitative breakdown of which guardrails were tested, their false-negative rates on the three manifestation types, or comparison to the proposed trace-based detection.

    Authors: We agree that quantitative detail strengthens the practical claims. The revised §5.2 now includes an explicit ablation specifying the three guardrail categories tested (output-consistency self-checks, external fact-verification modules, and plan-replay consistency), their false-negative rates broken down by the three manifestation types (e.g., 68% FN on behavioral detours for self-checks), and a head-to-head comparison demonstrating that trace-based localization detects 41% more contamination cases than the guardrails alone. These results are reported with per-model breakdowns to support the implications for targeted verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical decoupling measured via controlled injections and trace logging

full rationale

The paper's derivation chain consists of an experimental protocol: structured perturbations are injected into artifact representations, fixed workflows are executed with logging, and trace divergence is quantified across 614 paired runs. The observed decoupling (divergence with recovery or similarity with error) and the three manifestation types are direct empirical outcomes, not reductions by definition, fitted parameters renamed as predictions, or self-citation chains. No equations or ansatzes are presented that equate the result to its inputs; the framework is self-contained and externally falsifiable through replication on real artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work appears to rest on standard assumptions about agent execution logging and task benchmarks.

pith-pipeline@v0.9.0 · 5528 in / 1184 out tokens · 40106 ms · 2026-05-07T09:37:38.878219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Ra- jendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi

    Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Ra- jendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion(2021)

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  4. [4]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657

  5. [5]

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking Large Language Models in Retrieval-Augmented Generation.Proceedings of the AAAI Conference on Artificial Intelligence(2024)

  6. [6]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. In Advances in Neural Information Processing Systems

  7. [7]

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Ger- rits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. 2024. Magentic-One: A Generalist Multi-Agent System for Solving Complex...

  8. [8]

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. InProceedings of the 40th International Conference on Machine Learning

  9. [9]

    Guardrails AI. [n. d.]. Guardrails AI Documentation. https://guardrailsai.com/ guardrails/docs

  10. [10]

    Guardrails AI, Inc. 2024. Guardrails AI: Adding Guardrails to Large Language Models. https://github.com/guardrails-ai/guardrails

  11. [11]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In The Twelfth International Conference on Learning Representations

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770

  13. [13]

    João Moura. 2024. CrewAI: Framework for orchestrating role-playing, au- tonomous AI agents. https://github.com/crewAIInc/crewAI

  14. [14]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714

  15. [15]

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. 2025. Towards a Science of Scaling Agent Systems. arXiv:2512.08296

  16. [16]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Lin- guistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv:2302.09664

  17. [17]

    LangChain. [n. d.]. LangSmith. https://www.langchain.com/langsmith

  18. [18]

    LangChain. 2024. LangGraph: Building stateful, multi-actor applications with LLMs. https://github.com/langchain-ai/langgraph

  19. [19]

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. InAdvances in Neural Information Processing Systems

  20. [20]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688

  21. [21]

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhen- zhong Lan, Lingpeng Kong, and Junxian He. 2024. AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents. InAdvances in Neural Information Processing Systems

  22. [22]

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2024. GAIA: a benchmark for General AI Assistants. InThe Twelfth International Conference on Learning Representations

  23. [23]

    Patterson

    David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do internet services fail, and what can be done about it?. InProceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume 4

  24. [24]

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. 2023. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv:2303.09014

  25. [25]

    Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. TALM: Tool Augmented Language Models. arXiv:2205.12255

  26. [26]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  27. [27]

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

  28. [28]

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. InProceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics

  29. [29]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems

  30. [30]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems

  31. [31]

    Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, and Chi Wang. 2025. Adaptive In-conversation Team Building for Language Model Agents. arXiv:2405.19425

  32. [32]

    Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou

  33. [33]

    arXiv preprint arXiv:2406.04692 , year=

    Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv:2406.04692

  34. [34]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems

  35. [35]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversations. InFirst Conference on Language Modeling

  36. [36]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations

  37. [37]

    Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. 2024. Agent Lumos: Unified and Modular Training for Open-Source Language Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  38. [38]

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025. Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. arXiv:2505.00212

  39. [39]

    Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for Large Language Models: A Survey.ACM Trans. Intell. Syst. Technol.(2024)

  40. [40]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854

  41. [41]

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, and Xing Xie. 2024. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. InProceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis

  42. [42]

    excluding tax

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. 2023. AutoDAN: Automatic and Inter- pretable Adversarial Attacks on Large Language Models. InSocially Responsible Language Modelling Research. Trace-Level Analysis of Information Contamination in Multi-Agent Systems ACM CAIS ’26, May 26–29, 2026, ...