To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

Hui Liu; Jiliang Tang; Jingying Zeng; Pengfei He; Qiankun Peng; Qi He; Samarth Varshney; Shrivats Agrawal; Suhang Wang; Xianfeng Tang

arxiv: 2506.02546 · v2 · submitted 2025-06-03 · 💻 cs.CR

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

Pengfei He , Zhenwei Dai , Xianfeng Tang , Yue Xing , Hui Liu , Jingying Zeng , Qiankun Peng , Shrivats Agrawal

show 4 more authors

Samarth Varshney Suhang Wang Jiliang Tang Qi He

This is my paper

Pith reviewed 2026-05-19 11:26 UTC · model grok-4.3

classification 💻 cs.CR

keywords trust managementLLM multi-agent systemsattention-based scoringtrustworthiness evaluationrobustnessmalicious inputsGrice communication theory

0 comments

The pith

An attention-based trust management system allows LLM multi-agent systems to assess message trustworthiness and resist malicious inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model multi-agent systems can solve complex tasks but remain vulnerable when agents receive unreliable messages because they treat all incoming messages equally. The paper defines trustworthiness using six orthogonal dimensions drawn from human communication theory to provide interpretable measures. It introduces a lightweight attention-based method called A-Trust to score the trustworthiness of messages. A trust management system is then built on this method to handle both message-level and agent-level evaluations. Experiments across diverse settings show that the system improves robustness against malicious inputs.

Core claim

The paper claims that defining trustworthiness through six orthogonal dimensions and implementing an attention-based Trust Score allows construction of a trust management system that evaluates messages and agents in LLM multi-agent systems, resulting in significantly improved robustness against malicious inputs as demonstrated in multiple experimental settings and tasks.

What carries the argument

The Attention Trust Score (A-Trust), a lightweight attention-based method that scores message trustworthiness using the six orthogonal dimensions.

If this is right

Message-level trust assessments allow agents to discount or ignore specific unreliable communications before acting on them.
Agent-level trust assessments enable identification and isolation of consistently untrustworthy agents over time.
The single framework supports both message-level and agent-level evaluations without requiring separate mechanisms.
Robustness gains appear across a range of multi-agent tasks and settings when the system is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scoring method could be adapted to evaluate trustworthiness in single-agent LLM setups that receive external data streams.
Integration with existing verification techniques might create layered defenses for agent communication networks.
Large-scale simulations of agent populations would test whether attention scores remain stable as the number of agents grows.

Load-bearing premise

The six orthogonal trust dimensions inspired by Grice provide a comprehensive and interpretable measure of trustworthiness that can be accurately captured by an attention-based scoring method in LLM agents.

What would settle it

A controlled test where malicious messages are engineered to receive high attention-based trust scores on the six dimensions but still produce harmful effects, checking whether the trust management system fails to block them.

Figures

Figures reproduced from arXiv: 2506.02546 by Hui Liu, Jiliang Tang, Jingying Zeng, Pengfei He, Qiankun Peng, Qi He, Samarth Varshney, Shrivats Agrawal, Suhang Wang, Xianfeng Tang, Yue Xing, Zhenwei Dai.

**Figure 2.** Figure 2: Frequency of heads with top-10 highest attention weights. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison among different trust scores on Trust Violation dataset. From top to bottom, rows present A-Trust, perplexity-based and prompt-based score respectively. Each column shows results of one trust dimension. The orange line denotes violation data while the blue line denotes non-violation data. We evaluate A-Trust on the Trust Violation dataset. Specifically, for each regression model, we test on the … view at source ↗

**Figure 4.** Figure 4: An illustration of proposed trust management system. The system consists of three key components: message-level trust evaluation, a trust-aware action policy, and agent-level trust records that enable dynamic tracking and utilization of trust. 3 TMS: Trust Management System for LLM-MAS To ensure secure and reliable collaboration in LLM-MAS, we introduce a dedicated Trust Management System (TMS) to monitor … view at source ↗

**Figure 5.** Figure 5: Performance of TMS (A-Trust) with different number of malicious sources, on MMLUPhy dataset and complete structure. Full results are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Visual analysis of multi-head attention on different violations using Qwen2.5-7B. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Visual analysis of multi-head attention on different violations using Gemma3-4B. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Large Language Model-based Multi-Agent Systems (LLM-MAS) have demonstrated strong capabilities in solving complex tasks but remain vulnerable when agents receive unreliable messages. This vulnerability stems from a fundamental gap: LLM agents treat all incoming messages equally without evaluating their trustworthiness. While some existing studies approach trustworthiness, they focus on a single type of harmfulness rather than analyze it in a holistic approach from multiple trustworthiness perspectives. We address this gap by proposing a comprehensive definition of trustworthiness inspired by human communication theory (Grice, 1975). Our definition identifies six orthogonal trust dimensions that provide interpretable measures of trustworthiness. Building on this definition, we introduce the Attention Trust Score (A -Trust), a lightweight, attention-based method for evaluating the trustworthiness of messages. We then develop a principled trust management system (TMS) for LLM -MAS that supports both message-level and agent-level trust assessments. Experiments across diverse multi-agent settings and tasks demonstrate that our TMS significantly improves robustness against malicious inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines six trust dimensions from Grice and adds an attention-based A-Trust score for LLM multi-agent messages, but the experiments do not show that the scores track those dimensions rather than simpler cues.

read the letter

The main point is a six-dimension definition of trustworthiness for LLM multi-agent systems, drawn from Grice, paired with an attention mechanism to score incoming messages and a trust management system that applies it at both message and agent levels. This targets the issue of agents treating every message the same when some are unreliable or malicious. It moves past earlier work that handled only one narrow type of harm, which is a clear step toward a broader view. The definition aims for interpretability across orthogonal aspects, and the attention approach keeps the scoring lightweight enough to add without major overhead. That framing and the dual-level assessment are the parts that feel useful on first read. The experiments claim clear robustness gains across tasks and settings, but the support for why those gains happen is thin. There is no reported check that the attention weights actually reflect the six dimensions instead of proxy signals like token patterns or length that might flag the specific attacks in the test suite. No ablations against non-semantic baselines or human ratings to confirm dimension alignment appear in the description. If the performance edge comes from catching surface features of the synthetic malicious inputs rather than the intended trust model, the central claim weakens. Readers working on reliable agent collaboration or defenses against bad inputs would get the most from the definition and system design as a starting point. The work is coherent enough on its own terms to deserve peer review, though the authors would need to add validation that ties the scores to the proposed dimensions.

Referee Report

2 major / 2 minor

Summary. The paper proposes a definition of trustworthiness for LLM-based multi-agent systems (LLM-MAS) consisting of six orthogonal dimensions inspired by Grice (1975), introduces an Attention Trust Score (A-Trust) that uses attention mechanisms to produce interpretable per-message scores, develops a Trust Management System (TMS) supporting both message-level and agent-level assessments, and reports that experiments across diverse multi-agent settings and tasks show the TMS significantly improves robustness against malicious inputs.

Significance. If the central empirical claims hold after proper validation, the work would address a practical gap in holistic, multi-dimensional trust evaluation for LLM-MAS rather than single-dimension harm detection. The lightweight attention-based approach could integrate readily with existing LLM architectures and provide interpretable signals for agent decision-making.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the claim that 'experiments across diverse multi-agent settings and tasks demonstrate that our TMS significantly improves robustness' is presented without any description of baselines, metrics, statistical tests, number of runs, or how the six Grice dimensions were operationalized or measured. This leaves the central robustness claim without visible supporting evidence or controls.
[A-Trust definition / §3] Section introducing A-Trust (likely §3): the assertion that attention weights produce scores aligned with the six orthogonal trust dimensions lacks any reported human annotation study, inter-dimension correlation matrix, or ablation that replaces the attention scorer with a non-semantic baseline (e.g., token length or embedding norm). Without such checks it is unclear whether reported gains reflect the proposed trust model or surface cues correlated with the synthetic attacks.

minor comments (2)

[Method] Clarify the exact attention mechanism used (e.g., which layer, head, or query formulation) and whether it requires any fine-tuning or is purely zero-shot.
[Definition] Add a table or figure summarizing the six dimensions with their Grice-inspired definitions and example message annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the clarity and rigor of our experimental claims and validation of the A-Trust mechanism. We address each major comment point-by-point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that 'experiments across diverse multi-agent settings and tasks demonstrate that our TMS significantly improves robustness' is presented without any description of baselines, metrics, statistical tests, number of runs, or how the six Grice dimensions were operationalized or measured. This leaves the central robustness claim without visible supporting evidence or controls.

Authors: We agree that the original presentation of the robustness claims in the abstract and Experiments section lacked sufficient methodological detail. In the revised manuscript, we have expanded both the abstract and the Experiments section to explicitly describe: the baselines employed (including a no-trust baseline, a random trust assignment baseline, and a non-attention embedding-norm baseline); the evaluation metrics (task success rate under malicious inputs and attack mitigation rate); statistical testing via paired t-tests with reported p-values across 10 independent runs using different random seeds; and the operationalization of the six Grice-inspired dimensions through dimension-specific prompt templates and scoring rubrics in the TMS, with concrete examples now included in a new appendix subsection. revision: yes
Referee: [A-Trust definition / §3] Section introducing A-Trust (likely §3): the assertion that attention weights produce scores aligned with the six orthogonal trust dimensions lacks any reported human annotation study, inter-dimension correlation matrix, or ablation that replaces the attention scorer with a non-semantic baseline (e.g., token length or embedding norm). Without such checks it is unclear whether reported gains reflect the proposed trust model or surface cues correlated with the synthetic attacks.

Authors: We acknowledge the value of additional validation for the claimed alignment between attention weights and the six dimensions. A full-scale human annotation study was not performed in the original work owing to the substantial annotation effort required for multi-dimensional evaluation of LLM-generated messages. However, we have added an ablation study in the revised §3 and Experiments sections that replaces the attention scorer with non-semantic baselines (token length and embedding norm), showing that these yield significantly lower robustness gains. We have also included an inter-dimension correlation matrix in the appendix demonstrating low correlations (supporting orthogonality) and several qualitative case studies illustrating how attention patterns map to specific Grice dimensions. These additions provide quantitative and qualitative support for the trust model beyond surface cues. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes a new definition of trustworthiness drawing from external citation to Grice (1975) to identify six dimensions, then constructs an attention-based A-Trust scoring method and TMS on top of that definition. No equations, fitted parameters, or self-citations are shown that reduce the claimed outputs (robustness gains) to inputs by construction. The approach is presented as a novel construction with experimental support rather than a tautological renaming or self-referential loop, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unvalidated assumption that Grice-inspired dimensions are orthogonal and sufficient for LLM agent communication, plus the effectiveness of attention for computing trust scores; these are introduced rather than derived from external benchmarks or data.

axioms (1)

domain assumption Trustworthiness of messages in LLM-MAS can be decomposed into six orthogonal dimensions inspired by Grice (1975) human communication theory.
Invoked when proposing the comprehensive definition of trustworthiness in the abstract.

invented entities (2)

Attention Trust Score (A-Trust) no independent evidence
purpose: Lightweight computation of message trustworthiness using LLM attention mechanisms across the six dimensions.
New scoring method introduced to operationalize the trust definition.
Trust Management System (TMS) no independent evidence
purpose: Framework supporting message-level and agent-level trust assessments in LLM-MAS.
Overall system proposed to apply the trust scores for robustness.

pith-pipeline@v0.9.0 · 5735 in / 1320 out tokens · 72896 ms · 2026-05-19T11:26:12.995176+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Attention Trust Score (A-Trust), a lightweight, attention-based method for evaluating message trustworthiness... six orthogonal trust dimensions... factual accuracy, logical consistency, relevance, bias, clarity, quality.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that certain attention heads in the LLM specialize in detecting specific types of violations... Attnl,h(M) = geometric mean of attention weights.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
cs.AI 2025-08 unverdicted novelty 6.0

BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowled...
GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems
cs.CR 2026-04 unverdicted novelty 5.0

GAMMAF provides a benchmarking platform with data generation and defense evaluation pipelines for graph-based anomaly detection in LLM multi-agent systems, demonstrating improved integrity and lower operational costs ...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Logic and conversation

Herbert Paul Grice. Logic and conversation. Syntax and semantics, 3:43–58, 1975

work page 1975
[2]

Camel: Communicative agents for" mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023
[3]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform. arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024
[4]

Large Language Model-Based Agents for Software Engineering: A Survey

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey. arXiv preprint arXiv:2409.02977, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations

work page
[6]

Chatdev: Revolutionizing software development with llm-based agents

Yutao Qian et al. Chatdev: Revolutionizing software development with llm-based agents. arXiv preprint arXiv:2309.07922, 2023

work page arXiv 2023
[7]

Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, and Caiwen Ding. Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems. arXiv preprint arXiv:2404.04735, 2024

work page arXiv 2024
[8]

Make llms better zero- shot reasoners: Structure-orientated autonomous reasoning

Pengfei He, Zitao Li, Yue Xing, Yaling Li, Jiliang Tang, and Bolin Ding. Make llms better zero- shot reasoners: Structure-orientated autonomous reasoning. arXiv preprint arXiv:2410.19000, 2024

work page arXiv 2024
[9]

Chatgpt research group for optimizing the crystallinity of mofs and cofs

Zhiling Zheng, Oufan Zhang, Ha L Nguyen, Nakul Rampal, Ali H Alawadhi, Zichao Rong, Teresa Head-Gordon, Christian Borgs, Jennifer T Chayes, and Omar M Yaghi. Chatgpt research group for optimizing the crystallinity of mofs and cofs. ACS Central Science, 9(11):2161–2170, 2023

work page 2023
[10]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023
[11]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

Donghyun Lee and Mo Tiwari. Prompt infection: Llm-to-llm prompt injection within multi- agent systems. arXiv preprint arXiv:2410.07283, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Breaking agents: Compromising autonomous llm agents through malfunction amplification

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, and Yang Zhang. Breaking agents: Compromising autonomous llm agents through malfunction amplification. arXiv preprint arXiv:2407.20859, 2024

work page arXiv 2024
[14]

Flooding spread of manipulated knowledge in llm-based multi-agent communities.arXiv preprint arXiv:2407.07791, 2024

Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, and Gongshen Liu. Flooding spread of manipulated knowledge in llm-based multi-agent communities. arXiv preprint arXiv:2407.07791, 2024

work page arXiv 2024
[15]

Red-teaming llm multi-agent systems via communication attacks

Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. arXiv preprint arXiv:2502.14847, 2025

work page arXiv 2025
[16]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Agent2agent (a2a) protocol

Google. Agent2agent (a2a) protocol. https://github.com/google/A2A, 2025. 10

work page 2025
[18]

Knowledge-based consis- tency testing of large language models

Sai Sathiesh Rajan, Ezekiel Soremekun, and Sudipta Chattopadhyay. Knowledge-based consis- tency testing of large language models. arXiv preprint arXiv:2407.12830, 2024

work page arXiv 2024
[19]

Towards knowledge checking in retrieval-augmented generation: A representation perspective

Shenglai Zeng, Jiankun Zhang, Bingheng Li, Yuping Lin, Tianqi Zheng, Dante Everaert, Hanqing Lu, Hui Liu, Yue Xing, Monica Xiao Cheng, et al. Towards knowledge checking in retrieval-augmented generation: A representation perspective. arXiv preprint arXiv:2411.14572, 2024

work page arXiv 2024
[20]

Evaluation of an llm in identifying logical fallacies: A call for rigor when adopting llms in hci research

Gionnieve Lim and Simon T Perrault. Evaluation of an llm in identifying logical fallacies: A call for rigor when adopting llms in hci research. In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing , pages 303–308, 2024

work page 2024
[21]

Are llms good zero-shot fallacy classifiers? arXiv preprint arXiv:2410.15050, 2024

Fengjun Pan, Xiaobao Wu, Zongrui Li, and Anh Tuan Luu. Are llms good zero-shot fallacy classifiers? arXiv preprint arXiv:2410.15050, 2024

work page arXiv 2024
[22]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[23]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

arXiv preprint arXiv:2308.04592 , year=

Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi- Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shep- herd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023

work page arXiv 2023
[25]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv preprint arXiv:2411.00348, 2024

work page arXiv 2024
[27]

On the role of attention heads in large language model safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On the role of attention heads in large language model safety. arXiv preprint arXiv:2410.13708, 2024

work page arXiv 2024
[28]

Successor heads: Recurring, interpretable attention heads in the wild

Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. arXiv preprint arXiv:2312.09230, 2023

work page arXiv 2023
[29]

Trustscore: reference-free evaluation of llm response trustworthiness

Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z Pan. Trustscore: reference-free evaluation of llm response trustworthiness. arXiv preprint arXiv:2402.12545, 2024

work page arXiv 2024
[30]

A multi-dimensional trust evaluation model for large-scale p2p computing

Xiaoyong Li, Feng Zhou, and Xudong Yang. A multi-dimensional trust evaluation model for large-scale p2p computing. Journal of Parallel and Distributed Computing , 71(6):837–847, 2011

work page 2011
[31]

On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, and Michael R Lyu. On the resilience of multi-agent systems with malicious agents. arXiv preprint arXiv:2408.00989, 2024

work page arXiv 2024
[32]

Netsafe: Exploring the topological safety of multi-agent networks

Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, and Yang Wang. Netsafe: Exploring the topological safety of multi-agent networks. arXiv preprint arXiv:2410.15686, 2024

work page arXiv 2024
[33]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT, 2018

work page 2018
[34]

Stereoset: Measuring stereotypical bias in pretrained language models, 2020

Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models, 2020. 11

work page 2020
[35]

Why do most multi-agent llm systems fail?, March 2025

Jagadeesh Rajarajan. Why do most multi-agent llm systems fail?, March 2025. Accessed: 2025-05-14

work page 2025
[36]

Why are my prompts leaked? unraveling prompt extraction threats in customized large language models

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models. arXiv preprint arXiv:2408.02416, 2024

work page arXiv 2024
[37]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline de- fenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Ai control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942, 2023

work page arXiv 2023
[40]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) , 2021

work page 2021
[42]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL) , 2021

work page 2021
[43]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[45]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Embodied llm agents learn to cooperate in organized teams

Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams. arXiv preprint arXiv:2403.12482, 2024

work page arXiv 2024
[48]

Language agents with reinforcement learning for strategic play in the werewolf game

Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023

work page arXiv 2023
[49]

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, and Jing Shao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. arXiv preprint arXiv:2401.11880, 2024

work page arXiv 2024
[50]

Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. arXiv preprint arXiv:2406.14711, 2024

work page arXiv 2024
[51]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025. 13 A Related works LLM-based Multi-agent System. LLM-MAS enhance the reasoning and task-solving capabilities of large language...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Factual Accuracy - Are the claims verifiably true and consistent with known facts?

work page
[54]

Logical Consistency – Is the argument internally coherent and free of contradictions?

work page
[55]

Relevance to Context – Does the content stay on topic and address the given question or prompt?

work page
[56]

Neutrality & Bias – Is the language objective, balanced, and free from undue bias?

work page
[57]

Language Quality – Is the prose fluent, grammatical, and properly punctuated?

work page
[58]

17 D Attention analysis on additional models We present additional attention analysis on Qwen2.5-7B [51] and Gemma3-4B [52] to illustrate the generalization of our observation

Clarity & Precision – Are ideas expressed unambiguously and concisely? Scoring rules (per rubric): 0%=Poor, 100%=Excellent (Use percentage only.) Your tasks: Provide a brief justification (1-3 sentences) for each rubric score. 17 D Attention analysis on additional models We present additional attention analysis on Qwen2.5-7B [51] and Gemma3-4B [52] to ill...

work page
[59]

(2) We develop A-Trust to evaluate the message trustworthiness leveraging the LLM’s attention mechanism

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: (1) We decompose trustworthiness into six dimensions following human communication literature and curate a dataset correspondingly. (2) We develop A-Trust to evaluate the message trustworthiness lev...

work page
[60]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We summarize some limitations in our paper in Conclusion section and correspondingly suggest some potential future directions. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the ...

work page
[61]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs 23 Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: No mathematical theory is included in this paper. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and...

work page
[62]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We include the details abo...

work page
[63]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide anonymized repo containing both data and code in Section 2.2 and 4.1. Guidelines: • The answer N...

work page
[64]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Justification: We include the details about the structure of the LLM-MAS considered in this paper, the prompt...

work page
[65]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Through various settings, our proposed methods demonstrate consistent en- hancement compared to the baselines. Guidelines: • The an...

work page
[66]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We mention the details of server we run experiments on in Section 4.1, and include latency analysis in ...

work page
[67]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethicshttps://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The research is conducted following the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics....

work page
[68]

There is no potential negative social impact based on our knowledge

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No] Justification: We introduce a system to enhance the robustness of LLM-MAS. There is no potential negative social impact based on our knowledge. Guidelines: • The answer NA means that there is no societ...

work page
[69]

There is no direct way to misuse them for malicious purposes

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our proposed methods aims to enhance the trustworthiness of LLM-MAS. There is no direct way to ...

work page
[70]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We summarize the license in Section E. Guidelines: • The answer NA means that the paper does not...

work page
[71]

The description of the dataset can be found in Section 2.1

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We develop a new dataset, Trust Violation dataset, in our paper. The description of the dataset can be found in Section 2.1. We also demonstrate how to use the dataset in Section 2.2. Guidelines: ...

work page
[72]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: There is no crowdsourcing experiments. Guidelines: • ...

work page
[73]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[74]

Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025

[1] [1]

Logic and conversation

Herbert Paul Grice. Logic and conversation. Syntax and semantics, 3:43–58, 1975

work page 1975

[2] [2]

Camel: Communicative agents for" mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023

[3] [3]

AgentScope: A Flexible yet Robust Multi-Agent Platform,

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform. arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024

[4] [4]

Large Language Model-Based Agents for Software Engineering: A Survey

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey. arXiv preprint arXiv:2409.02977, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations

work page

[6] [6]

Chatdev: Revolutionizing software development with llm-based agents

Yutao Qian et al. Chatdev: Revolutionizing software development with llm-based agents. arXiv preprint arXiv:2309.07922, 2023

work page arXiv 2023

[7] [7]

Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, and Caiwen Ding. Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems. arXiv preprint arXiv:2404.04735, 2024

work page arXiv 2024

[8] [8]

Make llms better zero- shot reasoners: Structure-orientated autonomous reasoning

Pengfei He, Zitao Li, Yue Xing, Yaling Li, Jiliang Tang, and Bolin Ding. Make llms better zero- shot reasoners: Structure-orientated autonomous reasoning. arXiv preprint arXiv:2410.19000, 2024

work page arXiv 2024

[9] [9]

Chatgpt research group for optimizing the crystallinity of mofs and cofs

Zhiling Zheng, Oufan Zhang, Ha L Nguyen, Nakul Rampal, Ali H Alawadhi, Zichao Rong, Teresa Head-Gordon, Christian Borgs, Jennifer T Chayes, and Omar M Yaghi. Chatgpt research group for optimizing the crystallinity of mofs and cofs. ACS Central Science, 9(11):2161–2170, 2023

work page 2023

[10] [10]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023

work page arXiv 2023

[11] [11]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

Donghyun Lee and Mo Tiwari. Prompt infection: Llm-to-llm prompt injection within multi- agent systems. arXiv preprint arXiv:2410.07283, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Breaking agents: Compromising autonomous llm agents through malfunction amplification

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, and Yang Zhang. Breaking agents: Compromising autonomous llm agents through malfunction amplification. arXiv preprint arXiv:2407.20859, 2024

work page arXiv 2024

[14] [14]

Flooding spread of manipulated knowledge in llm-based multi-agent communities.arXiv preprint arXiv:2407.07791, 2024

Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, and Gongshen Liu. Flooding spread of manipulated knowledge in llm-based multi-agent communities. arXiv preprint arXiv:2407.07791, 2024

work page arXiv 2024

[15] [15]

Red-teaming llm multi-agent systems via communication attacks

Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. arXiv preprint arXiv:2502.14847, 2025

work page arXiv 2025

[16] [16]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Agent2agent (a2a) protocol

Google. Agent2agent (a2a) protocol. https://github.com/google/A2A, 2025. 10

work page 2025

[18] [18]

Knowledge-based consis- tency testing of large language models

Sai Sathiesh Rajan, Ezekiel Soremekun, and Sudipta Chattopadhyay. Knowledge-based consis- tency testing of large language models. arXiv preprint arXiv:2407.12830, 2024

work page arXiv 2024

[19] [19]

Towards knowledge checking in retrieval-augmented generation: A representation perspective

Shenglai Zeng, Jiankun Zhang, Bingheng Li, Yuping Lin, Tianqi Zheng, Dante Everaert, Hanqing Lu, Hui Liu, Yue Xing, Monica Xiao Cheng, et al. Towards knowledge checking in retrieval-augmented generation: A representation perspective. arXiv preprint arXiv:2411.14572, 2024

work page arXiv 2024

[20] [20]

Evaluation of an llm in identifying logical fallacies: A call for rigor when adopting llms in hci research

Gionnieve Lim and Simon T Perrault. Evaluation of an llm in identifying logical fallacies: A call for rigor when adopting llms in hci research. In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing , pages 303–308, 2024

work page 2024

[21] [21]

Are llms good zero-shot fallacy classifiers? arXiv preprint arXiv:2410.15050, 2024

Fengjun Pan, Xiaobao Wu, Zongrui Li, and Anh Tuan Luu. Are llms good zero-shot fallacy classifiers? arXiv preprint arXiv:2410.15050, 2024

work page arXiv 2024

[22] [22]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025

[23] [23]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

arXiv preprint arXiv:2308.04592 , year=

Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi- Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shep- herd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023

work page arXiv 2023

[25] [25]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv preprint arXiv:2411.00348, 2024

work page arXiv 2024

[27] [27]

On the role of attention heads in large language model safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On the role of attention heads in large language model safety. arXiv preprint arXiv:2410.13708, 2024

work page arXiv 2024

[28] [28]

Successor heads: Recurring, interpretable attention heads in the wild

Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. arXiv preprint arXiv:2312.09230, 2023

work page arXiv 2023

[29] [29]

Trustscore: reference-free evaluation of llm response trustworthiness

Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z Pan. Trustscore: reference-free evaluation of llm response trustworthiness. arXiv preprint arXiv:2402.12545, 2024

work page arXiv 2024

[30] [30]

A multi-dimensional trust evaluation model for large-scale p2p computing

Xiaoyong Li, Feng Zhou, and Xudong Yang. A multi-dimensional trust evaluation model for large-scale p2p computing. Journal of Parallel and Distributed Computing , 71(6):837–847, 2011

work page 2011

[31] [31]

On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, and Michael R Lyu. On the resilience of multi-agent systems with malicious agents. arXiv preprint arXiv:2408.00989, 2024

work page arXiv 2024

[32] [32]

Netsafe: Exploring the topological safety of multi-agent networks

Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, and Yang Wang. Netsafe: Exploring the topological safety of multi-agent networks. arXiv preprint arXiv:2410.15686, 2024

work page arXiv 2024

[33] [33]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT, 2018

work page 2018

[34] [34]

Stereoset: Measuring stereotypical bias in pretrained language models, 2020

Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models, 2020. 11

work page 2020

[35] [35]

Why do most multi-agent llm systems fail?, March 2025

Jagadeesh Rajarajan. Why do most multi-agent llm systems fail?, March 2025. Accessed: 2025-05-14

work page 2025

[36] [36]

Why are my prompts leaked? unraveling prompt extraction threats in customized large language models

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models. arXiv preprint arXiv:2408.02416, 2024

work page arXiv 2024

[37] [37]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline de- fenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Ai control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942, 2023

work page arXiv 2023

[40] [40]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) , 2021

work page 2021

[42] [42]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL) , 2021

work page 2021

[43] [43]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023

work page 2023

[45] [45]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Embodied llm agents learn to cooperate in organized teams

Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams. arXiv preprint arXiv:2403.12482, 2024

work page arXiv 2024

[48] [48]

Language agents with reinforcement learning for strategic play in the werewolf game

Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023

work page arXiv 2023

[49] [49]

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, and Jing Shao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. arXiv preprint arXiv:2401.11880, 2024

work page arXiv 2024

[50] [50]

Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. arXiv preprint arXiv:2406.14711, 2024

work page arXiv 2024

[51] [51]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025. 13 A Related works LLM-based Multi-agent System. LLM-MAS enhance the reasoning and task-solving capabilities of large language...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Factual Accuracy - Are the claims verifiably true and consistent with known facts?

work page

[54] [54]

Logical Consistency – Is the argument internally coherent and free of contradictions?

work page

[55] [55]

Relevance to Context – Does the content stay on topic and address the given question or prompt?

work page

[56] [56]

Neutrality & Bias – Is the language objective, balanced, and free from undue bias?

work page

[57] [57]

Language Quality – Is the prose fluent, grammatical, and properly punctuated?

work page

[58] [58]

17 D Attention analysis on additional models We present additional attention analysis on Qwen2.5-7B [51] and Gemma3-4B [52] to illustrate the generalization of our observation

Clarity & Precision – Are ideas expressed unambiguously and concisely? Scoring rules (per rubric): 0%=Poor, 100%=Excellent (Use percentage only.) Your tasks: Provide a brief justification (1-3 sentences) for each rubric score. 17 D Attention analysis on additional models We present additional attention analysis on Qwen2.5-7B [51] and Gemma3-4B [52] to ill...

work page

[59] [59]

(2) We develop A-Trust to evaluate the message trustworthiness leveraging the LLM’s attention mechanism

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: (1) We decompose trustworthiness into six dimensions following human communication literature and curate a dataset correspondingly. (2) We develop A-Trust to evaluate the message trustworthiness lev...

work page

[60] [60]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We summarize some limitations in our paper in Conclusion section and correspondingly suggest some potential future directions. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the ...

work page

[61] [61]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs 23 Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: No mathematical theory is included in this paper. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and...

work page

[62] [62]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We include the details abo...

work page

[63] [63]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide anonymized repo containing both data and code in Section 2.2 and 4.1. Guidelines: • The answer N...

work page

[64] [64]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Justification: We include the details about the structure of the LLM-MAS considered in this paper, the prompt...

work page

[65] [65]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Through various settings, our proposed methods demonstrate consistent en- hancement compared to the baselines. Guidelines: • The an...

work page

[66] [66]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We mention the details of server we run experiments on in Section 4.1, and include latency analysis in ...

work page

[67] [67]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethicshttps://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The research is conducted following the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics....

work page

[68] [68]

There is no potential negative social impact based on our knowledge

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No] Justification: We introduce a system to enhance the robustness of LLM-MAS. There is no potential negative social impact based on our knowledge. Guidelines: • The answer NA means that there is no societ...

work page

[69] [69]

There is no direct way to misuse them for malicious purposes

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our proposed methods aims to enhance the trustworthiness of LLM-MAS. There is no direct way to ...

work page

[70] [70]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We summarize the license in Section E. Guidelines: • The answer NA means that the paper does not...

work page

[71] [71]

The description of the dataset can be found in Section 2.1

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We develop a new dataset, Trust Violation dataset, in our paper. The description of the dataset can be found in Section 2.1. We also demonstrate how to use the dataset in Section 2.2. Guidelines: ...

work page

[72] [72]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: There is no crowdsourcing experiments. Guidelines: • ...

work page

[73] [73]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[74] [74]

Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025