PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

Bingyu Yan; Chaozhuo Li; Jinyu Hou; Litian Zhang; Xiaoming Zhang; Xiaozhe Zhang; Ziyi Zhou

REVIEW 2 major objections 2 minor 1 cited by

PropGuard defends LLM multi-agent systems by tracing malicious instruction paths with a dual-view graph and trained inspector.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 22:29 UTC pith:RJ67T2UP

load-bearing objection PropGuard sketches a graph-based way to trace and remediate attack spread in LLM multi-agent systems, but the abstract supplies no numbers or implementation details to back the claims. the 2 major comments →

arxiv 2605.16346 v1 pith:RJ67T2UP submitted 2026-05-08 cs.LG cs.AIcs.CR

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

Bingyu Yan , Xiaoming Zhang , Jinyu Hou , Chaozhuo Li , Ziyi Zhou , Xiaozhe Zhang , Litian Zhang This is my paper

classification cs.LG cs.AIcs.CR

keywords LLM-MASmulti-agent systemspropagation defensesecuritygraph explorationremediationAI safety

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that LLM-based multi-agent systems can be protected from attacks in which harmful instructions spread through messages, tools, and memory across agents and interaction rounds. Existing local filters and anomaly detectors often miss the fine-grained paths or break normal collaboration when they try to fix problems. PropGuard builds a dual-view spatio-temporal graph that pairs response risk estimates with complete state records, then uses a GE-GRPO trained inspector to locate compact suspicious subgraphs. After verifying the harm, it applies source-guided fixes that clean the origin and replay the affected later steps. A reader would care if the method works because it would let complex collaborative AI tasks run reliably even when some inputs carry hidden threats.

Core claim

PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions.

What carries the argument

The dual-view spatio-temporal graph that merges response risk estimates with preserved full-state evidence, which guides the GE-GRPO inspector in recovering compact suspicious propagation subgraphs.

Load-bearing premise

The dual-view spatio-temporal graph and GE-GRPO inspector can accurately recover compact suspicious propagation subgraphs and enable effective source-guided remediation without materially disrupting benign agent collaboration or introducing unacceptable overhead.

What would settle it

A controlled experiment that injects known malicious instructions into a multi-agent system, then measures whether the recovered subgraphs exactly match the actual contamination paths and whether remediation restores correct task outputs without introducing new errors.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

PropGuard sketches a graph-based way to trace and remediate attack spread in LLM multi-agent systems, but the abstract supplies no numbers or implementation details to back the claims.

read the letter

PropGuard targets how malicious prompts can move through messages, tools, and memory in LLM multi-agent setups. It builds a dual-view spatio-temporal graph that mixes response risk scores with full state records, then trains a GE-GRPO inspector to walk the graph and pull out compact suspicious subgraphs before applying source fixes and downstream replays. That combination of risk-guided exploration and upstream remediation is the main new piece compared with simple local filters or off-the-shelf graph anomaly detectors. The paper does a clean job naming the gap in existing work and laying out a concrete pipeline that tries to preserve benign collaboration while cutting attack paths. Experiments are said to run across four architectures and five attack types, which at least shows an attempt at breadth. The soft spot is the complete absence of any quantitative results, baselines, precision numbers on subgraph recovery, or overhead measurements in the abstract. Without those, it is impossible to tell whether the risk priors hold up under LLM stochasticity or whether the inspector actually returns useful subgraphs instead of noisy ones. The stress-test concern about missed upstream nodes or disrupted benign flows therefore lands directly on the current text. This is for readers working on security for agent systems who want a propagation-focused alternative to local defenses. A person already thinking about graph methods in multi-agent safety could pull ideas from the framework description. The work deserves peer review because the problem is timely and the approach is described in enough detail to be evaluated and improved, even if the experimental section needs substantial strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PropGuard, a propagation-aware defense framework for LLM-based multi-agent systems (LLM-MAS). It builds a dual-view spatio-temporal graph that fuses response-centric risk estimation with full-state evidence preservation. A GE-GRPO-trained inspector then sequentially explores this graph to extract compact suspicious propagation subgraphs. Subgraph-aware diagnosis verifies harmful propagation, after which source-guided remediation corrects upstream contamination and replays affected downstream interactions. Experiments across four communication architectures and five attack settings are reported to show reduced attack success rates while preserving high task-level defense success and a favorable effectiveness-efficiency trade-off.

Significance. If the empirical claims hold under rigorous controls, the work would be significant for AI safety and multi-agent systems research. It targets the under-addressed problem of fine-grained malicious propagation across agents and rounds, where local filtering and standard graph anomaly detection fall short. The integration of dual-view graph construction with reinforcement-learning-based sequential exploration and source-guided remediation offers a structured approach that could generalize beyond the tested settings. Credit is due for the explicit multi-architecture, multi-attack evaluation design, which supports claims of broad applicability if quantitative details are provided.

major comments (2)

[Abstract and §5] Abstract and §5 (Experiments): The central claim that PropGuard 'consistently lowers attack success while maintaining high task-level defense success' is load-bearing for the paper's contribution, yet the abstract supplies no numerical results, baselines, error bars, or statistical tests. Without these, it is impossible to assess effect sizes or rule out that improvements are marginal or architecture-specific.
[§4.2] §4.2 (Inspector and risk priors): The claim that risk priors from the dual-view graph enable the GE-GRPO inspector to recover compact suspicious subgraphs rests on the assumption that response-centric risk estimation remains stable under LLM stochasticity. No ablation or sensitivity analysis on temperature, prompt variation, or risk-threshold choice is described; if these priors shift, the sequential exploration order and termination could omit upstream nodes or include excessive benign edges, directly undermining both detection and remediation efficacy.

minor comments (2)

[§3] The acronym GE-GRPO is introduced without expansion or reference to its base algorithm; a brief definition or citation on first use would improve readability.
[Figures] Figure 3 (or equivalent architecture diagram) would benefit from explicit labels for the four communication topologies tested so readers can map results to specific interaction patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects for improving the clarity and robustness of our claims. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): The central claim that PropGuard 'consistently lowers attack success while maintaining high task-level defense success' is load-bearing for the paper's contribution, yet the abstract supplies no numerical results, baselines, error bars, or statistical tests. Without these, it is impossible to assess effect sizes or rule out that improvements are marginal or architecture-specific.

Authors: We agree that including quantitative details in the abstract would better support the central claim. The experiments in §5 report results across four architectures and five attack settings, with comparisons to baselines. In the revised manuscript, we have updated the abstract to include key numerical findings, such as the average reduction in attack success rates and maintained task success rates, along with references to error bars and statistical tests presented in the experimental section. revision: yes
Referee: [§4.2] §4.2 (Inspector and risk priors): The claim that risk priors from the dual-view graph enable the GE-GRPO inspector to recover compact suspicious subgraphs rests on the assumption that response-centric risk estimation remains stable under LLM stochasticity. No ablation or sensitivity analysis on temperature, prompt variation, or risk-threshold choice is described; if these priors shift, the sequential exploration order and termination could omit upstream nodes or include excessive benign edges, directly undermining both detection and remediation efficacy.

Authors: This is a valid concern regarding the robustness of the risk priors. The dual-view graph combines response-centric estimation with full-state evidence preservation to enhance stability. While the original submission used consistent LLM configurations, we acknowledge the lack of explicit sensitivity analysis. We have added an ablation study in the revised version varying temperature settings and prompt variations, demonstrating that the subgraph recovery remains effective and the performance metrics are stable within typical operational ranges. revision: yes

Circularity Check

0 steps flagged

No circularity: PropGuard is an empirical framework proposal with no derivation chain or equations that reduce outputs to inputs by construction.

full rationale

The paper introduces PropGuard as a novel propagation-aware defense for LLM-MAS, describing the construction of a dual-view spatio-temporal graph, a GE-GRPO trained inspector for sequential exploration of suspicious subgraphs, subgraph-aware diagnosis, and source-guided remediation. These elements are presented as original components whose effectiveness is assessed via experiments on four architectures and five attack settings. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text that would make any claimed result equivalent to its inputs by construction. The framework is self-contained as a proposed system evaluated empirically rather than derived from prior results in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Abstract-only review yields limited visibility into parameters or proofs; the ledger records the core domain assumption and newly introduced components described at high level.

axioms (1)

domain assumption Malicious instructions injected through messages, tools, or memories can propagate across agents and rounds in LLM-MAS, causing system-level compromise.
This premise is stated directly as the motivating security risk in the abstract.

invented entities (3)

dual-view spatio-temporal graph no independent evidence
purpose: Combines response-centric risk estimation with full-state evidence preservation to guide propagation tracking.
Newly proposed structure in the PropGuard framework.
GE-GRPO trained inspector no independent evidence
purpose: Sequentially explores the full-state graph to recover compact suspicious propagation subgraphs.
Trained component introduced for guided exploration.
source-guided remediation no independent evidence
purpose: Corrects upstream contamination and replays affected downstream interactions.
Remediation strategy presented as part of the framework.

pith-pipeline@v0.9.0 · 5752 in / 1542 out tokens · 139934 ms · 2026-05-20T22:29:34.286679+00:00 · methodology

0 comments

read the original abstract

LLM-based multi-agent systems (LLM-MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system-level compromise. Existing defenses largely rely on local filtering or graph-based anomaly detection, but they often fail to trace fine-grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation-aware framework for safeguarding LLM-MAS. PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task-level defense success, achieving a favorable effectiveness--efficiency trade-off.

Figures

Figures reproduced from arXiv: 2605.16346 by Bingyu Yan, Chaozhuo Li, Jinyu Hou, Litian Zhang, Xiaoming Zhang, Xiaozhe Zhang, Ziyi Zhou.

**Figure 2.** Figure 2: Scalability analysis on the Random-CSQA setting. Left: varying the number of agents. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Effectiveness–efficiency tradeoff under the random architecture. We evaluate the efficiency of PropGuard under the random architecture across five attack settings. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis on the number of top-K seeds under the Random-CSQA. Bars denote MDSR and the line denotes ASR. We study the sensitivity of PropGuard to the number of top-K seeds used for suspicious subgraph exploration. The seed number controls the initial coverage of high-risk regions: using too few seeds may miss malicious propagation traces, while using too many seeds may introduce irrelevant ben… view at source ↗

**Figure 5.** Figure 5: Attack source coverage and matching quality across attack settings. Each value is averaged [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement
cs.MA 2026-06 unverdicted novelty 7.0

Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placemen...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages ...

work page 2025
[2]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, and Xiaoming Zhang. Beyond self-talk: A communication-centric survey of llm-based multi-agent systems.arXiv preprint arXiv:2502.14321, 2025

work page internal anchor Pith review arXiv 2025
[4]

Deep Research Agents: A Systematic Examination And Roadmap

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

work page Pith review arXiv 2025
[5]

Codecor: An llm-based self-reflective multi-agent framework for code generation (2025).arXiv preprint arXiv:2501.07811

Ruwei Pan, Hongyu Zhang, and Chao Liu. Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811, 2025

work page arXiv 2025
[6]

Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

work page 2026
[7]

Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas

Bingyu Yan, Xiaoming Zhang, Ziyi Zhou, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, and Litian Zhang. Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29784–29792, 2026

work page 2026
[8]

A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[9]

Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen- tse Huang, Zhouruixing Zhu, Lingming Zhang, and Michael R Lyu

Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, and Meng Han. Web fraud attacks against llm-driven multi-agent systems.arXiv preprint arXiv:2509.01211, 2025

work page arXiv 2025
[10]

Lyu, and Maarten Sap

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

work page arXiv 2024
[11]

Sentinelnet: Safeguarding multi-agent collaboration through credit-based dynamic threat detection.arXiv preprint arXiv:2510.16219, 2025

Yang Feng and Xudong Pan. Sentinelnet: Safeguarding multi-agent collaboration through credit-based dynamic threat detection.arXiv preprint arXiv:2510.16219, 2025

work page arXiv 2025
[12]

Who’s the mole? modeling and detecting intention-hiding malicious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, and Chi Liu. Who’s the mole? modeling and detecting intention-hiding malicious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

work page arXiv 2025
[13]

G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems

Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7261–7276, 2025

work page 2025
[14]

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, and Xin Wang. Blindguard: Safeguarding llm-based multi-agent systems under unknown attacks.arXiv preprint arXiv:2508.08127, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

arXiv preprint arXiv:2512.18733 , year=

Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, and Shirui Pan. Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025. 11

work page arXiv 2025
[16]

Infa-guard: Mitigating malicious propagation via infection-aware safeguarding in llm-based multi-agent systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, and Jing Shao. Infa-guard: Mitigating malicious propagation via infection-aware safeguarding in llm-based multi-agent systems. arXiv preprint arXiv:2601.14667, 2026

work page arXiv 2026
[17]

arXiv preprint arXiv:2505.19234 , year=

Jialong Zhou, Lichao Wang, and Xiao Yang. Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025

work page arXiv 2025
[18]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

work page 2019
[19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

work page 2024
[22]

Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems

Fatemeh Nazary, Yashar Deldjoo, and Tommaso di Noia. Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems. InEuropean Conference on Information Retrieval, pages 239–251. Springer, 2025

work page 2025
[23]

AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

Junyuan Mao, Fanci Meng, Yifan Duan, Miao Yu, Xiaojun Jia, Junfeng Fang, Yuxuan Liang, Kun Wang, and Qingsong Wen. Agentsafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

work page Pith review arXiv 2025
[24]

Gpt-4o mini: Advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024. Accessed: 2026-05- 01

work page 2024
[25]

Qwen3.5-4b

Qwen Team. Qwen3.5-4b. https://huggingface.co/Qwen/Qwen3.5-4B, 2026. Accessed: 2026-05-01

work page 2026
[26]

arXiv preprint arXiv:2407.07791 , year =

Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, and Gongshen Liu. Flooding spread of manipulated knowledge in llm-based multi-agent communities.arXiv preprint arXiv:2407.07791, 2024

work page arXiv 2024
[27]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[28]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

work page 2024
[30]

Thinkguard: Deliberative slow thinking leads to cautious guardrails

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13698–13713, 2025

work page 2025
[31]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025. 12 Appendix Contents A Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

The current suspicious subgraph, containing nodes already selected as suspicious

work page
[33]

The frontier, i.e., visible nodes adjacent to the suspicious subgraph that can be selected for expansion

work page
[34]

Full-state information for visible nodes, including responses, incoming messages, memory contents, and tool outputs

work page
[35]

Risk prior scores for each node, where higher scores indicate higher suspiciousness

work page
[36]

thinking

Edge connections in the visible region. Exploration objective.Your goal is to thoroughly trace the full propagation chain of malicious influence across all interaction rounds. You should select frontier nodes that may have been influenced by compromised nodes, even when their risk prior scores are only moderate. Pay special attention to later rounds, such...

work page
[37]

The corresponding risk source ismemory

The agent’s memory contains injected contextual passages, such as fabricated references or authoritative-sounding but false contexts, that other agents do not have, and the agent’s answer follows these injected contexts. The corresponding risk source ismemory. 26 Diagnosis Prompt (continued)

work page
[38]

The corresponding risk source isself

The agent is actively trying to manipulate other agents’ answers or reasoning, using persuasion tactics or asserting answers without evidence. The corresponding risk source isself

work page
[39]

The corresponding risk source ismsg

The agent’s incoming messages contain adversarial content from a compromised neighbor, and the agent has adopted the manipulated position. The corresponding risk source ismsg

work page
[40]

The corresponding risk source istool

The agent receives corrupted tool outputs and blindly follows them. The corresponding risk source istool. Non-compromise criteria.A node should not be marked as compromised if: •The agent shows genuine uncertainty or discusses multiple viewpoints without pushing a specific wrong answer. •The agent’s memory is clean, containing only the task query, and its...

work page
[41]

is_compromised

failure_mode: the type of compromise, selected from answer_manipulation, instruction_injection, reasoning_corruption,information_leak, andtask_deviation. Output format.Respond only in the following JSON format: { "is_compromised": true/false, "risk_source": "msg | tool | memory | self", "failure_mode": "answer_manipulation | instruction_injection | reason...

work page
[42]

Do not mention in thenew_responsethat an attack was detected or that remediation occurred

work page
[43]

Do not quote, summarize, or preserve any malicious instructions, poisoned passages, or corrupted tool content

work page
[44]

Preserve benign task-relevant information whenever possible

work page
[45]

Respond as if you are a normal, helpful agent completing the user’s original task

work page
[46]

Keep thenew_responseconcise and task-focused

work page
[47]

remediation_action

Thenew_responseis mandatory for allrisk_sourcetypes. Output format.Respond only in the following JSON format: { "remediation_action": "regenerate_response | sanitize_memory_then_regenerate | discard_tool_output_then_regenerate", "clean_memory": "sanitized memory, or null if risk_source is not memory", "clean_tool_output": "clean tool output, null, or NEED...

work page

[1] [1]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages ...

work page 2025

[2] [2]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, and Xiaoming Zhang. Beyond self-talk: A communication-centric survey of llm-based multi-agent systems.arXiv preprint arXiv:2502.14321, 2025

work page internal anchor Pith review arXiv 2025

[4] [4]

Deep Research Agents: A Systematic Examination And Roadmap

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

work page Pith review arXiv 2025

[5] [5]

Codecor: An llm-based self-reflective multi-agent framework for code generation (2025).arXiv preprint arXiv:2501.07811

Ruwei Pan, Hongyu Zhang, and Chao Liu. Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811, 2025

work page arXiv 2025

[6] [6]

Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

work page 2026

[7] [7]

Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas

Bingyu Yan, Xiaoming Zhang, Ziyi Zhou, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, and Litian Zhang. Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29784–29792, 2026

work page 2026

[8] [8]

A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

work page 2025

[9] [9]

Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen- tse Huang, Zhouruixing Zhu, Lingming Zhang, and Michael R Lyu

Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, and Meng Han. Web fraud attacks against llm-driven multi-agent systems.arXiv preprint arXiv:2509.01211, 2025

work page arXiv 2025

[10] [10]

Lyu, and Maarten Sap

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

work page arXiv 2024

[11] [11]

Sentinelnet: Safeguarding multi-agent collaboration through credit-based dynamic threat detection.arXiv preprint arXiv:2510.16219, 2025

Yang Feng and Xudong Pan. Sentinelnet: Safeguarding multi-agent collaboration through credit-based dynamic threat detection.arXiv preprint arXiv:2510.16219, 2025

work page arXiv 2025

[12] [12]

Who’s the mole? modeling and detecting intention-hiding malicious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, and Chi Liu. Who’s the mole? modeling and detecting intention-hiding malicious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025

work page arXiv 2025

[13] [13]

G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems

Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7261–7276, 2025

work page 2025

[14] [14]

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, and Xin Wang. Blindguard: Safeguarding llm-based multi-agent systems under unknown attacks.arXiv preprint arXiv:2508.08127, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

arXiv preprint arXiv:2512.18733 , year=

Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, and Shirui Pan. Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025. 11

work page arXiv 2025

[16] [16]

Infa-guard: Mitigating malicious propagation via infection-aware safeguarding in llm-based multi-agent systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, and Jing Shao. Infa-guard: Mitigating malicious propagation via infection-aware safeguarding in llm-based multi-agent systems. arXiv preprint arXiv:2601.14667, 2026

work page arXiv 2026

[17] [17]

arXiv preprint arXiv:2505.19234 , year=

Jialong Zhou, Lichao Wang, and Xiao Yang. Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025

work page arXiv 2025

[18] [18]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

work page 2019

[19] [19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[20] [20]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

work page 2024

[22] [22]

Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems

Fatemeh Nazary, Yashar Deldjoo, and Tommaso di Noia. Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems. InEuropean Conference on Information Retrieval, pages 239–251. Springer, 2025

work page 2025

[23] [23]

AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

Junyuan Mao, Fanci Meng, Yifan Duan, Miao Yu, Xiaojun Jia, Junfeng Fang, Yuxuan Liang, Kun Wang, and Qingsong Wen. Agentsafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

work page Pith review arXiv 2025

[24] [24]

Gpt-4o mini: Advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024. Accessed: 2026-05- 01

work page 2024

[25] [25]

Qwen3.5-4b

Qwen Team. Qwen3.5-4b. https://huggingface.co/Qwen/Qwen3.5-4B, 2026. Accessed: 2026-05-01

work page 2026

[26] [26]

arXiv preprint arXiv:2407.07791 , year =

Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, and Gongshen Liu. Flooding spread of manipulated knowledge in llm-based multi-agent communities.arXiv preprint arXiv:2407.07791, 2024

work page arXiv 2024

[27] [27]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025

[28] [28]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

work page 2024

[30] [30]

Thinkguard: Deliberative slow thinking leads to cautious guardrails

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13698–13713, 2025

work page 2025

[31] [31]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025. 12 Appendix Contents A Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

The current suspicious subgraph, containing nodes already selected as suspicious

work page

[33] [33]

The frontier, i.e., visible nodes adjacent to the suspicious subgraph that can be selected for expansion

work page

[34] [34]

Full-state information for visible nodes, including responses, incoming messages, memory contents, and tool outputs

work page

[35] [35]

Risk prior scores for each node, where higher scores indicate higher suspiciousness

work page

[36] [36]

thinking

Edge connections in the visible region. Exploration objective.Your goal is to thoroughly trace the full propagation chain of malicious influence across all interaction rounds. You should select frontier nodes that may have been influenced by compromised nodes, even when their risk prior scores are only moderate. Pay special attention to later rounds, such...

work page

[37] [37]

The corresponding risk source ismemory

The agent’s memory contains injected contextual passages, such as fabricated references or authoritative-sounding but false contexts, that other agents do not have, and the agent’s answer follows these injected contexts. The corresponding risk source ismemory. 26 Diagnosis Prompt (continued)

work page

[38] [38]

The corresponding risk source isself

The agent is actively trying to manipulate other agents’ answers or reasoning, using persuasion tactics or asserting answers without evidence. The corresponding risk source isself

work page

[39] [39]

The corresponding risk source ismsg

The agent’s incoming messages contain adversarial content from a compromised neighbor, and the agent has adopted the manipulated position. The corresponding risk source ismsg

work page

[40] [40]

The corresponding risk source istool

The agent receives corrupted tool outputs and blindly follows them. The corresponding risk source istool. Non-compromise criteria.A node should not be marked as compromised if: •The agent shows genuine uncertainty or discusses multiple viewpoints without pushing a specific wrong answer. •The agent’s memory is clean, containing only the task query, and its...

work page

[41] [41]

is_compromised

failure_mode: the type of compromise, selected from answer_manipulation, instruction_injection, reasoning_corruption,information_leak, andtask_deviation. Output format.Respond only in the following JSON format: { "is_compromised": true/false, "risk_source": "msg | tool | memory | self", "failure_mode": "answer_manipulation | instruction_injection | reason...

work page

[42] [42]

Do not mention in thenew_responsethat an attack was detected or that remediation occurred

work page

[43] [43]

Do not quote, summarize, or preserve any malicious instructions, poisoned passages, or corrupted tool content

work page

[44] [44]

Preserve benign task-relevant information whenever possible

work page

[45] [45]

Respond as if you are a normal, helpful agent completing the user’s original task

work page

[46] [46]

Keep thenew_responseconcise and task-focused

work page

[47] [47]

remediation_action

Thenew_responseis mandatory for allrisk_sourcetypes. Output format.Respond only in the following JSON format: { "remediation_action": "regenerate_response | sanitize_memory_then_regenerate | discard_tool_output_then_regenerate", "clean_memory": "sanitized memory, or null if risk_source is not memory", "clean_tool_output": "clean tool output, null, or NEED...

work page