PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
Pith reviewed 2026-05-20 22:29 UTC · model grok-4.3
The pith
PropGuard defends LLM multi-agent systems by tracing malicious instruction paths with a dual-view graph and trained inspector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions.
What carries the argument
The dual-view spatio-temporal graph that merges response risk estimates with preserved full-state evidence, which guides the GE-GRPO inspector in recovering compact suspicious propagation subgraphs.
Load-bearing premise
The dual-view spatio-temporal graph and GE-GRPO inspector can accurately recover compact suspicious propagation subgraphs and enable effective source-guided remediation without materially disrupting benign agent collaboration or introducing unacceptable overhead.
What would settle it
A controlled experiment that injects known malicious instructions into a multi-agent system, then measures whether the recovered subgraphs exactly match the actual contamination paths and whether remediation restores correct task outputs without introducing new errors.
Figures
read the original abstract
LLM-based multi-agent systems (LLM-MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system-level compromise. Existing defenses largely rely on local filtering or graph-based anomaly detection, but they often fail to trace fine-grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation-aware framework for safeguarding LLM-MAS. PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task-level defense success, achieving a favorable effectiveness--efficiency trade-off.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PropGuard, a propagation-aware defense framework for LLM-based multi-agent systems (LLM-MAS). It builds a dual-view spatio-temporal graph that fuses response-centric risk estimation with full-state evidence preservation. A GE-GRPO-trained inspector then sequentially explores this graph to extract compact suspicious propagation subgraphs. Subgraph-aware diagnosis verifies harmful propagation, after which source-guided remediation corrects upstream contamination and replays affected downstream interactions. Experiments across four communication architectures and five attack settings are reported to show reduced attack success rates while preserving high task-level defense success and a favorable effectiveness-efficiency trade-off.
Significance. If the empirical claims hold under rigorous controls, the work would be significant for AI safety and multi-agent systems research. It targets the under-addressed problem of fine-grained malicious propagation across agents and rounds, where local filtering and standard graph anomaly detection fall short. The integration of dual-view graph construction with reinforcement-learning-based sequential exploration and source-guided remediation offers a structured approach that could generalize beyond the tested settings. Credit is due for the explicit multi-architecture, multi-attack evaluation design, which supports claims of broad applicability if quantitative details are provided.
major comments (2)
- [Abstract and §5] Abstract and §5 (Experiments): The central claim that PropGuard 'consistently lowers attack success while maintaining high task-level defense success' is load-bearing for the paper's contribution, yet the abstract supplies no numerical results, baselines, error bars, or statistical tests. Without these, it is impossible to assess effect sizes or rule out that improvements are marginal or architecture-specific.
- [§4.2] §4.2 (Inspector and risk priors): The claim that risk priors from the dual-view graph enable the GE-GRPO inspector to recover compact suspicious subgraphs rests on the assumption that response-centric risk estimation remains stable under LLM stochasticity. No ablation or sensitivity analysis on temperature, prompt variation, or risk-threshold choice is described; if these priors shift, the sequential exploration order and termination could omit upstream nodes or include excessive benign edges, directly undermining both detection and remediation efficacy.
minor comments (2)
- [§3] The acronym GE-GRPO is introduced without expansion or reference to its base algorithm; a brief definition or citation on first use would improve readability.
- [Figures] Figure 3 (or equivalent architecture diagram) would benefit from explicit labels for the four communication topologies tested so readers can map results to specific interaction patterns.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects for improving the clarity and robustness of our claims. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experiments): The central claim that PropGuard 'consistently lowers attack success while maintaining high task-level defense success' is load-bearing for the paper's contribution, yet the abstract supplies no numerical results, baselines, error bars, or statistical tests. Without these, it is impossible to assess effect sizes or rule out that improvements are marginal or architecture-specific.
Authors: We agree that including quantitative details in the abstract would better support the central claim. The experiments in §5 report results across four architectures and five attack settings, with comparisons to baselines. In the revised manuscript, we have updated the abstract to include key numerical findings, such as the average reduction in attack success rates and maintained task success rates, along with references to error bars and statistical tests presented in the experimental section. revision: yes
-
Referee: [§4.2] §4.2 (Inspector and risk priors): The claim that risk priors from the dual-view graph enable the GE-GRPO inspector to recover compact suspicious subgraphs rests on the assumption that response-centric risk estimation remains stable under LLM stochasticity. No ablation or sensitivity analysis on temperature, prompt variation, or risk-threshold choice is described; if these priors shift, the sequential exploration order and termination could omit upstream nodes or include excessive benign edges, directly undermining both detection and remediation efficacy.
Authors: This is a valid concern regarding the robustness of the risk priors. The dual-view graph combines response-centric estimation with full-state evidence preservation to enhance stability. While the original submission used consistent LLM configurations, we acknowledge the lack of explicit sensitivity analysis. We have added an ablation study in the revised version varying temperature settings and prompt variations, demonstrating that the subgraph recovery remains effective and the performance metrics are stable within typical operational ranges. revision: yes
Circularity Check
No circularity: PropGuard is an empirical framework proposal with no derivation chain or equations that reduce outputs to inputs by construction.
full rationale
The paper introduces PropGuard as a novel propagation-aware defense for LLM-MAS, describing the construction of a dual-view spatio-temporal graph, a GE-GRPO trained inspector for sequential exploration of suspicious subgraphs, subgraph-aware diagnosis, and source-guided remediation. These elements are presented as original components whose effectiveness is assessed via experiments on four architectures and five attack settings. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text that would make any claimed result equivalent to its inputs by construction. The framework is self-contained as a proposed system evaluated empirically rather than derived from prior results in a circular manner.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Malicious instructions injected through messages, tools, or memories can propagate across agents and rounds in LLM-MAS, causing system-level compromise.
invented entities (3)
-
dual-view spatio-temporal graph
no independent evidence
-
GE-GRPO trained inspector
no independent evidence
-
source-guided remediation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Easytool: Enhancing llm-based agents with concise tool instruction
Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages ...
work page 2025
-
[2]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent S ystems
Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, and Xiaoming Zhang. Beyond self-talk: A communication-centric survey of llm-based multi-agent systems.arXiv preprint arXiv:2502.14321, 2025
-
[4]
Deep Research Agents: A Systematic Examination And Roadmap, September 2025
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025
-
[5]
Ruwei Pan, Hongyu Zhang, and Chao Liu. Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811, 2025
-
[6]
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026
work page 2026
-
[7]
Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas
Bingyu Yan, Xiaoming Zhang, Ziyi Zhou, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, and Litian Zhang. Attack the messages, not the agents: A multi-round adaptive stealthy tampering framework for llm-mas. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29784–29792, 2026
work page 2026
-
[8]
A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025
Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025
work page 2025
-
[9]
Web fraud attacks against llm-driven multi-agent systems.arXiv preprint arXiv:2509.01211, 2025
Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, and Meng Han. Web fraud attacks against llm-driven multi-agent systems.arXiv preprint arXiv:2509.01211, 2025
-
[10]
Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024
-
[11]
Yang Feng and Xudong Pan. Sentinelnet: Safeguarding multi-agent collaboration through credit-based dynamic threat detection.arXiv preprint arXiv:2510.16219, 2025
-
[12]
Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, and Chi Liu. Who’s the mole? modeling and detecting intention-hiding malicious agents in llm-based multi-agent systems.arXiv preprint arXiv:2507.04724, 2025
-
[13]
G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems
Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. G-safeguard: A topology-guided security lens and treatment on llm- based multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7261–7276, 2025
work page 2025
-
[14]
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, and Xin Wang. Blindguard: Safeguarding llm-based multi-agent systems under unknown attacks.arXiv preprint arXiv:2508.08127, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, and Shirui Pan. Explainable and fine-grained safeguarding of llm multi-agent systems via bi-level graph anomaly detection.arXiv preprint arXiv:2512.18733, 2025. 11
-
[16]
Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, and Jing Shao. Infa-guard: Mitigating malicious propagation via infection-aware safeguarding in llm-based multi-agent systems. arXiv preprint arXiv:2601.14667, 2026
-
[17]
Jialong Zhou, Lichao Wang, and Xiao Yang. Guardian: Safeguarding llm multi-agent collabora- tions with temporal graph modeling.arXiv preprint arXiv:2505.19234, 2025
-
[18]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019
work page 2019
-
[19]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[20]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024
work page 2024
-
[22]
Fatemeh Nazary, Yashar Deldjoo, and Tommaso di Noia. Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems. InEuropean Conference on Information Retrieval, pages 239–251. Springer, 2025
work page 2025
-
[23]
Junyuan Mao, Fanci Meng, Yifan Duan, Miao Yu, Xiaojun Jia, Junfeng Fang, Yuxuan Liang, Kun Wang, and Qingsong Wen. Agentsafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025
-
[24]
Gpt-4o mini: Advancing cost-efficient intelligence
OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ , 2024. Accessed: 2026-05- 01
work page 2024
-
[25]
Qwen Team. Qwen3.5-4b. https://huggingface.co/Qwen/Qwen3.5-4B, 2026. Accessed: 2026-05-01
work page 2026
-
[26]
Jocher, G.; Chaurasia, A.; and Qiu, J
Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, and Gongshen Liu. Flooding spread of manipulated knowledge in llm-based multi-agent communities.arXiv preprint arXiv:2407.07791, 2024
-
[27]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025
work page 2025
-
[28]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024
work page 2024
-
[30]
Thinkguard: Deliberative slow thinking leads to cautious guardrails
Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13698–13713, 2025
work page 2025
-
[31]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025. 12 Appendix Contents A Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
The current suspicious subgraph, containing nodes already selected as suspicious
-
[33]
The frontier, i.e., visible nodes adjacent to the suspicious subgraph that can be selected for expansion
-
[34]
Full-state information for visible nodes, including responses, incoming messages, memory contents, and tool outputs
-
[35]
Risk prior scores for each node, where higher scores indicate higher suspiciousness
-
[36]
Edge connections in the visible region. Exploration objective.Your goal is to thoroughly trace the full propagation chain of malicious influence across all interaction rounds. You should select frontier nodes that may have been influenced by compromised nodes, even when their risk prior scores are only moderate. Pay special attention to later rounds, such...
-
[37]
The corresponding risk source ismemory
The agent’s memory contains injected contextual passages, such as fabricated references or authoritative-sounding but false contexts, that other agents do not have, and the agent’s answer follows these injected contexts. The corresponding risk source ismemory. 26 Diagnosis Prompt (continued)
-
[38]
The corresponding risk source isself
The agent is actively trying to manipulate other agents’ answers or reasoning, using persuasion tactics or asserting answers without evidence. The corresponding risk source isself
-
[39]
The corresponding risk source ismsg
The agent’s incoming messages contain adversarial content from a compromised neighbor, and the agent has adopted the manipulated position. The corresponding risk source ismsg
-
[40]
The corresponding risk source istool
The agent receives corrupted tool outputs and blindly follows them. The corresponding risk source istool. Non-compromise criteria.A node should not be marked as compromised if: •The agent shows genuine uncertainty or discusses multiple viewpoints without pushing a specific wrong answer. •The agent’s memory is clean, containing only the task query, and its...
-
[41]
failure_mode: the type of compromise, selected from answer_manipulation, instruction_injection, reasoning_corruption,information_leak, andtask_deviation. Output format.Respond only in the following JSON format: { "is_compromised": true/false, "risk_source": "msg | tool | memory | self", "failure_mode": "answer_manipulation | instruction_injection | reason...
-
[42]
Do not mention in thenew_responsethat an attack was detected or that remediation occurred
-
[43]
Do not quote, summarize, or preserve any malicious instructions, poisoned passages, or corrupted tool content
-
[44]
Preserve benign task-relevant information whenever possible
-
[45]
Respond as if you are a normal, helpful agent completing the user’s original task
-
[46]
Keep thenew_responseconcise and task-focused
-
[47]
Thenew_responseis mandatory for allrisk_sourcetypes. Output format.Respond only in the following JSON format: { "remediation_action": "regenerate_response | sanitize_memory_then_regenerate | discard_tool_output_then_regenerate", "clean_memory": "sanitized memory, or null if risk_source is not memory", "clean_tool_output": "clean tool output, null, or NEED...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.