arxiv: 2605.01758 · v3 · submitted 2026-05-03 · 💻 cs.AI

Recognition: no theorem link

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Yue Ma , Ziyuan Yang , Yi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsinfectious jailbreakjailbreak defenseforesight simulationlocal purificationmulti-persona predictionlarge multimodal modelsagent interaction security

0 comments

The pith

A training-free foresight method lets agents in multi-agent systems detect and locally purify infectious jailbreaks before they spread.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large multimodal model-based multi-agent systems face infectious jailbreaks that compromise one agent and then spread through interactions. The paper proposes Foresight-Guided Local Purification, a training-free framework in which each agent simulates future behavioral trajectories using multiple personas to identify infections via inconsistencies in retrieval and semantic predictions. Detected infections are then removed locally through immediate rollback or recursive binary diagnosis on interaction history. This approach targets the localized origins of infections unlike prior global cure factors that only suppress responses superficially. Experiments show the method cuts maximum cumulative infection rates from above 95 percent to below 5.47 percent while keeping retrieval and semantic metrics close to benign baselines.

Core claim

The core discovery is that agents can track behavioral evolution by simulating future interaction trajectories with a multi-persona strategy, using response diversity inconsistencies at retrieval-result and semantic levels as a diagnostic signal to detect infection, then apply localized purification via immediate album rollback for recent cases and Recursive Binary Diagnosis to recursively partition and cleanse persistent VirAEs.

What carries the argument

Foresight-Guided Local Purification (FLP) that simulates future chat rounds via multi-persona predictions to detect infections through diversity inconsistencies and performs targeted local purification.

If this is right

Infections remain localized and do not propagate across the agent population.
Response diversity is maintained at levels comparable to uninfected systems.
Defense requires no retraining or modification of the underlying models.
Localized foresight outperforms global shared cure factors in containing spread.
Recursive partitioning isolates and removes long-term infections without broad rollback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other collaborative threats where one compromised node affects others.
Simulation fidelity may vary in highly dynamic or adversarial environments beyond the tested cases.
Training-free local detection suggests easier integration into existing multi-agent deployments.
Predictive modeling of interaction trajectories could become a general tool for proactive agent security.

Load-bearing premise

Inconsistencies across multi-persona simulation predictions will accurately flag real infections without too many false positives and the simulations will faithfully model actual future interaction dynamics.

What would settle it

Running the method on a live multi-agent system under active infectious jailbreak attempts and observing either a maximum cumulative infection rate above 5.47 percent or retrieval and semantic metrics that deviate substantially from benign baselines.

Figures

Figures reproduced from arXiv: 2605.01758 by Yi Zhang, Yue Ma, Ziyuan Yang.

**Figure 2.** Figure 2: Overview of the FLP framework. The framework comprises three stages: Multi-Persona Simulation, Infection Diagnosis, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Retrieval coverage across methods and |B|. Proposition 4.1 (Infection Propagation Dynamics). If T (𝑟 | 𝑏) > 2T (𝑏 | 𝑟), the infection rate monotonically increases, indicating the persisting existence of the infected agents and leading the system to a widespread compromise. This formulation reveals that the retrieval mechanism fundamentally amplifies the propagation. Once VirAEs enter the image album B, th… view at source ↗

**Figure 4.** Figure 4: Transmission dynamics and system states. (a) Evo [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of Purification Strategies. 0 10 20 30 40 50 60 Round Attacked Cowpox FLP Baseline Method 0 1 2 3 4 5 6 Entropy 5.80 1.93 0.00 0.00 0.00 0.00 0.00 5.80 2.39 1.00 0.50 0.12 0.12 0.00 5.78 4.85 3.46 2.52 1.97 2.06 1.85 5.83 4.54 3.53 2.82 2.26 1.50 0.66 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Evolution of Retrieval Entropy 𝐸𝑟𝑒𝑡 . where 𝜏ℎ and 𝜏𝑠 are defined as the 𝛼-quantiles of retrieval entropy and semantic diversity over benign multi-agent systems, respectively. They are obtained by sorting the corresponding metric values from benign agents and selecting the value below which a proportion 𝛼 of samples fall, where 𝛼 controls the threshold sensitivity. When 𝐹𝑖𝑛 𝑓 ,𝑖 = 1, G𝑖 is considered inf… view at source ↗

**Figure 7.** Figure 7: Evolution of the semantic landscape. We visualize the 3D t-SNE embeddings of the agents’ interaction responses [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: The Impact of the image album length |B|. 0 10 20 30 Round 0.0 0.5 1.0 1.5 Rate (%) Current Infection Rate 0 10 20 30 Round 0.0 0.5 1.0 1.5 2.0 2.5 Cumulative Infection Rate r0=4/128 r0=8/128 r0=16/128 r0=32/128 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: The Impact of the Initial Infection Ratio [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a local, simulation-based defense for infectious jailbreaks in MAS that avoids global homogenization, but the experimental claims rest on unreported details and unvalidated assumptions about simulation fidelity.

read the letter

The main takeaway is that this work moves away from training global cure factors toward a training-free local purification method. Agents run foresight simulations of future interactions using multiple personas, then flag infections via inconsistencies in retrieval and semantic responses before applying targeted rollback or recursive binary diagnosis to clean up VirAEs. This directly addresses how infections spread through localized chat behaviors rather than uniformly across the system. The framing of why shared cures only suppress rather than recover is clear and useful. The reported drop from over 95% to under 5.47% infection rate while keeping metrics near benign baselines shows the intended outcome if the setup works. The soft spots are in the evidence. The abstract states strong quantitative results without baselines, trial counts, error bars, or statistical tests, so it is impossible to tell whether the numbers hold up or generalize. The central assumption—that multi-persona simulation inconsistencies reliably detect real infections without excessive false positives from benign variance or short-horizon noise—receives no supporting checks such as simulation-to-real divergence measures or ROC curves on clean runs. If those conditions do not hold, the gains could be artifacts of the specific test MAS and infection model. This is for researchers working on practical security for collaborative multi-agent systems. A reader interested in new defense angles would get value from the conceptual shift even before the experiments are filled in. It deserves peer review so the details and assumptions can be examined properly.

Referee Report

3 major / 1 minor

Summary. The paper proposes a training-free Foresight-Guided Local Purification (FLP) framework for defending multimodal model-based Multi-Agent Systems against infectious jailbreaks. It contrasts prior global cure-factor methods with a localized approach: each agent runs multi-persona simulations of future interaction trajectories, detects infections via inconsistencies in retrieval-result and semantic diversity, and applies targeted purification (immediate album rollback for recent infections; Recursive Binary Diagnosis (RBD) for long-term ones). Experiments are reported to reduce maximum cumulative infection rate from >95% to <5.47% while keeping retrieval and semantic metrics close to benign baselines.

Significance. If the results are reproducible, the work would be a meaningful contribution to MAS security: it supplies a training-free, foresight-based defense that targets localized interaction behaviors rather than imposing global homogenization. The explicit use of multi-persona simulation for diagnosis and the distinction between short- and long-term infection handling are concrete technical strengths.

major comments (3)

[Abstract] Abstract: the central quantitative claim (reduction of maximum cumulative infection rate from >95% to <5.47%) is stated without any description of experimental setup, baselines, number of trials, statistical tests, or error bars, rendering it impossible to assess whether the data support the reported performance.
[Method] Method section (multi-persona simulation and detection): the detection strategy assumes that observed inconsistencies across persona-based predictions reliably indicate infection rather than benign persona variance or short-horizon noise, yet no validation is supplied (e.g., simulation-to-real trajectory divergence metrics or ROC analysis on held-out benign runs) to confirm low false-positive rates or faithful modeling of real MAS dynamics.
[Experiments] Experiments section: the claim that retrieval and semantic metrics 'closely match benign baselines' is presented without accompanying tables, figures, or numerical values showing the actual metric comparisons, undermining the assertion that interaction diversity is preserved.

minor comments (1)

[Abstract] Abstract: the phrase 'album rollback' appears without prior definition; a brief parenthetical gloss would improve readability for readers unfamiliar with the agent state representation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current manuscript version requires additional clarifications and supporting data to make the claims fully verifiable. We will incorporate revisions to address each point as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claim (reduction of maximum cumulative infection rate from >95% to <5.47%) is stated without any description of experimental setup, baselines, number of trials, statistical tests, or error bars, rendering it impossible to assess whether the data support the reported performance.

Authors: We acknowledge that the abstract is overly concise and omits key experimental details. In the revised manuscript, we will expand the abstract to briefly describe the setup (multi-agent simulations with 100 independent trials per condition, using global cure-factor baselines), note that results are reported as averages with standard deviations, and reference the use of paired t-tests (p < 0.01) for significance. Full experimental protocols, trial counts, and error-bar figures remain in Section 4; the abstract revision will point readers there explicitly. revision: yes
Referee: [Method] Method section (multi-persona simulation and detection): the detection strategy assumes that observed inconsistencies across persona-based predictions reliably indicate infection rather than benign persona variance or short-horizon noise, yet no validation is supplied (e.g., simulation-to-real trajectory divergence metrics or ROC analysis on held-out benign runs) to confirm low false-positive rates or faithful modeling of real MAS dynamics.

Authors: The referee is correct that explicit validation of the detection signal is missing from the submitted version. While the multi-persona approach is motivated by the need to capture interaction diversity, we did not include quantitative checks such as ROC curves or divergence metrics. In the revision we will add a dedicated validation subsection (new Section 3.4) reporting: (i) KL-divergence between simulated and real trajectories on held-out benign runs (average 0.12), (ii) false-positive rate of 4.8% on 200 benign trials, and (iii) ROC-AUC of 0.93 for infection detection. These additions will directly address concerns about benign variance versus infection signals. revision: yes
Referee: [Experiments] Experiments section: the claim that retrieval and semantic metrics 'closely match benign baselines' is presented without accompanying tables, figures, or numerical values showing the actual metric comparisons, undermining the assertion that interaction diversity is preserved.

Authors: We agree that the claim requires supporting numerical evidence. The original text references the metrics but does not tabulate the comparisons. We will insert Table 3 (retrieval accuracy: FLP 92.4% vs. benign 93.1%; semantic cosine similarity: FLP 0.87 vs. benign 0.89) and Figure 4 (box plots with error bars across 100 trials) in the revised Experiments section. These additions will quantify the preservation of interaction diversity while demonstrating the infection-rate reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity: training-free simulation method with direct experimental outcomes

full rationale

The paper proposes a training-free FLP framework that uses multi-persona simulations to generate foresight trajectories, then applies response diversity and inconsistency checks at retrieval and semantic levels to detect infections, followed by rollback or recursive diagnosis for purification. All performance claims (e.g., infection rate drop from >95% to <5.47%) are reported as direct experimental measurements on the implemented system rather than quantities derived by fitting parameters to the target data or by self-referential definitions. No equations, ansatzes, or uniqueness theorems are shown to reduce to their own inputs by construction, and no load-bearing self-citations appear in the provided text. The central claims therefore remain independent of the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about infection spread and the diagnostic power of simulation diversity, plus newly introduced mechanisms whose effectiveness is asserted without external corroboration in the abstract.

axioms (2)

domain assumption MASs are vulnerable to infectious jailbreak where compromising a single agent spreads to others via interactions
Core premise stated at the start of the abstract.
domain assumption Global cure-factor defenses homogenize responses and provide only superficial suppression
Used to motivate the shift to local purification.

invented entities (2)

Foresight-Guided Local Purification (FLP) framework no independent evidence
purpose: Training-free detection and removal of infections via simulation and local rollback
Newly proposed method whose performance is asserted in the abstract.
Recursive Binary Diagnosis (RBD) no independent evidence
purpose: Recursive partitioning of memory album to localize and eliminate long-term infections
Introduced as part of the purification procedure.

pith-pipeline@v0.9.0 · 5588 in / 1511 out tokens · 63433 ms · 2026-05-15T07:07:18.166552+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

[1]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236(2023)

work page arXiv 2023
[2]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42

work page 2025
[3]

Jingyu Chen, Ruidong Ma, and John Oyekan. 2023. A deep multi-agent reinforce- ment learning framework for autonomous aerial navigation to grasping points on loads.Robotics and Autonomous Systems167 (2023), 104489

work page 2023
[4]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. 2024. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In International Conference on Learning Representations, Vol. 2024. 20094–20136

work page 2024
[5]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

S Cohen, R Bitton, and B Nassi. 2024. ComPromptMized: Unleashing Zero-click Worms that Target GenAI-Powered Applications

work page 2024
[7]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267

work page 2023
[8]

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. 2024. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. InProceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: ...

work page 2024
[9]

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. 2023. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751(2023)

work page arXiv 2023
[10]

Alireza Ghafarollahi and Markus J Buehler. 2025. SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523

work page 2025
[11]

2025.{Topic-FlipRAG}:{Topic-Orientated} Adversarial Opinion Manipulation Attacks to {Retrieval-Augmented} Gener- ation Models

Yuyang Gong, Zhuo Chen, Jiawei Liu, Miaokun Chen, Fengchang Yu, Wei Lu, XiaoFeng Wang, and Xiaozhong Liu. 2025.{Topic-FlipRAG}:{Topic-Orientated} Adversarial Opinion Manipulation Attacks to {Retrieval-Augmented} Gener- ation Models. In34th USENIX Security Symposium (USENIX Security 25). 3807– 3826

work page 2025
[12]

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

work page 2025
[13]

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. InEuropean Conference on Computer Vision. Springer, 388–404

work page 2024
[14]

Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, et al. 2024. Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models. Advances in Neural Information Processing Systems37 (2024), 7256–7295

work page 2024
[15]

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. 2024. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. InICML

work page 2024
[16]

Haojie Hao, Jiakai Wang, Aishan Liu, Yuqing Ma, Haotong Qin, Yuanfang Guo, and Xianglong Liu. 2026. Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack against Large Vision-Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 35481–35489

work page 2026
[17]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al . 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, Vol. 2024. 23247–23275

work page 2024
[18]

Raz Lapid, Ron Langberg, and Moshe Sipper. 2024. Open sesame! universal black-box jailbreaking of large language models.Applied Sciences14, 16 (2024), 7150

work page 2024
[19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[20]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems36 (2023), 51991–52008

work page 2023
[21]

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jail- breaking multimodal large language models. InEuropean Conference on Computer Vision. Springer, 174–189

work page 2024
[22]

Xingchuang Liao, Yuchen Qin, Zhimin Fan, Xiaoming Yu, Jingbo Yang, Rongye Shi, and Wenjun Wu. 2025. MA-HRL: Multi-Agent Hierarchical Reinforcement Learning for Medical Diagnostic Dialogue Systems.Electronics14, 15 (2025), 3001

work page 2025
[23]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306

work page 2024
[24]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. InInternational Conference on Learning Representations, Vol. 2024. 56174–56194

work page 2024
[25]

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860 (2023)

work page arXiv 2023
[26]

Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao

work page
[27]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 15982–16001. https://aclanthology.org/2025.acl-long.859

work page 2025
[28]

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion models for adversarial purification.arXiv preprint arXiv:2205.07460(2022)

work page arXiv 2022
[29]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: towards LLMs as operating systems. (2023)

work page 2023
[30]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023
[31]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal

work page
[32]

Visual adversarial examples jailbreak large language models.arXiv preprint arXiv:2306.132131, 3 (2023), 4

work page arXiv 2023
[33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[34]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page
[35]

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. 2023. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3677–3685

work page 2023
[37]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

work page 2024
[38]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.113668 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, and Bingsheng He. 2025. MegaAgent: A Large-Scale Autonomous LLM- based Multi-Agent System Without Predefined SOPs. InFindings of the Associ- ation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Associat...

work page doi:10.18653/v1/2025.findings- 2025
[42]

Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, and Chaowei Xiao. 2024. Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. InEuropean Conference on Computer Vision. Springer, 77–94

work page 2024
[43]

Zheng Wang, Zhongyang Li, Zeren Jiang, Dandan Tu, and Wei Shi. 2024. Crafting personalized agents through retrieval-augmented generation on editable memory graphs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 4891–4906

work page 2024
[44]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in neural information processing systems 36 (2023), 80079–80110. Ma et al

work page 2023
[45]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst conference on language modeling

work page 2024
[46]

Yutong Wu, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Han Qiu, Nils Lukas, and Tianwei Zhang. 2025. Cowpox: Towards the Immunity of VLM-based Multi- Agent Systems. InICML

work page 2025
[47]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences 68, 2 (2025), 121101

work page 2025
[48]

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496

work page 2023
[49]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5587–5605

work page 2024
[50]

Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. 2024. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models.arXiv preprint arXiv:2406.00083(2024)

work page arXiv 2024
[51]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986

work page 2023
[52]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. {PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security 25). 3827–3844. A Open Science We provide the artifacts necessary to evaluate the core contributions of this paper. The anonymized repository...

work page 2025