pith. sign in

arxiv: 2605.03228 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.AI· cs.CL

MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords LLM agentslong-horizon threatsshadow memoryagentic memoryattack detectiondefensive frameworksecurity guardrailsmulti-step attacks
0
0 comments X

The pith

MAGE uses a dedicated shadow memory to detect and mitigate long-horizon threats against LLM agents by retaining safety-critical context across full execution trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAGE as a framework that equips LLM agents with a safety-focused memory mechanism to handle attacks which unfold gradually through many steps of user-agent-environment interaction. Drawing from the shadow stack concept in systems security, it distills and preserves only the information relevant to potential risks rather than storing everything. This memory is consulted before any action executes to judge whether the step advances a hidden malicious goal. A sympathetic reader would care because LLM agents are moving into complex real-world tasks where single-turn defenses leave gaps that extended attacks can exploit.

Core claim

MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. This enables detection and mitigation of a wide range of long-horizon threats that exploit extended interactions and are improbable in single-turn settings.

What carries the argument

The shadow memory, a distilled safety-focused agentic memory that captures and retains safety-critical context from the full execution trajectory to evaluate risks of pending actions before they occur.

Load-bearing premise

That a distilled shadow memory can reliably capture and retain all safety-critical context across arbitrary execution trajectories without missing threats or generating excessive false alarms, and that this generalizes to the full range of long-horizon attacks.

What would settle it

A documented case of a multi-step attack sequence that completes its malicious objective without the shadow memory raising an alert before execution, or that triggers so many false alarms that normal agent tasks become unusable.

Figures

Figures reproduced from arXiv: 2605.03228 by Charles Fleming, Jiacheng Liang, Tanqiu Jiang, Ting Wang, Yuhui Wang.

Figure 1
Figure 1. Figure 1: (a) Tool-chaining attack; (b) MAGE defends against long￾horizon attacks by distilling and retaining safety-relevant context in shadow memory when evaluating the risks of potential actions. as keyword filters, per-turn classifiers, and safety prompting, are engineered for short contexts and degrade substantially under the distributional shift to long-horizon interactions [2]. A handful of recent studies att… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Architecture of MAGE: at each turn, the memory manager 𝑀 updates security-critical context, and the judge 𝐽 assesses the risk of the pending action. (b) Agentic workflow of 𝑀𝜃 and 𝐽𝜃 (with the same underlying LLM 𝜃) within MAGE. (c) Turn-wise RL procedure for optimizing 𝜃. To resolve this tension, we introduce agentic shadow memory, a compact, security-focused representation of the agent’s execution tr… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Impact of attack horizon: without defense, ASR rises view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis for reward hyper-parameters view at source ↗
Figure 5
Figure 5. Figure 5: Security-utility-overhead trade-off among different defense view at source ↗
Figure 6
Figure 6. Figure 6: (a) Impact of attack horizon: without defense, ASR rises view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis for 𝛾 and 𝜆. Top: benign utility and attack success rate. Bottom: token cost (K tokens per task). and crafts injections specifically designed to subvert the defense. Building upon PI2’s optimization-based attack pipeline, we adapt its attacker and rewriter modules to target MAGE’s components. We consider three strategies: i) M-adaptive — the injection includes in￾structions to corrupt … view at source ↗
Figure 8
Figure 8. Figure 8: Flowchart of MAGE’s defense against tool-chaining attacks. The agent receives a legitimate request to locate a confidential in￾ternal document and email it to an external contact. MAGE approves the preparatory file search and contact lookup, but rejects the final email action that would exfiltrate sensitive data. reasoning behind the decision. Harmful instructions are highlighted in red. Key security-criti… view at source ↗
Figure 9
Figure 9. Figure 9: Visual overview of Example 2.1. The agent receives a legiti view at source ↗
Figure 10
Figure 10. Figure 10: Training data size ablation. Left column (Evaluation I, view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics on Evaluation I (SHADE-Arena). From left to right: step-wise average reward, validation ASR (%), validation BU view at source ↗
Figure 12
Figure 12. Figure 12: Training dynamics on Evaluation II (AgentDojo). From left to right: step-wise average reward, validation ASR (%), validation BU (%), view at source ↗
read the original abstract

As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious objectives improbable in single-turn settings. Such long-horizon threats pose significant risks to the safe deployment of LLM agents in critical domains. In this paper, we present MAGE (Memory As Guardrail Enforcement), a novel defensive framework designed to counter a wide range of long-horizon threats. Inspired by the "shadow stack" abstraction in systems security, MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. Extensive evaluation demonstrates that MAGE substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy, achieves early-stage detection for the majority of attacks, and introduces only negligible overhead to agent utility. To our best knowledge, MAGE represents the first framework to detect and mitigate long-horizon threats using an agentic memory approach, establishing a new paradigm for this critical challenge and opening promising directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces MAGE (Memory As Guardrail Enforcement), a defensive framework for LLM agents facing long-horizon threats that exploit extended multi-turn interactions. Drawing an analogy to the shadow stack in systems security, MAGE maintains a dedicated safety-focused agentic 'shadow memory' that distills and retains safety-critical context across the full execution trajectory; this memory is then used to proactively evaluate the risk of pending actions before execution. The authors claim that extensive evaluations show MAGE substantially outperforms existing defenses in detection accuracy, achieves early-stage detection for the majority of attacks, incurs only negligible overhead on agent utility, and constitutes the first agentic-memory approach to this problem.

Significance. If the empirical results hold under rigorous scrutiny, MAGE could meaningfully advance LLM-agent security by shifting from reactive, single-turn defenses to a proactive, trajectory-aware memory guardrail. The shadow-memory construction is a coherent and novel application of a systems concept to agentic settings, and the emphasis on early detection plus low utility overhead addresses practical deployment concerns. The work opens clear follow-on directions in memory-augmented agent safety.

major comments (1)
  1. [Abstract] Abstract: the central claims of 'substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy' and 'achieves early-stage detection for the majority of attacks' are asserted without any metrics, baselines, threat models, attack taxonomies, or evaluation protocol. Because these quantitative results are load-bearing for the paper's contribution, their absence prevents verification that the shadow-memory mechanism actually delivers the stated gains rather than relying on an untested assumption that distilled context will reliably surface all safety-critical signals.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting this important point about the abstract. We address the comment directly below and will make the requested changes to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy' and 'achieves early-stage detection for the majority of attacks' are asserted without any metrics, baselines, threat models, attack taxonomies, or evaluation protocol. Because these quantitative results are load-bearing for the paper's contribution, their absence prevents verification that the shadow-memory mechanism actually delivers the stated gains rather than relying on an untested assumption that distilled context will reliably surface all safety-critical signals.

    Authors: We agree that the abstract, as currently written, presents the performance claims at a high level without quantitative anchors or explicit references to the evaluation setup. The full manuscript contains the supporting details: threat models and attack taxonomy are defined in the problem formulation and threat model section, the evaluation protocol (including datasets, metrics, and early-detection criteria) is specified in the experimental setup, baselines are compared in the results section, and concrete detection accuracy figures plus early-stage detection rates are reported with statistical detail. Nevertheless, the referee is correct that these elements are not summarized in the abstract itself, which can make the claims harder to assess at first reading. We will revise the abstract to incorporate key quantitative results (specific accuracy improvements and early-detection percentages), a concise statement of the evaluation scope, and reference to the main baselines and threat categories. This change will be made without altering the abstract's length constraints or introducing new unsubstantiated assertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces MAGE as an original defensive framework for LLM agents, drawing high-level inspiration from the shadow-stack abstraction in systems security but without any equations, fitted parameters, predictions, or first-principles derivations that reduce to the inputs by construction. No self-citations are invoked as load-bearing support for core claims, and the central design (distilled safety-focused agentic memory for risk assessment) is presented as a new construction rather than a renaming or definitional equivalence of prior results. Claims of outperformance and early detection rest on external evaluation rather than internal reduction, leaving the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced shadow memory construct and the assumption that long-horizon threats can be mitigated by proactive risk assessment from distilled context; no free parameters or external benchmarks are specified.

axioms (1)
  • domain assumption LLM agents face long-horizon threats that exploit extended interactions and are improbable in single-turn settings
    Stated in the opening of the abstract as the motivating premise.
invented entities (1)
  • Shadow memory no independent evidence
    purpose: Dedicated safety-focused agentic memory that distills and retains safety-critical context across the full execution trajectory
    Core novel component of MAGE, introduced without independent evidence outside the framework itself.

pith-pipeline@v0.9.0 · 5516 in / 1165 out tokens · 48186 ms · 2026-05-08T17:46:04.092776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages

  1. [1]

    Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. 2025. IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents. InProceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)

  2. [2]

    Cem Anil, Esin DURMUS, and Others. 2024. Many-shot Jailbreaking. InProceed- ings of Advances in Neural Information Processing Systems (NeurIPS)

  3. [3]

    Rauno Arike, Elizabeth Donoway, Henning Bartsch, and Marius Hobbhahn. 2025. Technical Report: Evaluating Goal Drift in Language Model Agents.ArXiv e-prints(2025)

  4. [4]

    Nathan Burow, Xinping Zhang, and Mathias Payer. 2019. SoK: Shining Light on Shadow Stacks. InProceedings of IEEE Symposium on Security and Privacy (S&P)

  5. [5]

    Hwan Chang, Yonghyun Jun, and Hwanhee Lee. 2025. ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents.ArXiv e-prints(2025)

  6. [6]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.ArXiv e-prints(2023)

  7. [7]

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InProceedings of USENIX Security Symposium (SEC)

  8. [8]

    Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner

  9. [9]

    Defending Against Prompt Injection With a Few Defensive Tokens.ArXiv e-prints(2025)

  10. [10]

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of ACM SAC Conference on Computer and Communications (CCS)

  11. [11]

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agent- Poison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  12. [12]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  13. [13]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.ArXiv e-prints(2025)

  14. [14]

    Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin

  15. [15]

    Securing AI Agents with Information-Flow Control.ArXiv e-prints(2025)

  16. [16]

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  17. [17]

    Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. 2025. Memory Injection Attacks on LLM Agents via Query- Only Interaction. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  18. [18]

    Zetong Guan, Enliang Li, Yihan Wang, Ziyu Wang, and Yisong Chen. 2026. GPU- Accelerated Robust Graph Neural Networks against Adversarial Cyber Attacks. InProceedings of International Conference on Digital Management and Information Technology (DMIT)

  19. [19]

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Bench- mark for Code Agents. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  20. [20]

    Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, and Min Zhang

  21. [21]

    InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)

    MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red- teaming. InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)

  22. [22]

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting.ArXiv e-prints(2024)

  23. [23]

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, et al . 2025. Memory in the Age of AI Agents: A Survey.ArXiv e-prints(2025)

  24. [24]

    Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, and Zili Shao. 2025. Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak.ArXiv e-prints(2025)

  25. [25]

    Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. 2025. Reason-to-Rank: Distilling Direct and Comparative Reasoning from Large Language Models for Document Reranking. InProceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

  26. [26]

    Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. 2026. Retrieval–Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends.ArXiv e-prints(2026)

  27. [27]

    Zimo Ji, Xunguang Wang, Zongjie Li, Pingchuan Ma, Yudong Gao, Daoyuan Wu, Xincheng Yan, Tian Tian, and Shuai Wang. 2025. Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks.ArXiv e-prints (2025)

  28. [28]

    Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. 2025. PromptLocate: Localizing Prompt Injection Attacks.ArXiv e-prints(2025)

  29. [29]

    Tanqiu Jiang, Yuhui Wang, Jiacheng Liang, and Ting Wang. 2026. AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks.ArXiv e-prints(2026)

  30. [30]

    Tanqiu Jiang, Zian Wang, Jiacheng Liang, Changjiang Li, Yuhui Wang, and Ting Wang. 2024. RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction.ArXiv e-prints(2024)

  31. [31]

    Xiaochong Jiang, Shiqi Yang, Wenting Yang, Yichen Liu, and Cheng Ji. 2026. Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains.ArXiv e-prints(2026)

  32. [32]

    Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, and Edward Suh. 2025. Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis.ArXiv e-prints(2025)

  33. [33]

    Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, and Joe Benton. 2025. SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.ArXiv e-prints(2025)

  34. [34]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2026. LLMs Get Lost In Multi-Turn Conversation. InProceedings of International Conference on Learning Representations (ICLR)

  35. [35]

    Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi. 2025. STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents.ArXiv e-prints(2025)

  36. [36]

    Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, and Yew-Soon Ong. 2025. MMT-ARD: Multimodal Multi- Teacher Adversarial Distillation for Robust Vision-Language Models.ArXiv e-prints(2025)

  37. [37]

    Yuqi Li, Hansheng Zeng, Fuyan Zhang, Chuanguang Yang, Yanli Li, and Weiping Ding. 2025. Efficient Medical Image Segmentation via Reinforcement Learning- Driven K-Space Sampling.IEEE Transactions on Emerging Topics in Computational Intelligence(2025)

  38. [38]

    Yuqi Li, Zijie Zhou, Zhiyuan Peng, Junhao Dong, Haochen You, Renye Yan, Ship- ing Wen, Yingli Tian, and Tingwen Huang. 2025. A preference-driven method- ology for efficient code generation.IEEE Transactions on Artificial Intelligence (2025)

  39. [39]

    Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, and Ting Wang. 2025. AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models. ArXiv e-prints(2025)

  40. [40]

    Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, and Ting Wang. 2026. RASA: Routing- Aware Safety Alignment for Mixture-of-Experts Models.ArXiv e-prints(2026)

  41. [41]

    Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, and Ting Wang. 2025. GraphRAG under Fire.ArXiv e-prints(2025)

  42. [42]

    Weisi Liu, Guangzeng Han, and Xiaolei Huang. 2025. Examining and Adapting Time for Multilingual Classification via Mixture of Temporal Experts. InProceed- ings of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

  43. [43]

    Weisi Liu, Zhe He, and Xiaolei Huang. 2025. Time matters: Examine temporal ef- fects on biomedical language models. InProceedings of AMIA Annual Symposium (AMIA)

  44. [44]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generat- ing Stealthy Jailbreak Prompts on Aligned Large Language Models. InProceedings of International Conference on Learning Representations (ICLR)

  45. [45]

    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and Universal Prompt Injection Attacks against Large Language Models.ArXiv e-prints(2024)

  46. [46]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. InPro- ceedings of USENIX Security Symposium (SEC)

  47. [47]

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. 2025. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. InPro- ceedings of IEEE Symposium on Security and Privacy (S&P)

  48. [48]

    Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, and Neil Zhenqiang Gong. 2025. SecInfer: Preventing Prompt Injection via Inference-time Scaling.ArXiv e-prints (2025)

  49. [49]

    Huaiying Luo and Cheng Ji. 2025. Cross-cloud data privacy protection: Optimiz- ing collaborative mechanisms of ai systems by integrating federated learning and llms. InProceedings of IEEE International Conference on Communications, Information System and Computer Engineering (CISCE)

  50. [50]

    Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. 2024. Fight Back Against Jailbreaking via Prompt Adversarial Tuning. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  51. [51]

    Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

    Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

  52. [52]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections.ArXiv e-prints(2025)

  53. [53]

    OpenAI. 2025. Introducing deep research. https://openai.com/index/introducing- deep-research/

  54. [54]

    OpenManus. 2025. Openmanus: Open-source ai agent framework. https:// github.com/mannaandpoem/OpenManus Wang et al

  55. [55]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.ArXiv e-prints(2023)

  56. [56]

    Shilong Pan, Zhiliang Tian, Zhen Huang, Wanlong Yu, Zhihua Wen, Xinwang Liu, Kai Lu, Minlie Huang, and Dongsheng Li. 2025. AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models. InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)

  57. [57]

    Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. 2025. X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents. InProceedings of Conference on Language Modeling (COLM)

  58. [58]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InProceedings of International Conference on Learning Representations (ICLR)

  59. [59]

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. In Proceedings of USENIX Security Symposium (SEC)

  60. [60]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)

  61. [61]

    Sander Schulhoff. 2024. The Sandwich Defense: Strengthening AI Prompt Se- curity. https://learnprompting.org/docs/prompt_hacking/defensive_measures/ sandwich_defense

  62. [62]

    Zeyang Sha, Hanling Tian, Zhuoer Xu, Shiwen Cui, Changhua Meng, and Weiqiang Wang. 2025. Agent Safety Alignment via Reinforcement Learning. ArXiv e-prints(2025)

  63. [63]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.ArXiv e-prints(2024)

  64. [64]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework.ArXiv e-prints(2024)

  65. [65]

    Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt Injection Attack to Tool Selection in LLM Agents. ArXiv e-prints(2025)

  66. [66]

    Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. ArXiv e-prints(2025)

  67. [67]

    Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. rLLM: A Framework for Post-Training Language Agents. https://pretty-radio-b75.notion.site/rLLM-A-Framework-for- Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31

  68. [68]

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.ArXiv e-prints(2024)

  69. [69]

    Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. 2025. AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection.ArXiv e-prints (2025)

  70. [70]

    Reachal Wang, Yuqi Jia, and Neil Zhenqiang Gong. 2026. ObliInjection: Order- Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data. In Proceedings of Network and Distributed System Security Symposium (NDSS)

  71. [71]

    Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, and Ting Wang

  72. [72]

    Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models.ArXiv e-prints(2025)

  73. [73]

    Yuhui Wang, Rongyi Zhu, and Ting Wang. 2025. Self-Destructive Language Model.ArXiv e-prints(2025)

  74. [74]

    Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Bal- ashankar, Peter Stone, and Lukas Rutishauser. 2025. Adversarial Reinforcement Learning for Large Language Model Agent Safety.ArXiv e-prints(2025)

  75. [75]

    Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)

  76. [76]

    Ruoyao Wen, Hao Li, Chaowei Xiao, and Ning Zhang. 2026. AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management. ArXiv e-prints(2026)

  77. [77]

    Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, and Chuan Guo. 2025. RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection.ArXiv e-prints(2025)

  78. [78]

    Simon Willison. 2023. Delimiters Won’t Save You from Prompt Injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/

  79. [79]

    Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective.ArXiv e-prints(2024)

  80. [80]

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. 2025. Instructional Segment Embedding: Improving LLM Safety with Instruction Hi- erarchy. InProceedings of International Conference on Learning Representations (ICLR)

Showing first 80 references.