MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory
Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3
The pith
MAGE uses a dedicated shadow memory to detect and mitigate long-horizon threats against LLM agents by retaining safety-critical context across full execution trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. This enables detection and mitigation of a wide range of long-horizon threats that exploit extended interactions and are improbable in single-turn settings.
What carries the argument
The shadow memory, a distilled safety-focused agentic memory that captures and retains safety-critical context from the full execution trajectory to evaluate risks of pending actions before they occur.
Load-bearing premise
That a distilled shadow memory can reliably capture and retain all safety-critical context across arbitrary execution trajectories without missing threats or generating excessive false alarms, and that this generalizes to the full range of long-horizon attacks.
What would settle it
A documented case of a multi-step attack sequence that completes its malicious objective without the shadow memory raising an alert before execution, or that triggers so many false alarms that normal agent tasks become unusable.
Figures
read the original abstract
As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious objectives improbable in single-turn settings. Such long-horizon threats pose significant risks to the safe deployment of LLM agents in critical domains. In this paper, we present MAGE (Memory As Guardrail Enforcement), a novel defensive framework designed to counter a wide range of long-horizon threats. Inspired by the "shadow stack" abstraction in systems security, MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. Extensive evaluation demonstrates that MAGE substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy, achieves early-stage detection for the majority of attacks, and introduces only negligible overhead to agent utility. To our best knowledge, MAGE represents the first framework to detect and mitigate long-horizon threats using an agentic memory approach, establishing a new paradigm for this critical challenge and opening promising directions for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MAGE (Memory As Guardrail Enforcement), a defensive framework for LLM agents facing long-horizon threats that exploit extended multi-turn interactions. Drawing an analogy to the shadow stack in systems security, MAGE maintains a dedicated safety-focused agentic 'shadow memory' that distills and retains safety-critical context across the full execution trajectory; this memory is then used to proactively evaluate the risk of pending actions before execution. The authors claim that extensive evaluations show MAGE substantially outperforms existing defenses in detection accuracy, achieves early-stage detection for the majority of attacks, incurs only negligible overhead on agent utility, and constitutes the first agentic-memory approach to this problem.
Significance. If the empirical results hold under rigorous scrutiny, MAGE could meaningfully advance LLM-agent security by shifting from reactive, single-turn defenses to a proactive, trajectory-aware memory guardrail. The shadow-memory construction is a coherent and novel application of a systems concept to agentic settings, and the emphasis on early detection plus low utility overhead addresses practical deployment concerns. The work opens clear follow-on directions in memory-augmented agent safety.
major comments (1)
- [Abstract] Abstract: the central claims of 'substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy' and 'achieves early-stage detection for the majority of attacks' are asserted without any metrics, baselines, threat models, attack taxonomies, or evaluation protocol. Because these quantitative results are load-bearing for the paper's contribution, their absence prevents verification that the shadow-memory mechanism actually delivers the stated gains rather than relying on an untested assumption that distilled context will reliably surface all safety-critical signals.
Simulated Author's Rebuttal
We thank the referee for the careful review and for highlighting this important point about the abstract. We address the comment directly below and will make the requested changes to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 'substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy' and 'achieves early-stage detection for the majority of attacks' are asserted without any metrics, baselines, threat models, attack taxonomies, or evaluation protocol. Because these quantitative results are load-bearing for the paper's contribution, their absence prevents verification that the shadow-memory mechanism actually delivers the stated gains rather than relying on an untested assumption that distilled context will reliably surface all safety-critical signals.
Authors: We agree that the abstract, as currently written, presents the performance claims at a high level without quantitative anchors or explicit references to the evaluation setup. The full manuscript contains the supporting details: threat models and attack taxonomy are defined in the problem formulation and threat model section, the evaluation protocol (including datasets, metrics, and early-detection criteria) is specified in the experimental setup, baselines are compared in the results section, and concrete detection accuracy figures plus early-stage detection rates are reported with statistical detail. Nevertheless, the referee is correct that these elements are not summarized in the abstract itself, which can make the claims harder to assess at first reading. We will revise the abstract to incorporate key quantitative results (specific accuracy improvements and early-detection percentages), a concise statement of the evaluation scope, and reference to the main baselines and threat categories. This change will be made without altering the abstract's length constraints or introducing new unsubstantiated assertions. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces MAGE as an original defensive framework for LLM agents, drawing high-level inspiration from the shadow-stack abstraction in systems security but without any equations, fitted parameters, predictions, or first-principles derivations that reduce to the inputs by construction. No self-citations are invoked as load-bearing support for core claims, and the central design (distilled safety-focused agentic memory for risk assessment) is presented as a new construction rather than a renaming or definitional equivalence of prior results. Claims of outperformance and early detection rest on external evaluation rather than internal reduction, leaving the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents face long-horizon threats that exploit extended interactions and are improbable in single-turn settings
invented entities (1)
-
Shadow memory
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Cost (Jcost = ½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a turn-wise GRPO variant ... J(θ) = E [Σ_t (1/G) Σ_i (L_mem + L_judge)] ... R_mem(m_t) = Σ_{k=t}^T γ^{k-t} · 1_{d_k = d*_k}
-
Foundation.AlphaCoordinateFixation (parameter-free α=1 fixation)alpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use λ=1.0 as the default ... γ ∈ [0.25, 0.5] yields the best performance ... 2,500 character threshold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. 2025. IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents. InProceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)
work page 2025
-
[2]
Cem Anil, Esin DURMUS, and Others. 2024. Many-shot Jailbreaking. InProceed- ings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[3]
Rauno Arike, Elizabeth Donoway, Henning Bartsch, and Marius Hobbhahn. 2025. Technical Report: Evaluating Goal Drift in Language Model Agents.ArXiv e-prints(2025)
work page 2025
-
[4]
Nathan Burow, Xinping Zhang, and Mathias Payer. 2019. SoK: Shining Light on Shadow Stacks. InProceedings of IEEE Symposium on Security and Privacy (S&P)
work page 2019
-
[5]
Hwan Chang, Yonghyun Jun, and Hwanhee Lee. 2025. ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents.ArXiv e-prints(2025)
work page 2025
-
[6]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.ArXiv e-prints(2023)
work page 2023
-
[7]
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InProceedings of USENIX Security Symposium (SEC)
work page 2025
-
[8]
Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner
-
[9]
Defending Against Prompt Injection With a Few Defensive Tokens.ArXiv e-prints(2025)
work page 2025
-
[10]
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of ACM SAC Conference on Computer and Communications (CCS)
work page 2025
-
[11]
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agent- Poison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[12]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav
-
[13]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.ArXiv e-prints(2025)
work page 2025
-
[14]
Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin
-
[15]
Securing AI Agents with Information-Flow Control.ArXiv e-prints(2025)
work page 2025
-
[16]
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[17]
Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. 2025. Memory Injection Attacks on LLM Agents via Query- Only Interaction. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2025
-
[18]
Zetong Guan, Enliang Li, Yihan Wang, Ziyu Wang, and Yisong Chen. 2026. GPU- Accelerated Robust Graph Neural Networks against Adversarial Cyber Attacks. InProceedings of International Conference on Digital Management and Information Technology (DMIT)
work page 2026
-
[19]
Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Bench- mark for Code Agents. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[20]
Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, and Min Zhang
-
[21]
InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red- teaming. InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)
-
[22]
Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting.ArXiv e-prints(2024)
work page 2024
-
[23]
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, et al . 2025. Memory in the Age of AI Agents: A Survey.ArXiv e-prints(2025)
work page 2025
-
[24]
Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, and Zili Shao. 2025. Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak.ArXiv e-prints(2025)
work page 2025
-
[25]
Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. 2025. Reason-to-Rank: Distilling Direct and Comparative Reasoning from Large Language Models for Document Reranking. InProceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
work page 2025
-
[26]
Yuelyu Ji, Zhuochun Li, Rui Meng, and Daqing He. 2026. Retrieval–Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends.ArXiv e-prints(2026)
work page 2026
-
[27]
Zimo Ji, Xunguang Wang, Zongjie Li, Pingchuan Ma, Yudong Gao, Daoyuan Wu, Xincheng Yan, Tian Tian, and Shuai Wang. 2025. Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks.ArXiv e-prints (2025)
work page 2025
-
[28]
Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. 2025. PromptLocate: Localizing Prompt Injection Attacks.ArXiv e-prints(2025)
work page 2025
-
[29]
Tanqiu Jiang, Yuhui Wang, Jiacheng Liang, and Ting Wang. 2026. AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks.ArXiv e-prints(2026)
work page 2026
-
[30]
Tanqiu Jiang, Zian Wang, Jiacheng Liang, Changjiang Li, Yuhui Wang, and Ting Wang. 2024. RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction.ArXiv e-prints(2024)
work page 2024
-
[31]
Xiaochong Jiang, Shiqi Yang, Wenting Yang, Yichen Liu, and Cheng Ji. 2026. Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains.ArXiv e-prints(2026)
work page 2026
-
[32]
Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, and Edward Suh. 2025. Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis.ArXiv e-prints(2025)
work page 2025
-
[33]
Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, and Joe Benton. 2025. SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.ArXiv e-prints(2025)
work page 2025
-
[34]
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2026. LLMs Get Lost In Multi-Turn Conversation. InProceedings of International Conference on Learning Representations (ICLR)
work page 2026
-
[35]
Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi. 2025. STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents.ArXiv e-prints(2025)
work page 2025
-
[36]
Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, and Yew-Soon Ong. 2025. MMT-ARD: Multimodal Multi- Teacher Adversarial Distillation for Robust Vision-Language Models.ArXiv e-prints(2025)
work page 2025
-
[37]
Yuqi Li, Hansheng Zeng, Fuyan Zhang, Chuanguang Yang, Yanli Li, and Weiping Ding. 2025. Efficient Medical Image Segmentation via Reinforcement Learning- Driven K-Space Sampling.IEEE Transactions on Emerging Topics in Computational Intelligence(2025)
work page 2025
-
[38]
Yuqi Li, Zijie Zhou, Zhiyuan Peng, Junhao Dong, Haochen You, Renye Yan, Ship- ing Wen, Yingli Tian, and Tingwen Huang. 2025. A preference-driven method- ology for efficient code generation.IEEE Transactions on Artificial Intelligence (2025)
work page 2025
-
[39]
Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, and Ting Wang. 2025. AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models. ArXiv e-prints(2025)
work page 2025
-
[40]
Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, and Ting Wang. 2026. RASA: Routing- Aware Safety Alignment for Mixture-of-Experts Models.ArXiv e-prints(2026)
work page 2026
-
[41]
Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, and Ting Wang. 2025. GraphRAG under Fire.ArXiv e-prints(2025)
work page 2025
-
[42]
Weisi Liu, Guangzeng Han, and Xiaolei Huang. 2025. Examining and Adapting Time for Multilingual Classification via Mixture of Temporal Experts. InProceed- ings of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)
work page 2025
-
[43]
Weisi Liu, Zhe He, and Xiaolei Huang. 2025. Time matters: Examine temporal ef- fects on biomedical language models. InProceedings of AMIA Annual Symposium (AMIA)
work page 2025
-
[44]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generat- ing Stealthy Jailbreak Prompts on Aligned Large Language Models. InProceedings of International Conference on Learning Representations (ICLR)
work page 2023
-
[45]
Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024. Automatic and Universal Prompt Injection Attacks against Large Language Models.ArXiv e-prints(2024)
work page 2024
-
[46]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. InPro- ceedings of USENIX Security Symposium (SEC)
work page 2024
-
[47]
Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. 2025. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. InPro- ceedings of IEEE Symposium on Security and Privacy (S&P)
work page 2025
-
[48]
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, and Neil Zhenqiang Gong. 2025. SecInfer: Preventing Prompt Injection via Inference-time Scaling.ArXiv e-prints (2025)
work page 2025
-
[49]
Huaiying Luo and Cheng Ji. 2025. Cross-cloud data privacy protection: Optimiz- ing collaborative mechanisms of ai systems by integrating federated learning and llms. InProceedings of IEEE International Conference on Communications, Information System and Computer Engineering (CISCE)
work page 2025
-
[50]
Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. 2024. Fight Back Against Jailbreaking via Prompt Adversarial Tuning. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[51]
Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr
-
[52]
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections.ArXiv e-prints(2025)
work page 2025
-
[53]
OpenAI. 2025. Introducing deep research. https://openai.com/index/introducing- deep-research/
work page 2025
-
[54]
OpenManus. 2025. Openmanus: Open-source ai agent framework. https:// github.com/mannaandpoem/OpenManus Wang et al
work page 2025
-
[55]
Patil, Ion Stoica, and Joseph E
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.ArXiv e-prints(2023)
work page 2023
-
[56]
Shilong Pan, Zhiliang Tian, Zhen Huang, Wanlong Yu, Zhihua Wen, Xinwang Liu, Kai Lu, Minlie Huang, and Dongsheng Li. 2025. AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models. InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2025
-
[57]
Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. 2025. X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents. InProceedings of Conference on Language Modeling (COLM)
work page 2025
-
[58]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InProceedings of International Conference on Learning Representations (ICLR)
work page 2024
-
[59]
Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. In Proceedings of USENIX Security Symposium (SEC)
work page 2025
-
[60]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InProceedings of Advances in Neural Information Processing Systems (NeurIPS)
work page 2023
-
[61]
Sander Schulhoff. 2024. The Sandwich Defense: Strengthening AI Prompt Se- curity. https://learnprompting.org/docs/prompt_hacking/defensive_measures/ sandwich_defense
work page 2024
-
[62]
Zeyang Sha, Hanling Tian, Zhuoer Xu, Shiwen Cui, Changhua Meng, and Weiqiang Wang. 2025. Agent Safety Alignment via Reinforcement Learning. ArXiv e-prints(2025)
work page 2025
-
[63]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.ArXiv e-prints(2024)
work page 2024
-
[64]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework.ArXiv e-prints(2024)
work page 2024
-
[65]
Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt Injection Attack to Tool Selection in LLM Agents. ArXiv e-prints(2025)
work page 2025
-
[66]
Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. ArXiv e-prints(2025)
work page 2025
-
[67]
Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. rLLM: A Framework for Post-Training Language Agents. https://pretty-radio-b75.notion.site/rLLM-A-Framework-for- Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31
work page 2025
-
[68]
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.ArXiv e-prints(2024)
work page 2024
-
[69]
Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. 2025. AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection.ArXiv e-prints (2025)
work page 2025
-
[70]
Reachal Wang, Yuqi Jia, and Neil Zhenqiang Gong. 2026. ObliInjection: Order- Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data. In Proceedings of Network and Distributed System Security Symposium (NDSS)
work page 2026
-
[71]
Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, and Ting Wang
-
[72]
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models.ArXiv e-prints(2025)
work page 2025
-
[73]
Yuhui Wang, Rongyi Zhu, and Ting Wang. 2025. Self-Destructive Language Model.ArXiv e-prints(2025)
work page 2025
-
[74]
Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Bal- ashankar, Peter Stone, and Lukas Rutishauser. 2025. Adversarial Reinforcement Learning for Large Language Model Agent Safety.ArXiv e-prints(2025)
work page 2025
-
[75]
Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)
work page 2025
-
[76]
Ruoyao Wen, Hao Li, Chaowei Xiao, and Ning Zhang. 2026. AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management. ArXiv e-prints(2026)
work page 2026
-
[77]
Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, and Chuan Guo. 2025. RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection.ArXiv e-prints(2025)
work page 2025
-
[78]
Simon Willison. 2023. Delimiters Won’t Save You from Prompt Injection. https: //simonwillison.net/2023/May/11/delimiters-wont-save-you/
work page 2023
-
[79]
Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective.ArXiv e-prints(2024)
work page 2024
-
[80]
Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. 2025. Instructional Segment Embedding: Improving LLM Safety with Instruction Hi- erarchy. InProceedings of International Conference on Learning Representations (ICLR)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.