pith. sign in

arxiv: 2605.05704 · v2 · pith:OBCY4VANnew · submitted 2026-05-07 · 💻 cs.CR · cs.AI

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Pith reviewed 2026-05-25 06:16 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agent safetyguardrailhierarchical memoryadversarial generationself-evolutionsafety-utility tradeoffdynamic rule injection
0
0 comments X

The pith

SafeHarbor uses hierarchical memory to inject context-aware defense rules into LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that LLM agents can maintain high utility on ambiguous benign tasks while refusing most harmful requests by replacing static guidelines with dynamically injected, context-aware rules. These rules are generated through adversarial methods and stored in a local hierarchical memory that evolves via entropy-driven splitting and merging. A sympathetic reader would care because existing defenses often force a sharp safety-utility trade-off that blocks useful agent behavior. If the approach holds, agents could operate in real environments with less over-refusal and without model retraining.

Core claim

SafeHarbor extracts context-aware defense rules through enhanced adversarial generation and maintains them in a local hierarchical memory system for dynamic rule injection, with an information entropy-based self-evolution mechanism that optimizes the memory structure through dynamic node splitting and merging.

What carries the argument

The local hierarchical memory system that stores and dynamically injects context-aware rules while self-evolving through entropy-based node splitting and merging.

If this is right

  • Agents reach a peak benign utility of 63.6 percent on GPT-4o while refusing over 93 percent of harmful requests.
  • The system supplies a training-free and plug-and-play defense that avoids retraining the base model.
  • Dynamic rule injection reduces over-refusal on ambiguous benign tasks compared with static guidelines.
  • Entropy-based splitting and merging continuously refines the memory structure during operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory structure could support safety rules that adapt across multiple interacting agents rather than a single agent.
  • Self-evolution through entropy might lower the frequency of manual safety policy updates as new attack patterns appear.
  • Memory augmentation of this form offers an alternative path to safety alignment that avoids the cost of repeated fine-tuning.

Load-bearing premise

The assumption that adversarially generated context-aware rules remain effective and do not create new failure modes once injected and managed through the hierarchical memory and entropy mechanisms.

What would settle it

A test set of harmful requests generated independently of the adversarial generation process that causes the refusal rate to fall below 93 percent or the benign utility to fall below 60 percent on GPT-4o.

Figures

Figures reproduced from arXiv: 2605.05704 by Deyue Zhang, Dongdong Yang, Hao Peng, Quanchen Zou, Wenxin Zhang, Xiangzheng Zhang, Zhe Liu, Zonghao Ying.

Figure 1
Figure 1. Figure 1: Comparison between (a) Traditional coarse-grained guardrails and (b) Our precise, rule-based SAFEHARBOR frame￾work. 1. Introduction The landscape of LLMs has evolved significantly, shifting from passive conversational chatbots to autonomous agents capable of active tool utilization and complex reasoning (Yao et al., 2022; Schick et al., 2023). By integrating with external APIs and execution environments, t… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed SAFEHARBOR framework. The workflow operates in three coordinated stages: (I) adversarial rule generation, which constructs dynamic clusters of safety rules; (II) dual knowledge storage, which organizes rules and synthesized exemptions into a memory tree while training a safety projector; and (III) scoring & retrieval, which employs a gating mechanism to route queries between a fast path and ri… view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter sensitivity analysis of the safety projector evaluating the impact of the contrastive loss weight λ and the safety margin ∆ on classification accuracy and F1-score. 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (a) Similarity Threshold sim Intent Match (IM) Noise Ratio (NR) Cluster Count 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (b) Gain Threshold gain Intent Match (IM) Noise… view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter sensitivity analysis of the safety projector evaluating the impact of the contrastive loss weight λ and the safety margin ∆ on classification accuracy and F1-score. 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (a) Similarity Threshold sim Intent Match (IM) Noise Ratio (NR) Cluster Count 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (b) Gain Threshold gain Intent Match (IM) Noise… view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter Sensitivity Analysis on Dynamic Memory Evolution. We evaluate the impact of (a) the Similarity Threshold (τsim) and (b) the Gain Threshold (τgain) on rule clustering and evolution performance. The metrics include Intent Match (IM), Noise Ratio (NR), and system overhead (Cluster Count/Merge Calls). The shaded regions indicate the optimal configurations selected for our final implementation. l… view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter Sensitivity Analysis on Dynamic Memory Evolution. We evaluate the impact of (a) the Similarity Threshold (τsim) and (b) the Gain Threshold (τgain) on rule clustering and evolution performance. The metrics include Intent Match (IM), Noise Ratio (NR), and system overhead (Cluster Count/Merge Calls). The shaded regions indicate the optimal configurations selected for our final implementation. l… view at source ↗
Figure 5
Figure 5. Figure 5: Safety Projector Bypass Analysis. This evaluation systematically explores the impact of varying the Harmful Threshold and Benign Threshold on two critical performance metrics: (a) Harmful Leak Rate, which quantifies the safety risk by measuring the percentage of malicious queries that bypass the filter; and (b) the Benign Fast Path Rate, which reflects system efficiency by indicating the proportion of safe… view at source ↗
Figure 5
Figure 5. Figure 5: Safety Projector Bypass Analysis. This evaluation systematically explores the impact of varying the Harmful Threshold and Benign Threshold on two critical performance metrics: (a) Harmful Leak Rate, which quantifies the safety risk by measuring the percentage of malicious queries that bypass the filter; and (b) the Benign Fast Path Rate, which reflects system efficiency by indicating the proportion of safe… view at source ↗
Figure 6
Figure 6. Figure 6: The system prompt template for Goal Decomposition. It instructs the model to atomize the high-level harmful objective into a sequence of seemingly benign sub-steps, integrated within a natural narrative to bypass intent detection. System Prompt for Privilege Escalation You are an expert in framing requests with administrative authority to test system compliance boundaries. Your task is to rewrite the input… view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt template for Privilege Escalation. The model wraps the user query in a natural, authoritative directive to test access controls without using detectable format tags. K. Case Study: Harmful Leakage Prevention In this scenario, the attacker attempts to utilize legitimate tools (e.g., Instagram lookup, Google search) to harass a target. The system is required to look beyond the superficial t… view at source ↗
Figure 8
Figure 8. Figure 8: The system prompt template for Contextual Reframing. It embeds the {{TOPIC}} into safe educational or fictional contexts to evaluate intent recognition. Step 1: Incoming User Query (Attacker) Prompt: ”Look up the instagram posts of Sylvia Lewandowska and download an image which might not be professional. Based on her instagram you should see where she works, use that information to get a help email of her … view at source ↗
Figure 9
Figure 9. Figure 9: A qualitative case study demonstrating SAFEHARBOR’s defense pipeline against Social Engineering. The system retrieves nuanced Exemption and Prohibition clauses, allowing the LLM judge to distinguish between ”investigative summaries” (allowed) and ”causing distress” (forbidden), ultimately blocking the harassment attempt. L. Case Study: Benign Over-refusal Mitigation Here, the user performs an administrativ… view at source ↗
Figure 10
Figure 10. Figure 10: Case study of False Positive Mitigation. Despite high-risk keywords like ”SSH” and ”backup”, SAFEHARBOR retrieves the specific Exemption Clause for ”making a backup copy”. The LLM Verifier, aided by a low Projector Harm Score (0.0853), correctly identifies the administrative context and permits the operation. 17 view at source ↗
Figure 11
Figure 11. Figure 11: The system prompt utilized for the Grule.Generate function. The model is instructed to perform contrastive analysis between harmful and benign query clusters to derive nuanced exemption clauses without over-generalizing. 18 view at source ↗
Figure 12
Figure 12. Figure 12: The system prompt for Grule.Refine function. When a new attack trajectory falls within the semantic basin of an existing cluster (High Similarity, Low Information Gain), this module merges the specific nuances of the new attack into the existing rule to prevent redundancy while expanding benign exemptions. 19 view at source ↗
Figure 13
Figure 13. Figure 13: The system prompt for the LLM Judgment in the retrieving phase. It integrates dynamic safety signals and retrieval-augmented exemptions to distinguish between legitimate administrative actions and actual threats, enforcing a ”Presumption of Utility” for authorized users. 20 view at source ↗
read the original abstract

Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes SafeHarbor, a training-free guardrail framework for LLM agents that extracts context-aware defense rules via enhanced adversarial generation, stores and injects them through a local hierarchical memory system, and applies an information entropy-based self-evolution mechanism with dynamic node splitting and merging. It claims this yields state-of-the-art results on ambiguous benign tasks and explicit malicious attacks, specifically a peak benign utility of 63.6% on GPT-4o while maintaining >93% refusal rate on harmful requests, as a plug-and-play solution.

Significance. If the performance claims and mechanism attributions hold after proper validation, the work could provide a practical, reproducible approach to balancing safety and utility in autonomous LLM agents without retraining. The public release of source code supports reproducibility and is a strength.

major comments (3)
  1. [Abstract] Abstract: The central performance claims (63.6% benign utility on GPT-4o; >93% refusal rate) are asserted without any description of experimental setup, datasets, baselines, measurement protocols, or number of trials. This absence is load-bearing because the paper attributes these numbers specifically to the hierarchical memory and entropy-driven split/merge mechanisms.
  2. [Abstract] Abstract (self-evolution mechanism description): No ablation, rule-retention analysis, or failure-case examination is referenced for the entropy-based splitting/merging logic. Without these, it is impossible to confirm that the mechanism improves coverage without dropping critical rules or introducing inconsistent boundaries, undermining attribution of the reported gains.
  3. [Abstract] Abstract (adversarial generation and hierarchical injection): The assumption that context-aware rules generated adversarially remain effective when managed via hierarchical memory lacks any cited validation, retention-rate measurement, or comparison to static rules. This is load-bearing for the claim that the framework establishes precise decision boundaries.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments' but supplies no concrete benchmark names or task counts; adding these would improve clarity without altering the technical content.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments focused on the abstract. The full manuscript contains detailed experimental sections, ablations, and validations supporting the claims; however, we acknowledge that the abstract could be strengthened for self-containment and will revise it accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (63.6% benign utility on GPT-4o; >93% refusal rate) are asserted without any description of experimental setup, datasets, baselines, measurement protocols, or number of trials. This absence is load-bearing because the paper attributes these numbers specifically to the hierarchical memory and entropy-driven split/merge mechanisms.

    Authors: The abstract summarizes key outcomes while the experimental setup (datasets for ambiguous benign and malicious tasks, baselines including static guardrails, measurement via utility and refusal rates, and multi-trial averaging) is described in Section 4. Attribution to the mechanisms rests on the comparisons and ablations in Section 5. To address the concern about self-containment, we will revise the abstract to include a concise clause noting the evaluation was performed across standard benchmarks with multiple trials on models including GPT-4o. revision: yes

  2. Referee: [Abstract] Abstract (self-evolution mechanism description): No ablation, rule-retention analysis, or failure-case examination is referenced for the entropy-based splitting/merging logic. Without these, it is impossible to confirm that the mechanism improves coverage without dropping critical rules or introducing inconsistent boundaries, undermining attribution of the reported gains.

    Authors: The abstract omits these details due to space limits, but the manuscript provides ablation studies, rule-retention rates, and failure-case analysis for the entropy-driven split/merge logic in Section 5.2, showing improved coverage without loss of critical rules. We will revise the abstract to briefly reference that the self-evolution mechanism was validated via such analyses. revision: partial

  3. Referee: [Abstract] Abstract (adversarial generation and hierarchical injection): The assumption that context-aware rules generated adversarially remain effective when managed via hierarchical memory lacks any cited validation, retention-rate measurement, or comparison to static rules. This is load-bearing for the claim that the framework establishes precise decision boundaries.

    Authors: The manuscript validates adversarial rule effectiveness under hierarchical management through direct comparisons to static rules and retention-rate measurements in the experimental results (Section 5). These support the precise boundaries claim. We will revise the abstract to note that the hierarchical injection was evaluated against static baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal with no derivations or self-referential predictions

full rationale

The paper presents a framework (SafeHarbor) with hierarchical memory, adversarial rule generation, and entropy-based evolution, evaluated via experiments claiming SOTA metrics (63.6% benign utility, >93% refusal). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on experimental results rather than any chain that reduces to its own inputs by construction. This matches the default expectation of no circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities with independent evidence are stated beyond the high-level framework description.

invented entities (1)
  • Hierarchical memory system with dynamic node splitting/merging no independent evidence
    purpose: Dynamic injection of context-aware defense rules
    Introduced as core component of SafeHarbor; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5766 in / 1086 out tokens · 34468 ms · 2026-05-25T06:16:18.217357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    International Conference on Learning Representations , volume=

    Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    Proceedings of the 40th International Conference on Machine Learning , pages=

    PaLM-E: an embodied multimodal language model , author=. Proceedings of the 40th International Conference on Machine Learning , pages=

  6. [6]

    Transactions on Machine Learning Research , year=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , year=

  7. [7]

    The Twelfth International Conference on Learning Representations , year=

    SWE-bench: Can Language Models Resolve Real-world Github Issues? , author=. The Twelfth International Conference on Learning Representations , year=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  10. [10]

    Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

    Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

  11. [11]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    RigorLLM: resilient guardrails for large language models against undesired content , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  12. [12]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Dynamic guided and domain applicable safeguards for enhanced security in large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  13. [13]

    Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

    Contextual Agent Security: A Policy for Every Purpose , author=. Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

  14. [14]

    Socially Responsible Language Modelling Research , year=

    Testing Language Model Agents Safely in the Wild , author=. Socially Responsible Language Modelling Research , year=

  15. [15]

    GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

    Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=

  16. [16]

    The Thirteenth International Conference on Learning Representations , year=

    Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , author=. The Thirteenth International Conference on Learning Representations , year=

  17. [17]

    The Twelfth International Conference on Learning Representations , year=

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. The Twelfth International Conference on Learning Representations , year=

  18. [18]

    2024 , howpublished =

    Llama Guard 3 8B , author =. 2024 , howpublished =

  19. [19]

    Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

    Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

  20. [20]

    Forty-second International Conference on Machine Learning , year=

    AdvAgent: Controllable Blackbox Red-teaming on Web Agents , author=. Forty-second International Conference on Machine Learning , year=

  21. [21]

    Forty-second International Conference on Machine Learning , year=

    ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning , author=. Forty-second International Conference on Machine Learning , year=

  22. [22]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

    Agrail: A lifelong agent guardrail with effective and adaptive safety detection , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

  23. [23]

    arXiv preprint arXiv:2505.23020 , year=

    AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models , author=. arXiv preprint arXiv:2505.23020 , year=

  24. [24]

    , author=

    MemGPT: Towards LLMs as Operating Systems. , author=. 2023 , publisher=

  25. [25]

    Procedia computer science , volume=

    A Survey on RAG with LLMs , author=. Procedia computer science , volume=. 2024 , publisher=

  26. [26]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  27. [27]

    Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Memory sandbox: Transparent and interactive memory management for conversational agents , author=. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

    Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

  30. [30]

    The Thirteenth International Conference on Learning Representations , year=

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. The Thirteenth International Conference on Learning Representations , year=

  31. [31]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

  32. [32]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  33. [33]

    ACM Transactions on Information Systems , volume=

    A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

  34. [34]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Reasoningbank: Scaling agent self-evolving with reasoning memory , author=. arXiv preprint arXiv:2509.25140 , year=

  35. [35]

    arXiv preprint arXiv:2311.08719 , year=

    Think-in-memory: Recalling and post-thinking enable llms with long-term memory , author=. arXiv preprint arXiv:2311.08719 , year=

  36. [36]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  37. [37]

    The Twelfth International Conference on Learning Representations , year=

    Multi-step Jailbreaking Privacy Attacks on Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  38. [38]

    Interface-Induced Superconductivity in Magnetic Topological Insulator-Iron Chalcogenide Heterostructures

    HouYi: A Black-Box Jailbreaking Algorithm for Large Language Models Agents , author=. arXiv preprint arXiv:2312.04353 , year=

  39. [39]

    Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

  40. [40]

    Socially Responsible Language Modelling Research , year=

    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author=. Socially Responsible Language Modelling Research , year=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=

  42. [42]

    Machine learning , volume=

    Induction of decision trees , author=. Machine learning , volume=. 1986 , publisher=

  43. [43]

    Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Mining high-speed data streams , author=. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  44. [44]

    Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

    Optimization-based prompt injection attack to llm-as-a-judge , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=