SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Deyue Zhang; Dongdong Yang; Hao Peng; Quanchen Zou; Wenxin Zhang; Xiangzheng Zhang; Zhe Liu; Zonghao Ying

arxiv: 2605.05704 · v2 · pith:OBCY4VANnew · submitted 2026-05-07 · 💻 cs.CR · cs.AI

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu , Zonghao Ying , Wenxin Zhang , Quanchen Zou , Deyue Zhang , Dongdong Yang , Xiangzheng Zhang , Hao Peng This is my paper

Pith reviewed 2026-05-25 06:16 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agent safetyguardrailhierarchical memoryadversarial generationself-evolutionsafety-utility tradeoffdynamic rule injection

0 comments

The pith

SafeHarbor uses hierarchical memory to inject context-aware defense rules into LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that LLM agents can maintain high utility on ambiguous benign tasks while refusing most harmful requests by replacing static guidelines with dynamically injected, context-aware rules. These rules are generated through adversarial methods and stored in a local hierarchical memory that evolves via entropy-driven splitting and merging. A sympathetic reader would care because existing defenses often force a sharp safety-utility trade-off that blocks useful agent behavior. If the approach holds, agents could operate in real environments with less over-refusal and without model retraining.

Core claim

SafeHarbor extracts context-aware defense rules through enhanced adversarial generation and maintains them in a local hierarchical memory system for dynamic rule injection, with an information entropy-based self-evolution mechanism that optimizes the memory structure through dynamic node splitting and merging.

What carries the argument

The local hierarchical memory system that stores and dynamically injects context-aware rules while self-evolving through entropy-based node splitting and merging.

If this is right

Agents reach a peak benign utility of 63.6 percent on GPT-4o while refusing over 93 percent of harmful requests.
The system supplies a training-free and plug-and-play defense that avoids retraining the base model.
Dynamic rule injection reduces over-refusal on ambiguous benign tasks compared with static guidelines.
Entropy-based splitting and merging continuously refines the memory structure during operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory structure could support safety rules that adapt across multiple interacting agents rather than a single agent.
Self-evolution through entropy might lower the frequency of manual safety policy updates as new attack patterns appear.
Memory augmentation of this form offers an alternative path to safety alignment that avoids the cost of repeated fine-tuning.

Load-bearing premise

The assumption that adversarially generated context-aware rules remain effective and do not create new failure modes once injected and managed through the hierarchical memory and entropy mechanisms.

What would settle it

A test set of harmful requests generated independently of the adversarial generation process that causes the refusal rate to fall below 93 percent or the benign utility to fall below 60 percent on GPT-4o.

Figures

Figures reproduced from arXiv: 2605.05704 by Deyue Zhang, Dongdong Yang, Hao Peng, Quanchen Zou, Wenxin Zhang, Xiangzheng Zhang, Zhe Liu, Zonghao Ying.

**Figure 1.** Figure 1: Comparison between (a) Traditional coarse-grained guardrails and (b) Our precise, rule-based SAFEHARBOR framework. 1. Introduction The landscape of LLMs has evolved significantly, shifting from passive conversational chatbots to autonomous agents capable of active tool utilization and complex reasoning (Yao et al., 2022; Schick et al., 2023). By integrating with external APIs and execution environments, t… view at source ↗

**Figure 2.** Figure 2: The proposed SAFEHARBOR framework. The workflow operates in three coordinated stages: (I) adversarial rule generation, which constructs dynamic clusters of safety rules; (II) dual knowledge storage, which organizes rules and synthesized exemptions into a memory tree while training a safety projector; and (III) scoring & retrieval, which employs a gating mechanism to route queries between a fast path and ri… view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity analysis of the safety projector evaluating the impact of the contrastive loss weight λ and the safety margin ∆ on classification accuracy and F1-score. 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (a) Similarity Threshold sim Intent Match (IM) Noise Ratio (NR) Cluster Count 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (b) Gain Threshold gain Intent Match (IM) Noise… view at source ↗

**Figure 4.** Figure 4: Hyperparameter Sensitivity Analysis on Dynamic Memory Evolution. We evaluate the impact of (a) the Similarity Threshold (τsim) and (b) the Gain Threshold (τgain) on rule clustering and evolution performance. The metrics include Intent Match (IM), Noise Ratio (NR), and system overhead (Cluster Count/Merge Calls). The shaded regions indicate the optimal configurations selected for our final implementation. l… view at source ↗

**Figure 5.** Figure 5: Safety Projector Bypass Analysis. This evaluation systematically explores the impact of varying the Harmful Threshold and Benign Threshold on two critical performance metrics: (a) Harmful Leak Rate, which quantifies the safety risk by measuring the percentage of malicious queries that bypass the filter; and (b) the Benign Fast Path Rate, which reflects system efficiency by indicating the proportion of safe… view at source ↗

**Figure 6.** Figure 6: The system prompt template for Goal Decomposition. It instructs the model to atomize the high-level harmful objective into a sequence of seemingly benign sub-steps, integrated within a natural narrative to bypass intent detection. System Prompt for Privilege Escalation You are an expert in framing requests with administrative authority to test system compliance boundaries. Your task is to rewrite the input… view at source ↗

**Figure 7.** Figure 7: The system prompt template for Privilege Escalation. The model wraps the user query in a natural, authoritative directive to test access controls without using detectable format tags. K. Case Study: Harmful Leakage Prevention In this scenario, the attacker attempts to utilize legitimate tools (e.g., Instagram lookup, Google search) to harass a target. The system is required to look beyond the superficial t… view at source ↗

**Figure 8.** Figure 8: The system prompt template for Contextual Reframing. It embeds the {{TOPIC}} into safe educational or fictional contexts to evaluate intent recognition. Step 1: Incoming User Query (Attacker) Prompt: ”Look up the instagram posts of Sylvia Lewandowska and download an image which might not be professional. Based on her instagram you should see where she works, use that information to get a help email of her … view at source ↗

**Figure 9.** Figure 9: A qualitative case study demonstrating SAFEHARBOR’s defense pipeline against Social Engineering. The system retrieves nuanced Exemption and Prohibition clauses, allowing the LLM judge to distinguish between ”investigative summaries” (allowed) and ”causing distress” (forbidden), ultimately blocking the harassment attempt. L. Case Study: Benign Over-refusal Mitigation Here, the user performs an administrativ… view at source ↗

**Figure 10.** Figure 10: Case study of False Positive Mitigation. Despite high-risk keywords like ”SSH” and ”backup”, SAFEHARBOR retrieves the specific Exemption Clause for ”making a backup copy”. The LLM Verifier, aided by a low Projector Harm Score (0.0853), correctly identifies the administrative context and permits the operation. 17 view at source ↗

**Figure 11.** Figure 11: The system prompt utilized for the Grule.Generate function. The model is instructed to perform contrastive analysis between harmful and benign query clusters to derive nuanced exemption clauses without over-generalizing. 18 view at source ↗

**Figure 12.** Figure 12: The system prompt for Grule.Refine function. When a new attack trajectory falls within the semantic basin of an existing cluster (High Similarity, Low Information Gain), this module merges the specific nuances of the new attack into the existing rule to prevent redundancy while expanding benign exemptions. 19 view at source ↗

**Figure 13.** Figure 13: The system prompt for the LLM Judgment in the retrieving phase. It integrates dynamic safety signals and retrieval-augmented exemptions to distinguish between legitimate administrative actions and actual threats, enforcing a ”Presumption of Utility” for authorized users. 20 view at source ↗

read the original abstract

Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeHarbor adds entropy-driven splitting and merging to a hierarchical memory for storing adversarial rules in LLM agent guardrails, but the reported gains lack ablations or mechanism validation.

read the letter

SafeHarbor extracts context-aware rules via adversarial generation, stores them in a local hierarchical memory for dynamic injection, and uses an information entropy measure to split and merge nodes over time. This is pitched as a training-free, plug-and-play guardrail that improves the safety-utility balance for tool-using agents. The entropy self-evolution step is the clearest new piece; the hierarchical structure itself is a specific design choice that could be straightforward to implement. They also release the code, which helps anyone who wants to test it directly. The abstract frames the over-refusal problem clearly and gives concrete numbers: 63.6% benign utility on GPT-4o with over 93% refusal on harmful requests. Those figures are the main thing a reader would take away at first glance. The soft spot is exactly what the stress-test note flags. The performance is attributed to the combination of adversarial rules, hierarchical injection, and entropy-driven adaptation, yet nothing in the description shows an ablation that removes the entropy component, measures rule retention after merges, or checks whether split/merge operations introduce inconsistent boundaries. Without those checks it is impossible to credit the new mechanisms rather than the base adversarial generation or benchmark details. The assumption that the dynamic memory preserves rule effectiveness without new failure modes stays untested on the evidence given. This is for people working on practical defenses for LLM agents rather than foundational theory. A reader already building guardrails might pick up the memory architecture as one option to try, but would have to run their own controls to see what actually moves the needle. It deserves peer review because the deployment problem is real and the approach is concrete enough to evaluate, though any review would need to press for the missing validation steps.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes SafeHarbor, a training-free guardrail framework for LLM agents that extracts context-aware defense rules via enhanced adversarial generation, stores and injects them through a local hierarchical memory system, and applies an information entropy-based self-evolution mechanism with dynamic node splitting and merging. It claims this yields state-of-the-art results on ambiguous benign tasks and explicit malicious attacks, specifically a peak benign utility of 63.6% on GPT-4o while maintaining >93% refusal rate on harmful requests, as a plug-and-play solution.

Significance. If the performance claims and mechanism attributions hold after proper validation, the work could provide a practical, reproducible approach to balancing safety and utility in autonomous LLM agents without retraining. The public release of source code supports reproducibility and is a strength.

major comments (3)

[Abstract] Abstract: The central performance claims (63.6% benign utility on GPT-4o; >93% refusal rate) are asserted without any description of experimental setup, datasets, baselines, measurement protocols, or number of trials. This absence is load-bearing because the paper attributes these numbers specifically to the hierarchical memory and entropy-driven split/merge mechanisms.
[Abstract] Abstract (self-evolution mechanism description): No ablation, rule-retention analysis, or failure-case examination is referenced for the entropy-based splitting/merging logic. Without these, it is impossible to confirm that the mechanism improves coverage without dropping critical rules or introducing inconsistent boundaries, undermining attribution of the reported gains.
[Abstract] Abstract (adversarial generation and hierarchical injection): The assumption that context-aware rules generated adversarially remain effective when managed via hierarchical memory lacks any cited validation, retention-rate measurement, or comparison to static rules. This is load-bearing for the claim that the framework establishes precise decision boundaries.

minor comments (1)

[Abstract] The abstract states 'extensive experiments' but supplies no concrete benchmark names or task counts; adding these would improve clarity without altering the technical content.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments focused on the abstract. The full manuscript contains detailed experimental sections, ablations, and validations supporting the claims; however, we acknowledge that the abstract could be strengthened for self-containment and will revise it accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (63.6% benign utility on GPT-4o; >93% refusal rate) are asserted without any description of experimental setup, datasets, baselines, measurement protocols, or number of trials. This absence is load-bearing because the paper attributes these numbers specifically to the hierarchical memory and entropy-driven split/merge mechanisms.

Authors: The abstract summarizes key outcomes while the experimental setup (datasets for ambiguous benign and malicious tasks, baselines including static guardrails, measurement via utility and refusal rates, and multi-trial averaging) is described in Section 4. Attribution to the mechanisms rests on the comparisons and ablations in Section 5. To address the concern about self-containment, we will revise the abstract to include a concise clause noting the evaluation was performed across standard benchmarks with multiple trials on models including GPT-4o. revision: yes
Referee: [Abstract] Abstract (self-evolution mechanism description): No ablation, rule-retention analysis, or failure-case examination is referenced for the entropy-based splitting/merging logic. Without these, it is impossible to confirm that the mechanism improves coverage without dropping critical rules or introducing inconsistent boundaries, undermining attribution of the reported gains.

Authors: The abstract omits these details due to space limits, but the manuscript provides ablation studies, rule-retention rates, and failure-case analysis for the entropy-driven split/merge logic in Section 5.2, showing improved coverage without loss of critical rules. We will revise the abstract to briefly reference that the self-evolution mechanism was validated via such analyses. revision: partial
Referee: [Abstract] Abstract (adversarial generation and hierarchical injection): The assumption that context-aware rules generated adversarially remain effective when managed via hierarchical memory lacks any cited validation, retention-rate measurement, or comparison to static rules. This is load-bearing for the claim that the framework establishes precise decision boundaries.

Authors: The manuscript validates adversarial rule effectiveness under hierarchical management through direct comparisons to static rules and retention-rate measurements in the experimental results (Section 5). These support the precise boundaries claim. We will revise the abstract to note that the hierarchical injection was evaluated against static baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal with no derivations or self-referential predictions

full rationale

The paper presents a framework (SafeHarbor) with hierarchical memory, adversarial rule generation, and entropy-based evolution, evaluated via experiments claiming SOTA metrics (63.6% benign utility, >93% refusal). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on experimental results rather than any chain that reduces to its own inputs by construction. This matches the default expectation of no circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities with independent evidence are stated beyond the high-level framework description.

invented entities (1)

Hierarchical memory system with dynamic node splitting/merging no independent evidence
purpose: Dynamic injection of context-aware defense rules
Introduced as core component of SafeHarbor; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5766 in / 1086 out tokens · 34468 ms · 2026-05-25T06:16:18.217357+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical memory tree M... centroid ci, covering radius ri, dual-policy unit {Rharm, Ebenign}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

work page 2022
[2]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

work page
[4]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

Proceedings of the 40th International Conference on Machine Learning , pages=

PaLM-E: an embodied multimodal language model , author=. Proceedings of the 40th International Conference on Machine Learning , pages=

work page
[6]

Transactions on Machine Learning Research , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , year=

work page
[7]

The Twelfth International Conference on Learning Representations , year=

SWE-bench: Can Language Models Resolve Real-world Github Issues? , author=. The Twelfth International Conference on Learning Representations , year=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2023
[11]

Proceedings of the 41st International Conference on Machine Learning , pages=

RigorLLM: resilient guardrails for large language models against undesired content , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[12]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Dynamic guided and domain applicable safeguards for enhanced security in large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025
[13]

Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

Contextual Agent Security: A Policy for Every Purpose , author=. Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

work page 2025
[14]

Socially Responsible Language Modelling Research , year=

Testing Language Model Agents Safely in the Wild , author=. Socially Responsible Language Modelling Research , year=

work page
[15]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Thirteenth International Conference on Learning Representations , year=

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[17]

The Twelfth International Conference on Learning Representations , year=

Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. The Twelfth International Conference on Learning Representations , year=

work page
[18]

2024 , howpublished =

Llama Guard 3 8B , author =. 2024 , howpublished =

work page 2024
[19]

Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

work page
[20]

Forty-second International Conference on Machine Learning , year=

AdvAgent: Controllable Blackbox Red-teaming on Web Agents , author=. Forty-second International Conference on Machine Learning , year=

work page
[21]

Forty-second International Conference on Machine Learning , year=

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning , author=. Forty-second International Conference on Machine Learning , year=

work page
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

Agrail: A lifelong agent guardrail with effective and adaptive safety detection , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

work page
[23]

arXiv preprint arXiv:2505.23020 , year=

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models , author=. arXiv preprint arXiv:2505.23020 , year=

work page arXiv
[24]

, author=

MemGPT: Towards LLMs as Operating Systems. , author=. 2023 , publisher=

work page 2023
[25]

Procedia computer science , volume=

A Survey on RAG with LLMs , author=. Procedia computer science , volume=. 2024 , publisher=

work page 2024
[26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[27]

Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Memory sandbox: Transparent and interactive memory management for conversational agents , author=. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[28]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

work page 2025
[30]

The Thirteenth International Conference on Learning Representations , year=

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[31]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[33]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

work page 2025
[34]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Reasoningbank: Scaling agent self-evolving with reasoning memory , author=. arXiv preprint arXiv:2509.25140 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2311.08719 , year=

Think-in-memory: Recalling and post-thinking enable llms with long-term memory , author=. arXiv preprint arXiv:2311.08719 , year=

work page arXiv
[36]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[37]

The Twelfth International Conference on Learning Representations , year=

Multi-step Jailbreaking Privacy Attacks on Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[38]

Interface-Induced Superconductivity in Magnetic Topological Insulator-Iron Chalcogenide Heterostructures

HouYi: A Black-Box Jailbreaking Algorithm for Large Language Models Agents , author=. arXiv preprint arXiv:2312.04353 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

work page
[40]

Socially Responsible Language Modelling Research , year=

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author=. Socially Responsible Language Modelling Research , year=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

Machine learning , volume=

Induction of decision trees , author=. Machine learning , volume=. 1986 , publisher=

work page 1986
[43]

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Mining high-speed data streams , author=. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page
[44]

Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

Optimization-based prompt injection attack to llm-as-a-judge , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2024

[1] [1]

NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

work page 2022

[2] [2]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [3]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

work page

[4] [4]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

Proceedings of the 40th International Conference on Machine Learning , pages=

PaLM-E: an embodied multimodal language model , author=. Proceedings of the 40th International Conference on Machine Learning , pages=

work page

[6] [6]

Transactions on Machine Learning Research , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , year=

work page

[7] [7]

The Twelfth International Conference on Learning Representations , year=

SWE-bench: Can Language Models Resolve Real-world Github Issues? , author=. The Twelfth International Conference on Learning Representations , year=

work page

[8] [8]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [9]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2023

[11] [11]

Proceedings of the 41st International Conference on Machine Learning , pages=

RigorLLM: resilient guardrails for large language models against undesired content , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page

[12] [12]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Dynamic guided and domain applicable safeguards for enhanced security in large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025

[13] [13]

Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

Contextual Agent Security: A Policy for Every Purpose , author=. Proceedings of the 2025 Workshop on Hot Topics in Operating Systems , pages=

work page 2025

[14] [14]

Socially Responsible Language Modelling Research , year=

Testing Language Model Agents Safely in the Wild , author=. Socially Responsible Language Modelling Research , year=

work page

[15] [15]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

The Thirteenth International Conference on Learning Representations , year=

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[17] [17]

The Twelfth International Conference on Learning Representations , year=

Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. The Twelfth International Conference on Learning Representations , year=

work page

[18] [18]

2024 , howpublished =

Llama Guard 3 8B , author =. 2024 , howpublished =

work page 2024

[19] [19]

Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

work page

[20] [20]

Forty-second International Conference on Machine Learning , year=

AdvAgent: Controllable Blackbox Red-teaming on Web Agents , author=. Forty-second International Conference on Machine Learning , year=

work page

[21] [21]

Forty-second International Conference on Machine Learning , year=

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning , author=. Forty-second International Conference on Machine Learning , year=

work page

[22] [22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

Agrail: A lifelong agent guardrail with effective and adaptive safety detection , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

work page

[23] [23]

arXiv preprint arXiv:2505.23020 , year=

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models , author=. arXiv preprint arXiv:2505.23020 , year=

work page arXiv

[24] [24]

, author=

MemGPT: Towards LLMs as Operating Systems. , author=. 2023 , publisher=

work page 2023

[25] [25]

Procedia computer science , volume=

A Survey on RAG with LLMs , author=. Procedia computer science , volume=. 2024 , publisher=

work page 2024

[26] [26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[27] [27]

Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Memory sandbox: Transparent and interactive memory management for conversational agents , author=. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page

[28] [28]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[29] [29]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

work page 2025

[30] [30]

The Thirteenth International Conference on Learning Representations , year=

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[31] [31]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page

[33] [33]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

work page 2025

[34] [34]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Reasoningbank: Scaling agent self-evolving with reasoning memory , author=. arXiv preprint arXiv:2509.25140 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

arXiv preprint arXiv:2311.08719 , year=

Think-in-memory: Recalling and post-thinking enable llms with long-term memory , author=. arXiv preprint arXiv:2311.08719 , year=

work page arXiv

[36] [36]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024

[37] [37]

The Twelfth International Conference on Learning Representations , year=

Multi-step Jailbreaking Privacy Attacks on Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[38] [38]

Interface-Induced Superconductivity in Magnetic Topological Insulator-Iron Chalcogenide Heterostructures

HouYi: A Black-Box Jailbreaking Algorithm for Large Language Models Agents , author=. arXiv preprint arXiv:2312.04353 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

work page

[40] [40]

Socially Responsible Language Modelling Research , year=

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation , author=. Socially Responsible Language Modelling Research , year=

work page

[41] [41]

Advances in Neural Information Processing Systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

Machine learning , volume=

Induction of decision trees , author=. Machine learning , volume=. 1986 , publisher=

work page 1986

[43] [43]

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Mining high-speed data streams , author=. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page

[44] [44]

Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

Optimization-based prompt injection attack to llm-as-a-judge , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2024