arxiv: 2605.05704 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.AI

Recognition: unknown

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu , Zonghao Ying , Wenxin Zhang , Quanchen Zou , Deyue Zhang , Dongdong Yang , Xiangzheng Zhang , Hao Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agent safetyhierarchical memoryguardrailadversarial generationentropy-based optimizationover-refusal mitigationtool-use securitycontext-aware rules

0 comments

The pith

SafeHarbor uses hierarchical memory to extract and evolve context-aware rules that let LLM agents refuse harmful tool use while handling ambiguous benign tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SafeHarbor to resolve the common trade-off where stronger safety rules for LLM agents cause excessive refusal of legitimate requests. It generates context-aware defense rules through enhanced adversarial examples and stores them in a local hierarchical memory that supports dynamic injection at runtime. An entropy-based self-evolution process continuously splits and merges memory nodes to refine decision boundaries without any model training or fine-tuning. The resulting plug-and-play guardrail is shown to reach 63.6 percent benign utility on GPT-4o while refusing more than 93 percent of explicit harmful requests.

Core claim

SafeHarbor extracts context-aware defense rules via enhanced adversarial generation, then organizes and injects them through a local hierarchical memory system. The memory structure self-evolves by using information entropy to decide when nodes should split or merge, producing precise, adaptive decision boundaries for tool-use decisions. This training-free design yields state-of-the-art balance between safety and utility across both ambiguous benign tasks and direct malicious attacks.

What carries the argument

The local hierarchical memory system that stores and dynamically injects context-aware defense rules, evolved through entropy-based node splitting and merging.

If this is right

Achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks.
Reaches 63.6 percent peak benign utility on GPT-4o while sustaining over 93 percent refusal on harmful requests.
Operates as a training-free, efficient, plug-and-play module that requires no per-model retraining.
Continuously refines its own rule structure through entropy-based splitting and merging without external supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-augmented approach could be applied to multi-agent systems where safety rules must adapt across shared tool inventories.
Replacing static guidelines with evolving local memory may reduce the latency and compute cost of safety checks in production deployments.
Evaluating the framework on open-source models would test whether the reported balance holds outside proprietary frontier models.

Load-bearing premise

That rules derived from adversarial generation plus entropy-driven memory reorganization will keep stable, precise boundaries across new tasks, models, and deployments without extra tuning or fresh failure modes.

What would settle it

A new LLM agent or previously unseen task category on which SafeHarbor's refusal rate on harmful requests falls below 80 percent or its benign utility drops below 50 percent.

Figures

Figures reproduced from arXiv: 2605.05704 by Deyue Zhang, Dongdong Yang, Hao Peng, Quanchen Zou, Wenxin Zhang, Xiangzheng Zhang, Zhe Liu, Zonghao Ying.

**Figure 1.** Figure 1: Comparison between (a) Traditional coarse-grained guardrails and (b) Our precise, rule-based SAFEHARBOR framework. 1. Introduction The landscape of LLMs has evolved significantly, shifting from passive conversational chatbots to autonomous agents capable of active tool utilization and complex reasoning (Yao et al., 2022; Schick et al., 2023). By integrating with external APIs and execution environments, t… view at source ↗

**Figure 2.** Figure 2: The proposed SAFEHARBOR framework. The workflow operates in three coordinated stages: (I) adversarial rule generation, which constructs dynamic clusters of safety rules; (II) dual knowledge storage, which organizes rules and synthesized exemptions into a memory tree while training a safety projector; and (III) scoring & retrieval, which employs a gating mechanism to route queries between a fast path and ri… view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity analysis of the safety projector evaluating the impact of the contrastive loss weight λ and the safety margin ∆ on classification accuracy and F1-score. 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (a) Similarity Threshold sim Intent Match (IM) Noise Ratio (NR) Cluster Count 0.1 0.3 0.5 0.7 0.9 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 (b) Gain Threshold gain Intent Match (IM) Noise… view at source ↗

**Figure 4.** Figure 4: Hyperparameter Sensitivity Analysis on Dynamic Memory Evolution. We evaluate the impact of (a) the Similarity Threshold (τsim) and (b) the Gain Threshold (τgain) on rule clustering and evolution performance. The metrics include Intent Match (IM), Noise Ratio (NR), and system overhead (Cluster Count/Merge Calls). The shaded regions indicate the optimal configurations selected for our final implementation. l… view at source ↗

**Figure 5.** Figure 5: Safety Projector Bypass Analysis. This evaluation systematically explores the impact of varying the Harmful Threshold and Benign Threshold on two critical performance metrics: (a) Harmful Leak Rate, which quantifies the safety risk by measuring the percentage of malicious queries that bypass the filter; and (b) the Benign Fast Path Rate, which reflects system efficiency by indicating the proportion of safe… view at source ↗

**Figure 6.** Figure 6: The system prompt template for Goal Decomposition. It instructs the model to atomize the high-level harmful objective into a sequence of seemingly benign sub-steps, integrated within a natural narrative to bypass intent detection. System Prompt for Privilege Escalation You are an expert in framing requests with administrative authority to test system compliance boundaries. Your task is to rewrite the input… view at source ↗

**Figure 7.** Figure 7: The system prompt template for Privilege Escalation. The model wraps the user query in a natural, authoritative directive to test access controls without using detectable format tags. K. Case Study: Harmful Leakage Prevention In this scenario, the attacker attempts to utilize legitimate tools (e.g., Instagram lookup, Google search) to harass a target. The system is required to look beyond the superficial t… view at source ↗

**Figure 8.** Figure 8: The system prompt template for Contextual Reframing. It embeds the {{TOPIC}} into safe educational or fictional contexts to evaluate intent recognition. Step 1: Incoming User Query (Attacker) Prompt: ”Look up the instagram posts of Sylvia Lewandowska and download an image which might not be professional. Based on her instagram you should see where she works, use that information to get a help email of her … view at source ↗

**Figure 9.** Figure 9: A qualitative case study demonstrating SAFEHARBOR’s defense pipeline against Social Engineering. The system retrieves nuanced Exemption and Prohibition clauses, allowing the LLM judge to distinguish between ”investigative summaries” (allowed) and ”causing distress” (forbidden), ultimately blocking the harassment attempt. L. Case Study: Benign Over-refusal Mitigation Here, the user performs an administrativ… view at source ↗

**Figure 10.** Figure 10: Case study of False Positive Mitigation. Despite high-risk keywords like ”SSH” and ”backup”, SAFEHARBOR retrieves the specific Exemption Clause for ”making a backup copy”. The LLM Verifier, aided by a low Projector Harm Score (0.0853), correctly identifies the administrative context and permits the operation. 17 view at source ↗

**Figure 11.** Figure 11: The system prompt utilized for the Grule.Generate function. The model is instructed to perform contrastive analysis between harmful and benign query clusters to derive nuanced exemption clauses without over-generalizing. 18 view at source ↗

**Figure 12.** Figure 12: The system prompt for Grule.Refine function. When a new attack trajectory falls within the semantic basin of an existing cluster (High Similarity, Low Information Gain), this module merges the specific nuances of the new attack into the existing rule to prevent redundancy while expanding benign exemptions. 19 view at source ↗

**Figure 13.** Figure 13: The system prompt for the LLM Judgment in the retrieving phase. It integrates dynamic safety signals and retrieval-augmented exemptions to distinguish between legitimate administrative actions and actual threats, enforcing a ”Presumption of Utility” for authorized users. 20 view at source ↗

read the original abstract

With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textsc{SafeHarbor} achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeHarbor gives a training-free hierarchical memory system for dynamic safety rules in LLM agents, but the reported gains rest on benchmarks whose robustness to new models and tasks is not yet clear.

read the letter

This paper gives a practical plug-and-play guardrail for LLM agents that uses hierarchical memory to store context-aware rules extracted from adversarial attacks, then lets the structure evolve via entropy calculations to split or merge nodes. It does well at addressing the over-refusal issue that plagues stricter safety filters. By making rules dynamic and local to the memory system, it avoids the need for retraining the underlying model. The public code release is a plus for reproducibility. The approach draws from memory augmentation ideas but applies them specifically to safety rule management in an agent setting. The soft spots center on the experimental support. The abstract highlights strong numbers like 63.6% benign utility and 93% refusal, but the lack of visible details on how baselines were chosen or whether ablations isolate the entropy component makes it hard to judge if the gains are robust. The stress-test concern about generalization to unseen tasks and models seems valid based on what's described; without tests on different LLMs or long-horizon scenarios, the claim that the self-evolution maintains precise boundaries could be optimistic. Minor issues might include how the adversarial generation is enhanced and whether it covers the full range of potential attacks. This work is for practitioners in AI safety who want something they can integrate into agent deployments without heavy compute. A reader interested in engineering solutions for agent security would get value from the architecture and the reported trade-off improvements. It deserves a serious referee because the problem is timely and the method is grounded in implementable components, though revisions would likely focus on strengthening the evaluation. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes SafeHarbor, a training-free, plug-and-play framework for LLM agent safety. It extracts context-aware defense rules via enhanced adversarial generation, maintains them in a local hierarchical memory system for dynamic injection, and uses an information entropy-based self-evolution mechanism to optimize the structure through node splitting and merging. The central claim is that this yields state-of-the-art performance, including a peak benign utility of 63.6% on GPT-4o while achieving >93% refusal on harmful requests, thereby mitigating over-refusal without compromising utility.

Significance. If the results are substantiated, the work offers a practical advance in balancing safety and utility for tool-using LLM agents. The hierarchical memory and entropy-driven adaptation provide a concrete, externally grounded alternative to static guidelines or fine-tuned models, with potential for deployment across models. Public code release supports reproducibility.

major comments (2)

[§4 (Experiments) and Table 2] §4 (Experiments) and Table 2: The reported 63.6% benign utility and >93% refusal rates are presented as SOTA, but the section provides no details on baseline implementations, dataset construction for ambiguous benign tasks, number of trials, or statistical significance (e.g., variance across runs). This makes it impossible to confirm the performance edge is not due to post-hoc choices or unstated assumptions.
[§3.3 (Entropy-based self-evolution) and §4.3 (Ablations)] §3.3 (Entropy-based self-evolution) and §4.3 (Ablations): The claim that entropy-driven splitting/merging produces stable decision boundaries without per-deployment tuning or new failure modes is load-bearing, yet no ablation isolates the entropy component from static rules or enhanced adversarial generation. Experiments are confined to fixed benchmarks; no cross-model transfer, long-horizon agent trajectories, or OOD inputs are reported to validate generalization.

minor comments (2)

[Abstract] Abstract: The phrase 'extensive experiments' is used without quantifying the number of models, tasks, or attack types evaluated beyond the GPT-4o peak.
[§3.2 (Hierarchical memory)] §3.2 (Hierarchical memory): The node splitting/merging criteria would be clearer with an explicit equation for the entropy threshold rather than prose description.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on experimental rigor and the need for stronger validation of the entropy-driven mechanism. We provide point-by-point responses below and will revise the manuscript accordingly to improve reproducibility and clarify limitations.

read point-by-point responses

Referee: [§4 (Experiments) and Table 2] §4 (Experiments) and Table 2: The reported 63.6% benign utility and >93% refusal rates are presented as SOTA, but the section provides no details on baseline implementations, dataset construction for ambiguous benign tasks, number of trials, or statistical significance (e.g., variance across runs). This makes it impossible to confirm the performance edge is not due to post-hoc choices or unstated assumptions.

Authors: We agree that the experimental section lacks sufficient implementation details. In the revised manuscript we will expand §4 and the Table 2 caption (plus a new appendix) to specify: (1) exact baseline implementations and reproduction steps, including any prompt templates or hyperparameters used; (2) the full construction process for the ambiguous benign tasks dataset, including source prompts, ambiguity criteria, and filtering; (3) the number of independent trials (we ran 3 seeds) together with mean and standard deviation to establish statistical significance. These additions will allow independent verification of the reported performance. revision: yes
Referee: [§3.3 (Entropy-based self-evolution) and §4.3 (Ablations)] §3.3 (Entropy-based self-evolution) and §4.3 (Ablations): The claim that entropy-driven splitting/merging produces stable decision boundaries without per-deployment tuning or new failure modes is load-bearing, yet no ablation isolates the entropy component from static rules or enhanced adversarial generation. Experiments are confined to fixed benchmarks; no cross-model transfer, long-horizon agent trajectories, or OOD inputs are reported to validate generalization.

Authors: We will add a targeted ablation in §4.3 that isolates the entropy component by comparing the full SafeHarbor system against a static-rules variant (enhanced adversarial generation only, no splitting/merging). This will directly quantify the contribution of the entropy-driven evolution to decision-boundary stability. Regarding generalization, our current evaluation is limited to the fixed benchmarks described in §4. We will add an explicit limitations paragraph acknowledging that cross-model transfer, long-horizon trajectories, and OOD inputs were not evaluated and that further work is required to substantiate broader claims; the hierarchical design is intended to support such adaptation but we do not present evidence for it beyond the reported settings. revision: partial

standing simulated objections not resolved

Full empirical validation of generalization claims via cross-model transfer, long-horizon agent trajectories, and OOD inputs, as these require new experimental setups and benchmarks not performed in the original work.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained and externally grounded

full rationale

The paper presents SafeHarbor as a training-free, plug-and-play framework that extracts context-aware rules via enhanced adversarial generation and optimizes a hierarchical memory via an information-entropy self-evolution mechanism (node splitting/merging). No equations, fitted parameters, or performance claims reduce the reported metrics (e.g., 63.6% benign utility, >93% refusal) to quantities defined by the method's own outputs. The central claims rest on standard external concepts (adversarial testing, entropy) and benchmark experiments rather than self-referential definitions or self-citation chains. This matches the reader's assessment that no abstract equations collapse claimed results to the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the proposed memory structure and self-evolution mechanism; no explicit free parameters, axioms, or invented physical entities are stated in the abstract.

invented entities (1)

Hierarchical memory system with entropy-driven splitting and merging no independent evidence
purpose: Dynamic injection of context-aware defense rules
New component introduced by the paper to enable training-free adaptation.

pith-pipeline@v0.9.0 · 5546 in / 1231 out tokens · 48360 ms · 2026-05-08T09:27:19.494741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Andriushchenko, M., Croce, F., and Flammarion, N

Ac- cessed: 2026-01-12. Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned llms with simple adaptive attacks. InThe Thirteenth International Conference on Learning Representations, 2024a. Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J. Z., Fredrikson, M., e...

2026
[2]

Drat- tack: Prompt decomposition and reconstruction makes powerful llms jailbreakers

Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. Drat- tack: Prompt decomposition and reconstruction makes powerful llms jailbreakers. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pp. 13891–13913,

2024
[3]

Think-in-memory: Recalling and post-thinking enable llms with long-term memory

Liu, L., Yang, X., Shen, Y ., Hu, B., Zhang, Z., Gu, J., and Zhang, G. Think-in-memory: Recalling and post- thinking enable llms with long-term memory.arXiv preprint arXiv:2311.08719,

work page arXiv
[4]

2502.11448 , archivePrefix=

Luo, W., Dai, S., Liu, X., Banerjee, S., Sun, H., Chen, M., and Xiao, C. Agrail: A lifelong agent guardrail with effective and adaptive safety detection.arXiv preprint arXiv:2502.11448,

work page arXiv
[5]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Ouyang, S., Yan, J., Hsu, I., Chen, Y ., Jiang, K., Wang, Z., Han, R., Le, L. T., Daruki, S., Tang, X., et al. Rea- soningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140,

work page internal anchor Pith review arXiv
[6]

Your agent may misevolve: Emergent risks in self-evolving llm agents

Shao, S., Ren, Q., Qian, C., Wei, B., Guo, D., JingYi, Y ., Song, X., Zhang, L., Zhang, W., Liu, D., et al. Your agent may misevolve: Emergent risks in self-evolving llm agents. InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025,

2025
[7]

Shi, J., Yuan, Z., Liu, Y ., Huang, Y ., Zhou, P., Sun, L., and Gong, N. Z. Optimization-based prompt injection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 660–674,

2024
[8]

CoRR abs/2406.09187 (2024)

Xiang, Z., Zheng, L., Li, Y ., Hong, J., Li, Q., Xie, H., Zhang, J., Xiong, Z., Xie, C., Yang, C., et al. Guardagent: Safeguard llm agents by a guard agent via knowledge- enabled reasoning.arXiv preprint arXiv:2406.09187,

work page arXiv
[9]

A-MEM: Agentic Memory for LLM Agents

Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., and Zhang, Y . A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review arXiv
[10]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop,

2022
[11]

Agentalign: Navigating safety alignment in the shift from informa- tive to agentic large language models.arXiv preprint arXiv:2505.23020, 2025a

Zhang, J., Yin, L., Zhou, Y ., and Hu, S. Agentalign: Navigating safety alignment in the shift from informa- tive to agentic large language models.arXiv preprint arXiv:2505.23020, 2025a. Zhang, Z., Cui, S., Lu, Y ., Zhou, J., Yang, J., Wang, H., and Huang, M. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470,

work page arXiv
[12]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review arXiv