Recognition: unknown
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
Pith reviewed 2026-05-08 09:27 UTC · model grok-4.3
The pith
SafeHarbor uses hierarchical memory to extract and evolve context-aware rules that let LLM agents refuse harmful tool use while handling ambiguous benign tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SafeHarbor extracts context-aware defense rules via enhanced adversarial generation, then organizes and injects them through a local hierarchical memory system. The memory structure self-evolves by using information entropy to decide when nodes should split or merge, producing precise, adaptive decision boundaries for tool-use decisions. This training-free design yields state-of-the-art balance between safety and utility across both ambiguous benign tasks and direct malicious attacks.
What carries the argument
The local hierarchical memory system that stores and dynamically injects context-aware defense rules, evolved through entropy-based node splitting and merging.
If this is right
- Achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks.
- Reaches 63.6 percent peak benign utility on GPT-4o while sustaining over 93 percent refusal on harmful requests.
- Operates as a training-free, efficient, plug-and-play module that requires no per-model retraining.
- Continuously refines its own rule structure through entropy-based splitting and merging without external supervision.
Where Pith is reading between the lines
- The same memory-augmented approach could be applied to multi-agent systems where safety rules must adapt across shared tool inventories.
- Replacing static guidelines with evolving local memory may reduce the latency and compute cost of safety checks in production deployments.
- Evaluating the framework on open-source models would test whether the reported balance holds outside proprietary frontier models.
Load-bearing premise
That rules derived from adversarial generation plus entropy-driven memory reorganization will keep stable, precise boundaries across new tasks, models, and deployments without extra tuning or fresh failure modes.
What would settle it
A new LLM agent or previously unseen task category on which SafeHarbor's refusal rate on harmful requests falls below 80 percent or its benign utility drops below 50 percent.
Figures
read the original abstract
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textsc{SafeHarbor} achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SafeHarbor, a training-free, plug-and-play framework for LLM agent safety. It extracts context-aware defense rules via enhanced adversarial generation, maintains them in a local hierarchical memory system for dynamic injection, and uses an information entropy-based self-evolution mechanism to optimize the structure through node splitting and merging. The central claim is that this yields state-of-the-art performance, including a peak benign utility of 63.6% on GPT-4o while achieving >93% refusal on harmful requests, thereby mitigating over-refusal without compromising utility.
Significance. If the results are substantiated, the work offers a practical advance in balancing safety and utility for tool-using LLM agents. The hierarchical memory and entropy-driven adaptation provide a concrete, externally grounded alternative to static guidelines or fine-tuned models, with potential for deployment across models. Public code release supports reproducibility.
major comments (2)
- [§4 (Experiments) and Table 2] §4 (Experiments) and Table 2: The reported 63.6% benign utility and >93% refusal rates are presented as SOTA, but the section provides no details on baseline implementations, dataset construction for ambiguous benign tasks, number of trials, or statistical significance (e.g., variance across runs). This makes it impossible to confirm the performance edge is not due to post-hoc choices or unstated assumptions.
- [§3.3 (Entropy-based self-evolution) and §4.3 (Ablations)] §3.3 (Entropy-based self-evolution) and §4.3 (Ablations): The claim that entropy-driven splitting/merging produces stable decision boundaries without per-deployment tuning or new failure modes is load-bearing, yet no ablation isolates the entropy component from static rules or enhanced adversarial generation. Experiments are confined to fixed benchmarks; no cross-model transfer, long-horizon agent trajectories, or OOD inputs are reported to validate generalization.
minor comments (2)
- [Abstract] Abstract: The phrase 'extensive experiments' is used without quantifying the number of models, tasks, or attack types evaluated beyond the GPT-4o peak.
- [§3.2 (Hierarchical memory)] §3.2 (Hierarchical memory): The node splitting/merging criteria would be clearer with an explicit equation for the entropy threshold rather than prose description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on experimental rigor and the need for stronger validation of the entropy-driven mechanism. We provide point-by-point responses below and will revise the manuscript accordingly to improve reproducibility and clarify limitations.
read point-by-point responses
-
Referee: [§4 (Experiments) and Table 2] §4 (Experiments) and Table 2: The reported 63.6% benign utility and >93% refusal rates are presented as SOTA, but the section provides no details on baseline implementations, dataset construction for ambiguous benign tasks, number of trials, or statistical significance (e.g., variance across runs). This makes it impossible to confirm the performance edge is not due to post-hoc choices or unstated assumptions.
Authors: We agree that the experimental section lacks sufficient implementation details. In the revised manuscript we will expand §4 and the Table 2 caption (plus a new appendix) to specify: (1) exact baseline implementations and reproduction steps, including any prompt templates or hyperparameters used; (2) the full construction process for the ambiguous benign tasks dataset, including source prompts, ambiguity criteria, and filtering; (3) the number of independent trials (we ran 3 seeds) together with mean and standard deviation to establish statistical significance. These additions will allow independent verification of the reported performance. revision: yes
-
Referee: [§3.3 (Entropy-based self-evolution) and §4.3 (Ablations)] §3.3 (Entropy-based self-evolution) and §4.3 (Ablations): The claim that entropy-driven splitting/merging produces stable decision boundaries without per-deployment tuning or new failure modes is load-bearing, yet no ablation isolates the entropy component from static rules or enhanced adversarial generation. Experiments are confined to fixed benchmarks; no cross-model transfer, long-horizon agent trajectories, or OOD inputs are reported to validate generalization.
Authors: We will add a targeted ablation in §4.3 that isolates the entropy component by comparing the full SafeHarbor system against a static-rules variant (enhanced adversarial generation only, no splitting/merging). This will directly quantify the contribution of the entropy-driven evolution to decision-boundary stability. Regarding generalization, our current evaluation is limited to the fixed benchmarks described in §4. We will add an explicit limitations paragraph acknowledging that cross-model transfer, long-horizon trajectories, and OOD inputs were not evaluated and that further work is required to substantiate broader claims; the hierarchical design is intended to support such adaptation but we do not present evidence for it beyond the reported settings. revision: partial
- Full empirical validation of generalization claims via cross-model transfer, long-horizon agent trajectories, and OOD inputs, as these require new experimental setups and benchmarks not performed in the original work.
Circularity Check
No significant circularity; derivation is self-contained and externally grounded
full rationale
The paper presents SafeHarbor as a training-free, plug-and-play framework that extracts context-aware rules via enhanced adversarial generation and optimizes a hierarchical memory via an information-entropy self-evolution mechanism (node splitting/merging). No equations, fitted parameters, or performance claims reduce the reported metrics (e.g., 63.6% benign utility, >93% refusal) to quantities defined by the method's own outputs. The central claims rest on standard external concepts (adversarial testing, entropy) and benchmark experiments rather than self-referential definitions or self-citation chains. This matches the reader's assessment that no abstract equations collapse claimed results to the method itself.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Hierarchical memory system with entropy-driven splitting and merging
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Andriushchenko, M., Croce, F., and Flammarion, N
Ac- cessed: 2026-01-12. Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned llms with simple adaptive attacks. InThe Thirteenth International Conference on Learning Representations, 2024a. Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J. Z., Fredrikson, M., e...
2026
-
[2]
Drat- tack: Prompt decomposition and reconstruction makes powerful llms jailbreakers
Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. Drat- tack: Prompt decomposition and reconstruction makes powerful llms jailbreakers. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pp. 13891–13913,
2024
-
[3]
Think-in-memory: Recalling and post-thinking enable llms with long-term memory
Liu, L., Yang, X., Shen, Y ., Hu, B., Zhang, Z., Gu, J., and Zhang, G. Think-in-memory: Recalling and post- thinking enable llms with long-term memory.arXiv preprint arXiv:2311.08719,
-
[4]
Luo, W., Dai, S., Liu, X., Banerjee, S., Sun, H., Chen, M., and Xiao, C. Agrail: A lifelong agent guardrail with effective and adaptive safety detection.arXiv preprint arXiv:2502.11448,
-
[5]
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Ouyang, S., Yan, J., Hsu, I., Chen, Y ., Jiang, K., Wang, Z., Han, R., Le, L. T., Daruki, S., Tang, X., et al. Rea- soningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140,
work page internal anchor Pith review arXiv
-
[6]
Your agent may misevolve: Emergent risks in self-evolving llm agents
Shao, S., Ren, Q., Qian, C., Wei, B., Guo, D., JingYi, Y ., Song, X., Zhang, L., Zhang, W., Liu, D., et al. Your agent may misevolve: Emergent risks in self-evolving llm agents. InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025,
2025
-
[7]
Shi, J., Yuan, Z., Liu, Y ., Huang, Y ., Zhou, P., Sun, L., and Gong, N. Z. Optimization-based prompt injection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 660–674,
2024
-
[8]
Xiang, Z., Zheng, L., Li, Y ., Hong, J., Li, Q., Xie, H., Zhang, J., Xiong, Z., Xie, C., Yang, C., et al. Guardagent: Safeguard llm agents by a guard agent via knowledge- enabled reasoning.arXiv preprint arXiv:2406.09187,
-
[9]
A-MEM: Agentic Memory for LLM Agents
Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., and Zhang, Y . A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,
work page internal anchor Pith review arXiv
-
[10]
R., and Cao, Y
Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop,
2022
-
[11]
Zhang, J., Yin, L., Zhou, Y ., and Hu, S. Agentalign: Navigating safety alignment in the shift from informa- tive to agentic large language models.arXiv preprint arXiv:2505.23020, 2025a. Zhang, Z., Cui, S., Lu, Y ., Zhou, J., Yang, J., Wang, H., and Huang, M. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470,
-
[12]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.