pith. sign in

arxiv: 2607.01277 · v1 · pith:CDHHOKVQnew · submitted 2026-07-01 · 💻 cs.CR

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

Pith reviewed 2026-07-03 20:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM safetyjailbreak defensezero-trust AImulti-turn attacksruntime oversightconversation consistencyproactive safety
0
0 comments X

The pith

An independent oversight model with four gates blocks multi-turn LLM jailbreaks by escalating any danger signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Cognitive Firewall as a way to protect LLMs from harmful outputs even when attacks unfold over multiple conversation turns without any single message looking dangerous. It places a separate model in between the user and the main LLM to check four aspects of the interaction: what the user is really trying to do, whether claimed permissions can be trusted, whether the conversation is escalating toward harm, and whether the planned response carries risk. Decisions escalate rather than average, so one clear problem stops the output and leaves a record of why. If this works, it offers a practical way to add conversation-level safety that current single-message checks miss, while keeping refusals on safe queries low.

Core claim

The Cognitive Firewall decomposes safety assessment into four gates—an intent gate, a zero-trust context gate, a consistency gate, and an output risk gate—whose decisions are combined through escalation rather than averaging. This allows any confident danger signal to block an interaction while preserving an auditable rationale. Experiments demonstrate that the approach lowers attack success rates to 2 percent or below on three jailbreak benchmarks and 14 percent on human-crafted attacks, with an 8 percent over-refusal rate on benign queries.

What carries the argument

The four categorical gates combined through escalation in the Cognitive Firewall framework.

If this is right

  • Attack success drops to 2 percent or below on single-turn, multi-turn, and authority-based jailbreaks.
  • Human-crafted attacks are reduced to 14 percent success.
  • Over-refusal on benign queries stays at 8 percent.
  • The system provides an auditable rationale for each blocked interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gate-based oversight could apply to other generative AI systems beyond language models.
  • Placing safety logic in a separate model may allow easier updates without retraining the main LLM.
  • Zero-trust verification of user claims could extend to detecting other forms of manipulation in AI conversations.

Load-bearing premise

The separate oversight model can accurately implement the four gates without itself being tricked into allowing harm or refusing too many safe requests.

What would settle it

A test showing the oversight model fails to block a significant portion of the jailbreaks or refuses more than 15 percent of benign queries would indicate the framework does not deliver the claimed protection.

Figures

Figures reproduced from arXiv: 2607.01277 by Jiacheng Li, Michele Guida, Noorbakhsh Amiri Golilarz, Ruslan Shikhhamzayev, Shahram Rahimi, Sindhuja Penchala, Stefano Iannucci.

Figure 1
Figure 1. Figure 1: The Cognitive Firewall pipeline. A separate oversight model runs three pre-generation gates, intent G1, zero-trust context G2, and consistency G3, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Large language models (LLMs) can be induced to produce harmful content through multi turn strategies in which no single user message appears clearly unsafe. Existing runtime safeguards commonly evaluate prompts or responses as isolated messages, which limits their ability to recover ac-cumulated intent, verify asserted authority, or detect harmful objectives decomposed across a dialogue. This paper presents the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight model between a user and a protected target mod l. The framework decomposes safety assessment into four categorical gates: an intent gate that identi-fies the operational objective of a request, a zero trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions are combined through escalation rather than score averaging, allowing any confident danger signal to block an interaction while preserving an auditable rationale. Experiments on four jailbreak benchmarks and a benign safety test set show that the Cognitive Firewall substantially reduces attack success across single turn, multi turn, authority based, and human crafted attacks. It lowers attack success to 2 percent or below on three attack sets and to 14 percent on the most difficult human crafted set, while maintaining an 8 percent over refusal rate. These results indicate that decomposed, conversation level oversight can improve proactive containment and auditability for LLM safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight LLM between user and target model. Safety assessment is decomposed into four gates (intent identification, zero-trust context verification, cross-turn consistency detection, and output risk assessment) whose decisions are combined by escalation rather than averaging. Experiments on four jailbreak benchmarks plus a benign test set are reported to reduce attack success to ≤2% on three sets and 14% on the hardest human-crafted set, with an 8% over-refusal rate on benign queries.

Significance. If the reported attack-success reductions prove reproducible and the oversight model itself remains uncompromised, the decomposed, auditable gate structure would represent a meaningful advance over single-message classifiers for multi-turn and authority-based jailbreaks. The escalation mechanism and explicit rationale generation are positive features for explainability. The work supplies no machine-checked proofs, open code, or parameter-free derivations.

major comments (2)
  1. [Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.
  2. [Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.
minor comments (2)
  1. [Abstract] Abstract contains typographical errors ("mod l", "ac-cumulated").
  2. [Experiments] The benign safety test set and over-refusal metric are mentioned but not defined or sized in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and verifiability of the manuscript. We address each major comment below and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.

    Authors: We agree that the current manuscript omits key implementation details, model specifications, baselines, and statistical analysis, which limits independent verification. In the revised version we will expand the Experiments section to specify the oversight model family and configuration, include comparisons against representative guardrail baselines, and report error bars together with appropriate statistical significance tests on the attack-success figures. revision: yes

  2. Referee: [Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.

    Authors: The framework treats the oversight model as an independent, trusted component whose outputs are combined via escalation; however, the manuscript does not describe hardening techniques or provide a separate robustness evaluation of that layer. We will revise the Framework and Limitations sections to state this assumption explicitly, outline possible isolation strategies (e.g., distinct model family or separate deployment), and note the absence of dedicated oversight-layer benchmarks as a limitation for future work. revision: partial

Circularity Check

0 steps flagged

No circularity; descriptive framework plus benchmark results

full rationale

The paper introduces a multi-gate oversight framework for LLM safety and reports empirical attack-success rates on jailbreak benchmarks. No equations, parameter fits, derivations, or self-citations appear in the provided text. The central claims rest on observed benchmark outcomes rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. The reader's assessment of zero circularity is therefore confirmed; the work contains no load-bearing self-referential steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that the four gates can be realized by current models without new vulnerabilities; no free parameters, mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5818 in / 1122 out tokens · 27158 ms · 2026-07-03T20:36:25.113017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 25 canonical work pages · 13 internal anchors

  1. [1]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inanet al., “Llama guard: LLM-based input-output safeguard for human-AI conversations,” arXiv preprint arXiv:2312.06674, 2023

  2. [2]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    W. Zenget al., “ShieldGemma: Generative AI content moderation based on Gemma,” arXiv preprint arXiv:2407.21772, 2024

  3. [3]

    WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,

    S. Hanet al., “WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,” inAdv. Neural Inf. Process. Syst., 2024

  4. [4]

    Granite guardian,

    I. Padhiet al., “Granite guardian,” arXiv preprint arXiv:2412.07724, 2024

  5. [5]

    Training language models to follow instructions with human feedback,

    L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 27 730–27 744

  6. [6]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Baiet al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022

  7. [7]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023

  8. [8]

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack,” inProc. 34th USENIX Security Symp., 2025, arXiv:2404.01833

  9. [9]

    LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,

    Q. Renet al., “LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2025, pp. 24 763–24 785

  10. [10]

    LLM defenses are not robust to multi-turn human jailbreaks yet,

    N. Liet al., “LLM defenses are not robust to multi-turn human jailbreaks yet,” inNeurIPS Workshop on Red Teaming GenAI, 2024, arXiv:2408.15221

  11. [11]

    THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

    Z. Ma, Z. Xu, D. Yu, C. Kang, C. Li, and P. Liu, “THRD: A training- free multi-turn defense framework for jailbreak attacks on large language models,” arXiv preprint arXiv:2606.01738, 2026

  12. [12]

    Zero trust archi- tecture,

    S. Rose, O. Borchert, S. Mitchell, and S. Connelly, “Zero trust archi- tecture,” National Institute of Standards and Technology, Gaithersburg, MD, Tech. Rep. NIST Special Publication 800-207, Aug. 2020

  13. [13]

    Reforming artificial intelligence: A call for cognitive containment,

    N. Amiri Golilarz, H. S. Al Khatib, and S. Rahimi, “Reforming artificial intelligence: A call for cognitive containment,” Preprints.org preprint 2025110867, 2025

  14. [14]

    Bridging the gap: Toward cognitive autonomy in artificial intelligence,

    N. Amiri Golilarz, S. Penchala, and S. Rahimi, “Bridging the gap: Toward cognitive autonomy in artificial intelligence,” arXiv preprint arXiv:2512.02280, 2025

  15. [15]

    Deliberative alignment: Reasoning enables safer language models,

    M. Y . Guanet al., “Deliberative alignment: Reasoning enables safer language models,” arXiv preprint arXiv:2412.16339, 2024

  16. [16]

    Jailbroken: How does LLM safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdv. Neural Inf. Process. Syst., 2023

  17. [17]

    A holistic approach to undesired content detection in the real world,

    T. Markovet al., “A holistic approach to undesired content detection in the real world,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 12, 2023, pp. 15 009–15 018

  18. [18]

    The Llama 3 Herd of Models

    A. Grattafioriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,

    T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,” inProc. EMNLP (System Demonstrations), 2023, pp. 431–445

  20. [20]

    Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

    M. Sharmaet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” arXiv preprint arXiv:2501.18837, 2025

  21. [21]

    Attacks, defenses and evaluations for LLM conversation safety: A survey,

    Z. Dong, Z. Zhou, C. Yang, J. Shao, and Y . Qiao, “Attacks, defenses and evaluations for LLM conversation safety: A survey,” inProc. NAACL- HLT, 2024, pp. 6734–6747

  22. [22]

    Intention analysis makes LLMs a good jailbreak defender,

    Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes LLMs a good jailbreak defender,” inProc. COLING, 2025, pp. 2947– 2968

  23. [23]

    SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,

    X. Wanget al., “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inProc. 34th USENIX Security Symp., 2025, arXiv:2406.05498

  24. [24]

    Bergeron: Combating adversarial attacks through a conscience-based alignment framework,

    M. T. Pisanoet al., “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023

  25. [25]

    AutoDefense: Multi-agent LLM defense against jailbreak attacks,

    Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “AutoDefense: Multi-agent LLM defense against jailbreak attacks,” arXiv preprint arXiv:2403.04783, 2024

  26. [26]

    arXiv preprint arXiv:2505.03574 , year=

    S. Chennabasappaet al., “LlamaFirewall: An open source guardrail system for building secure AI agents,” arXiv preprint arXiv:2505.03574, 2025

  27. [27]

    One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

    X. Shenet al., “One turn too late: Response-aware defense against hidden malicious intent in multi-turn dialogue,” arXiv preprint arXiv:2605.05630, 2026

  28. [28]

    CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,

    K. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,” arXiv preprint arXiv:2603.29062, 2026

  29. [29]

    Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,

    P. Kulkarni and A. Namer, “Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,” inProc. IEEE Conf. Artif. Intell. (CAI), 2025, pp. 930–935

  30. [30]

    Many-shot jailbreaking,

    C. Anilet al., “Many-shot jailbreaking,” inAdv. Neural Inf. Process. Syst., 2024

  31. [31]

    HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,

    M. Mazeikaet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 35 181–35 224

  32. [32]

    Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,

    J. A. Corll, “Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,” arXiv preprint arXiv:2602.11247, 2026

  33. [33]

    Defending large language models against jailbreaking attacks through goal prioriti- zation,

    Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2024, pp. 8865–8887

  34. [34]

    LLM self defense: By self examination, LLMs know they are being tricked,

    M. Phuteet al., “LLM self defense: By self examination, LLMs know they are being tricked,” inProc. ICLR Tiny Papers Track, 2024

  35. [35]

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    P. Vergaet al., “Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models,” arXiv preprint arXiv:2404.18796, 2024

  36. [36]

    Branch-solve-merge improves large language model evaluation and generation,

    S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. Weston, and X. Li, “Branch-solve-merge improves large language model evaluation and generation,” inProc. NAACL-HLT, 2024, pp. 8352–8370

  37. [37]

    GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,

    Z. Xianget al., “GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 68 316–68 342

  38. [38]

    Scalable and transferable black-box jailbreaks for language models via persona modulation,

    R. Shah, Q. Feuillade-Montixi, S. Pour, A. Tagade, S. Casper, and J. Rando, “Scalable and transferable black-box jailbreaks for language models via persona modulation,” arXiv preprint arXiv:2311.03348, 2023

  39. [39]

    “do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2024, pp. 1671–1685

  40. [40]

    The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,

    Q. Lan and A. Kaul, “The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,” arXiv preprint arXiv:2603.23791, 2026

  41. [41]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” arXiv preprint arXiv:2404.13208, 2024

  42. [42]

    Control illusion: The failure of instruction hierarchies in large language models,

    Y . Genget al., “Control illusion: The failure of instruction hierarchies in large language models,” inProc. AAAI Conf. Artif. Intell., vol. 40, no. 36, 2026, pp. 30 816–30 824

  43. [43]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016

  44. [44]

    AI safety via debate

    G. Irving, P. Christiano, and D. Amodei, “AI safety via debate,” arXiv preprint arXiv:1805.00899, 2018

  45. [45]

    Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,

    C. Burnset al., “Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

  46. [46]

    JailbreakBench: An open robustness benchmark for jailbreaking large language models,

    P. Chaoet al., “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdv. Neural Inf. Process. Syst., 2024

  47. [47]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models,

    P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “XSTest: A test suite for identifying exaggerated safety behaviours in large language models,” inProc. NAACL-HLT, 2024, pp. 5377–5400