Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

Jiacheng Li; Michele Guida; Noorbakhsh Amiri Golilarz; Ruslan Shikhhamzayev; Shahram Rahimi; Sindhuja Penchala; Stefano Iannucci

arxiv: 2607.01277 · v1 · pith:CDHHOKVQnew · submitted 2026-07-01 · 💻 cs.CR

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

Michele Guida , Ruslan Shikhhamzayev , Sindhuja Penchala , Stefano Iannucci , Jiacheng Li , Shahram Rahimi , Noorbakhsh Amiri Golilarz This is my paper

Pith reviewed 2026-07-03 20:36 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM safetyjailbreak defensezero-trust AImulti-turn attacksruntime oversightconversation consistencyproactive safety

0 comments

The pith

An independent oversight model with four gates blocks multi-turn LLM jailbreaks by escalating any danger signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Cognitive Firewall as a way to protect LLMs from harmful outputs even when attacks unfold over multiple conversation turns without any single message looking dangerous. It places a separate model in between the user and the main LLM to check four aspects of the interaction: what the user is really trying to do, whether claimed permissions can be trusted, whether the conversation is escalating toward harm, and whether the planned response carries risk. Decisions escalate rather than average, so one clear problem stops the output and leaves a record of why. If this works, it offers a practical way to add conversation-level safety that current single-message checks miss, while keeping refusals on safe queries low.

Core claim

The Cognitive Firewall decomposes safety assessment into four gates—an intent gate, a zero-trust context gate, a consistency gate, and an output risk gate—whose decisions are combined through escalation rather than averaging. This allows any confident danger signal to block an interaction while preserving an auditable rationale. Experiments demonstrate that the approach lowers attack success rates to 2 percent or below on three jailbreak benchmarks and 14 percent on human-crafted attacks, with an 8 percent over-refusal rate on benign queries.

What carries the argument

The four categorical gates combined through escalation in the Cognitive Firewall framework.

If this is right

Attack success drops to 2 percent or below on single-turn, multi-turn, and authority-based jailbreaks.
Human-crafted attacks are reduced to 14 percent success.
Over-refusal on benign queries stays at 8 percent.
The system provides an auditable rationale for each blocked interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gate-based oversight could apply to other generative AI systems beyond language models.
Placing safety logic in a separate model may allow easier updates without retraining the main LLM.
Zero-trust verification of user claims could extend to detecting other forms of manipulation in AI conversations.

Load-bearing premise

The separate oversight model can accurately implement the four gates without itself being tricked into allowing harm or refusing too many safe requests.

What would settle it

A test showing the oversight model fails to block a significant portion of the jailbreaks or refuses more than 15 percent of benign queries would indicate the framework does not deliver the claimed protection.

Figures

Figures reproduced from arXiv: 2607.01277 by Jiacheng Li, Michele Guida, Noorbakhsh Amiri Golilarz, Ruslan Shikhhamzayev, Shahram Rahimi, Sindhuja Penchala, Stefano Iannucci.

**Figure 1.** Figure 1: The Cognitive Firewall pipeline. A separate oversight model runs three pre-generation gates, intent G1, zero-trust context G2, and consistency G3, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Large language models (LLMs) can be induced to produce harmful content through multi turn strategies in which no single user message appears clearly unsafe. Existing runtime safeguards commonly evaluate prompts or responses as isolated messages, which limits their ability to recover ac-cumulated intent, verify asserted authority, or detect harmful objectives decomposed across a dialogue. This paper presents the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight model between a user and a protected target mod l. The framework decomposes safety assessment into four categorical gates: an intent gate that identi-fies the operational objective of a request, a zero trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions are combined through escalation rather than score averaging, allowing any confident danger signal to block an interaction while preserving an auditable rationale. Experiments on four jailbreak benchmarks and a benign safety test set show that the Cognitive Firewall substantially reduces attack success across single turn, multi turn, authority based, and human crafted attacks. It lowers attack success to 2 percent or below on three attack sets and to 14 percent on the most difficult human crafted set, while maintaining an 8 percent over refusal rate. These results indicate that decomposed, conversation level oversight can improve proactive containment and auditability for LLM safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Cognitive Firewall describes a four-gate oversight framework that targets multi-turn jailbreaks but supplies no evidence the oversight model itself resists the same attacks.

read the letter

The main takeaway is that this paper outlines a runtime framework using an independent oversight LLM with four escalating gates—intent identification, zero-trust context checks, cross-turn consistency, and output risk assessment—to handle decomposed attacks that single-message filters miss. The reported results show attack success falling to 2% or below on three benchmarks and 14% on human-crafted ones, with 8% over-refusal on benign tests.

What the work does reasonably is spell out a concrete decomposition of safety checks and an escalation rule that avoids simple averaging. This matches known weaknesses in current safeguards and gives a clear structure for auditability.

The soft spot is the complete absence of any test or hardening for the oversight model. Because it is itself an LLM, the same multi-turn or authority-based attacks could target the gates, yet the description offers no isolation, different model family, or separate evaluation. The abstract states performance numbers without implementation details, baselines, statistical tests, or error analysis, so the gains cannot be verified from the given text.

This is for practitioners building runtime defenses in deployed LLM systems. A reader looking for ideas on conversation-level oversight will find the gate breakdown useful to discuss, but the work does not yet provide enough evidence for direct adoption or citation.

It deserves peer review so referees can ask for the missing methods, oversight tests, and comparisons.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight LLM between user and target model. Safety assessment is decomposed into four gates (intent identification, zero-trust context verification, cross-turn consistency detection, and output risk assessment) whose decisions are combined by escalation rather than averaging. Experiments on four jailbreak benchmarks plus a benign test set are reported to reduce attack success to ≤2% on three sets and 14% on the hardest human-crafted set, with an 8% over-refusal rate on benign queries.

Significance. If the reported attack-success reductions prove reproducible and the oversight model itself remains uncompromised, the decomposed, auditable gate structure would represent a meaningful advance over single-message classifiers for multi-turn and authority-based jailbreaks. The escalation mechanism and explicit rationale generation are positive features for explainability. The work supplies no machine-checked proofs, open code, or parameter-free derivations.

major comments (2)

[Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.
[Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.

minor comments (2)

[Abstract] Abstract contains typographical errors ("mod l", "ac-cumulated").
[Experiments] The benign safety test set and over-refusal metric are mentioned but not defined or sized in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and verifiability of the manuscript. We address each major comment below and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.

Authors: We agree that the current manuscript omits key implementation details, model specifications, baselines, and statistical analysis, which limits independent verification. In the revised version we will expand the Experiments section to specify the oversight model family and configuration, include comparisons against representative guardrail baselines, and report error bars together with appropriate statistical significance tests on the attack-success figures. revision: yes
Referee: [Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.

Authors: The framework treats the oversight model as an independent, trusted component whose outputs are combined via escalation; however, the manuscript does not describe hardening techniques or provide a separate robustness evaluation of that layer. We will revise the Framework and Limitations sections to state this assumption explicitly, outline possible isolation strategies (e.g., distinct model family or separate deployment), and note the absence of dedicated oversight-layer benchmarks as a limitation for future work. revision: partial

Circularity Check

0 steps flagged

No circularity; descriptive framework plus benchmark results

full rationale

The paper introduces a multi-gate oversight framework for LLM safety and reports empirical attack-success rates on jailbreak benchmarks. No equations, parameter fits, derivations, or self-citations appear in the provided text. The central claims rest on observed benchmark outcomes rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. The reader's assessment of zero circularity is therefore confirmed; the work contains no load-bearing self-referential steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that the four gates can be realized by current models without new vulnerabilities; no free parameters, mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5818 in / 1122 out tokens · 27158 ms · 2026-07-03T20:36:25.113017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 25 canonical work pages · 13 internal anchors

[1]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inanet al., “Llama guard: LLM-based input-output safeguard for human-AI conversations,” arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

ShieldGemma: Generative AI Content Moderation Based on Gemma

W. Zenget al., “ShieldGemma: Generative AI content moderation based on Gemma,” arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,

S. Hanet al., “WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,” inAdv. Neural Inf. Process. Syst., 2024

2024
[4]

Granite guardian,

I. Padhiet al., “Granite guardian,” arXiv preprint arXiv:2412.07724, 2024

work page arXiv 2024
[5]

Training language models to follow instructions with human feedback,

L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 27 730–27 744

2022
[6]

Constitutional AI: Harmlessness from AI Feedback

Y . Baiet al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack,” inProc. 34th USENIX Security Symp., 2025, arXiv:2404.01833

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,

Q. Renet al., “LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2025, pp. 24 763–24 785

2025
[10]

LLM defenses are not robust to multi-turn human jailbreaks yet,

N. Liet al., “LLM defenses are not robust to multi-turn human jailbreaks yet,” inNeurIPS Workshop on Red Teaming GenAI, 2024, arXiv:2408.15221

work page arXiv 2024
[11]

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Z. Ma, Z. Xu, D. Yu, C. Kang, C. Li, and P. Liu, “THRD: A training- free multi-turn defense framework for jailbreak attacks on large language models,” arXiv preprint arXiv:2606.01738, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Zero trust archi- tecture,

S. Rose, O. Borchert, S. Mitchell, and S. Connelly, “Zero trust archi- tecture,” National Institute of Standards and Technology, Gaithersburg, MD, Tech. Rep. NIST Special Publication 800-207, Aug. 2020

2020
[13]

Reforming artificial intelligence: A call for cognitive containment,

N. Amiri Golilarz, H. S. Al Khatib, and S. Rahimi, “Reforming artificial intelligence: A call for cognitive containment,” Preprints.org preprint 2025110867, 2025

2025
[14]

Bridging the gap: Toward cognitive autonomy in artificial intelligence,

N. Amiri Golilarz, S. Penchala, and S. Rahimi, “Bridging the gap: Toward cognitive autonomy in artificial intelligence,” arXiv preprint arXiv:2512.02280, 2025

work page arXiv 2025
[15]

Deliberative alignment: Reasoning enables safer language models,

M. Y . Guanet al., “Deliberative alignment: Reasoning enables safer language models,” arXiv preprint arXiv:2412.16339, 2024

work page arXiv 2024
[16]

Jailbroken: How does LLM safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdv. Neural Inf. Process. Syst., 2023

2023
[17]

A holistic approach to undesired content detection in the real world,

T. Markovet al., “A holistic approach to undesired content detection in the real world,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 12, 2023, pp. 15 009–15 018

2023
[18]

The Llama 3 Herd of Models

A. Grattafioriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,

T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,” inProc. EMNLP (System Demonstrations), 2023, pp. 431–445

2023
[20]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

M. Sharmaet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” arXiv preprint arXiv:2501.18837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Attacks, defenses and evaluations for LLM conversation safety: A survey,

Z. Dong, Z. Zhou, C. Yang, J. Shao, and Y . Qiao, “Attacks, defenses and evaluations for LLM conversation safety: A survey,” inProc. NAACL- HLT, 2024, pp. 6734–6747

2024
[22]

Intention analysis makes LLMs a good jailbreak defender,

Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes LLMs a good jailbreak defender,” inProc. COLING, 2025, pp. 2947– 2968

2025
[23]

SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,

X. Wanget al., “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inProc. 34th USENIX Security Symp., 2025, arXiv:2406.05498

work page arXiv 2025
[24]

Bergeron: Combating adversarial attacks through a conscience-based alignment framework,

M. T. Pisanoet al., “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023

work page arXiv 2023
[25]

AutoDefense: Multi-agent LLM defense against jailbreak attacks,

Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “AutoDefense: Multi-agent LLM defense against jailbreak attacks,” arXiv preprint arXiv:2403.04783, 2024

work page arXiv 2024
[26]

arXiv preprint arXiv:2505.03574 , year=

S. Chennabasappaet al., “LlamaFirewall: An open source guardrail system for building secure AI agents,” arXiv preprint arXiv:2505.03574, 2025

work page arXiv 2025
[27]

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

X. Shenet al., “One turn too late: Response-aware defense against hidden malicious intent in multi-turn dialogue,” arXiv preprint arXiv:2605.05630, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,

K. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,” arXiv preprint arXiv:2603.29062, 2026

work page arXiv 2026
[29]

Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,

P. Kulkarni and A. Namer, “Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,” inProc. IEEE Conf. Artif. Intell. (CAI), 2025, pp. 930–935

2025
[30]

Many-shot jailbreaking,

C. Anilet al., “Many-shot jailbreaking,” inAdv. Neural Inf. Process. Syst., 2024

2024
[31]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,

M. Mazeikaet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 35 181–35 224

2024
[32]

Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,

J. A. Corll, “Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,” arXiv preprint arXiv:2602.11247, 2026

work page arXiv 2026
[33]

Defending large language models against jailbreaking attacks through goal prioriti- zation,

Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2024, pp. 8865–8887

2024
[34]

LLM self defense: By self examination, LLMs know they are being tricked,

M. Phuteet al., “LLM self defense: By self examination, LLMs know they are being tricked,” inProc. ICLR Tiny Papers Track, 2024

2024
[35]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

P. Vergaet al., “Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models,” arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Branch-solve-merge improves large language model evaluation and generation,

S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. Weston, and X. Li, “Branch-solve-merge improves large language model evaluation and generation,” inProc. NAACL-HLT, 2024, pp. 8352–8370

2024
[37]

GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,

Z. Xianget al., “GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 68 316–68 342

2025
[38]

Scalable and transferable black-box jailbreaks for language models via persona modulation,

R. Shah, Q. Feuillade-Montixi, S. Pour, A. Tagade, S. Casper, and J. Rando, “Scalable and transferable black-box jailbreaks for language models via persona modulation,” arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023
[39]

“do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2024, pp. 1671–1685

2024
[40]

The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,

Q. Lan and A. Kaul, “The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,” arXiv preprint arXiv:2603.23791, 2026

work page arXiv 2026
[41]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Control illusion: The failure of instruction hierarchies in large language models,

Y . Genget al., “Control illusion: The failure of instruction hierarchies in large language models,” inProc. AAAI Conf. Artif. Intell., vol. 40, no. 36, 2026, pp. 30 816–30 824

2026
[43]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[44]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei, “AI safety via debate,” arXiv preprint arXiv:1805.00899, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,

C. Burnset al., “Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

2024
[46]

JailbreakBench: An open robustness benchmark for jailbreaking large language models,

P. Chaoet al., “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdv. Neural Inf. Process. Syst., 2024

2024
[47]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models,

P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “XSTest: A test suite for identifying exaggerated safety behaviours in large language models,” inProc. NAACL-HLT, 2024, pp. 5377–5400

2024

[1] [1]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inanet al., “Llama guard: LLM-based input-output safeguard for human-AI conversations,” arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

ShieldGemma: Generative AI Content Moderation Based on Gemma

W. Zenget al., “ShieldGemma: Generative AI content moderation based on Gemma,” arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,

S. Hanet al., “WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,” inAdv. Neural Inf. Process. Syst., 2024

2024

[4] [4]

Granite guardian,

I. Padhiet al., “Granite guardian,” arXiv preprint arXiv:2412.07724, 2024

work page arXiv 2024

[5] [5]

Training language models to follow instructions with human feedback,

L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 27 730–27 744

2022

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Y . Baiet al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack,” inProc. 34th USENIX Security Symp., 2025, arXiv:2404.01833

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,

Q. Renet al., “LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2025, pp. 24 763–24 785

2025

[10] [10]

LLM defenses are not robust to multi-turn human jailbreaks yet,

N. Liet al., “LLM defenses are not robust to multi-turn human jailbreaks yet,” inNeurIPS Workshop on Red Teaming GenAI, 2024, arXiv:2408.15221

work page arXiv 2024

[11] [11]

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Z. Ma, Z. Xu, D. Yu, C. Kang, C. Li, and P. Liu, “THRD: A training- free multi-turn defense framework for jailbreak attacks on large language models,” arXiv preprint arXiv:2606.01738, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Zero trust archi- tecture,

S. Rose, O. Borchert, S. Mitchell, and S. Connelly, “Zero trust archi- tecture,” National Institute of Standards and Technology, Gaithersburg, MD, Tech. Rep. NIST Special Publication 800-207, Aug. 2020

2020

[13] [13]

Reforming artificial intelligence: A call for cognitive containment,

N. Amiri Golilarz, H. S. Al Khatib, and S. Rahimi, “Reforming artificial intelligence: A call for cognitive containment,” Preprints.org preprint 2025110867, 2025

2025

[14] [14]

Bridging the gap: Toward cognitive autonomy in artificial intelligence,

N. Amiri Golilarz, S. Penchala, and S. Rahimi, “Bridging the gap: Toward cognitive autonomy in artificial intelligence,” arXiv preprint arXiv:2512.02280, 2025

work page arXiv 2025

[15] [15]

Deliberative alignment: Reasoning enables safer language models,

M. Y . Guanet al., “Deliberative alignment: Reasoning enables safer language models,” arXiv preprint arXiv:2412.16339, 2024

work page arXiv 2024

[16] [16]

Jailbroken: How does LLM safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdv. Neural Inf. Process. Syst., 2023

2023

[17] [17]

A holistic approach to undesired content detection in the real world,

T. Markovet al., “A holistic approach to undesired content detection in the real world,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 12, 2023, pp. 15 009–15 018

2023

[18] [18]

The Llama 3 Herd of Models

A. Grattafioriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,

T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,” inProc. EMNLP (System Demonstrations), 2023, pp. 431–445

2023

[20] [20]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

M. Sharmaet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” arXiv preprint arXiv:2501.18837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Attacks, defenses and evaluations for LLM conversation safety: A survey,

Z. Dong, Z. Zhou, C. Yang, J. Shao, and Y . Qiao, “Attacks, defenses and evaluations for LLM conversation safety: A survey,” inProc. NAACL- HLT, 2024, pp. 6734–6747

2024

[22] [22]

Intention analysis makes LLMs a good jailbreak defender,

Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes LLMs a good jailbreak defender,” inProc. COLING, 2025, pp. 2947– 2968

2025

[23] [23]

SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,

X. Wanget al., “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inProc. 34th USENIX Security Symp., 2025, arXiv:2406.05498

work page arXiv 2025

[24] [24]

Bergeron: Combating adversarial attacks through a conscience-based alignment framework,

M. T. Pisanoet al., “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023

work page arXiv 2023

[25] [25]

AutoDefense: Multi-agent LLM defense against jailbreak attacks,

Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “AutoDefense: Multi-agent LLM defense against jailbreak attacks,” arXiv preprint arXiv:2403.04783, 2024

work page arXiv 2024

[26] [26]

arXiv preprint arXiv:2505.03574 , year=

S. Chennabasappaet al., “LlamaFirewall: An open source guardrail system for building secure AI agents,” arXiv preprint arXiv:2505.03574, 2025

work page arXiv 2025

[27] [27]

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

X. Shenet al., “One turn too late: Response-aware defense against hidden malicious intent in multi-turn dialogue,” arXiv preprint arXiv:2605.05630, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,

K. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,” arXiv preprint arXiv:2603.29062, 2026

work page arXiv 2026

[29] [29]

Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,

P. Kulkarni and A. Namer, “Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,” inProc. IEEE Conf. Artif. Intell. (CAI), 2025, pp. 930–935

2025

[30] [30]

Many-shot jailbreaking,

C. Anilet al., “Many-shot jailbreaking,” inAdv. Neural Inf. Process. Syst., 2024

2024

[31] [31]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,

M. Mazeikaet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 35 181–35 224

2024

[32] [32]

Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,

J. A. Corll, “Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,” arXiv preprint arXiv:2602.11247, 2026

work page arXiv 2026

[33] [33]

Defending large language models against jailbreaking attacks through goal prioriti- zation,

Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2024, pp. 8865–8887

2024

[34] [34]

LLM self defense: By self examination, LLMs know they are being tricked,

M. Phuteet al., “LLM self defense: By self examination, LLMs know they are being tricked,” inProc. ICLR Tiny Papers Track, 2024

2024

[35] [35]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

P. Vergaet al., “Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models,” arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Branch-solve-merge improves large language model evaluation and generation,

S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. Weston, and X. Li, “Branch-solve-merge improves large language model evaluation and generation,” inProc. NAACL-HLT, 2024, pp. 8352–8370

2024

[37] [37]

GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,

Z. Xianget al., “GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 68 316–68 342

2025

[38] [38]

Scalable and transferable black-box jailbreaks for language models via persona modulation,

R. Shah, Q. Feuillade-Montixi, S. Pour, A. Tagade, S. Casper, and J. Rando, “Scalable and transferable black-box jailbreaks for language models via persona modulation,” arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023

[39] [39]

“do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2024, pp. 1671–1685

2024

[40] [40]

The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,

Q. Lan and A. Kaul, “The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,” arXiv preprint arXiv:2603.23791, 2026

work page arXiv 2026

[41] [41]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Control illusion: The failure of instruction hierarchies in large language models,

Y . Genget al., “Control illusion: The failure of instruction hierarchies in large language models,” inProc. AAAI Conf. Artif. Intell., vol. 40, no. 36, 2026, pp. 30 816–30 824

2026

[43] [43]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[44] [44]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei, “AI safety via debate,” arXiv preprint arXiv:1805.00899, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [45]

Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,

C. Burnset al., “Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

2024

[46] [46]

JailbreakBench: An open robustness benchmark for jailbreaking large language models,

P. Chaoet al., “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdv. Neural Inf. Process. Syst., 2024

2024

[47] [47]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models,

P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “XSTest: A test suite for identifying exaggerated safety behaviours in large language models,” inProc. NAACL-HLT, 2024, pp. 5377–5400

2024