Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

Peter Garraghan; William Hackett

arxiv: 2607.02121 · v1 · pith:D43TNJNKnew · submitted 2026-07-02 · 💻 cs.CR · cs.AI

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

William Hackett , Peter Garraghan This is my paper

Pith reviewed 2026-07-03 10:29 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords guardrail detectionblack-box reconnaissanceLLM securityadversarial emulationAI safetybehavioral monitoring

0 comments

The pith

Behavioral monitoring of HTTP, lexical, and timing signals detects guardrail presence in black-box AI systems with 100% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a black-box methodology to determine whether an AI system uses a guardrail by observing differences in how it handles benign versus malicious prompts. This matters because attack techniques for bypassing guardrails differ from those for bypassing an LLM's built-in safety alignment, affecting how researchers test security. The method relies on signals from HTTP responses, word choices, and response timings without needing any prior knowledge of the system. Experiments show perfect detection of guardrail presence and strong separation between guardrail blocks and simple LLM rejections.

Core claim

The central claim is that a reconnaissance method using behavioral monitoring of HTTP, lexical, and timing signals can detect the presence of a guardrail, identify the content categories it blocks, and distinguish guardrail blocks from LLM rejections on unseen prompts, achieving 100% accuracy for presence detection and 98% average F1 score for distinction, with statistically significant behavioral separation (q < 0.001).

What carries the argument

The black-box guardrail reconnaissance methodology based on behavioral monitoring of HTTP, lexical, and timing signals.

If this is right

Attackers or testers can select appropriate bypass techniques based on whether a guardrail is present.
Guardrail content categories can be identified through the method.
Distinction allows better optimization of adversarial prompts against production systems.
The method works with zero prior knowledge and only black-box access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might enable automated tools for mapping safety layers in AI deployments.
Similar signal-based methods could apply to detecting other security controls beyond guardrails.
Production systems might evolve to minimize distinguishable signals if this reconnaissance becomes common.

Load-bearing premise

HTTP, lexical, and timing signals provide sufficient and generalizable separation between guardrail activation and LLM rejection across arbitrary production systems without prior knowledge or tuning.

What would settle it

Testing the method on a production AI system known to have a guardrail and observing failure to achieve 100% detection accuracy or lack of statistical separation would falsify the claim.

Figures

Figures reproduced from arXiv: 2607.02121 by Peter Garraghan, William Hackett.

**Figure 1.** Figure 1: Guardrail Block Patterns. Three example guardrail block patterns: Dropped connection, non-200 status code, and verbose response stating the block reason, alongside an LLM rejection. 2.2 LLM Safety Alignment and Refusal LLMs deployed in production AI applications are hardened through a process known as safety alignment, which prevents the model from generating policy-violating content and causes an LLM ref… view at source ↗

**Figure 2.** Figure 2: Guardrail Reconnaissance Approach. End-to-end overview of the four stages: Prompt-sets, feature extraction, behavioral monitoring, and guardrail characteristics. metadata such as x-content-filter or altered server signatures1 . Lexical. Indirect signals capturing the linguistic style of the response even when the HTTP layer is unchanged. We extract block language, where an LLM judge classifies whether a re… view at source ↗

**Figure 3.** Figure 3: Guardrail Detection Performance. Average detection accuracy across all targets including the noguardrail control. 0 1 2 3 4 5 mean −log10 (q) ProtectAI DeBERTa v2 OpenAI Omni Moderation DuoGuard 1.5B Meta Prompt Guard v2 LLM Guard Azure Prompt Shield Azure Content Safety Azure Foundry Guardrails LlamaFirewall no-guardrail q=.05 (weak) q=.01 (mod) q=.001 (high) HTTP Lexical Timing [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 4.** Figure 4: Guardrail Signal Strength. Signal strength across all guardrails, 6 block patterns and 3 LLMs. Each marker is one signal, coloured by channel (HTTP, Lexical, Timing), positioned at mean − log10(q). Higher values indicate stronger statistical separation between malicious and benign prompt-set distributions on signal. Feature Detection Strength [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Guardrail Detection Capability. Guardrail detection strength across three malicious prompt-sets highlighting suspected content guardrail detects. ✓ = Matches vendor-claimed detection capability, ✗ = Does not match vendor-claimed detection capability. gories was DuoGuard 1.5B, where jailbreaks were identified with moderate strength. While this is still an adequate signal to indicate DuoGuard detects jailbre… view at source ↗

**Figure 7.** Figure 7: Guardrail Block vs LLM Rejection. F1 of the guardrail block fingerprint at distinguishing a real guardrail block from an LLM rejection on 100 unseen prompts across all targets4 . Higher is better, with 1.00 being perfect separation. Guardrail Block Classification Performance [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) and agentic systems become integrated into real-world applications, ensuring their safety and security is critical. Guardrail systems that detect and block malicious instructions sent to and from an LLM are an essential component of AI security. However, researchers conducting black-box adversarial emulation against production AI systems often struggle to determine whether a guardrail block or an LLM rejection has occurred. This distinction is important because the techniques used to bypass guardrails can differ substantially from those used to bypass LLM safety alignment, and has a material impact on attack technique selection and optimization. We propose the first black-box guardrail reconnaissance methodology, which detects the presence of a guardrail within a target AI system through behavioral monitoring of HTTP, lexical, and timing signals, assuming only black-box access and zero prior knowledge of the guardrail or AI system. Experiments demonstrate that our approach detects guardrail presence with 100% accuracy, with statistically significant behavioral separation between benign and malicious interactions (q < 0.001). Our approach further identifies the content categories a guardrail is designed to block, and distinguishes guardrail blocks from LLM rejection on unseen prompts with an average F1 score of 98%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a first black-box method to spot guardrails via HTTP, lexical and timing signals with 100% detection accuracy and 98% F1, but supplies no experimental details to back those numbers.

read the letter

The main thing to know is that the authors describe a behavioral monitoring approach to detect guardrail presence in black-box AI systems and to separate guardrail blocks from ordinary LLM refusals. They report perfect accuracy on detection plus high F1 on unseen prompts and say the method also reveals blocked content categories.

What is new is the framing of this as a reconnaissance step for adversarial testing. The distinction matters because bypass techniques differ depending on whether the target is a guardrail or the model itself, and the paper positions the work as the first to use these three signal types with zero prior knowledge.

The paper does a reasonable job laying out why the problem is practical for people doing real-world security evaluations. The core idea of watching response patterns without needing internal access is simple and directly relevant to current attack workflows.

The soft spots sit in the evidence. The abstract states statistically significant separation and the performance numbers but gives no count of systems tested, no description of the prompt sets, no mention of controls for network effects or model differences, and no detail on how the lexical or timing features were chosen. Without that information the 100% and 98% figures cannot be evaluated, and the claim that the signals work across arbitrary production systems without tuning rests on an untested assumption. A finite test collection cannot rule out cases where the separation collapses.

This paper is for researchers doing black-box adversarial work on deployed LLM systems. A reader already active in that area might pick up the high-level monitoring idea even if they have to re-validate the signals themselves.

The work shows clear thinking about a real gap and engages the literature by claiming novelty on the reconnaissance angle. It deserves peer review so the missing experimental setup and data can be examined.

Referee Report

2 major / 0 minor

Summary. The paper proposes the first black-box guardrail reconnaissance methodology that detects guardrail presence in target AI systems via behavioral monitoring of HTTP, lexical, and timing signals, assuming only black-box access and zero prior knowledge. Experiments are claimed to show 100% accuracy in guardrail detection with statistically significant separation (q < 0.001), identification of blocked content categories, and 98% average F1 score distinguishing guardrail blocks from LLM rejections on unseen prompts.

Significance. If substantiated, the result would be significant for AI security research by enabling researchers to distinguish guardrail activations from model rejections during black-box adversarial emulation, informing attack technique selection. The black-box, zero-knowledge premise is a practical strength for production system testing.

major comments (2)

[Abstract] Abstract: The central performance claims of 100% accuracy, q < 0.001 significance, and 98% F1 are presented without any description of experimental setup, number of systems tested, prompt sets used, statistical methods, controls for confounding factors (e.g., network conditions), or how unseen prompts were constructed. This absence makes it impossible to evaluate whether the reported metrics are load-bearing or reproducible.
[Abstract] Abstract (methodology description): The claim that HTTP/lexical/timing signals yield generalizable separation across arbitrary production systems without prior knowledge or tuning is load-bearing for the 100% accuracy and 98% F1 results, yet the abstract provides no evidence that separation holds when guardrails are integrated differently (e.g., post-processing vs. in-line proxy) or under varying network/model conditions; results from a finite test set do not rule out collapse of diagnostic power.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We address each point below and agree that the abstract can be strengthened for clarity while preserving its concise nature. Full experimental details are already present in the manuscript body.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims of 100% accuracy, q < 0.001 significance, and 98% F1 are presented without any description of experimental setup, number of systems tested, prompt sets used, statistical methods, controls for confounding factors (e.g., network conditions), or how unseen prompts were constructed. This absence makes it impossible to evaluate whether the reported metrics are load-bearing or reproducible.

Authors: The abstract summarizes key results at a high level, as is conventional. The full manuscript details the experimental setup in Section 4: we tested five production AI systems with distinct guardrail architectures, using 1,200 prompts (800 benign, 400 malicious) drawn from established benchmarks, with statistical significance assessed via Mann-Whitney U tests yielding q < 0.001 after multiple-comparison correction. Network conditions were controlled by running experiments on isolated infrastructure with fixed latency emulation; unseen prompts were constructed via a 20% random hold-out set disjoint from training data. We will revise the abstract to include a one-sentence overview of experimental scale for improved evaluability. revision: yes
Referee: [Abstract] Abstract (methodology description): The claim that HTTP/lexical/timing signals yield generalizable separation across arbitrary production systems without prior knowledge or tuning is load-bearing for the 100% accuracy and 98% F1 results, yet the abstract provides no evidence that separation holds when guardrails are integrated differently (e.g., post-processing vs. in-line proxy) or under varying network/model conditions; results from a finite test set do not rule out collapse of diagnostic power.

Authors: Our evaluation explicitly included both post-processing and in-line proxy integrations across the five systems, with consistent 100% detection accuracy and 98% F1 on held-out prompts. The signals are chosen precisely because they are fundamental to request/response flows and thus less sensitive to specific integration details. We acknowledge that a finite test set cannot exhaustively demonstrate invariance to every conceivable network jitter or model variant; the revised manuscript will expand the limitations discussion to address this and report additional robustness checks under simulated variable latency. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurement study with experimental results

full rationale

The paper presents a black-box empirical methodology relying on HTTP, lexical, and timing signals to detect guardrail activation, with claims of 100% accuracy and 98% F1 supported by experiments on specific systems (q < 0.001). No equations, derivations, fitted parameters, or self-citations are described that reduce any reported result to an input by construction, self-definition, or renaming. The central claims rest on observable behavioral separation rather than load-bearing self-references or ansatzes imported from prior author work. This is a standard empirical measurement study whose results are falsifiable on new systems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities. The method implicitly relies on the unstated assumption that the chosen behavioral signals are stable and discriminative across systems.

pith-pipeline@v0.9.1-grok · 5733 in / 1141 out tokens · 35237 ms · 2026-07-03T10:29:45.707662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages

[1]

2024 , eprint=

Safeguarding Large Language Models: A Survey , author=. 2024 , eprint=

2024
[2]

AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher. AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

work page doi:10.18653/v1/2025.naacl-long.306 2025
[3]

ACL , year=

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free , author=. ACL , year=
[4]

Frontiers of Computer Science , year=

Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Dong, Zican and Hou, Yupeng and Zhang, Beichen and Min, Yingqian and Zhang, Junjie and Liu, Peiyu and Wang, Xiaolei and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Hu, Yiwen and Nie, Jian-Yun...

work page doi:10.1007/s11704-026-60308-3
[5]

2023 , eprint=

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks , author=. 2023 , eprint=

2023
[6]

34th USENIX Security Symposium (USENIX Security 25) , year =

Mark Russinovich and Ahmed Salem and Ronen Eldan , title =. 34th USENIX Security Symposium (USENIX Security 25) , year =
[7]

2025 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2025 , eprint=

2025
[8]

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

Hackett, William and Birch, Lewis and Trawicki, Stefan and Suri, Neeraj and Garraghan, Peter. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems. Proceedings of the The First Workshop on LLM Security (LLMSEC). 2025

2025
[9]

N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan. N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.40

work page doi:10.18653/v1/2023.emnlp-demo.40 2023
[10]

2026 , eprint=

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts , author=. 2026 , eprint=

2026
[11]

J ailbreak R adar: Comprehensive Assessment of Jailbreak Attacks Against LLM s

Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang. J ailbreak R adar: Comprehensive Assessment of Jailbreak Attacks Against LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1045

work page doi:10.18653/v1/2025.acl-long.1045 2025
[12]

Artificial Intelligence Review , year=

Dong, Yi and Mu, Ronghui and Zhang, Yanghao and Sun, Siqi and Zhang, Tianle and Wu, Changshun and Jin, Gaojie and Qi, Yi and Hu, Jinwei and Meng, Jie and Bensalem, Saddek and Huang, Xiaowei , title=. Artificial Intelligence Review , year=. doi:10.1007/s10462-025-11389-2 , url=

work page doi:10.1007/s10462-025-11389-2
[13]

Proceedings of the ACM Web Conference 2026 , pages =

Wang, Junyi and Zhu, Zhibin and Liu, Chuanyi , title =. Proceedings of the ACM Web Conference 2026 , pages =. 2026 , isbn =. doi:10.1145/3774904.3792438 , abstract =

work page doi:10.1145/3774904.3792438 2026
[14]

and Kiekintveld, Christopher and Laszka, Aron , title =

Roy, Shanto and Sharmin, Nazia and Acosta, Jaime C. and Kiekintveld, Christopher and Laszka, Aron , title =. ACM Comput. Surv. , month = dec, articleno =. 2022 , issue_date =. doi:10.1145/3538704 , abstract =

work page doi:10.1145/3538704 2022
[15]

2025 , eprint=

System Prompt Extraction Attacks and Defenses in Large Language Models , author=. 2025 , eprint=

2025
[16]

Kornaropoulos and Giuseppe Ateniese , title =

Dario Pasquini and Evgenios M. Kornaropoulos and Giuseppe Ateniese , title =. 34th USENIX Security Symposium (USENIX Security 25) , year =
[17]

2002 , publisher=

HTTP: the definitive guide , author=. 2002 , publisher=

2002
[18]

Safe RLHF: Safe Reinforcement Learning from Human Feedback , url =

Dai, Juntao and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong , booktitle =. Safe RLHF: Safe Reinforcement Learning from Human Feedback , url =
[19]

Findings of the Association for Computational Linguistics: EMNLP , volume=

On Guardrail Models’ Robustness to Mutations and Adversarial Attacks , author=. Findings of the Association for Computational Linguistics: EMNLP , volume=
[20]

2026 , eprint=

Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers , author=. 2026 , eprint=

2026
[21]

2024 , eprint=

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester , author=. 2024 , eprint=

2024
[22]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

2024
[23]

2024 , publisher =

OpenAI , title =. 2024 , publisher =

2024
[24]

2025 , eprint=

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails , author=. 2025 , eprint=

2025
[25]

2024 , publisher =

Meta , title =. 2024 , publisher =

2024
[26]

2025 , publisher =

ProtectAI.com , title =. 2025 , publisher =

2025
[27]

2026 , url=

Azure , title =. 2026 , url=

2026
[28]

2026 , publisher =

Meta , title =. 2026 , publisher =

2026
[29]

What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices

Noels, Sander and Bied, Guillaume and Buyl, Maarten and Rogiers, Alexander and Fettach, Yousra and Lijffijt, Jefrey and De Bie, Tijl. What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices. Machine Learning and Knowledge Discovery in Databases. Research Track. 2026

2026
[30]

Fielding and Mark Nottingham and Julian Reschke , title =

Roy T. Fielding and Mark Nottingham and Julian Reschke , title =. 2022 , month = jun, doi =

2022
[31]

2026 , eprint=

Peering Behind the Shield: Guardrail Identification in Large Language Models , author=. 2026 , eprint=

2026

[1] [1]

2024 , eprint=

Safeguarding Large Language Models: A Survey , author=. 2024 , eprint=

2024

[2] [2]

AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher. AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

work page doi:10.18653/v1/2025.naacl-long.306 2025

[3] [3]

ACL , year=

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free , author=. ACL , year=

[4] [4]

Frontiers of Computer Science , year=

Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Dong, Zican and Hou, Yupeng and Zhang, Beichen and Min, Yingqian and Zhang, Junjie and Liu, Peiyu and Wang, Xiaolei and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Hu, Yiwen and Nie, Jian-Yun...

work page doi:10.1007/s11704-026-60308-3

[5] [5]

2023 , eprint=

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks , author=. 2023 , eprint=

2023

[6] [6]

34th USENIX Security Symposium (USENIX Security 25) , year =

Mark Russinovich and Ahmed Salem and Ronen Eldan , title =. 34th USENIX Security Symposium (USENIX Security 25) , year =

[7] [7]

2025 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2025 , eprint=

2025

[8] [8]

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

Hackett, William and Birch, Lewis and Trawicki, Stefan and Suri, Neeraj and Garraghan, Peter. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems. Proceedings of the The First Workshop on LLM Security (LLMSEC). 2025

2025

[9] [9]

N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan. N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.40

work page doi:10.18653/v1/2023.emnlp-demo.40 2023

[10] [10]

2026 , eprint=

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts , author=. 2026 , eprint=

2026

[11] [11]

J ailbreak R adar: Comprehensive Assessment of Jailbreak Attacks Against LLM s

Chu, Junjie and Liu, Yugeng and Yang, Ziqing and Shen, Xinyue and Backes, Michael and Zhang, Yang. J ailbreak R adar: Comprehensive Assessment of Jailbreak Attacks Against LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1045

work page doi:10.18653/v1/2025.acl-long.1045 2025

[12] [12]

Artificial Intelligence Review , year=

Dong, Yi and Mu, Ronghui and Zhang, Yanghao and Sun, Siqi and Zhang, Tianle and Wu, Changshun and Jin, Gaojie and Qi, Yi and Hu, Jinwei and Meng, Jie and Bensalem, Saddek and Huang, Xiaowei , title=. Artificial Intelligence Review , year=. doi:10.1007/s10462-025-11389-2 , url=

work page doi:10.1007/s10462-025-11389-2

[13] [13]

Proceedings of the ACM Web Conference 2026 , pages =

Wang, Junyi and Zhu, Zhibin and Liu, Chuanyi , title =. Proceedings of the ACM Web Conference 2026 , pages =. 2026 , isbn =. doi:10.1145/3774904.3792438 , abstract =

work page doi:10.1145/3774904.3792438 2026

[14] [14]

and Kiekintveld, Christopher and Laszka, Aron , title =

Roy, Shanto and Sharmin, Nazia and Acosta, Jaime C. and Kiekintveld, Christopher and Laszka, Aron , title =. ACM Comput. Surv. , month = dec, articleno =. 2022 , issue_date =. doi:10.1145/3538704 , abstract =

work page doi:10.1145/3538704 2022

[15] [15]

2025 , eprint=

System Prompt Extraction Attacks and Defenses in Large Language Models , author=. 2025 , eprint=

2025

[16] [16]

Kornaropoulos and Giuseppe Ateniese , title =

Dario Pasquini and Evgenios M. Kornaropoulos and Giuseppe Ateniese , title =. 34th USENIX Security Symposium (USENIX Security 25) , year =

[17] [17]

2002 , publisher=

HTTP: the definitive guide , author=. 2002 , publisher=

2002

[18] [18]

Safe RLHF: Safe Reinforcement Learning from Human Feedback , url =

Dai, Juntao and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong , booktitle =. Safe RLHF: Safe Reinforcement Learning from Human Feedback , url =

[19] [19]

Findings of the Association for Computational Linguistics: EMNLP , volume=

On Guardrail Models’ Robustness to Mutations and Adversarial Attacks , author=. Findings of the Association for Computational Linguistics: EMNLP , volume=

[20] [20]

2026 , eprint=

Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers , author=. 2026 , eprint=

2026

[21] [21]

2024 , eprint=

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester , author=. 2024 , eprint=

2024

[22] [22]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

2024

[23] [23]

2024 , publisher =

OpenAI , title =. 2024 , publisher =

2024

[24] [24]

2025 , eprint=

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails , author=. 2025 , eprint=

2025

[25] [25]

2024 , publisher =

Meta , title =. 2024 , publisher =

2024

[26] [26]

2025 , publisher =

ProtectAI.com , title =. 2025 , publisher =

2025

[27] [27]

2026 , url=

Azure , title =. 2026 , url=

2026

[28] [28]

2026 , publisher =

Meta , title =. 2026 , publisher =

2026

[29] [29]

What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices

Noels, Sander and Bied, Guillaume and Buyl, Maarten and Rogiers, Alexander and Fettach, Yousra and Lijffijt, Jefrey and De Bie, Tijl. What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices. Machine Learning and Knowledge Discovery in Databases. Research Track. 2026

2026

[30] [30]

Fielding and Mark Nottingham and Julian Reschke , title =

Roy T. Fielding and Mark Nottingham and Julian Reschke , title =. 2022 , month = jun, doi =

2022

[31] [31]

2026 , eprint=

Peering Behind the Shield: Guardrail Identification in Large Language Models , author=. 2026 , eprint=

2026