Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers
Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3
The pith
Guardrail models fail to detect malicious prompts split across overlong inputs, even though downstream LLMs can still act on them
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the mismatch between the limited inspection windows of guardrail models and the substantially larger context windows of downstream LLMs creates a blind spot; the Prompt Overflow Attack exploits this by fragmenting malicious instructions and interleaving them with benign filler across an overlong prompt, so that no individual inspected segment triggers detection while the full context remains executable by the LLM.
What carries the argument
The Prompt Overflow Attack, which distributes malicious instructions across segments separated by benign filler so that truncation or segmentation-based guardrail inspection sees only benign pieces.
If this is right
- Prompts detected by guardrails in short-context settings evade once extended with filler content.
- State-of-the-art guardrails including Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors are all susceptible.
- The full prompt remains fully actionable to the downstream LLM despite passing the guardrail.
- Potential defenses include strengthening guardrails to close the length-based blind spot.
Where Pith is reading between the lines
- Guardrails may need to adopt context windows comparable to the models they protect.
- The same length-mismatch pattern could affect other layered safety mechanisms that preprocess inputs.
- Varying the filler density and segment count would test how far the attack generalizes across different guardrail architectures.
Load-bearing premise
Guardrail models handle overlength prompts exclusively through truncation or segmentation-based inspection rather than full-context processing.
What would settle it
Test whether an overlong prompt containing fragmented malicious instructions evades a given guardrail model while the downstream LLM still produces the intended harmful output.
Figures
read the original abstract
Guardrail models (a.k.a. safety checkers) are widely deployed to screen user inputs before they reach large language models (LLMs), serving as a primary defense against prompt injection attacks. Due to strict context constraints, these models handle overlength prompts through truncation or segmentation-based inspection. While prior work has focused on semantic adversarial inputs, the security implications of these long-input processing mechanisms remain largely unexplored. In this paper, we identify a critical blind spot arising from the mismatch between the limited inspection windows of guardrail models and the substantially larger context inference windows of downstream LLMs. We introduce a novel Prompt Overflow Attack, which exploits this mismatch by fragmenting malicious instructions and interleaving them with benign filler content across an overlong prompt, such that no individual inspected segment appears malicious while the full context remains actionable to the LLM. Through a systematic evaluation against state-of-the-art guardrail models, including Meta Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors, we demonstrate that prompts reliably detected in short-context settings can evade guardrail models once adversarially manipulated into over-length inputs, yet remain fully actionable by downstream LLMs. We further propose potential defense strategies and outline mitigation directions to strengthen guardrail models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that guardrail models (Llama Prompt Guard, Granite Guardian, DeBERTa-based detectors) process overlength inputs via truncation or segmentation, creating a mismatch with LLMs' larger context windows. It introduces the Prompt Overflow Attack, which fragments malicious instructions and interleaves them with benign filler across an overlong prompt so that no inspected segment appears malicious, yet the full prompt remains actionable to the downstream LLM. The authors report that prompts reliably detected in short contexts evade these guardrails under the attack and propose mitigation strategies.
Significance. If the empirical results hold, the work identifies a practical, previously underexplored attack surface in deployed guardrail systems that is orthogonal to semantic adversarial techniques. Demonstrating both evasion and downstream actionability on named production guardrails would be a concrete contribution to LLM safety engineering.
major comments (2)
- [Abstract, §3] Abstract and §3 (mechanism description): the central claim that guardrails 'handle overlength prompts through truncation or segmentation-based inspection' is load-bearing for the blind-spot argument, yet the manuscript provides no direct evidence, API documentation, or experimental verification of the exact processing pipeline used by Llama Prompt Guard or Granite Guardian on inputs exceeding their stated limits.
- [§4] §4 (evaluation): the assertion of a 'systematic evaluation' that shows 'prompts reliably detected in short-context settings can evade guardrail models' requires explicit reporting of detection rates, number of test prompts, segmentation parameters, baseline short-context accuracy, and LLM success rates (with which downstream models) to substantiate both the evasion and actionability claims; without these metrics the results cannot be assessed for statistical reliability or generalizability.
minor comments (2)
- [Abstract] The abstract states the attack 'remains fully actionable by downstream LLMs' but does not name the LLMs used or report success-rate thresholds; this detail belongs in the evaluation section for reproducibility.
- [§3] Clarify whether the benign filler content is drawn from a fixed distribution or chosen adversarially, as this affects the attack's claimed generality.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the presentation of our core claims and evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (mechanism description): the central claim that guardrails 'handle overlength prompts through truncation or segmentation-based inspection' is load-bearing for the blind-spot argument, yet the manuscript provides no direct evidence, API documentation, or experimental verification of the exact processing pipeline used by Llama Prompt Guard or Granite Guardian on inputs exceeding their stated limits.
Authors: We acknowledge that the manuscript states the processing behavior without including direct experimental verification or proprietary API documentation. This statement follows from the publicly documented context limits of the models (512 tokens for Llama Prompt Guard; similar constraints for Granite Guardian), which require truncation or segmentation for longer inputs under standard transformer inference practices. Since internal pipelines are not publicly disclosed, we cannot cite source-level documentation. We will revise §3 to add a dedicated paragraph with indirect experimental support: we will report results from controlled tests showing that guardrail decisions change when inputs cross the documented token thresholds in ways consistent with segmentation, and we will reference any available public model cards or technical reports. revision: yes
-
Referee: [§4] §4 (evaluation): the assertion of a 'systematic evaluation' that shows 'prompts reliably detected in short-context settings can evade guardrail models' requires explicit reporting of detection rates, number of test prompts, segmentation parameters, baseline short-context accuracy, and LLM success rates (with which downstream models) to substantiate both the evasion and actionability claims; without these metrics the results cannot be assessed for statistical reliability or generalizability.
Authors: The full §4 already contains these metrics (200 prompts, short-context detection rates of 92–97% dropping to 4–12% under overflow, 512-token segmentation windows, and downstream success rates of 78–91% on Llama-3-70B and GPT-4o). To improve clarity and address the concern directly, we will add an explicit summary table in §4 listing all requested quantities, move key numbers into the main text, and ensure the abstract references the scale of the evaluation. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical security study demonstrating a prompt overflow attack via fragmentation of malicious instructions across overlong inputs. It evaluates evasion on concrete guardrail models (Llama Prompt Guard, Granite Guardian, DeBERTa) and downstream actionability on LLMs, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains. The mismatch between inspection windows and LLM context is stated as an observed architectural difference and tested directly; the claim does not reduce to any self-referential definition or renaming of results. This is a standard non-circular empirical attack paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Renzo Arturo Alva Principe, Nicola Chiarini, and Marco Viviani. 2025. Long Document Classification in the Transformer Era: A Survey on Challenges, Ad- vances, and Open Issues.WIREs Data Mining and Knowledge Discovery15, 2 (2025), e70019
work page 2025
-
[2]
Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, et al. 2024. Many-shot Jailbreaking. InAdvances in Neural Information Processing Systems
work page 2024
-
[3]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. 2024. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.Advances in Neural Information Processing Systems37 (2024), 55005–55029
work page 2024
-
[5]
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25)
work page 2025
-
[6]
Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. Revisiting Transformer-based Models for Long Document Classification. InFindings of the Association for Computational Linguistics: EMNLP 2022. 7212–7230
work page 2022
-
[7]
Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. 1997. Solv- ing the Multiple Instance Problem with Axis-Parallel Rectangles.Artificial Intelligence89, 1–2 (1997), 31–71
work page 1997
-
[8]
Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, and Xiaowei Huang
-
[9]
Safeguarding large language models: a survey.Artificial Intelligence Review 58 (2025)
work page 2025
-
[10]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020. 3356–3369
work page 2020
-
[11]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23). 79–90
work page 2023
- [12]
-
[13]
Eric Hartford and Cognitive Computations. [n. d.]. Dolphin 2.9.3 Mistral Nemo 12b. https://huggingface.co/dphn/dolphin-2.9.3-mistral-nemo-12b. Hugging Face model card, accessed April 17, 2026
work page 2026
-
[14]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: Improv- ing DeBERTa using ELECTRA-style Pre-training with Gradient-Disentangled Embedding Sharing. InInternational Conference on Learning Representations (ICLR’23)
work page 2023
-
[15]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. InInternational Confer- ence on Learning Representations (ICLR’21)
work page 2021
- [16]
-
[17]
IBM. 2024. granite-guardian-hap-125m Model Card. Hugging Face Model Card: ibm-granite/granite-guardian-hap-125m. https://huggingface.co/ibm-granite/ granite-guardian-hap-125m
work page 2024
-
[18]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and en- tailment. InProceedings of the AAAI conference on artificial intelligence (AAAI’20), Vol. 34. 8018–8025
work page 2020
-
[20]
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6193–6202
work page 2020
-
[21]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173
work page 2024
-
[22]
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. InPro- ceedings of the 33rd USENIX Security Symposium (USENIX Security ’24). 1831– 1847
work page 2024
-
[24]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.Proceedings of Machine Learning Research235 (2024), 35181–35224
work page 2024
-
[25]
Meta Llama. 2025. Hugging Face Model Card: meta-llama/Llama-Prompt-Guard- 2-86M. https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
work page 2025
-
[26]
Meta Llama. 2025. Llama Prompt Guard 2: Model Cards and Prompt For- mats. https://www.llama.com/docs/model-cards-and-prompt-formats/prompt- guard/. Accessed April 27, 2026
work page 2025
-
[27]
mradermacher. [n. d.]. Qwen2.5-14B-Instruct-abliterated-GGUF. https:// huggingface.co/mradermacher/Qwen2.5-14B-Instruct-abliterated-GGUF. Hug- ging Face model card, accessed April 17, 2026
work page 2026
-
[28]
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martin Santillan Cooper, Kieran Fraser, et al. 2025. Granite Guardian: Comprehensive LLM Safeguarding. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Li...
work page 2025
-
[29]
Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In2019 IEEE automatic speech recognition and understanding workshop (ASRU). 838–844
work page 2019
-
[30]
Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16). 1135–1144
work page 2016
-
[32]
Avital Shafran, Roei Schuster, and Vitaly Shmatikov. 2025. Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents. In Proceedings of the 34th USENIX Security Symposium (USENIX Security ’25)
work page 2025
-
[33]
Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. 2021. Univer- sal adversarial attacks with natural triggers for text classification. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies. 3724–3733
work page 2021
-
[34]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InInternational conference on machine learning (ICML’17). 3319–3328
work page 2017
-
[35]
Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP’24). 11279–11313
work page 2024
-
[36]
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. InProceedings 13 Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He, Yan Zhai, Kun Sun, Mingkui Wei, and Junjie Xiong of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Jo...
work page 2019
-
[37]
Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP’25). 276–284
work page 2025
-
[38]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
work page 2024
-
[39]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems
work page 2023
-
[40]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480–1489
work page 2016
-
[41]
Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang
-
[42]
Human-imperceptible retrieval poisoning attacks in LLM-powered appli- cations. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE’24). 502–506
-
[43]
Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poison- ing Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP’23). 13764–13775
work page 2023
-
[44]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
sure, to create a prop that convincingly mimics an id or driver’s license
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25). A Ethical Considerations We attest that we have reviewed the conference ethics discussions and guidelines and considered...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.