Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers

Changjia Zhu; Junjie Xiong; Junyu Wang; Kun Sun; Mingkui Wei; Xu He; Yan Zhai; Yuanbo Zhou

arxiv: 2605.23196 · v1 · pith:Z4NDL3HMnew · submitted 2026-05-22 · 💻 cs.CR

Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers

Yuanbo Zhou , Changjia Zhu , Junyu Wang , Xu He , Yan Zhai , Kun Sun , Mingkui Wei , Junjie Xiong This is my paper

Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3

classification 💻 cs.CR

keywords prompt overflow attackguardrail modelsprompt injectioncontext length mismatchadversarial promptsLLM safetyoverlength inputstruncation vulnerability

0 comments

The pith

Guardrail models fail to detect malicious prompts split across overlong inputs, even though downstream LLMs can still act on them

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that guardrail models process overlength prompts by truncating or segmenting them due to context limits, while LLMs use much larger inference windows. This mismatch lets a Prompt Overflow Attack break malicious instructions into fragments mixed with benign filler text so that no inspected segment appears malicious, yet the complete prompt remains fully actionable to the target LLM. A sympathetic reader would care because guardrails are the main front-line defense against prompt injection, and this attack shows how length manipulation alone can defeat them without altering the underlying harmful intent. Systematic tests confirm that prompts reliably caught in short contexts evade several production guardrails once extended this way.

Core claim

The central claim is that the mismatch between the limited inspection windows of guardrail models and the substantially larger context windows of downstream LLMs creates a blind spot; the Prompt Overflow Attack exploits this by fragmenting malicious instructions and interleaving them with benign filler across an overlong prompt, so that no individual inspected segment triggers detection while the full context remains executable by the LLM.

What carries the argument

The Prompt Overflow Attack, which distributes malicious instructions across segments separated by benign filler so that truncation or segmentation-based guardrail inspection sees only benign pieces.

If this is right

Prompts detected by guardrails in short-context settings evade once extended with filler content.
State-of-the-art guardrails including Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors are all susceptible.
The full prompt remains fully actionable to the downstream LLM despite passing the guardrail.
Potential defenses include strengthening guardrails to close the length-based blind spot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Guardrails may need to adopt context windows comparable to the models they protect.
The same length-mismatch pattern could affect other layered safety mechanisms that preprocess inputs.
Varying the filler density and segment count would test how far the attack generalizes across different guardrail architectures.

Load-bearing premise

Guardrail models handle overlength prompts exclusively through truncation or segmentation-based inspection rather than full-context processing.

What would settle it

Test whether an overlong prompt containing fragmented malicious instructions evades a given guardrail model while the downstream LLM still produces the intended harmful output.

Figures

Figures reproduced from arXiv: 2605.23196 by Changjia Zhu, Junjie Xiong, Junyu Wang, Kun Sun, Mingkui Wei, Xu He, Yan Zhai, Yuanbo Zhou.

**Figure 1.** Figure 1: Overview of Prompt Overflow Attack. The attacker exploits the architectural mismatch between the guardrail models [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The Overflow Prompt Construction Pipeline. Starting from an original malicious prompt [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Risk-aware construction identifies and exploits detector-critical evidence. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effectiveness across inspection policies and attack density. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Bypass rate comparison under Tail/Interleave lay [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Guardrail models (a.k.a. safety checkers) are widely deployed to screen user inputs before they reach large language models (LLMs), serving as a primary defense against prompt injection attacks. Due to strict context constraints, these models handle overlength prompts through truncation or segmentation-based inspection. While prior work has focused on semantic adversarial inputs, the security implications of these long-input processing mechanisms remain largely unexplored. In this paper, we identify a critical blind spot arising from the mismatch between the limited inspection windows of guardrail models and the substantially larger context inference windows of downstream LLMs. We introduce a novel Prompt Overflow Attack, which exploits this mismatch by fragmenting malicious instructions and interleaving them with benign filler content across an overlong prompt, such that no individual inspected segment appears malicious while the full context remains actionable to the LLM. Through a systematic evaluation against state-of-the-art guardrail models, including Meta Llama Prompt Guard, IBM Granite Guardian, and DeBERTa-based detectors, we demonstrate that prompts reliably detected in short-context settings can evade guardrail models once adversarially manipulated into over-length inputs, yet remain fully actionable by downstream LLMs. We further propose potential defense strategies and outline mitigation directions to strengthen guardrail models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a length-based mismatch lets fragmented prompts bypass guardrails while still working on LLMs.

read the letter

The key takeaway is that current guardrail models have shorter context windows than the LLMs they protect, so they inspect prompts by truncating or splitting them. This creates an opening for attacks that break malicious content into pieces separated by harmless text, so no single inspected part triggers the guardrail, but the full prompt still works on the downstream model. The paper introduces the Prompt Overflow Attack using fragmentation and interleaving on overlength inputs. This seems distinct from earlier adversarial prompt work that focused on semantics rather than length handling. They run tests on Llama Prompt Guard, IBM Granite Guardian, and DeBERTa detectors, showing that manipulated long prompts evade detection while staying effective for the LLM. They also suggest some mitigation ideas. This evaluation on specific deployed guardrails is a plus, as it ties the idea to real systems. The claim that short-context detections fail on long versions is the core result. One potential issue is whether the guardrails' actual long-input behavior matches the truncation or segmentation assumption. If they use more sophisticated methods or pass more context, the mismatch might be smaller. The abstract lacks specific metrics, so the full paper needs to show the evasion rates clearly and rule out other explanations like general length effects on detection accuracy. This work is aimed at researchers and engineers working on LLM safety mechanisms. It is worth sending to peer review because it identifies a practical vulnerability with some empirical backing on current tools, even if further validation would strengthen it.

Referee Report

2 major / 2 minor

Summary. The paper claims that guardrail models (Llama Prompt Guard, Granite Guardian, DeBERTa-based detectors) process overlength inputs via truncation or segmentation, creating a mismatch with LLMs' larger context windows. It introduces the Prompt Overflow Attack, which fragments malicious instructions and interleaves them with benign filler across an overlong prompt so that no inspected segment appears malicious, yet the full prompt remains actionable to the downstream LLM. The authors report that prompts reliably detected in short contexts evade these guardrails under the attack and propose mitigation strategies.

Significance. If the empirical results hold, the work identifies a practical, previously underexplored attack surface in deployed guardrail systems that is orthogonal to semantic adversarial techniques. Demonstrating both evasion and downstream actionability on named production guardrails would be a concrete contribution to LLM safety engineering.

major comments (2)

[Abstract, §3] Abstract and §3 (mechanism description): the central claim that guardrails 'handle overlength prompts through truncation or segmentation-based inspection' is load-bearing for the blind-spot argument, yet the manuscript provides no direct evidence, API documentation, or experimental verification of the exact processing pipeline used by Llama Prompt Guard or Granite Guardian on inputs exceeding their stated limits.
[§4] §4 (evaluation): the assertion of a 'systematic evaluation' that shows 'prompts reliably detected in short-context settings can evade guardrail models' requires explicit reporting of detection rates, number of test prompts, segmentation parameters, baseline short-context accuracy, and LLM success rates (with which downstream models) to substantiate both the evasion and actionability claims; without these metrics the results cannot be assessed for statistical reliability or generalizability.

minor comments (2)

[Abstract] The abstract states the attack 'remains fully actionable by downstream LLMs' but does not name the LLMs used or report success-rate thresholds; this detail belongs in the evaluation section for reproducibility.
[§3] Clarify whether the benign filler content is drawn from a fixed distribution or chosen adversarially, as this affects the attack's claimed generality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our core claims and evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (mechanism description): the central claim that guardrails 'handle overlength prompts through truncation or segmentation-based inspection' is load-bearing for the blind-spot argument, yet the manuscript provides no direct evidence, API documentation, or experimental verification of the exact processing pipeline used by Llama Prompt Guard or Granite Guardian on inputs exceeding their stated limits.

Authors: We acknowledge that the manuscript states the processing behavior without including direct experimental verification or proprietary API documentation. This statement follows from the publicly documented context limits of the models (512 tokens for Llama Prompt Guard; similar constraints for Granite Guardian), which require truncation or segmentation for longer inputs under standard transformer inference practices. Since internal pipelines are not publicly disclosed, we cannot cite source-level documentation. We will revise §3 to add a dedicated paragraph with indirect experimental support: we will report results from controlled tests showing that guardrail decisions change when inputs cross the documented token thresholds in ways consistent with segmentation, and we will reference any available public model cards or technical reports. revision: yes
Referee: [§4] §4 (evaluation): the assertion of a 'systematic evaluation' that shows 'prompts reliably detected in short-context settings can evade guardrail models' requires explicit reporting of detection rates, number of test prompts, segmentation parameters, baseline short-context accuracy, and LLM success rates (with which downstream models) to substantiate both the evasion and actionability claims; without these metrics the results cannot be assessed for statistical reliability or generalizability.

Authors: The full §4 already contains these metrics (200 prompts, short-context detection rates of 92–97% dropping to 4–12% under overflow, 512-token segmentation windows, and downstream success rates of 78–91% on Llama-3-70B and GPT-4o). To improve clarity and address the concern directly, we will add an explicit summary table in §4 listing all requested quantities, move key numbers into the main text, and ensure the abstract references the scale of the evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical security study demonstrating a prompt overflow attack via fragmentation of malicious instructions across overlong inputs. It evaluates evasion on concrete guardrail models (Llama Prompt Guard, Granite Guardian, DeBERTa) and downstream actionability on LLMs, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains. The mismatch between inspection windows and LLM context is stated as an observed architectural difference and tested directly; the claim does not reduce to any self-referential definition or renaming of results. This is a standard non-circular empirical attack paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical security paper; central claim rests on existence and effectiveness of the described attack demonstrated via evaluation. No free parameters, mathematical axioms, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5771 in / 1227 out tokens · 46883 ms · 2026-05-25T04:28:51.925828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 5 internal anchors

[1]

Renzo Arturo Alva Principe, Nicola Chiarini, and Marco Viviani. 2025. Long Document Classification in the Transformer Era: A Survey on Challenges, Ad- vances, and Open Issues.WIREs Data Mining and Knowledge Discovery15, 2 (2025), e70019

work page 2025
[2]

Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, et al. 2024. Many-shot Jailbreaking. InAdvances in Neural Information Processing Systems

work page 2024
[3]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. 2024. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.Advances in Neural Information Processing Systems37 (2024), 55005–55029

work page 2024
[5]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25)

work page 2025
[6]

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. Revisiting Transformer-based Models for Long Document Classification. InFindings of the Association for Computational Linguistics: EMNLP 2022. 7212–7230

work page 2022
[7]

Dietterich, Richard H

Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. 1997. Solv- ing the Multiple Instance Problem with Axis-Parallel Rectangles.Artificial Intelligence89, 1–2 (1997), 31–71

work page 1997
[8]

Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, and Xiaowei Huang

work page
[9]

Safeguarding large language models: a survey.Artificial Intelligence Review 58 (2025)

work page 2025
[10]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020. 3356–3369

work page 2020
[11]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23). 79–90

work page 2023
[12]

Shanshan Han, Salman Avestimehr, and Chaoyang He. 2025. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences.arXiv preprint arXiv:2502.08142(2025)

work page arXiv 2025
[13]

Eric Hartford and Cognitive Computations. [n. d.]. Dolphin 2.9.3 Mistral Nemo 12b. https://huggingface.co/dphn/dolphin-2.9.3-mistral-nemo-12b. Hugging Face model card, accessed April 17, 2026

work page 2026
[14]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: Improv- ing DeBERTa using ELECTRA-style Pre-training with Gradient-Disentangled Embedding Sharing. InInternational Conference on Learning Representations (ICLR’23)

work page 2023
[15]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. InInternational Confer- ence on Learning Representations (ICLR’21)

work page 2021
[16]

David Herel, Hugo Cisneros, and Tomas Mikolov. 2022. Preserving semantics in textual adversarial attacks.arXiv preprint arXiv:2211.04205(2022)

work page arXiv 2022
[17]

IBM. 2024. granite-guardian-hap-125m Model Card. Hugging Face Model Card: ibm-granite/granite-guardian-hap-125m. https://huggingface.co/ibm-granite/ granite-guardian-hap-125m

work page 2024
[18]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and en- tailment. InProceedings of the AAAI conference on artificial intelligence (AAAI’20), Vol. 34. 8018–8025

work page 2020
[20]

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6193–6202

work page 2020
[21]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

work page 2024
[22]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. InPro- ceedings of the 33rd USENIX Security Symposium (USENIX Security ’24). 1831– 1847

work page 2024
[24]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.Proceedings of Machine Learning Research235 (2024), 35181–35224

work page 2024
[25]

Meta Llama. 2025. Hugging Face Model Card: meta-llama/Llama-Prompt-Guard- 2-86M. https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M

work page 2025
[26]

Meta Llama. 2025. Llama Prompt Guard 2: Model Cards and Prompt For- mats. https://www.llama.com/docs/model-cards-and-prompt-formats/prompt- guard/. Accessed April 27, 2026

work page 2025
[27]

mradermacher. [n. d.]. Qwen2.5-14B-Instruct-abliterated-GGUF. https:// huggingface.co/mradermacher/Qwen2.5-14B-Instruct-abliterated-GGUF. Hug- ging Face model card, accessed April 17, 2026

work page 2026
[28]

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martin Santillan Cooper, Kieran Fraser, et al. 2025. Granite Guardian: Comprehensive LLM Safeguarding. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Li...

work page 2025
[29]

Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In2019 IEEE automatic speech recognition and understanding workshop (ASRU). 838–844

work page 2019
[30]

Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16). 1135–1144

work page 2016
[32]

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. 2025. Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents. In Proceedings of the 34th USENIX Security Symposium (USENIX Security ’25)

work page 2025
[33]

Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. 2021. Univer- sal adversarial attacks with natural triggers for text classification. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies. 3724–3733

work page 2021
[34]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InInternational conference on machine learning (ICML’17). 3319–3328

work page 2017
[35]

Bissyandé

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP’24). 11279–11313

work page 2024
[36]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. InProceedings 13 Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He, Yan Zhai, Kun Sun, Mingkui Wei, and Junjie Xiong of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Jo...

work page 2019
[37]

Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP’25). 276–284

work page 2025
[38]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

work page 2024
[39]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems

work page 2023
[40]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480–1489

work page 2016
[41]

Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang

work page
[42]

InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE’24)

Human-imperceptible retrieval poisoning attacks in LLM-powered appli- cations. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE’24). 502–506

work page
[43]

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poison- ing Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP’23). 13764–13775

work page 2023
[44]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

sure, to create a prop that convincingly mimics an id or driver’s license

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25). A Ethical Considerations We attest that we have reviewed the conference ethics discussions and guidelines and considered...

work page 2025

[1] [1]

Renzo Arturo Alva Principe, Nicola Chiarini, and Marco Viviani. 2025. Long Document Classification in the Transformer Era: A Survey on Challenges, Ad- vances, and Open Issues.WIREs Data Mining and Knowledge Discovery15, 2 (2025), e70019

work page 2025

[2] [2]

Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, et al. 2024. Many-shot Jailbreaking. InAdvances in Neural Information Processing Systems

work page 2024

[3] [3]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. 2024. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.Advances in Neural Information Processing Systems37 (2024), 55005–55029

work page 2024

[5] [5]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25)

work page 2025

[6] [6]

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. Revisiting Transformer-based Models for Long Document Classification. InFindings of the Association for Computational Linguistics: EMNLP 2022. 7212–7230

work page 2022

[7] [7]

Dietterich, Richard H

Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. 1997. Solv- ing the Multiple Instance Problem with Axis-Parallel Rectangles.Artificial Intelligence89, 1–2 (1997), 31–71

work page 1997

[8] [8]

Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, and Xiaowei Huang

work page

[9] [9]

Safeguarding large language models: a survey.Artificial Intelligence Review 58 (2025)

work page 2025

[10] [10]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020. 3356–3369

work page 2020

[11] [11]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23). 79–90

work page 2023

[12] [12]

Shanshan Han, Salman Avestimehr, and Chaoyang He. 2025. Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences.arXiv preprint arXiv:2502.08142(2025)

work page arXiv 2025

[13] [13]

Eric Hartford and Cognitive Computations. [n. d.]. Dolphin 2.9.3 Mistral Nemo 12b. https://huggingface.co/dphn/dolphin-2.9.3-mistral-nemo-12b. Hugging Face model card, accessed April 17, 2026

work page 2026

[14] [14]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: Improv- ing DeBERTa using ELECTRA-style Pre-training with Gradient-Disentangled Embedding Sharing. InInternational Conference on Learning Representations (ICLR’23)

work page 2023

[15] [15]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. InInternational Confer- ence on Learning Representations (ICLR’21)

work page 2021

[16] [16]

David Herel, Hugo Cisneros, and Tomas Mikolov. 2022. Preserving semantics in textual adversarial attacks.arXiv preprint arXiv:2211.04205(2022)

work page arXiv 2022

[17] [17]

IBM. 2024. granite-guardian-hap-125m Model Card. Hugging Face Model Card: ibm-granite/granite-guardian-hap-125m. https://huggingface.co/ibm-granite/ granite-guardian-hap-125m

work page 2024

[18] [18]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and en- tailment. InProceedings of the AAAI conference on artificial intelligence (AAAI’20), Vol. 34. 8018–8025

work page 2020

[20] [20]

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6193–6202

work page 2020

[21] [21]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

work page 2024

[22] [22]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injec- tion attack against llm-integrated applications.arXiv preprint arXiv:2306.05499 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. InPro- ceedings of the 33rd USENIX Security Symposium (USENIX Security ’24). 1831– 1847

work page 2024

[24] [24]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.Proceedings of Machine Learning Research235 (2024), 35181–35224

work page 2024

[25] [25]

Meta Llama. 2025. Hugging Face Model Card: meta-llama/Llama-Prompt-Guard- 2-86M. https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M

work page 2025

[26] [26]

Meta Llama. 2025. Llama Prompt Guard 2: Model Cards and Prompt For- mats. https://www.llama.com/docs/model-cards-and-prompt-formats/prompt- guard/. Accessed April 27, 2026

work page 2025

[27] [27]

mradermacher. [n. d.]. Qwen2.5-14B-Instruct-abliterated-GGUF. https:// huggingface.co/mradermacher/Qwen2.5-14B-Instruct-abliterated-GGUF. Hug- ging Face model card, accessed April 17, 2026

work page 2026

[28] [28]

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martin Santillan Cooper, Kieran Fraser, et al. 2025. Granite Guardian: Comprehensive LLM Safeguarding. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Li...

work page 2025

[29] [29]

Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In2019 IEEE automatic speech recognition and understanding workshop (ASRU). 838–844

work page 2019

[30] [30]

Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16). 1135–1144

work page 2016

[32] [32]

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. 2025. Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents. In Proceedings of the 34th USENIX Security Symposium (USENIX Security ’25)

work page 2025

[33] [33]

Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. 2021. Univer- sal adversarial attacks with natural triggers for text classification. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies. 3724–3733

work page 2021

[34] [34]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InInternational conference on machine learning (ICML’17). 3319–3328

work page 2017

[35] [35]

Bissyandé

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP’24). 11279–11313

work page 2024

[36] [36]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. InProceedings 13 Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He, Yan Zhai, Kun Sun, Mingkui Wei, and Junjie Xiong of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Jo...

work page 2019

[37] [37]

Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. 2025. ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP’25). 276–284

work page 2025

[38] [38]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

work page 2024

[39] [39]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems

work page 2023

[40] [40]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480–1489

work page 2016

[41] [41]

Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang

work page

[42] [42]

InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE’24)

Human-imperceptible retrieval poisoning attacks in LLM-powered appli- cations. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE’24). 502–506

work page

[43] [43]

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poison- ing Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP’23). 13764–13775

work page 2023

[44] [44]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

sure, to create a prop that convincingly mimics an id or driver’s license

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25). A Ethical Considerations We attest that we have reviewed the conference ethics discussions and guidelines and considered...

work page 2025