pith. machine review for the scientific record. sign in

arxiv: 2604.28157 · v1 · submitted 2026-04-30 · 💻 cs.CR

Recognition: unknown

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:43 UTC · model grok-4.3

classification 💻 cs.CR
keywords red-teamingprompt injectionknowledge corruptionlong-context LLMsoptimization-based attackscomputational efficiencymemory optimizationLLM security
0
0 comments X

The pith

FlashRT makes optimization-based red-teaming 2x-7x faster and 2x-4x more memory-efficient for long-context LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present FlashRT, a framework that addresses the high computational and memory costs of optimization-based methods for red-teaming long-context LLMs against threats like prompt injection and knowledge corruption. These methods provide more rigorous security assessments than heuristic approaches but are impractical for widespread use due to their resource demands on models with extended contexts. FlashRT achieves significant efficiency gains through targeted optimizations that apply across different black-box methods. This development matters for enabling academic and practical evaluation of LLM security at scale, potentially leading to better defenses in applications such as agents and assistants. The reported results show speedups of 2x to 7x and memory reductions of 2x to 4x compared to the leading baseline, with no loss in attack performance.

Core claim

FlashRT is the first framework to improve the efficiency in terms of both computation and memory for optimization-based prompt injection and knowledge corruption attacks under long-context LLMs. Through extensive evaluations, we find that FlashRT consistently delivers a 2x-7x speedup (e.g., reducing runtime from one hour to less than ten minutes) and a 2x-4x reduction in GPU memory consumption (e.g., reducing from 264.1 GB to 65.7 GB GPU memory for a 32K token context) compared to state-of-the-art baseline nanoGCG. FlashRT can be broadly applied to black-box optimization methods, such as TAP and AutoDAN.

What carries the argument

FlashRT, a framework of targeted computational shortcuts and memory management techniques applied to black-box optimization for generating prompt injection and knowledge corruption attacks on long-context LLMs.

If this is right

  • Systematic security evaluation of long-context LLMs becomes feasible for researchers with standard hardware resources.
  • Assessment of defense strategies against prompt injection and knowledge corruption can occur at larger scales and faster iteration cycles.
  • The same efficiency techniques extend to other black-box optimization red-teaming methods such as TAP and AutoDAN.
  • Long-context models can be tested more thoroughly before real-world deployment in retrieval-augmented generation and agent systems.
  • Lower resource requirements reduce barriers for academic groups to conduct repeated, large-scale red-teaming experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could accelerate discovery of vulnerabilities in long-context models used for autonomous agents and AI assistants.
  • The same optimization patterns might enable red-teaming on contexts longer than 32K tokens that currently exceed practical budgets.
  • Similar efficiency techniques could be ported to other LLM attack surfaces beyond prompt injection and knowledge corruption.
  • Combining FlashRT outputs with automated defense tuning loops might create faster closed-loop security improvement pipelines.

Load-bearing premise

The efficiency optimizations preserve attack success rates at levels comparable to the unoptimized baseline across tested models and contexts.

What would settle it

A head-to-head run on a 32K-token context model showing that FlashRT's runtime stays above 20 minutes, memory use stays above 100 GB, or attack success rate falls more than 20 percent below the nanoGCG baseline would falsify the efficiency claims.

Figures

Figures reproduced from arXiv: 2604.28157 by Chenlong Yin, Jinyuan Jia, Yanting Wang, Ying Chen.

Figure 1
Figure 1. Figure 1: (a,b) nanoGCG and (c,d) FlashRT: comparison of total computation time and GPU memory usage for forward and backward passes on NarrativeQA with Llama-3.1-8B. nanoGCG is dominated by forward-pass computation while incurring high backward-pass memory cost; FlashRT significantly reduces both overheads. based strategies, while the latter employ optimization techniques to automatically craft adversarial texts th… view at source ↗
Figure 2
Figure 2. Figure 2: Standard KV-Caching (top) needs to recompute hidden states and key-value pairs for all tokens in the right side of the candidate malicious text to calculate the exact loss. Different from standard KV-Caching, we only recompute hidden states and key-value pairs for selected tokens to approximately calculate the loss, thereby improving computational efficiency. eliminating redundant computation for the share… view at source ↗
Figure 3
Figure 3. Figure 3: Influence scores of individual tokens on the hidden states of the target output Yˆ at the 300-th iteration, when the malicious text is injected in the middle of the context. Tokens from the task instruction Is, malicious text T, user instruction Iu, and target output Yˆ generally exhibit high influence scores, whereas only a sparse subset of tokens in the right context Cr show notable influence. In contras… view at source ↗
Figure 4
Figure 4. Figure 4: This figure illustrates how peak GPU memory usage scales with increasing context length. The model is Llama-3.1-8B-Instruct. The results show that our method’s memory growth rate is slower than that of nanoGCG as the context length increases. A Stability Analysis for Attention Weights In Section 5.3, we note that our algorithm updates the recompute set (by recomputing influence scores) whenever the best ma… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of malicious text position for prompt injection attacks, where 0 (or 1) denotes injecting at the beginning (or end). The dataset is MuSiQue. 0 0.25 0.5 0.75 1.0 Injection Position 0 500 1000 1500 2000 Time (s) nanoGCG nanoGCG-OPT FlashRT (a) Total time 0 0.25 0.5 0.75 1.0 Injection Position 0 20 40 60 80 100 120 Memory (GB) nanoGCG nanoGCG-OPT FlashRT (b) GPU memory 0 0.25 0.5 0.75 1.0 Injection Pos… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of malicious text position for knowledge corruption attacks, where 0 (or 1) denotes injecting at the beginning (or end). The dataset is NQ view at source ↗
Figure 7
Figure 7. Figure 7: Impact of β on computation time, memory usage, and forward/backward pass statistics. 0.05 0.2 0.4 0.6 0.8 1.0 0 250 500 750 1000 1250 1500 Total time (s) (a) Total time 0.05 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Memory (GB) (b) GPU memory 0.05 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 Time per forward (s) (c) Forward time 0.05 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Time per backward (s) (d) Backward… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of γ on computation time, memory usage, and forward/backward pass statistics view at source ↗
Figure 9
Figure 9. Figure 9: Impact of τ on computation time, memory usage, and forward/backward pass statistics. 110 50 100 150 200 0 250 500 750 1000 1250 1500 Total time (s) (a) Total time 110 50 100 150 200 0 20 40 60 80 100 Memory (GB) (b) GPU memory 110 50 100 150 200 0.0 0.2 0.4 0.6 0.8 Time per forward (s) (c) Forward time 110 50 100 150 200 0.0 0.2 0.4 0.6 0.8 Time per backward (s) (d) Backward time 110 50 100 150 200 0 1000 … view at source ↗
Figure 10
Figure 10. Figure 10: Impact of ρ on computation time, memory usage, and forward/backward pass statistics view at source ↗
read the original abstract

Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retrieval-augmented generation, autonomous agents, and AI assistants. However, security remains a major concern for their widespread deployment, with threats such as prompt injection and knowledge corruption. To quantify the security risks faced by LLMs under these threats, the research community has developed heuristic-based and optimization-based red-teaming methods. Optimization-based methods generally produce stronger attacks than heuristic attacks and thus provide a more rigorous assessment of LLM security risks. However, they are often resource-intensive, requiring significant computation and GPU memory, especially for long context scenarios. The resource-intensive nature poses a major obstacle for the community (especially academic researchers) to systematically evaluate the security risks of long-context LLMs and assess the effectiveness of defense strategies at scale. In this work, we propose FlashRT, the first framework to improve the efficiency (in terms of both computation and memory) for optimization-based prompt injection and knowledge corruption attacks under long-context LLMs. Through extensive evaluations, we find that FlashRT consistently delivers a 2x-7x speedup (e.g., reducing runtime from one hour to less than ten minutes) and a 2x-4x reduction in GPU memory consumption (e.g., reducing from 264.1 GB to 65.7 GB GPU memory for a 32K token context) compared to state-of-the-art baseline nanoGCG. FlashRT can be broadly applied to black-box optimization methods, such as TAP and AutoDAN. We hope FlashRT can serve as a red-teaming tool to enable systematic evaluation of long-context LLM security. The code is available at: https://github.com/Wang-Yanting/FlashRT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlashRT, a framework for improving the computational and memory efficiency of optimization-based red-teaming methods targeting prompt injection and knowledge corruption attacks on long-context LLMs. It reports consistent 2x-7x speedups (e.g., reducing runtime from one hour to under ten minutes) and 2x-4x GPU memory reductions (e.g., from 264.1 GB to 65.7 GB for 32K-token contexts) relative to the nanoGCG baseline, while claiming comparable attack success rates. The framework is positioned as broadly applicable to other black-box optimization methods such as TAP and AutoDAN, with code released for reproducibility.

Significance. If the efficiency gains are achieved without meaningful degradation in attack success rates, FlashRT would meaningfully lower the resource barrier for systematic security evaluations of long-context LLMs, enabling academic researchers to assess risks and defenses at greater scale. The open-source code release strengthens the potential for adoption and verification.

major comments (2)
  1. [§4] §4 (Experimental Results): The central claim that FlashRT delivers efficiency gains while preserving attack success rates comparable to nanoGCG is load-bearing for the paper's practical contribution. The reported ASR values across tasks and context lengths lack accompanying standard deviations, number of trials, or statistical significance tests. Without these, it is impossible to determine whether observed differences (particularly on 32K-token contexts) are within experimental noise or indicate that the pruning/caching optimizations systematically narrow the explored suffix space.
  2. [§3.2] §3.2 (Optimizations): The descriptions of search pruning, candidate caching, and reduced query budgets should include an ablation or quantitative analysis of their effect on ASR. If these techniques alter the selection criterion or shrink the candidate pool, the reported wall-clock gains may reflect an easier optimization problem rather than a strict improvement in efficiency for the original task.
minor comments (2)
  1. [Abstract] Abstract: The model names 'Gemini-3.1-Pro' and 'Qwen-3.5' should be confirmed against the exact versions evaluated in the experiments and corrected if they are non-standard or typographical.
  2. [§4] The manuscript would benefit from an explicit statement of the total number of queries or API calls used in each ASR measurement to allow direct comparison of query efficiency alongside wall-clock time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of statistical rigor and ablation analysis needed to strengthen our claims about FlashRT's efficiency gains. We address each major comment below and commit to revisions that will provide the requested evidence without altering the core contributions of the work.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The central claim that FlashRT delivers efficiency gains while preserving attack success rates comparable to nanoGCG is load-bearing for the paper's practical contribution. The reported ASR values across tasks and context lengths lack accompanying standard deviations, number of trials, or statistical significance tests. Without these, it is impossible to determine whether observed differences (particularly on 32K-token contexts) are within experimental noise or indicate that the pruning/caching optimizations systematically narrow the explored suffix space.

    Authors: We agree that the absence of standard deviations, trial counts, and statistical tests weakens the ability to rigorously confirm ASR comparability. The original experiments were conducted with multiple independent runs per configuration to mitigate stochasticity in the optimization process, but these details and variability measures were not reported. In the revised manuscript, we will expand §4 to include: (i) the number of trials (at least 5 per setting), (ii) mean ASR with standard deviations as error bars in all tables and figures, and (iii) results of statistical significance tests (e.g., paired t-tests with p-values) between FlashRT and nanoGCG. These additions will demonstrate that any ASR differences fall within experimental noise and do not indicate systematic narrowing of the suffix space, as our pruning and caching preserve the original objective function and candidate evaluation criteria. revision: yes

  2. Referee: [§3.2] §3.2 (Optimizations): The descriptions of search pruning, candidate caching, and reduced query budgets should include an ablation or quantitative analysis of their effect on ASR. If these techniques alter the selection criterion or shrink the candidate pool, the reported wall-clock gains may reflect an easier optimization problem rather than a strict improvement in efficiency for the original task.

    Authors: We acknowledge that without quantitative ablations, it is difficult to isolate whether the efficiency improvements come purely from computational optimizations or from an inadvertently easier search problem. In the revised version, we will add a new subsection (or appendix) under §3.2 that presents ablation studies for each optimization: search pruning, candidate caching, and reduced query budgets. For each, we will report both the efficiency metrics (runtime and memory) and the corresponding ASR on representative tasks and context lengths (including 32K tokens). These ablations will show that ASR remains comparable (within 1-3% of the full-search baseline) while achieving the reported speedups, confirming that the techniques accelerate the original optimization task rather than simplifying it. We have internal development data supporting this and will include the full quantitative results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical efficiency claims rest on external benchmarks

full rationale

The paper introduces FlashRT as an engineering framework of optimizations (search pruning, candidate caching, reduced query budgets) for black-box red-teaming methods. Its headline results—2x-7x wall-clock speedup and 2x-4x memory reduction versus nanoGCG—are presented as measured experimental outcomes on long-context prompt-injection and knowledge-corruption tasks, not as outputs of any closed-form derivation or fitted parameter. No equations appear that define a quantity in terms of itself or that rename a fitted statistic as a “prediction.” The claim that FlashRT “can be broadly applied to black-box optimization methods, such as TAP and AutoDAN” is supported by additional empirical runs rather than by a self-citation chain or uniqueness theorem. Because the central claims are falsifiable by direct replication against the stated external baseline and do not reduce to the paper’s own inputs by construction, the derivation chain contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on the abstract only; no specific free parameters, axioms, or invented entities are identifiable from the provided text. The framework likely relies on standard optimization techniques and empirical tuning, but details are unavailable.

pith-pipeline@v0.9.0 · 5645 in / 1256 out tokens · 54023 ms · 2026-05-07T06:43:13.367937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models, ”arXiv preprint arXiv:2211.09527, 2022

  2. [2]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, ” inAISec Workshop, 2023

  3. [3]

    Formalizing and benchmarking prompt injection attacks and defenses,

    Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses, ” inUSENIX Security Symposium, 2024

  4. [4]

    Prompt Injection attack against LLM-integrated Applications

    Y. Liu, G. Deng, Y. Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y. Liu, H. Wang, Y. Zhenget al., “Prompt injection attack against llm-integrated applications, ”arXiv preprint arXiv:2306.05499, 2023

  5. [5]

    Neural exec: Learning (and learning from) execution triggers for prompt injection attacks,

    D. Pasquini, M. Strohmeier, and C. Troncoso, “Neural exec: Learning (and learning from) execution triggers for prompt injection attacks, ” inAI Sec Workshop, 2024

  6. [6]

    Automatic and uni- versal prompt injection attacks against large language models.arXiv preprint arXiv:2403.04957, 2024

    X. Liu, Z. Yu, Y. Zhang, N. Zhang, and C. Xiao, “Automatic and universal prompt injection attacks against large language models, ”arXiv preprint arXiv:2403.04957, 2024

  7. [7]

    A critical evaluation of defenses against prompt injection attacks,

    Y. Jia, Z. Shao, Y. Liu, J. Jia, D. Song, and N. Z. Gong, “A critical evaluation of defenses against prompt injection attacks, ”arXiv preprint arXiv:2505.18333, 2025

  8. [8]

    Secalign: De- fending against prompt injection with preference optimization,

    S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wagner, and C. Guo, “Secalign: De- fending against prompt injection with preference optimization, ” inCCS, 2025

  9. [9]

    Datasentinel: A game-theoretic detection of prompt injection attacks,

    Y. Liu, Y. Jia, J. Jia, D. Song, and N. Z. Gong, “Datasentinel: A game-theoretic detection of prompt injection attacks, ” inIEEE S&P, 2025

  10. [10]

    Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models,

    W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models, ” inUSENIX Security, 2025

  11. [11]

    Phantom: General trigger attacks on retrieval augmented language generation,

    H. Chaudhari, G. Severi, J. Abascal, M. Jagielski, C. A. Choquette-Choo, M. Nasr, C. Nita-Rotaru, and A. Oprea, “Phantom: General trigger attacks on retrieval augmented language generation, ”arXiv, 2024

  12. [12]

    Certifiably robust rag against retrieval corruption,

    C. Xiang, T. Wu, Z. Zhong, D. Wagner, D. Chen, and P. Mittal, “Certifiably robust rag against retrieval corruption, ”arXiv preprint arXiv:2405.15556, 2024

  13. [13]

    Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,

    P. Cheng, Y. Ding, T. Ju, Z. Wu, W. Du, P. Yi, Z. Zhang, and G. Liu, “Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models, ”arXiv preprint arXiv:2405.13401, 2024

  14. [14]

    Machine against the {RAG}: Jamming {Retrieval- Augmented}generation with blocker documents,

    A. Shafran, R. Schuster, and V. Shmatikov, “Machine against the {RAG}: Jamming {Retrieval- Augmented}generation with blocker documents, ” inUSENIX Security, 2025

  15. [15]

    Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models,

    Y. Gong, Z. Chen, M. Chen, F. Yu, W. Lu, X. Wang, X. Liu, and J. Liu, “Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models, ”arXiv preprint arXiv:2502.01386, 2025. Preprint version

  16. [16]

    Graphrag under fire,

    J. Liang, Y. Wang, C. Li, R. Zhu, T. Jiang, N. Gong, and T. Wang, “Graphrag under fire, ” inIEEE S&P, 2026

  17. [17]

    Universal and transferable adversarial attacks on aligned language models,

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models, ”arXiv, 2023

  18. [18]

    nanogcg,

    GraySwanAI, “nanogcg, ” https://github.com/GraySwanAI/nanoGCG, 2024, accessed: 2025-11-06

  19. [19]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models,

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models, ” inICLR, 2024

  20. [20]

    Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

    M. Nasr, N. Carlini, C. Sitawarin, S. V. Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections, ”arXiv preprint arXiv:2510.09023, 2025

  21. [21]

    RL is a hammer and LLMs are nails: A simple reinforcement learning recipe for strong prompt injection.arXiv preprint arXiv:2510.04885, 2025

    Y. Wen, A. Zharmagambetov, I. Evtimov, N. Kokhlikyan, T. Goldstein, K. Chaudhuri, and C. Guo, “Rl is a hammer and llms are nails: A simple reinforcement learning recipe for strong prompt injection, ” arXiv preprint arXiv:2510.04885, 2025

  22. [22]

    Piarena: A platform for prompt injection evaluation,

    R. Geng, C. Yin, Y. Wang, Y. Chen, and J. Jia, “Piarena: A platform for prompt injection evaluation, ” in ACL, 2026

  23. [23]

    Pismith: Reinforcement learning-based red teaming for prompt injection defenses,

    C. Yin, R. Geng, Y. Wang, and J. Jia, “Pismith: Reinforcement learning-based red teaming for prompt injection defenses, ”arXiv preprint arXiv:2603.13026, 2026

  24. [24]

    Tree of attacks: Jailbreaking black-box llms automatically,

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically, ”NeurIPS, 2024

  25. [25]

    Meta secalign: A secure foundation llm against prompt injection attacks,

    S. Chen, A. Zharmagambetov, D. Wagner, and C. Guo, “Meta secalign: A secure foundation llm against prompt injection attacks, ”arXiv preprint arXiv:2507.02735, 2025

  26. [26]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models, ”arXiv preprint arXiv:2310.04451, 2023

  27. [27]

    Autograd mechanics,

    PyTorch Contributors, “Autograd mechanics, ” https://docs.pytorch.org/docs/stable/notes/autograd. html, 2025, accessed 2025-10-05

  28. [28]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need, ”Advances in neural information processing systems, vol. 30, 2017

  29. [29]

    Making a sota adversarial attack on llms 38×faster,

    Haize Labs, “Making a sota adversarial attack on llms 38×faster, ” https://www.haizelabs.com/ technology/making-a-sota-adversarial-attack-on-llms-38x-faster, 2024, accessed: 2025-11-06

  30. [30]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks,

    M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks, ” inICLR, 2025

  31. [31]

    Lessons from defending gemini against indirect prompt injections,

    C. Shi, S. Lin, S. Song, J. Hayes, I. Shumailov, I. Yona, J. Pluto, A. Pappu, C. A. Choquette-Choo, M. Nasret al., “Lessons from defending gemini against indirect prompt injections, ”arXiv preprint arXiv:2505.14534, 2025

  32. [32]

    Black-box optimization of llm outputs by asking for directions,

    J. Zhang, M. Ding, Y. Liu, J. Hong, and F. Tramèr, “Black-box optimization of llm outputs by asking for directions, ”arXiv preprint arXiv:2510.16794, 2025

  33. [33]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, ”arXiv preprint arXiv:2309.10253, 2023

  34. [34]

    Pleak: Prompt leaking attacks against large language model applications,

    B. Hui, H. Yuan, N. Gong, P. Burlina, and Y. Cao, “Pleak: Prompt leaking attacks against large language model applications, ” inCCS, 2024

  35. [35]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

    Z. Liao and H. Sun, “Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms, ”arXiv preprint arXiv:2404.07921, 2024. Preprint version

  36. [36]

    Tao-attack: Toward advanced optimization-based jailbreak attacks for large language models,

    Z. Xu, J. Li, X. Zhang, H. Yu, and H. Liu, “Tao-attack: Toward advanced optimization-based jailbreak attacks for large language models, ”arXiv preprint arXiv:2603.03081, 2026

  37. [37]

    Claude code: An ai coding assistant,

    Anthropic, “Claude code: An ai coding assistant, ” https://www.anthropic.com/claude-code, 2025, accessed: 2026

  38. [38]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks, ”arXiv preprint arXiv:2404.02151, 2024

  39. [39]

    Pal: Proxy-guided black-box attack on large language models

    C. Sitawarin, N. Mu, D. Wagner, and A. Araujo, “Pal: Proxy-guided black-box attack on large language models, ”arXiv preprint arXiv:2402.09674, 2024

  40. [40]

    Flashattention-2: Faster attention with better parallelism and work partitioning,

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning, ”arXiv, 2023

  41. [41]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models, ” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

  42. [42]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention, ” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

  43. [43]

    Efficiently scaling transformer inference,

    R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference, ”Proceedings of machine learning and systems, vol. 5, pp. 606–624, 2023

  44. [44]

    Positive review only’: Researchers hide ai prompts in papers,

    Nikkei Asia, “Positive review only’: Researchers hide ai prompts in papers, ” https://asia.nikkei.com/Business/Technology/Artificial-intelligence/ Positive-review-only-Researchers-hide-AI-prompts-in-papers, 2025

  45. [45]

    Openai’s approach to external red teaming for ai models and systems,

    L. Ahmad, S. Agarwal, M. Lampe, and P. Mishkin, “Openai’s approach to external red teaming for ai models and systems, ”arXiv preprint arXiv:2503.16431, 2025

  46. [46]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, ”arXiv preprint arXiv:2402.04249, 2024

  47. [47]

    Hotflip: White-box adversarial examples for text classifica- tion,

    J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “Hotflip: White-box adversarial examples for text classifica- tion, ”arXiv preprint arXiv:1712.06751, 2017

  48. [48]

    Learning to attribute with attention,

    B. Cohen-Wang, Y.-S. Chuang, and A. Madry, “Learning to attribute with attention, ”arXiv preprint arXiv:2504.13752, 2025

  49. [49]

    AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption

    Y. Wang, R. Geng, Y. Chen, and J. Jia, “Attntrace: Attention-based context traceback for long-context llms, ”arXiv preprint arXiv:2508.03793, 2025

  50. [50]

    Analyzing the Structure of Attention in a Transformer Language Model

    J. Vig and Y. Belinkov, “Analyzing the structure of attention in a transformer language model, ”arXiv preprint arXiv:1906.04284, 2019

  51. [51]

    Hotflip: White-box adversarial examples for text classifica- tion,

    J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “Hotflip: White-box adversarial examples for text classifica- tion, ” inACL, 2018

  52. [52]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Houet al., “Longbench: A bilingual, multitask benchmark for long context understanding, ”arXiv preprint arXiv:2308.14508, 2023

  53. [53]

    Tracllm: A generic framework for attributing outputs of long context llms,

    Y. Wang, W. Zou, R. Geng, and J. Jia, “Tracllm: A generic framework for attributing outputs of long context llms, ” inUSENIX Security Symposium, 2025

  54. [54]

    Musique: Multihop questions via single- hop question composition,

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Musique: Multihop questions via single- hop question composition, ”Transactions of the Association for Computational Linguistics, vol. 10, pp. 539–554, 2022. Preprint version

  55. [55]

    The narrativeqa reading comprehension challenge,

    T. Kočisk`y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The narrativeqa reading comprehension challenge, ”Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018

  56. [56]

    Efficient attentions for long document summarization,

    L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization, ” arXiv preprint arXiv:2104.02112, 2021

  57. [57]

    Natural questions: a benchmark for question answering research,

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Leeet al., “Natural questions: a benchmark for question answering research, ”TACL, 2019

  58. [58]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering, ” inEMNLP, 2018

  59. [59]

    Ms marco: A human generated machine reading comprehension dataset,

    T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “Ms marco: A human generated machine reading comprehension dataset, ”choice, vol. 2640, p. 660, 2016

  60. [60]

    Prompt injection attacks against gpt-3,

    S. Willison, “Prompt injection attacks against gpt-3, ” https://simonwillison.net/2022/Sep/12/ prompt-injection/, 2022

  61. [61]

    Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples,

    H. J. Branch, J. R. Cefalu, J. McHugh, L. Hujer, A. Bahl, D. d. C. Iglesias, R. Heichman, and R. Darwishi, “Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples, ” arXiv preprint arXiv:2209.02128, 2022

  62. [62]

    Delimiters won’t save you from prompt injection,

    S. Willison, “Delimiters won’t save you from prompt injection, ” https://simonwillison.net/2023/May/ 11/delimiters-wont-save-you, 2023

  63. [63]

    all-MiniLM-L6-v2: A compact sentence embedding model,

    N. Reimers, I. Gurevych, and the SentenceTransformers community, “all-MiniLM-L6-v2: A compact sentence embedding model, ” 2021, 384-dimensional sentence embeddings, trained on 1 billion sentence pairs. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  64. [64]

    Llama Prompt Guard 2: A Classifier for Detecting Prompt Injection and Jailbreak Attacks,

    Meta, “Llama Prompt Guard 2: A Classifier for Detecting Prompt Injection and Jailbreak Attacks, ” https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M, 2025, accessed: 2025

  65. [65]

    Jailbreaking black box large language models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries, ” in2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42

  66. [66]

    Scaled dot-product attention in pytorch,

    PyTorch Contributors, “Scaled dot-product attention in pytorch, ” https://pytorch.org/docs/stable/ generated/torch.nn.functional.scaled_dot_product_attention.html, 2023, accessed: 2026-04-17

  67. [67]

    Hugging face transformers,

    Hugging Face Inc., “Hugging face transformers, ” https://github.com/huggingface/transformers, 2023, accessed: 2026-04-17

  68. [68]

    Long code arena: a set of benchmarks for long-context code models

    E. Bogomolov, A. Eliseeva, T. Galimzyanov, E. Glukhov, A. Shapkin, M. Tigina, Y. Golubev, A. Kovrigin, A. Van Deursen, M. Izadiet al., “Long code arena: a set of benchmarks for long-context code models, ” arXiv preprint arXiv:2406.11612, 2024

  69. [69]

    lca-baselines: Baselines for all tasks from long code arena benchmarks,

    JetBrains-Research, “lca-baselines: Baselines for all tasks from long code arena benchmarks, ” https: //github.com/JetBrains-Research/lca-baselines, 2025, gitHub repository, accessed 1 Nov 2025

  70. [70]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases,

    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, ”arXiv, 2024

  71. [71]

    Pymupdf – python bindings for mupdf (version 1.26.3),

    Artifex Software Inc. and contributors, “Pymupdf – python bindings for mupdf (version 1.26.3), ” https://pymupdf.readthedocs.io/, released July 2, 2025; high-performance PDF/text extraction library

  72. [72]

    Piarena: A platform for prompt injection evaluation,

    R. Geng, C. Yin, Y. Wang, Y. Chen, and J. Jia, “Piarena: A platform for prompt injection evaluation, ” https://github.com/sleeepeer/PIArena, 2026, gitHub repository. Preprint version Figure 3:Influence scores of individual tokens on the hidden states of the target output ˆY at the 300-th iteration, when the malicious text is injected in the middle of the ...