Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Dongdong She; Zesen Liu; Zihan Zhang

arxiv: 2606.21732 · v1 · pith:KIINVIF7new · submitted 2026-06-19 · 💻 cs.CR · cs.AI

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Zesen Liu , Zihan Zhang , Dongdong She This is my paper

Pith reviewed 2026-06-26 13:22 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords relinkingprompt compressionLLM agentsadversarial attackscompression boundaryconfused deputyagent benchmarks

0 comments

The pith

LLM agents that compress long contexts create a security boundary where the summarizer can reassemble split benign fragments into complete malicious instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that summarization-based prompt compression shifts the security boundary in LLM agents because filters inspect the original distributed prompt while the backend executes on the newly generated compressed context. Relinking occurs when the compressor connects locally benign fragments into a full malicious payload, without any explicit attack string present in the source. This arises because attention mechanisms make the fragments jointly available, pre-training makes compatible connections plausible, and post-training favors compact actionable summaries. The Relink tool splits malicious payloads into benign fragments and achieves an 86.9 percent relink rate and backend action rate on four long-context agent benchmarks, compared with 17 percent for clean-split controls. Existing defenses do not reliably block it, but the paper's KBRA defense reduces the residual backend action rate to zero.

Core claim

Relinking is a compression-boundary vulnerability in which the summarizer behaves as a confused deputy and produces a complete malicious instruction by connecting distributed, locally benign fragments; the vulnerability is inherent to summarization because attention makes fragments jointly available, pre-training makes connections plausible, and post-training favors compact backend-actionable outputs, and it is realized in practice by an automated tool that induces adversarial relinking at 86.9 percent success across benchmarks.

What carries the argument

Relinking, the process in which attention, pre-training, and post-training allow a summarizer to connect separated benign fragments into a single malicious payload that was never present in the source context.

If this is right

Filters that inspect only the pre-compression prompt miss attacks that appear only after summarization.
Adversarial relinking achieves 86.9 percent relink and backend action rates on long-context agent benchmarks.
Existing defenses do not reliably capture adversarial relinking.
The KBRA defense reduces residual backend action rate to 0.0 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compression steps should be treated as execution boundaries that require joint inspection of possible fragment combinations.
Similar relinking risks could appear in other aggregation or summarization stages inside multi-step AI pipelines.
Testing with deliberately varied fragment spacing might show how sensitive the connection formation is to context length.

Load-bearing premise

The assumption that attention mechanisms and pre/post-training will cause the summarizer to plausibly connect separated benign fragments into the attacker's intended malicious payload rather than some other summary.

What would settle it

Running the compressor on the split benign fragments and checking whether it produces the attacker's specific malicious instruction or instead produces unrelated or safe content.

Figures

Figures reproduced from arXiv: 2606.21732 by Dongdong She, Zesen Liu, Zihan Zhang.

**Figure 2.** Figure 2: Mechanistic hypothesis of relinking. prompt injection [22, 44, 46], which embeds a complete instruction directly, and payload splitting [29, 32], which relies on attacker-supplied and explicit reconstruction instruction (e.g., instructions like “combine part A and part B”) to reconstruct the payload, relinking eliminates attacker-side reconstruction. The source contains only disconnected, locally benign fr… view at source ↗

**Figure 3.** Figure 3: Corpus signals motivating Relink’s construction representation. Panels (a)–(c) show target-value types, carrier-span styles, and local action/value co-occurrence, motivating exact-value preservation, style-aware carriers, and explicit locality control. 6.1 Corpus-Grounded Representation To enforce the source-side indicators on real contexts, we analyzed 11 diverse agentic corpora ( [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 4.** Figure 4: Relink constructs an adversarial untrusted context by removing the payload binding, adding non-binding role cues, and distributing the separated action/value fragments into selected anchors so that τ ⋆ is grounded but unsupported before compression. Instantiating the instruction graph. When structured fields are available, the procedure reads p ⋆ , r ⋆ , v ⋆ , and κ ⋆ directly. Otherwise it first identifie… view at source ↗

**Figure 5.** Figure 5: RQ1 mechanism probes on benchmark-derived contexts. Bars report the positive-gap rate for each probe. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Source-side admissibility and post-compression backend action rate for [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of compression rate on Relink success. Relink remains high across rates 0.3–0.8, while the clean-split control stays low, showing that relinking is not tied to a particular compression budget. RQ3 diagnoses which pipeline factors affect Relink after the default RQ2 setting. We vary the compression rate, compression prompt, compression model, and backend model one at a time. For compression-model cha… view at source ↗

**Figure 8.** Figure 8: OpenClaw case study. The agent fetches a webpage containing two isolated parts. The top red box acts as [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Claude Code case study. The compressor processes a low-authority support bundle containing separated [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Summarization-based prompt compression is increasingly used by LLM agents to shorten long, distributed contexts, but it shifts the security boundary: filters inspect the pre-compression prompt while the backend acts on a newly generated compressed context. We identify relinking, a compression-boundary vulnerability where the compressor behaves as a confused deputy, summarizing distributed, locally benign fragments into a complete malicious instruction. Unlike prompt injection, relinking need not place an explicitly malicious payload in the source context. We show that relinking arises from summarization itself: attention makes separated fragments jointly available, pre-training makes compatible fragments plausible to connect, and post-training favors compact backend-actionable summaries. We formalize the attacker-induced form as adversarial relinking and present Relink, an automated DSL-based tool that splits malicious payloads into benign fragments while keeping the complete payload absent before compression. Across four long-context agent benchmarks, Relink achieves 86.9% Relink Rate and Backend Action Rate versus 17.0% for clean-split controls. Existing defenses fail to reliably capture adversarial relinking; our KBRA defense reduces residual Backend Action Rate to 0.0%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relinking flags a real boundary shift in compressed LLM agents with working attack code and numbers, but the causal mechanism for exact payload reconstruction needs tighter controls.

read the letter

The core takeaway is that this paper shows how summarization in LLM agents can turn distributed benign fragments into a malicious instruction at the compression step, and they back it with an automated tool that works at scale.

What stands out as new is the separation of relinking from prompt injection: the attack does not require an explicit bad payload in the source context, only fragments that the compressor will connect. They built Relink, a DSL-based splitter, and ran it across four long-context agent benchmarks, reporting 86.9% Relink Rate and Backend Action Rate against 17% for clean-split controls. The KBRA defense they introduce brings the residual rate to zero. That empirical package is the paper's main strength; it demonstrates the attack is practical rather than purely theoretical.

The soft spot is in the mechanism claim. The abstract ties the effect to attention making fragments jointly available, pre-training making connections plausible, and post-training favoring compact summaries. Yet the results do not appear to include ablations that test whether the summarizer reliably picks the attacker's intended payload over other plausible summaries of the same fragments. Without those, the high success rate shows the attack can be made to work, but it leaves open whether the three listed factors are the actual drivers or whether the tool simply finds splits that happen to trigger bad behavior for other reasons.

This is aimed at researchers working on LLM agent security and prompt compression defenses. It is worth a serious referee because the boundary shift it describes is structural and the attack numbers are concrete enough to force discussion, even if the explanatory part needs more work.

Referee Report

2 major / 2 minor

Summary. The paper claims that summarization-based prompt compression in LLM agents creates a new attack surface called relinking, in which the compressor (acting as a confused deputy) reconstructs complete malicious instructions from distributed, locally benign fragments. Unlike traditional prompt injection, the full payload need not appear pre-compression. The authors attribute this to attention making fragments jointly available, pre-training making connections plausible, and post-training favoring compact summaries. They introduce Relink, an automated DSL-based splitting tool, and report 86.9% Relink Rate and Backend Action Rate across four long-context agent benchmarks (versus 17.0% for clean-split controls). They also propose the KBRA defense that reduces residual Backend Action Rate to 0.0%.

Significance. If the empirical results and mechanistic account hold, the work identifies a previously unexamined security boundary shift in agent architectures that rely on compression. The provision of an automated attack-generation tool, controlled benchmark evaluation on four tasks, and a concrete defense proposal constitute concrete contributions. The empirical demonstration with explicit controls is a strength; the work is framed as an empirical demonstration rather than a parameter-free derivation.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The central claim that relinking 'arises from summarization itself' via the three listed mechanisms requires that the compressor preferentially reconstruct the attacker's exact intended malicious payload rather than other plausible summaries of the same fragments. The reported 86.9% Backend Action Rate (versus 17.0% clean-split control) does not include ablations or controls demonstrating that alternative (benign or differently malicious) summaries are not equally likely under the same attention and training dynamics. This is load-bearing for distinguishing relinking as a distinct compression-boundary phenomenon.
[Evaluation section] Evaluation section: The Backend Action Rate metric is central to the success claims, yet the manuscript does not appear to provide the precise definition, exclusion criteria, or measurement protocol (e.g., how backend actions are observed and attributed to the relinked payload). Without these details the 86.9% figure cannot be independently verified or compared across the four benchmarks.

minor comments (2)

[Abstract] The four long-context agent benchmarks are referenced but not named or characterized in the abstract; adding explicit benchmark identifiers and task descriptions would improve reproducibility.
[Abstract] Notation for 'Relink Rate' versus 'Backend Action Rate' should be defined at first use and kept consistent throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify important areas for strengthening the manuscript's claims and clarity. We respond to each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claim that relinking 'arises from summarization itself' via the three listed mechanisms requires that the compressor preferentially reconstruct the attacker's exact intended malicious payload rather than other plausible summaries of the same fragments. The reported 86.9% Backend Action Rate (versus 17.0% clean-split control) does not include ablations or controls demonstrating that alternative (benign or differently malicious) summaries are not equally likely under the same attention and training dynamics. This is load-bearing for distinguishing relinking as a distinct compression-boundary phenomenon.

Authors: We appreciate this point, which correctly identifies that the clean-split control alone does not directly demonstrate preferential reconstruction of the attacker's exact payload. The control shows that non-adversarially split fragments yield low rates of malicious backend actions (17.0%), indicating the effect is not generic to any benign fragments. However, we agree this leaves open whether other plausible summaries occur at comparable rates. In the revised manuscript we will add an ablation that samples multiple summaries from the compressor on the same fragment sets and reports the frequency of the exact intended malicious reconstruction versus benign or alternative malicious summaries. This will be placed in the Evaluation section to directly support the mechanistic claim. revision: yes
Referee: [Evaluation section] Evaluation section: The Backend Action Rate metric is central to the success claims, yet the manuscript does not appear to provide the precise definition, exclusion criteria, or measurement protocol (e.g., how backend actions are observed and attributed to the relinked payload). Without these details the 86.9% figure cannot be independently verified or compared across the four benchmarks.

Authors: We agree that the manuscript omitted a sufficiently precise definition and protocol for Backend Action Rate. In the revised Evaluation section we will add: a formal definition (proportion of trials in which the backend executes an action whose intent matches the relinked payload); exclusion criteria (e.g., discarding actions that cannot be unambiguously attributed due to pre-existing context or ambiguity); and the measurement protocol (backend logging of function calls or outputs, followed by keyword plus semantic similarity matching to the payload). These additions will enable independent verification and cross-benchmark comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration is self-contained

full rationale

The paper frames relinking as an observed vulnerability in summarization-based compression for LLM agents, demonstrated via an automated DSL tool (Relink) that splits payloads and evaluated on four benchmarks with reported rates (86.9% vs 17.0% controls). No load-bearing derivations, equations, or predictions reduce to fitted parameters, self-definitions, or self-citation chains. The mechanisms (attention, pre/post-training) are invoked as explanatory context for the empirical phenomenon rather than as a formal derivation that collapses to the inputs. The work is an attack presentation with defense evaluation, not a self-referential prediction loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the contribution is presented as empirical identification of an attack surface.

pith-pipeline@v0.9.1-grok · 5732 in / 1161 out tokens · 19682 ms · 2026-06-26T13:22:15.499514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 14 linked inside Pith

[1]

Compaction

Anthropic. Compaction. https://platform.claude.com/docs/en/build-with-claude/ compaction, 2026. Claude API Docs. Accessed: May 23, 2026

2026
[2]

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021
[3]

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

Pith/arXiv arXiv 2022
[4]

Emergent tool use from multi-agent autocurricula

Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. InInternational conference on learning representations, 2019

2019
[5]

The berkeley framenet project

Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. InCOLING 1998 Volume 1: The 17th International Conference on Computational Linguistics, 1998

1998
[6]

Faithbench: A diverse hallucination benchmark for summarization by modern llms

Forrest Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, et al. Faithbench: A diverse hallucination benchmark for summarization by modern llms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

2025
[7]

τ 2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

Pith/arXiv arXiv 2025
[8]

From single to multi: How llms hallucinate in multi-document summarization

Catarina G Belem, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka. From single to multi: How llms hallucinate in multi-document summarization. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5276–5309, 2025. 19

2025
[9]

Checking for race conditions in file accesses.Computing systems, 2(2):131–152, 1996

Matt Bishop, Michael Dilger, et al. Checking for race conditions in file accesses.Computing systems, 2(2):131–152, 1996

1996
[10]

Composer 2 technical report.arXiv e-prints, pages arXiv–2603, 2026

Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, et al. Composer 2 technical report.arXiv e-prints, pages arXiv–2603, 2026

2026
[11]

{StruQ}: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025

2025
[12]

Meta secalign: A secure foundation llm against prompt injection attacks.arXiv preprint arXiv:2507.02735, 2025

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks.arXiv preprint arXiv:2507.02735, 2025

arXiv 2025
[13]

Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

arXiv 2025
[14]

Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tram`er. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

Pith/arXiv arXiv 2025
[15]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram `er. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

2024
[16]

Thematic proto-roles and argument selection.language, 67(3):547–619, 1991

David Dowty. Thematic proto-roles and argument selection.language, 67(3):547–619, 1991

1991
[17]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

2021
[18]

From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

Pith/arXiv arXiv 2025
[19]

The case for case reopened

Charles J Fillmore. The case for case reopened. InGrammatical relations, pages 59–81. Brill, 1977

1977
[20]

Zorik Gekhman, Nadav Oved, Orgad Keller, Idan Szpektor, and Roi Reichart. On the robustness of dialogue history representation in conversational question answering: a comprehensive study and a new prompt-based method.Transactions of the Association for Computational Linguistics, 11:351–366, 2023

2023
[21]

Automatic labeling of semantic roles.Computational linguistics, 28(3):245–288, 2002

Daniel Gildea and Dan Jurafsky. Automatic labeling of semantic roles.Computational linguistics, 28(3):245–288, 2002

2002
[22]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023
[23]

The confused deputy: (or why capabilities might have been invented).ACM SIGOPS Operating Systems Review, 22(4):36–38, 1988

Norm Hardy. The confused deputy: (or why capabilities might have been invented).ACM SIGOPS Operating Systems Review, 22(4):36–38, 1988

1988
[24]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H Hsu, and Pin-Yu Chen. Attention tracker: Detecting prompt injection attacks in llms. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2309–2322, 2025

2025
[25]

Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023
[26]

Promptlocate: Localizing prompt injection attacks

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection attacks. arXiv preprint arXiv:2510.12252, 2025. 20

arXiv 2025
[27]

Llmlingua: Compressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

2023
[28]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

2024
[29]

Exploiting programmatic behavior of llms: Dual-use through standard security attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. In2024 IEEE security and privacy workshops (SPW), pages 132–143. IEEE, 2024

2024
[30]

Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

Pith/arXiv arXiv 2025
[31]

ConversationSummaryMemory — langchain

LangChain. ConversationSummaryMemory — langchain. https://reference.langchain.com/py thon/langchain-classic/memory/summary/ConversationSummaryMemory. Accessed: 2026-06-12

2026
[32]

Drattack: Prompt decomposition and reconstruction makes powerful llms jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llms jailbreakers. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13891–13913, 2024

2024
[33]

Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091, 2026

Pith/arXiv arXiv 2026
[34]

Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

Pith/arXiv arXiv 2023
[35]

Compressionattack: Exploiting prompt compres- sion as a new attack surface in llm-powered agents.arXiv preprint arXiv:2510.22963, 2025

Zesen Liu, Zhixiang Zhang, Yuchong Xie, and Dongdong She. Compressionattack: Exploiting prompt compres- sion as a new attack surface in llm-powered agents.arXiv preprint arXiv:2510.22963, 2025

Pith/arXiv arXiv 2025
[36]

Cwe-367: Time-of-check time-of-use (toctou) race condition.CWE Version 1.8, page 443, 2010

Taxonomy Mappings. Cwe-367: Time-of-check time-of-use (toctou) race condition.CWE Version 1.8, page 443, 2010

2010
[37]

CWE-441: Unintended Proxy or Intermediary (‘Confused Deputy’)

MITRE Corporation. CWE-441: Unintended Proxy or Intermediary (‘Confused Deputy’). https://cwe.mi tre.org/data/definitions/441.html. Common Weakness Enumeration, Version 4.20. Accessed: 2026-06-12

2026
[38]

In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Pith/arXiv arXiv 2022
[39]

Compaction

OpenAI. Compaction. https://developers.openai.com/api/docs/guides/compaction,
[40]

Accessed: May 23, 2026

OpenAI API Docs. Accessed: May 23, 2026

2026
[41]

Compaction

OpenClaw. Compaction. https://docs.openclaw.ai/concepts/compaction , 2026. Open- Claw Docs. Accessed: May 24, 2026

2026
[42]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[43]

The proposition bank: An annotated corpus of semantic roles.Computational linguistics, 31(1):71–106, 2005

Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated corpus of semantic roles.Computational linguistics, 31(1):71–106, 2005. 21

2005
[44]

Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R¨uhle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. InFindings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

2024
[45]

Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

F´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

Pith/arXiv arXiv 2022
[46]

Fine-tuned deberta-v3-base for prompt injection detection, 2024

ProtectAI.com. Fine-tuned deberta-v3-base for prompt injection detection, 2024

2024
[47]

Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-Fran c ¸ois Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Kost, Christopher Carnahan, and Jordan Lee Boyd-Graber. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Metho...

2023
[48]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, et al. Towards understanding sycophancy in language models. InInternational Conference on Learning Representations, volume 2024, pages 110–144, 2024

2024
[49]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

arXiv 2023
[50]

Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020
[51]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[52]

Attntrace: Attention-based context traceback for long-context llms.arXiv preprint arXiv:2508.03793, 2025

Yanting Wang, Runpeng Geng, Ying Chen, and Jinyuan Jia. Attntrace: Attention-based context traceback for long-context llms.arXiv preprint arXiv:2508.03793, 2025

Pith/arXiv arXiv 2025
[53]

Llm agents making agent tools

Georg W¨olflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, and Jakob Nikolas Kather. Llm agents making agent tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26092–26130, 2025

2025
[54]

Benchmarking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1809–1820, 2025

2025
[55]

Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

Pith/arXiv arXiv 2024
[56]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

2024
[57]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. InInternational Conference on Learning Representations, volume 2025, pages 35331–35366, 2025

2025
[58]

Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4375–4391, 2024. 22 A Benchmark Construction Goal and filtering....

2024

[1] [1]

Compaction

Anthropic. Compaction. https://platform.claude.com/docs/en/build-with-claude/ compaction, 2026. Claude API Docs. Accessed: May 23, 2026

2026

[2] [2]

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021

[3] [3]

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

Pith/arXiv arXiv 2022

[4] [4]

Emergent tool use from multi-agent autocurricula

Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. InInternational conference on learning representations, 2019

2019

[5] [5]

The berkeley framenet project

Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. InCOLING 1998 Volume 1: The 17th International Conference on Computational Linguistics, 1998

1998

[6] [6]

Faithbench: A diverse hallucination benchmark for summarization by modern llms

Forrest Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, et al. Faithbench: A diverse hallucination benchmark for summarization by modern llms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

2025

[7] [7]

τ 2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

Pith/arXiv arXiv 2025

[8] [8]

From single to multi: How llms hallucinate in multi-document summarization

Catarina G Belem, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka. From single to multi: How llms hallucinate in multi-document summarization. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5276–5309, 2025. 19

2025

[9] [9]

Checking for race conditions in file accesses.Computing systems, 2(2):131–152, 1996

Matt Bishop, Michael Dilger, et al. Checking for race conditions in file accesses.Computing systems, 2(2):131–152, 1996

1996

[10] [10]

Composer 2 technical report.arXiv e-prints, pages arXiv–2603, 2026

Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, et al. Composer 2 technical report.arXiv e-prints, pages arXiv–2603, 2026

2026

[11] [11]

{StruQ}: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025

2025

[12] [12]

Meta secalign: A secure foundation llm against prompt injection attacks.arXiv preprint arXiv:2507.02735, 2025

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks.arXiv preprint arXiv:2507.02735, 2025

arXiv 2025

[13] [13]

Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

arXiv 2025

[14] [14]

Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tram`er. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

Pith/arXiv arXiv 2025

[15] [15]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram `er. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

2024

[16] [16]

Thematic proto-roles and argument selection.language, 67(3):547–619, 1991

David Dowty. Thematic proto-roles and argument selection.language, 67(3):547–619, 1991

1991

[17] [17]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

2021

[18] [18]

From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

Pith/arXiv arXiv 2025

[19] [19]

The case for case reopened

Charles J Fillmore. The case for case reopened. InGrammatical relations, pages 59–81. Brill, 1977

1977

[20] [20]

Zorik Gekhman, Nadav Oved, Orgad Keller, Idan Szpektor, and Roi Reichart. On the robustness of dialogue history representation in conversational question answering: a comprehensive study and a new prompt-based method.Transactions of the Association for Computational Linguistics, 11:351–366, 2023

2023

[21] [21]

Automatic labeling of semantic roles.Computational linguistics, 28(3):245–288, 2002

Daniel Gildea and Dan Jurafsky. Automatic labeling of semantic roles.Computational linguistics, 28(3):245–288, 2002

2002

[22] [22]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023

[23] [23]

The confused deputy: (or why capabilities might have been invented).ACM SIGOPS Operating Systems Review, 22(4):36–38, 1988

Norm Hardy. The confused deputy: (or why capabilities might have been invented).ACM SIGOPS Operating Systems Review, 22(4):36–38, 1988

1988

[24] [24]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H Hsu, and Pin-Yu Chen. Attention tracker: Detecting prompt injection attacks in llms. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2309–2322, 2025

2025

[25] [25]

Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023

[26] [26]

Promptlocate: Localizing prompt injection attacks

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection attacks. arXiv preprint arXiv:2510.12252, 2025. 20

arXiv 2025

[27] [27]

Llmlingua: Compressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

2023

[28] [28]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

2024

[29] [29]

Exploiting programmatic behavior of llms: Dual-use through standard security attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. In2024 IEEE security and privacy workshops (SPW), pages 132–143. IEEE, 2024

2024

[30] [30]

Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

Pith/arXiv arXiv 2025

[31] [31]

ConversationSummaryMemory — langchain

LangChain. ConversationSummaryMemory — langchain. https://reference.langchain.com/py thon/langchain-classic/memory/summary/ConversationSummaryMemory. Accessed: 2026-06-12

2026

[32] [32]

Drattack: Prompt decomposition and reconstruction makes powerful llms jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llms jailbreakers. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13891–13913, 2024

2024

[33] [33]

Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091, 2026

Pith/arXiv arXiv 2026

[34] [34]

Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection attack against llm-integrated applications.arXiv preprint arXiv:2306.05499, 2023

Pith/arXiv arXiv 2023

[35] [35]

Compressionattack: Exploiting prompt compres- sion as a new attack surface in llm-powered agents.arXiv preprint arXiv:2510.22963, 2025

Zesen Liu, Zhixiang Zhang, Yuchong Xie, and Dongdong She. Compressionattack: Exploiting prompt compres- sion as a new attack surface in llm-powered agents.arXiv preprint arXiv:2510.22963, 2025

Pith/arXiv arXiv 2025

[36] [36]

Cwe-367: Time-of-check time-of-use (toctou) race condition.CWE Version 1.8, page 443, 2010

Taxonomy Mappings. Cwe-367: Time-of-check time-of-use (toctou) race condition.CWE Version 1.8, page 443, 2010

2010

[37] [37]

CWE-441: Unintended Proxy or Intermediary (‘Confused Deputy’)

MITRE Corporation. CWE-441: Unintended Proxy or Intermediary (‘Confused Deputy’). https://cwe.mi tre.org/data/definitions/441.html. Common Weakness Enumeration, Version 4.20. Accessed: 2026-06-12

2026

[38] [38]

In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Pith/arXiv arXiv 2022

[39] [39]

Compaction

OpenAI. Compaction. https://developers.openai.com/api/docs/guides/compaction,

[40] [40]

Accessed: May 23, 2026

OpenAI API Docs. Accessed: May 23, 2026

2026

[41] [41]

Compaction

OpenClaw. Compaction. https://docs.openclaw.ai/concepts/compaction , 2026. Open- Claw Docs. Accessed: May 24, 2026

2026

[42] [42]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[43] [43]

The proposition bank: An annotated corpus of semantic roles.Computational linguistics, 31(1):71–106, 2005

Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated corpus of semantic roles.Computational linguistics, 31(1):71–106, 2005. 21

2005

[44] [44]

Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R¨uhle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. InFindings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

2024

[45] [45]

Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

F´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

Pith/arXiv arXiv 2022

[46] [46]

Fine-tuned deberta-v3-base for prompt injection detection, 2024

ProtectAI.com. Fine-tuned deberta-v3-base for prompt injection detection, 2024

2024

[47] [47]

Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-Fran c ¸ois Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Kost, Christopher Carnahan, and Jordan Lee Boyd-Graber. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Metho...

2023

[48] [48]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, et al. Towards understanding sycophancy in language models. InInternational Conference on Learning Representations, volume 2024, pages 110–144, 2024

2024

[49] [49]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

arXiv 2023

[50] [50]

Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020

[51] [51]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[52] [52]

Attntrace: Attention-based context traceback for long-context llms.arXiv preprint arXiv:2508.03793, 2025

Yanting Wang, Runpeng Geng, Ying Chen, and Jinyuan Jia. Attntrace: Attention-based context traceback for long-context llms.arXiv preprint arXiv:2508.03793, 2025

Pith/arXiv arXiv 2025

[53] [53]

Llm agents making agent tools

Georg W¨olflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, and Jakob Nikolas Kather. Llm agents making agent tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26092–26130, 2025

2025

[54] [54]

Benchmarking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1809–1820, 2025

2025

[55] [55]

Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

Pith/arXiv arXiv 2024

[56] [56]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

2024

[57] [57]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. InInternational Conference on Learning Representations, volume 2025, pages 35331–35366, 2025

2025

[58] [58]

Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Promptreps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4375–4391, 2024. 22 A Benchmark Construction Goal and filtering....

2024