pith. machine review for the scientific record. sign in

arxiv: 2504.21228 · v3 · submitted 2025-04-29 · 💻 cs.CR · cs.AI

Recognition: unknown

CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

Authors on Pith no claims yet
classification 💻 cs.CR cs.AI
keywords promptinstructionsattributioncacheprunecontextfollowllmsattack
0
0 comments X
read the original abstract

Large Language Models (LLMs) are susceptible to indirect prompt injection attacks, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune, which defends against this attack by identifying and pruning neurons associated with instruction-following during KV cache encoding of the prompt context. The pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow. To identify these neurons, we introduce a neural attribution mechanism guided by a preferential attribution loss, and theoretically connect this loss to an upper bound of the Direct Preference Optimization (DPO) objective. Further, we improve the fidelity of neural attribution by leveraging an observed triggering effect in instruction-following. Our approach does not interfere with prompt formatting or incur test-time overhead during response generation. Experiments show that CachePrune significantly reduces the attack success rate while preserving the LLM's ability to follow user instructions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Conjunctive Prompt Attacks in Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

  2. MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.