pith. machine review for the scientific record. sign in

arxiv: 2403.14720 · v1 · submitted 2024-03-20 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:24 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords indirect prompt injectionLLM securityprompt engineeringprovenance signaladversarial defenseinput transformationmodel robustness
0
0 comments X

The pith

Spotlighting uses input transformations to mark data origins, letting LLMs ignore embedded adversarial instructions and cutting indirect prompt injection success from over 50% to under 2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces spotlighting as a family of prompt engineering techniques that transform inputs to create a consistent signal of their source. Large language models normally process concatenated text without knowing which parts come from trusted user commands versus untrusted external data. Indirect prompt injection attacks exploit this by hiding malicious instructions inside the untrusted data, causing the model to follow those instructions instead. Spotlighting counters this by applying transformations that give the model a reliable way to track provenance and follow only the intended commands. Experiments show the method lowers attack success rates sharply while leaving standard task performance nearly unchanged.

Core claim

Spotlighting is a family of prompt engineering techniques that utilize transformations of an input to provide a reliable and continuous signal of its provenance, enabling LLMs to distinguish among multiple sources of input and thereby defend against indirect prompt injection attacks, reducing attack success rates from greater than 50% to below 2% with minimal impact on task efficacy.

What carries the argument

Spotlighting, a family of prompt engineering techniques that apply transformations to inputs in order to create a continuous provenance signal that LLMs can follow when processing combined text streams.

If this is right

  • LLMs can be made to ignore instructions embedded in untrusted data when those inputs carry a spotlighted provenance signal.
  • Standard NLP task performance stays largely intact under spotlighting transformations.
  • The defense works across the GPT models tested without requiring any model retraining or architectural changes.
  • Prompt-based provenance signals offer a practical layer of protection for applications that combine user commands with external data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attackers may develop variants that replicate or mimic the specific transformation patterns to evade the provenance signal.
  • The same transformation approach could help LLMs separate other mixed inputs, such as user queries from retrieved documents in retrieval-augmented systems.
  • Evaluating spotlighting on non-GPT model families would test whether the effect depends on particular training characteristics.

Load-bearing premise

The selected transformations will produce a provenance signal that LLMs interpret and obey consistently, without being bypassed by new attack variants.

What would settle it

An experiment in which a new indirect prompt injection attack achieves more than 2% success rate against the same spotlighted inputs and GPT-family models used in the paper.

read the original abstract

Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces spotlighting, a family of prompt-engineering techniques that apply input transformations (such as delimiters or highlighting) to create a continuous provenance signal distinguishing trusted user commands from untrusted data. The central claim, evaluated on GPT-family models, is that these techniques reduce the success rate of indirect prompt injection attacks from greater than 50% to below 2% while preserving task efficacy.

Significance. If the empirical results hold under broader testing, spotlighting would offer a lightweight, training-free defense against a practical attack vector in LLM applications that ingest untrusted content. The approach is notable for its simplicity and reported minimal overhead on downstream NLP tasks.

major comments (2)
  1. [Evaluation] Evaluation section: the headline ASR reduction (>50% to <2%) is demonstrated only against fixed, non-adaptive attack templates. No experiments test adaptive adversaries who know the spotlighting rules and can embed counter-instructions to ignore markers, mimic their syntax, or re-frame the input as user-controlled, leaving the robustness claim unverified.
  2. [Method and Experiments] Method and Experiments: the manuscript supplies no concrete attack constructions, model versions (e.g., GPT-3.5 vs. GPT-4), prompt templates, dataset sizes, or statistical tests, so the central quantitative claim cannot be reproduced or assessed for variance from the provided text.
minor comments (2)
  1. [Abstract] Abstract: the notation {50} and {2} appears to be a LaTeX artifact; replace with explicit percentages.
  2. [Method] Clarify the exact set of spotlighting transformations evaluated and whether they are applied uniformly or chosen per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve evaluation robustness and reproducibility.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the headline ASR reduction (>50% to <2%) is demonstrated only against fixed, non-adaptive attack templates. No experiments test adaptive adversaries who know the spotlighting rules and can embed counter-instructions to ignore markers, mimic their syntax, or re-frame the input as user-controlled, leaving the robustness claim unverified.

    Authors: We agree that the current evaluation uses fixed, non-adaptive attack templates and does not include adaptive adversaries aware of spotlighting. This limits the strength of the robustness claim. In the revised manuscript we will add a dedicated subsection with new experiments testing adaptive strategies (e.g., instructions to ignore delimiters, mimic syntax, or re-frame provenance). We will report the resulting ASR values and discuss any remaining vulnerabilities. These additions will be included in the next version. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments: the manuscript supplies no concrete attack constructions, model versions (e.g., GPT-3.5 vs. GPT-4), prompt templates, dataset sizes, or statistical tests, so the central quantitative claim cannot be reproduced or assessed for variance from the provided text.

    Authors: We acknowledge that the submitted text did not present these details with sufficient explicitness. The full manuscript uses GPT-3.5-turbo and GPT-4, specific attack templates (provided in the appendix), datasets of several hundred examples per task, and reports results with standard error across runs. To ensure reproducibility we will expand the Method and Experiments sections with explicit listings of model versions, full prompt templates, exact dataset sizes and sources, and statistical details including variance measures. We will also add a link to evaluation code and prompts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical defense technique evaluated directly on attack success rates

full rationale

The paper introduces spotlighting as a prompt engineering family of input transformations and reports experimental results showing ASR reduction from >50% to <2% on GPT models. No mathematical derivations, equations, fitted parameters, or self-citations are used to derive the central claim; the result is obtained by direct testing of the proposed transformations against the evaluated attack strings. The evaluation is self-contained against the reported benchmarks with no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that LLMs can be guided by formatting signals in concatenated text and that the introduced transformations will not be ignored or adversarially stripped.

axioms (1)
  • domain assumption LLMs process concatenated inputs without distinguishing sections from different sources
    This is the stated vulnerability that spotlighting is designed to mitigate.
invented entities (1)
  • spotlighting techniques no independent evidence
    purpose: Provide reliable provenance signal through input transformations
    Newly proposed family of prompt engineering methods with no independent evidence supplied beyond the reported experiments.

pith-pipeline@v0.9.0 · 5514 in / 1146 out tokens · 52924 ms · 2026-05-14T22:24:16.816058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

    cs.CR 2026-05 unverdicted novelty 8.0

    JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

  2. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  3. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  4. Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows

    cs.CR 2026-05 unverdicted novelty 8.0

    Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving ove...

  5. Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

    cs.CR 2026-05 unverdicted novelty 8.0

    Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...

  6. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  7. No More, No Less: Task Alignment in Terminal Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

  8. IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

  9. Toward a Principled Framework for Agent Safety Measurement

    cs.CR 2026-05 unverdicted novelty 7.0

    BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

  10. AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.

  11. Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

    cs.CR 2026-04 unverdicted novelty 7.0

    Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.

  12. Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

  13. Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

    cs.CR 2026-03 conditional novelty 7.0

    Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.

  14. AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...

  15. ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

    cs.CR 2026-05 unverdicted novelty 6.0

    ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

  16. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  17. Evaluation of Prompt Injection Defenses in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.

  18. Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

    cs.CR 2026-04 unverdicted novelty 6.0

    Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.

  19. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  20. How Adversarial Environments Mislead Agentic AI?

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

  21. QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

  22. Evaluation of Prompt Injection Defenses in Large Language Models

    cs.CR 2026-04 unverdicted novelty 5.0

    Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.

  23. Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

    cs.CR 2026-03 unverdicted novelty 5.0

    A domain-specific multi-layer safeguard for educational LLM tutors achieves 0% false positives and 46.34% attack bypass at 2.5 ms latency on a 480-query holdout, outperforming NeMo Guardrails in usability but not full...

  24. Security Considerations for Artificial Intelligence Agents

    cs.LG 2026-03 unverdicted novelty 3.0

    Frontier AI agents introduce new confidentiality, integrity, and availability risks through changed assumptions on code-data separation and authority boundaries, requiring layered defenses like sandboxing and policy e...

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 22 Pith papers · 12 internal anchors

  1. [1]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin,et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

  2. [2]

    Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

    J. Yi, Y . Xie, B. Zhu, K. Hines, E. Kiciman, G. Sun, X. Xie, F. Wu, “Benchmarking and Defending Against Indirect Prompt Injection At- tacks on Large Language Models”, arXiv preprint arXiv:2312.14197 , 2023

  3. [3]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” arXiv preprint arXiv:1905.00537, 2020

  4. [4]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” arXiv preprint arXiv:1606.05250, 2016

  5. [5]

    Learning Word Vectors for Sentiment Analysis,

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, C. Potts, “Learning Word Vectors for Sentiment Analysis,” inProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , Portland, Oregon, USA, June 2011, pp. 142–150

  6. [7]

    Q Series: Switching and Signalling No. 5,

    International Telecommunication Union, “Q Series: Switching and Signalling No. 5,” 1988. [Online]. Available: https://www.itu.int/rec/T- REC-Q.140-Q.180-198811-I/en. [Accessed: Feb. 2, 2024]

  7. [8]

    Q Series: Switching and Signalling No. 6,

    International Telecommunication Union, “Q Series: Switching and Signalling No. 6,” 1988. [Online]. Available: https://www.itu.int/rec/T- REC-Q.251-Q.300-198811-I/en. [Accessed: Feb. 2, 2024]

  8. [9]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020

  9. [10]

    GPT-4 Technical Report

    OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023

  10. [11]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

  11. [12]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. , “Con- stitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022

  12. [13]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022

  13. [14]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, “More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models,” arXiv preprint arXiv:2302.12173 , 2023

  14. [15]

    InstructGPT: Neurally-Guided Procedural Generation of 3D Shapes from Natural Language Instructions,

    L. Ouyang, S. Toyer, C. Donahue, J. Rahim, Y . Bao, J. Wu, H. He, Z. Tung, A. Chaganty, P. Liang, C. D. Manning, J. Pennington, A. Rad- ford, D. Amodei, et al. , “InstructGPT: Neurally-Guided Procedural Generation of 3D Shapes from Natural Language Instructions,” arXiv preprint arXiv:2202.02796, 2022

  15. [16]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903 , 2023

  16. [17]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, K. Narasimhan, et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”arXiv preprint arXiv:2305.10601, 2023

  17. [18]

    How We Broke LLMs: Indirect Prompt Injection,

    K. Greshake, “How We Broke LLMs: Indirect Prompt Injection,” Kai Greshake, 2022. [Online]. Available: https://kai-greshake.de/posts/llm- malware/. [Accessed: Feb. 21, 2024]

  18. [19]

    Hacking Google Bard - From Prompt In- jection to Data Exfiltration,

    Wunderwuzzi, “Hacking Google Bard - From Prompt In- jection to Data Exfiltration,” Embrace The Red , 2023. [On- line]. Available: https://embracethered.com/blog/posts/2023/google- bard-data-exfiltration/. [Accessed: Feb. 21, 2024]

  19. [20]

    Core Views on AI Safety: When, Why, What, and How,

    Anthropic Team, “Core Views on AI Safety: When, Why, What, and How,” 2023. [Online]. Available: https://www.anthropic.com/news/core-views-on-ai-safety. [Accessed: Feb. 21, 2024]

  20. [21]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, M. Fredrikson, et al. , “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv preprint arXiv:2307.15043 , 2023

  21. [22]

    [Accessed: Feb

    Jailbreak Chat, Available: https://jailbreakchat.com/. [Accessed: Feb. 2, 2024]

  22. [23]

    notices” the attack text but does not “fall for

    Appendix 8.1. Measuring Attack Success Rate The simplicity of the keyword payload allows us to clearly de- termine whether (i) the original metaprompt instructions are over- ridden or (ii) the LLM is mostly unaffected by the attack. Take, for example, a document summarization use case. In the attack documents, the keyword ‘canary’ is the desired outcome o...