ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

Yash Narendra

arxiv: 2605.18918 · v1 · pith:ZE7MF3XInew · submitted 2026-05-18 · 💻 cs.CR · cs.AI

ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

Yash Narendra This is my paper

Pith reviewed 2026-05-20 10:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionguard modelslatent spaceAI safetydefense architecturelatency reductionagentic systemsmodel-agnostic

0 comments

The pith

The safety signal for detecting prompt injections already exists inside a guard model's latent representation, before any verdict is output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that guard models used to block prompt injections contain a usable separation signal in their internal hidden states. Accessing that signal directly, rather than waiting for the model's final safe/unsafe output, produces both a large latency reduction and a measurable accuracy gain. The approach matters for agentic AI systems because safety checks can now be inserted at every reasoning step without becoming the dominant delay. ESLD packages the observation into a model-agnostic wrapper that requires no changes to the underlying guard.

Core claim

The signal needed to separate safe from malicious input is already present in the guard model's internal representation, before it writes anything out. Reading this signal directly speeds up the safety check by more than 3× on average, while improving detection accuracy over the guard's verdict by 16.4 percentage points on average. ESLD is a model-agnostic architecture that sits on top of any existing guard model and improves both latency and detection accuracy without retraining or modifying the guard.

What carries the argument

The internal latent representation of the guard model, which encodes the separation between safe and malicious inputs before the model produces an explicit verdict.

If this is right

Guard checks can be placed on the critical path of multi-step agentic workflows without dominating total latency.
Detection accuracy rises by 16.4 percentage points on average relative to the guard model's explicit verdict.
The same guard model can be reused across many tasks without any retraining or architectural change.
Production systems can afford to run safety checks on every intermediate input rather than sampling only a subset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar latent extraction could be applied to other safety or alignment signals that current models already compute internally.
Agentic systems with very long reasoning chains become feasible to secure end-to-end once per-step checks cost little time.
Output verdicts may systematically discard information that remains visible in earlier layers or hidden states.
The method invites testing whether combining latent signals from several different guard models yields further accuracy gains.

Load-bearing premise

The guard model's internal states already hold a strong, generalizable signal that distinguishes safe from malicious prompts.

What would settle it

A new set of prompt-injection examples on which a classifier built from the guard's hidden states shows either lower accuracy than the guard's own output or no reduction in inference time below one-third of the original latency.

Figures

Figures reproduced from arXiv: 2605.18918 by Yash Narendra.

read the original abstract

Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying ``This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (``safe'' or ``unsafe'') before the assistant is allowed to act. In an agentic task with many steps, this check becomes a latency bottleneck. This paper shows that the signal needed to separate safe from malicious input is already present in the guard model's internal representation, before it writes anything out. Reading this signal directly speeds up the safety check by more than $3\times$ on average, while improving detection accuracy over the guard's verdict by 16.4 percentage points on average. This is more than latency optimization. Guard-model checks that were previously too slow to run on every step of an agent can now be placed on the critical path without sacrificing accuracy, and in fact with higher accuracy than the guard provides on its own. ESLD (External Surrogate Latent Defense) packages this finding into a deployable defense. ESLD is a model-agnostic architecture that sits on top of any existing guard model and improves both latency and detection accuracy, without retraining or modifying the guard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ESLD, a model-agnostic architecture that extracts and classifies from the internal latent representations of unmodified guard models for prompt-injection detection. It claims this yields >3× average speedup in safety checks and +16.4 percentage-point average accuracy gain over the guard's own verdict, enabling stronger defenses on the critical path of agentic workflows without retraining the guard.

Significance. If the empirical gains prove robust across guards and attack distributions, the result would be significant for practical AI safety: it converts an existing latency bottleneck into a deployable improvement that simultaneously raises accuracy, allowing per-step checks in multi-turn agents that were previously infeasible.

major comments (2)

[Experimental Evaluation] The central claim that latent activations contain a reliably stronger separator than the guard's classification head (reader's weakest assumption) is load-bearing for both the accuracy and latency results. The manuscript must demonstrate this via held-out attack styles and cross-guard generalization experiments; without them the 16.4 pp gain risks being an artifact of surrogate training on the evaluation distribution.
[ESLD Architecture] § on surrogate architecture and training: the paper should report the exact training objective, regularization, and whether the surrogate was tuned on the same attack corpus used for final evaluation. If any hyper-parameter search or data leakage exists, the claimed improvement over the guard's verdict is not yet shown to be generalizable.

minor comments (2)

[Implementation Details] Clarify the precise layer or token position from which the latent vector is extracted for each guard model tested.
[Results] Add a table comparing ESLD latency and accuracy against the unmodified guard on identical hardware and batch sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional experiments where feasible.

read point-by-point responses

Referee: [Experimental Evaluation] The central claim that latent activations contain a reliably stronger separator than the guard's classification head (reader's weakest assumption) is load-bearing for both the accuracy and latency results. The manuscript must demonstrate this via held-out attack styles and cross-guard generalization experiments; without them the 16.4 pp gain risks being an artifact of surrogate training on the evaluation distribution.

Authors: We agree that robustness to held-out attack styles and cross-guard generalization is essential to substantiate the central claim. The original manuscript already reports results across multiple distinct guard models and a diverse set of attack distributions. To directly respond to this concern, we have added new experiments in the revised version that use attack styles completely withheld from surrogate training and evaluate on two additional unseen guard models. These results, presented in the updated experimental evaluation section, show that the accuracy advantage persists (average gain of 13.8 percentage points on held-out attacks), indicating the improvement is not an artifact of the training distribution. revision: yes
Referee: [ESLD Architecture] § on surrogate architecture and training: the paper should report the exact training objective, regularization, and whether the surrogate was tuned on the same attack corpus used for final evaluation. If any hyper-parameter search or data leakage exists, the claimed improvement over the guard's verdict is not yet shown to be generalizable.

Authors: We thank the referee for highlighting the need for these implementation details. The surrogate is trained with a binary cross-entropy objective on the extracted latent activations to predict safety labels. L2 regularization with coefficient 0.001 is applied, and hyperparameters (learning rate, regularization strength, and surrogate depth) were selected via grid search on a validation split that is strictly disjoint from both the surrogate training corpus and the final evaluation set. The attack corpus for surrogate training does not overlap with the evaluation corpus. We have expanded the surrogate architecture and training subsection in the revised manuscript to include the exact objective, regularization term, hyperparameter search procedure, and explicit data-split description to demonstrate generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: ESLD is an empirical readout of pre-existing guard latents

full rationale

The paper presents ESLD as a model-agnostic architecture that extracts an already-present separation signal from unmodified guard-model internal representations, yielding measured 3× latency gains and 16.4 pp accuracy gains on external benchmarks. No derivation chain, equation, or self-citation reduces the central claims to fitted parameters defined by the authors themselves or to a self-referential uniqueness theorem. The result is an observation about existing model internals validated against held-out attack distributions, remaining self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that guard-model latent states encode detectable safety signals independently of the final output token sequence. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The internal representation of a guard model encodes sufficient information to classify inputs as safe or unsafe without requiring the model's final output.
This premise is required for the latency and accuracy claims to hold without retraining.

invented entities (1)

ESLD architecture no independent evidence
purpose: Surrogate classifier that reads latent states from an unmodified guard model
New packaging of the latent-signal observation into a deployable system.

pith-pipeline@v0.9.0 · 5863 in / 1306 out tokens · 30021 ms · 2026-05-20T10:02:23.026306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ESLD is a lightweight module attached to the hidden states of a guard LLM. It acts as a surrogate for the guard’s generated verdict and classifies the latent representation directly.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

[1]

ICLR Workshop , year=

Understanding intermediate layers using linear classifier probes , author=. ICLR Workshop , year=

work page
[2]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in language models is mediated by a single direction , author=. arXiv:2406.11717 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The internal state of an

Azaria, Amos and Mitchell, Tom , booktitle=. The internal state of an

work page
[4]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page
[5]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=

work page
[6]

and Chen, Deming and Dao, Tri , journal=

Cai, Tianle and Li, Yuhong and Geng, Zhengyang and Peng, Hongwu and Lee, Jason D. and Chen, Deming and Dao, Tri , journal=. Medusa: Simple

work page
[7]

Free Dolly: Introducing the world's first truly open instruction-tuned

Conover, Mike and others , year=. Free Dolly: Introducing the world's first truly open instruction-tuned

work page
[8]

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal=

work page
[9]

NeurIPS Datasets and Benchmarks , year=

Debenedetti, Edoardo and Zhang, Jie and Balunovi. NeurIPS Datasets and Benchmarks , year=

work page
[10]

2023 , howpublished=

Prompt injections benchmark dataset , author=. 2023 , howpublished=

work page 2023
[11]

EMNLP , year=

Transformer feed-forward layers are key-value memories , author=. EMNLP , year=

work page
[12]

Not what you've signed up for: Compromising real-world

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle=. Not what you've signed up for: Compromising real-world

work page
[13]

Han, Seungju and others , journal=

work page
[14]

Inan, Hakan and others , journal=

work page
[15]

Ji, Jiaming and others , journal=

work page
[16]

Jiang, Liwei and others , journal=

work page
[17]

Klimt, Bryan and Yang, Yiming , booktitle=. The

work page
[18]

Mosscap prompt injection challenge , author=

work page
[19]

A well-conditioned estimator for large-dimensional covariance matrices , author=. J. Multivariate Analysis , volume=

work page
[20]

ICML , year=

Fast inference from transformers via speculative decoding , author=. ICML , year=

work page
[21]

Lian, Wing and others , year=

work page
[22]

USENIX Security , year=

Formalizing and benchmarking prompt injection attacks and defenses , author=. USENIX Security , year=

work page
[23]

ACL , year=

Data contamination: From memorization to exploitation , author=. ACL , year=

work page
[24]

The geometry of truth: Emergent linear structure in

Marks, Samuel and Tegmark, Max , journal=. The geometry of truth: Emergent linear structure in

work page
[25]

Mazeika, Mantas and others , journal=

work page
[26]

2024 , howpublished=

Jailbreak prompts collection , author=. 2024 , howpublished=

work page 2024
[27]

Padhi, Inkit and others , journal=

work page
[28]

NeurIPS ML Safety Workshop , year=

Ignore previous prompt: Attack techniques for language models , author=. NeurIPS ML Safety Workshop , year=

work page
[29]

Radharapu, Bhaktipriya and Robinson, Kevin and Aroyo, Lora and Lahoti, Preethi , journal=

work page
[30]

Safeguard benign prompts corpus , author=

work page
[31]

EMNLP Findings , year=

Sainz, Oscar and Campos, Jon Ander and Garc. EMNLP Findings , year=

work page
[32]

ACL , year=

The right tool for the job: Matching model and instance complexities , author=. ACL , year=

work page
[33]

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , journal=

work page
[34]

, booktitle=

Teerapittayanon, Surat and McDanel, Bradley and Kung, H.T. , booktitle=

work page
[35]

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle=

work page
[36]

Do-Not-Answer: A dataset for evaluating safeguards in

Wang, Yuxia and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy , journal=. Do-Not-Answer: A dataset for evaluating safeguards in

work page
[37]

Frontiers of Computer Science , year=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , year=

work page
[38]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey , author=. arXiv:2309.07864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Xin, Ji and Tang, Raphael and Lee, Jaejun and Yu, Yaoliang and Lin, Jimmy , booktitle=

work page
[40]

KDD , year=

Benchmarking and defending against indirect prompt injection attacks on large language models , author=. KDD , year=

work page
[41]

Zeng, Wenjun and others , journal=

work page
[42]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , journal=

work page
[43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Representation engineering: A top-down approach to

Zou, Andy and others , journal=. Representation engineering: A top-down approach to

work page

[1] [1]

ICLR Workshop , year=

Understanding intermediate layers using linear classifier probes , author=. ICLR Workshop , year=

work page

[2] [2]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in language models is mediated by a single direction , author=. arXiv:2406.11717 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The internal state of an

Azaria, Amos and Mitchell, Tom , booktitle=. The internal state of an

work page

[4] [4]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page

[5] [5]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=

work page

[6] [6]

and Chen, Deming and Dao, Tri , journal=

Cai, Tianle and Li, Yuhong and Geng, Zhengyang and Peng, Hongwu and Lee, Jason D. and Chen, Deming and Dao, Tri , journal=. Medusa: Simple

work page

[7] [7]

Free Dolly: Introducing the world's first truly open instruction-tuned

Conover, Mike and others , year=. Free Dolly: Introducing the world's first truly open instruction-tuned

work page

[8] [8]

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal=

work page

[9] [9]

NeurIPS Datasets and Benchmarks , year=

Debenedetti, Edoardo and Zhang, Jie and Balunovi. NeurIPS Datasets and Benchmarks , year=

work page

[10] [10]

2023 , howpublished=

Prompt injections benchmark dataset , author=. 2023 , howpublished=

work page 2023

[11] [11]

EMNLP , year=

Transformer feed-forward layers are key-value memories , author=. EMNLP , year=

work page

[12] [12]

Not what you've signed up for: Compromising real-world

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle=. Not what you've signed up for: Compromising real-world

work page

[13] [13]

Han, Seungju and others , journal=

work page

[14] [14]

Inan, Hakan and others , journal=

work page

[15] [15]

Ji, Jiaming and others , journal=

work page

[16] [16]

Jiang, Liwei and others , journal=

work page

[17] [17]

Klimt, Bryan and Yang, Yiming , booktitle=. The

work page

[18] [18]

Mosscap prompt injection challenge , author=

work page

[19] [19]

A well-conditioned estimator for large-dimensional covariance matrices , author=. J. Multivariate Analysis , volume=

work page

[20] [20]

ICML , year=

Fast inference from transformers via speculative decoding , author=. ICML , year=

work page

[21] [21]

Lian, Wing and others , year=

work page

[22] [22]

USENIX Security , year=

Formalizing and benchmarking prompt injection attacks and defenses , author=. USENIX Security , year=

work page

[23] [23]

ACL , year=

Data contamination: From memorization to exploitation , author=. ACL , year=

work page

[24] [24]

The geometry of truth: Emergent linear structure in

Marks, Samuel and Tegmark, Max , journal=. The geometry of truth: Emergent linear structure in

work page

[25] [25]

Mazeika, Mantas and others , journal=

work page

[26] [26]

2024 , howpublished=

Jailbreak prompts collection , author=. 2024 , howpublished=

work page 2024

[27] [27]

Padhi, Inkit and others , journal=

work page

[28] [28]

NeurIPS ML Safety Workshop , year=

Ignore previous prompt: Attack techniques for language models , author=. NeurIPS ML Safety Workshop , year=

work page

[29] [29]

Radharapu, Bhaktipriya and Robinson, Kevin and Aroyo, Lora and Lahoti, Preethi , journal=

work page

[30] [30]

Safeguard benign prompts corpus , author=

work page

[31] [31]

EMNLP Findings , year=

Sainz, Oscar and Campos, Jon Ander and Garc. EMNLP Findings , year=

work page

[32] [32]

ACL , year=

The right tool for the job: Matching model and instance complexities , author=. ACL , year=

work page

[33] [33]

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , journal=

work page

[34] [34]

, booktitle=

Teerapittayanon, Surat and McDanel, Bradley and Kung, H.T. , booktitle=

work page

[35] [35]

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle=

work page

[36] [36]

Do-Not-Answer: A dataset for evaluating safeguards in

Wang, Yuxia and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy , journal=. Do-Not-Answer: A dataset for evaluating safeguards in

work page

[37] [37]

Frontiers of Computer Science , year=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , year=

work page

[38] [38]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey , author=. arXiv:2309.07864 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Xin, Ji and Tang, Raphael and Lee, Jaejun and Yu, Yaoliang and Lin, Jimmy , booktitle=

work page

[40] [40]

KDD , year=

Benchmarking and defending against indirect prompt injection attacks on large language models , author=. KDD , year=

work page

[41] [41]

Zeng, Wenjun and others , journal=

work page

[42] [42]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , journal=

work page

[43] [43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Representation engineering: A top-down approach to

Zou, Andy and others , journal=. Representation engineering: A top-down approach to

work page