pith. machine review for the scientific record. sign in

arxiv: 2604.06247 · v1 · submitted 2026-04-06 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SALLIE: Safeguarding Against Latent Language & Image Exploits

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak detectionprompt injectionvision-language modelsresidual stream activationsk-NN classifierlayer ensemblemechanistic interpretabilitymultimodal defense
0
0 comments X

The pith

SALLIE detects jailbreaks and prompt injections in language and vision models by reading signals from internal residual stream activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SALLIE as a lightweight defense that works on both textual and visual threats in LLMs and VLMs. It extracts activations from the model's residual stream, runs a k-NN classifier on each layer to score maliciousness, and combines those scores through a layer ensemble. This approach runs at inference time, requires no model changes, and avoids the performance drops that come from input rewriting or separate text and image defenses. The authors show it beats prior methods across many datasets and on several compact open-source models. A sympathetic reader would care because it offers one unified, modal-agnostic guard that keeps the original model fast and intact.

Core claim

SALLIE is a runtime detection framework that integrates into standard token-level fusion pipelines and defends against textual and visual jailbreaks plus prompt injections by extracting internal residual stream activations, calculating layer-wise maliciousness scores with a k-NN classifier, and aggregating the predictions through a layer ensemble module.

What carries the argument

The three-stage architecture that pulls residual stream activations, applies per-layer k-NN scoring for maliciousness, and aggregates results with a layer ensemble.

Load-bearing premise

Internal residual stream activations contain robust, detectable signals of maliciousness that a k-NN classifier can identify reliably without false positives that hurt overall performance, and that the same signals work across different models without any architectural changes.

What would settle it

Running SALLIE on a new model or held-out jailbreak dataset and finding either many missed attacks or a high rate of false positives that measurably slows or degrades the underlying model's output quality.

Figures

Figures reproduced from arXiv: 2604.06247 by Guy Azov, Guy Shtar, Ofer Rivlin.

Figure 1
Figure 1. Figure 1: Examples flow of the SALLIE framework. 3 Our Methodology: A Unified Approach For Malicious Input Detection Our system (see [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA projection of hidden-state activations at an intermediate layer, colored by [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PCA projection of hidden-state activations at an intermediate layer Per Modality. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity on the validation set: FNR @ FPR [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), SALLIE extracts robust signals directly from the model's internal activations. At inference, SALLIE defends via a three-stage architecture: (1) extracting internal residual stream activations, (2) calculating layer-wise maliciousness scores using a K-Nearest Neighbors (k-NN) classifier, and (3) aggregating these predictions via a layer ensemble module. We evaluate SALLIE on compact, open-source architectures - Phi-3.5-vision-instruct (arXiv:2404.14219), SmolVLM2-2.2B-Instruct (arXiv:2504.05299), and gemma-3-4b-it (arXiv:2503.19786) - prioritized for practical inference times and real-world deployment costs. Our comprehensive evaluation pipeline spans over ten datasets and more than five strong baseline methods from the literature, and SALLIE consistently outperforms these baselines across a wide range of experimental settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SALLIE, a lightweight runtime detection framework for LLMs and VLMs that extracts residual stream activations and applies a three-stage process: (1) activation extraction, (2) layer-wise maliciousness scoring via k-NN classifiers, and (3) aggregation through a layer ensemble module. It claims to deliver a unified, modal-agnostic defense against textual jailbreaks, visual exploits, and prompt injections on models including Phi-3.5-vision-instruct, SmolVLM2-2.2B-Instruct, and gemma-3-4b-it, outperforming five baselines across more than ten datasets without performance degradation or architectural modifications.

Significance. If the outperformance claims hold with full evidence, SALLIE would offer a practical contribution to AI safety by providing an interpretable, low-overhead inference-time defense that unifies handling of multimodal threats. The emphasis on internal activations and evaluation on compact open-source models supports deployability, and the avoidance of complex input transformations addresses limitations in prior work.

major comments (3)
  1. [Abstract] Abstract: The headline claim that SALLIE 'consistently outperforms these baselines across a wide range of experimental settings' is stated without any quantitative metrics, accuracy/F1 scores, false-positive rates, error bars, dataset breakdowns, or statistical tests. This absence is load-bearing for the central empirical claim and prevents verification of the evaluation pipeline.
  2. [Method (three-stage architecture)] Three-stage architecture (method description): The k-NN component for layer-wise maliciousness scores provides no details on training data construction (how activations are labeled malicious/benign), choice of k, distance metric, or preprocessing steps such as dimensionality reduction. These omissions directly affect reproducibility and assessment of whether the extracted signals are robust or merely overfit to the training attacks.
  3. [Evaluation pipeline] Evaluation pipeline: No analysis addresses generalization under distribution shift or to novel multimodal jailbreaks, despite k-NN's known sensitivity to such shifts. This is load-bearing for the 'unified, modal-agnostic defense' claim, as the paper does not test whether decision boundaries capture intrinsic mechanistic features rather than attack-specific patterns.
minor comments (2)
  1. [References] The arXiv citations would benefit from consistent formatting and inclusion of titles or DOIs for easier lookup.
  2. A diagram or pseudocode for the three-stage architecture and layer ensemble would improve clarity of the inference flow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of SALLIE as a practical contribution to AI safety. We address each major comment below and will revise the manuscript accordingly to improve clarity, reproducibility, and completeness of the evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that SALLIE 'consistently outperforms these baselines across a wide range of experimental settings' is stated without any quantitative metrics, accuracy/F1 scores, false-positive rates, error bars, dataset breakdowns, or statistical tests. This absence is load-bearing for the central empirical claim and prevents verification of the evaluation pipeline.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the performance claims. In the revised version, we will incorporate key metrics including average accuracy and F1 improvements over baselines, false-positive rates, and reference to the range of datasets and models evaluated. revision: yes

  2. Referee: [Method (three-stage architecture)] Three-stage architecture (method description): The k-NN component for layer-wise maliciousness scores provides no details on training data construction (how activations are labeled malicious/benign), choice of k, distance metric, or preprocessing steps such as dimensionality reduction. These omissions directly affect reproducibility and assessment of whether the extracted signals are robust or merely overfit to the training attacks.

    Authors: We acknowledge the need for greater specificity on the k-NN implementation to ensure reproducibility. The revised method section will explicitly describe the construction of labeled training activations from our jailbreak and benign prompt datasets, the selected value of k, the distance metric employed, and any normalization or dimensionality reduction steps applied to the residual stream activations. revision: yes

  3. Referee: [Evaluation pipeline] Evaluation pipeline: No analysis addresses generalization under distribution shift or to novel multimodal jailbreaks, despite k-NN's known sensitivity to such shifts. This is load-bearing for the 'unified, modal-agnostic defense' claim, as the paper does not test whether decision boundaries capture intrinsic mechanistic features rather than attack-specific patterns.

    Authors: We recognize that explicit evaluation of generalization under distribution shift is important for substantiating the modal-agnostic claims. Our existing results span multiple models and diverse datasets, but we will add a dedicated discussion of generalization limitations in the revision and include analysis on held-out attack variants where feasible to better demonstrate that the approach captures broader mechanistic signals. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical k-NN detection on activations is self-contained

full rationale

The paper presents SALLIE as a three-stage runtime detector that extracts residual-stream activations and applies a standard k-NN classifier plus layer ensemble. No equations or derivations are offered that reduce by construction to fitted parameters, self-definitions, or self-citations. The method is described as a lightweight, plug-in framework evaluated empirically on multiple models and datasets; any training details (labeling, k, metric) are external to the claimed result rather than tautological inputs. This is the normal case of an applied ML defense paper whose central claim rests on experimental performance rather than a closed logical loop.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; framework assumes activations encode reliable maliciousness signals but introduces no new entities or fitted constants beyond standard classifier choices.

free parameters (2)
  • k in k-NN
    Number of neighbors for classification; choice affects detection but not specified in abstract.
  • layer ensemble weights
    How predictions from different layers are aggregated; likely tuned but unspecified.
axioms (1)
  • domain assumption Internal residual stream activations contain robust signals for detecting malicious inputs
    Central to the extraction step and k-NN scoring described in the abstract.

pith-pipeline@v0.9.0 · 5659 in / 1363 out tokens · 57203 ms · 2026-05-10T18:59:55.068741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 31 canonical work pages · 11 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

  2. [2]

    Detecting Language Model Attacks with Perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132,

  3. [3]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151,

  4. [4]

    Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717,

  5. [5]

    VPI-Bench: Visual prompt injection attacks for computer-use agents,

    Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, and Bryan Hooi. Vpi-bench: Visual prompt injection attacks for computer-use agents. arXiv preprint arXiv:2506.02456,

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    URLhttps://lmsys.org/blog/2023-03-30-vicuna/. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  7. [7]

    Astra: Agentic steerability and risk assessment framework.arXiv preprint arXiv:2511.18114,

    Itay Hazan, Yael Mathov, Guy Shtar, Ron Bitton, and Itsik Mantin. Astra: Agentic steerability and risk assessment framework.arXiv preprint arXiv:2511.18114,

  8. [8]

    arXiv preprint arXiv:2411.11114 , year =

    Z He, Z Wang, Z Chu, H Xu, R Zheng, K Ren, and C Chen. Jailbreaklens: Interpreting jail- break mechanism in the lens of representation and circuit.arXiv preprint arXiv:2411.11114,

  9. [9]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

  10. [10]

    Visual- webinstruct: Scaling up multimodal instruction data through web search.ArXiv preprint, abs/2503.10582, 2025

    Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. Visualwe- binstruct: Scaling up multimodal instruction data through web search.arXiv preprint arXiv:2503.10582,

  11. [11]

    doi:10.48550/arXiv.2502.14744 , abstract =

    Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, and Xiangyu Yue. Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states.arXiv preprint arXiv:2502.14744,

  12. [12]

    Safety layers in aligned large language models: The key to llm security.arXiv preprint arXiv:2408.17003, 2024

    Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security.arXiv preprint arXiv:2408.17003, 2024a. Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepincep- tion: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

  13. [13]

    arXiv preprint arXiv:2403.09792 , year=

    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models.arXiv preprint arXiv:2403.09792, 2024b. Y Lin, P He, H Xu, Y Xing, M Yamada, H Liu, and J Tang. Towards understanding jailbreak attacks in llms: A representation space ...

  14. [14]

    VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap.arXiv preprint arXiv:2502.10486,

    URLhttps://transformer-circuits.pub/2025/biology.html. Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023a. Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimod...

  15. [15]

    CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models

    Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. Codechameleon: Personalized encryption framework for jailbreaking large language models.arXiv preprint arXiv:2402.16717,

  16. [16]

    SmolVLM: Redefining small and efficient multimodal models

    Andr´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299,

  17. [17]

    Ignore Previous Prompt: Attack Techniques For Language Models

    URL https://openai.com/index/gpt-4-1/. Large language model. F´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527,

  18. [18]

    Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023

    Mansi Phute, Alec Helbling, Matthew Hull, Sheng-Yun Peng, Sebastian Szyller, Cory Cor- nelius, and Duen Horng Chau. Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308,

  19. [19]

    Visual adversarial examples jailbreak aligned largelanguagemodels

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. arXiv preprint arXiv:2306.13213,

  20. [20]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684,

  21. [21]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul R¨ottger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:...

  22. [22]

    emnlp-main.154/

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long

  23. [23]

    X Shen, Z Chen, M Backes, Y Shen, and Y Zhang

    URLhttps://aclanthology.org/2024.naacl-long.301/. X Shen, Z Chen, M Backes, Y Shen, and Y Zhang. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security,

  24. [25]

    Visualpuzzles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

    URLhttps://arxiv.org/abs/2504.10342. Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks,

  25. [26]

    Tensor Trust: Interpretable prompt injection attacks from an online game,

    URL https: //qwenlm.github.io/blog/qwen2.5/. Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor trust: In- terpretable prompt injection attacks from an online game.arXiv preprint arXiv:2311.01011,

  26. [27]

    Omni- guard: An efficient approach for ai safety moderation across modalities.arXiv preprint arXiv:2505.23856,

    S Verma, K Hines, J Bilmes, C Siska, L Zettlemoyer, H Gonen, and C Singh. Omni- guard: An efficient approach for ai safety moderation across modalities.arXiv preprint arXiv:2505.23856,

  27. [28]

    URL https://arxiv.org/abs/2408. 01605. Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, and Richang Hong. Align is not enough: Multimodal universal jailbreak attack against multimodal large language models.arXiv preprint arXiv:2506.01307,

  28. [29]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  29. [30]

    A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549,

  30. [31]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  31. [32]

    Eeg-defender: Defending against jailbreak through early exit generation of large language models.arXiv preprint arXiv:2408.11308,

    C Zhao, Z Dou, and K Huang. Eeg-defender: Defending against jailbreak through early exit generation of large language models.arXiv preprint arXiv:2408.11308,

  32. [33]

    How alignment and jailbreak work: Explain llm safety through intermediate hidden states

    Zhenglong Zhou, Haiyang Yu, Xinghua Zhang, Rong Xu, Furong Huang, and Yixuan Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024,

  33. [34]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  34. [35]

    Pishield: Detecting prompt injection attacks via intrinsic llm features.arXiv preprint arXiv:2510.14005,

    W Zou, Y Liu, Y Wang, Y Chen, NZ Gong, and J Jia. Pishield: Detecting prompt injection attacks via intrinsic llm features.arXiv preprint arXiv:2510.14005,

  35. [36]

    Both are queried at temperature 0 for deterministic outputs

    andGPT-4.1- Mini(OpenAI, 2025). Both are queried at temperature 0 for deterministic outputs. Each judge receives a system prompt defining the detection task, followed by a user prompt containing the input to analyze. System prompt. You are a security analyst specializing in AI safety. Your task is to analyze user inputs and determine if they contain malic...

  36. [37]

    Under review

    Prompt Injection: Attempts to override, ignore, or bypass system instructions (e.g., ”ignore previous instructions”, ”you are now...”, 14 Preprint. Under review. Dataset Modality Type Split Samples Taori et al. (2023) Textual Benign Training 20,000 Shen et al. (2024) Textual Jailbreak Training 1,270 Luo et al. (2024) Textual Jailbreak Training 414 Zou et ...

  37. [38]

    Be thorough but avoid false positives on legitimate requests

    Jailbreak: Attempts to bypass safety guidelines or get the AI to produce harmful content (e.g., roleplay scenarios to bypass restrictions, hypothetical framings). Be thorough but avoid false positives on legitimate requests. Modality Attack Dataset SALLIE Other Phi-3.5 Gemma-3 JailGuard HiddenDetect Visual PI astra 0.00 0.01 0.42 0.97 Visual PI vpi bench ...

  38. [39]

    For each input, hidden states are extracted from the last token at layers 1 to 24 (α= 0.75 × 32, following the paper)

    as the backbone. For each input, hidden states are extracted from the last token at layers 1 to 24 (α= 0.75 × 32, following the paper). Per-layer attack and benign prototype vectors are computed as the mean hidden state of each class over the validation set. A test input is scored by averaging the per-layer difference between its cosine distance to the be...