pith. sign in

arxiv: 2607.02072 · v1 · pith:OWKSAWTKnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.CR

kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail

Pith reviewed 2026-07-03 17:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords kNNGuardLLM guardrailshidden activationstraining-free classificationk-nearest neighborsprompt safetydomain adaptationinference speed
0
0 comments X

The pith

Hidden activations from any off-the-shelf LLM support a training-free guardrail that matches fine-tuned performance at much higher speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents kNNGuard as a method that pulls hidden activations from a fixed LLM and runs multi-layer k-nearest-neighbor classification against a bank of only 50 labeled prompts. It fuses activation-space and embedding-space scores to label new prompts as safe or unsafe. This setup requires no model training or gradient updates yet reaches competitive or higher F1 scores than fine-tuned guardrails across six topical and security domains. The approach runs 2.7 times faster than the strongest comparable system and 10 times faster than a fine-tuned classifier. Domain changes need only a new prompt bank, which can be built in seconds.

Core claim

kNNGuard extracts hidden activations from an off-the-shelf LLM for a bank of 50 safe and unsafe prompts, then applies multi-layer kNN that combines activation-space and embedding-space scores to classify incoming prompts. Across six domains the method produces competitive or superior F1 scores relative to fine-tuned state-of-the-art guardrails, runs 2.7 times faster than the best comparable guardrail and 10 times faster than a fine-tuned safety classifier, and supports domain adaptation through bank replacement in under 10 seconds.

What carries the argument

Multi-layer kNN that fuses hidden-activation and embedding scores from a fixed bank of 50 prompts.

If this is right

  • Any production LLM can receive a guardrail without retraining or gradient steps.
  • Switching safety rules for a new domain requires only replacing the 50-prompt bank.
  • Inference latency stays low enough for real-time pipeline integration.
  • The same mechanism applies to both topical relevance and security threats.
  • No fine-tuning step means the original model weights and capabilities remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Non-technical users could create custom guardrails by supplying their own small prompt sets.
  • The method might scale to very large prompt banks or additional layers without retraining overhead.
  • Layer selection could be automated per domain to further improve accuracy.
  • Similar activation-based kNN could be tested for other classification tasks inside LLMs.

Load-bearing premise

The hidden activations of an unmodified LLM already contain enough class-separable structure that kNN on a bank of 50 prompts can separate safe from unsafe inputs across domains.

What would settle it

A controlled test on a held-out domain or with a different base LLM in which kNNGuard F1 drops below the fine-tuned baselines while latency stays low would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2607.02072 by Hamid Nasiri, Mahmoud Abdelfattah, Peter Garraghan.

Figure 1
Figure 1. Figure 1: Architecture of kNNGuard. During the bank-building phase, labelled prompts are processed by the frozen [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Adaptive fusion in kNNGuard. Activation-space and embedding-space scores are compared via a [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: F1, recall, error rate and latency scores across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Total error decomposition across topical domains. Each bar represents the combined FPR and FNR per [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Total error decomposition across security and safety domains. Each bar represents the combined FPR and [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: F1, Recall, FPR, and FNR scores compared [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE projection of prompts in Medical and Coding domains comparing Embedding-kNN and kNNGuard FE. malicious code-execution request may share sig￾nificant lexical overlap, causing the embedding en￾coder to map them to adjacent regions in the vector space. This ambiguity is reflected quantitatively by low Silhouette scores of 0.527 for the Coding domain and 0.342 for the Medical domain. This lack of strict … view at source ↗
Figure 8
Figure 8. Figure 8: Time comparison between LoRA fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompts injected per domain for Nemotron Topic Guard and kNNGuard classification. All prompts share a unified output protocol to fit require￾ments of Llama Nemotron Topic Guard V1. A.2 kNNGuard: Tests on various models 0.0 0.2 0.4 0.6 0.8 1.0 F1 Score 0.874 0.812 0.886 0.889 0.829 0.808 F1 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.866 0.793 0.890 0.923 0.909 0.776 Recall 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.129 0.1… view at source ↗
Figure 10
Figure 10. Figure 10: Average performance of kNNGuard (fused ensemble, k = 13, n = 50 per class) across all six evaluation domains. F1 score, Recall, false positive rate (FPR), and false negative rate (FNR) are averaged over Coding Instructions, Coding Outputs, Medical, Safety, Jailbreak, and Prompt Injection. Llama-3.1- 8B-Instruct denotes the original kNNGuard backbone; all other models are alternative backbones evaluated un… view at source ↗
Figure 11
Figure 11. Figure 11: Radar plot of per-domain F1 scores for kNNGuard FE (fused ensemble, k = 13, bank size n = 50 per class, τ = 0.5). Each axis represents one evaluation domain; larger distance from the center indi￾cates higher F1 (better discrimination). Line styles and markers distinguish the six backbone models for read￾ability in grayscale. Llama-3.1-8B-Instruct denotes the original kNNGuard baseline; Mistral-7B-Instruct… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in domains requiring guardrails to detect unsafe, off-topic, or adversarial prompts. Existing guardrails predominately rely on fine-tuning to build classifiers, which often suffer from low generalization and high inference latency. We present kNNGuard, a training-free guardrail that utilizes the activation space of an off-the-shelf LLM. Given a small bank of 50 safe and unsafe prompts, kNNGuard extracts hidden activations and performs multi-layer kNN fusing activation-space and embedding-space scores for classification. Across six domains spanning topical and security prompts, kNNGuard achieves competitive or superior F1 compared to fine-tuned state-of-the-art guardrails while running 2.7x faster than the best comparable guardrail, and 10x faster than a fine-tuned safety classifier without gradient updates or fine-tuning. Domain adaptation requires only updating the labeled bank, which can be constructed in under 10 seconds and several orders of magnitude faster than established guardrails. We also analyze the impact of system prompts, layer selection, and integration into production LLM pipelines as a configurable, low-latency guardrail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents kNNGuard, a training-free guardrail that extracts hidden activations from an off-the-shelf LLM, applies multi-layer kNN on a fixed bank of 50 labeled safe/unsafe prompts, and fuses activation-space and embedding-space scores for classification. It claims competitive or superior F1 scores across six domains (topical and security prompts) relative to fine-tuned SOTA guardrails, with 2.7x faster inference than the best comparable method and 10x faster than a fine-tuned safety classifier, plus rapid domain adaptation by swapping the bank.

Significance. If the separability results hold, the work provides a practical, low-latency alternative to fine-tuning for LLM guardrails, with the key advantage of zero gradient updates and sub-10-second bank construction for new domains. This could meaningfully reduce deployment costs in production pipelines where retraining is prohibitive.

major comments (2)
  1. [Experiments / Evaluation] The central empirical claim (competitive F1 with a 50-prompt bank) rests on the assumption that off-the-shelf LLM activations contain sufficient class-separable structure at this scale; however, the manuscript provides no ablation on bank size, composition sensitivity, or cross-domain prompt variability, leaving open whether nearest-neighbor decisions are driven by intrinsic geometry or by the specific choice of the 50 examples.
  2. [Experiments] The reported speedups (2.7x and 10x) and F1 comparisons require explicit details on hardware, batching, exact baseline implementations, metric computation (e.g., how F1 is aggregated across domains), and statistical significance; these are load-bearing for the "superior" claim but appear underspecified relative to the strength of the conclusion.
minor comments (2)
  1. [Method] Notation for the fusion of activation and embedding scores (e.g., how the multi-layer kNN distances are combined) could be clarified with an explicit equation or pseudocode.
  2. [Abstract] The abstract states results across "six domains" but does not name them or indicate prompt selection criteria; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental claims and details. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] The central empirical claim (competitive F1 with a 50-prompt bank) rests on the assumption that off-the-shelf LLM activations contain sufficient class-separable structure at this scale; however, the manuscript provides no ablation on bank size, composition sensitivity, or cross-domain prompt variability, leaving open whether nearest-neighbor decisions are driven by intrinsic geometry or by the specific choice of the 50 examples.

    Authors: We agree that explicit ablations would better substantiate the separability assumption. In the revised manuscript we will add: (i) an ablation varying bank size from 10 to 100 examples while holding other factors fixed, (ii) results across multiple randomly sampled banks of size 50 to quantify composition sensitivity, and (iii) per-domain variance analysis to address cross-domain prompt variability. These additions will clarify the contribution of activation-space geometry versus example selection. revision: yes

  2. Referee: [Experiments] The reported speedups (2.7x and 10x) and F1 comparisons require explicit details on hardware, batching, exact baseline implementations, metric computation (e.g., how F1 is aggregated across domains), and statistical significance; these are load-bearing for the "superior" claim but appear underspecified relative to the strength of the conclusion.

    Authors: We concur that greater experimental transparency is required. The revision will specify: hardware platform and memory, batch sizes used during inference, exact baseline code or references, macro-averaged F1 across the six domains, and statistical significance (standard deviation over repeated runs or bootstrap intervals). These details will be inserted into the experimental setup and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method validated externally

full rationale

The paper describes an empirical, training-free kNN-based classifier operating on off-the-shelf LLM activations with a static 50-prompt bank. Classification performance is measured via F1 scores on held-out test sets across six domains and compared directly to independently implemented fine-tuned baselines. No equations, uniqueness theorems, or predictions are offered; the central claim reduces to experimental results rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM hidden activations encode prompt safety information in a kNN-separable manner; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Hidden activations from an off-the-shelf LLM contain class-separable structure for safe versus unsafe prompts that kNN can exploit with small example banks
    This premise is required for the training-free classification to succeed across domains.

pith-pipeline@v0.9.1-grok · 5738 in / 1217 out tokens · 36735 ms · 2026-07-03T17:00:29.827112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2505.04806 , year=

    Red teaming the mind of the machine: A systematic evaluation of prompt injection and jailbreak vulnerabilities in llms , author=. arXiv preprint arXiv:2505.04806 , year=

  2. [2]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts) , pages=

    Guardrails and security for LLMs: Safe, secure and controllable steering of LLM applications , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts) , pages=

  3. [3]

    arXiv preprint arXiv:2603.20206 , year=

    Enhancing Safety of Large Language Models via Embedding Space Separation , author=. arXiv preprint arXiv:2603.20206 , year=

  4. [4]

    2nd NeurIPS Works

    Activation monitoring: advantages of using internal representations for LLM oversight , author=. 2nd NeurIPS Works. on Attributing Model Behavior at Scale , year=

  5. [5]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    CantTalkAboutThis: Aligning language models to stay on topic in dialogues , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  6. [6]

    arXiv preprint arXiv:2411.12946 , year=

    A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection , author=. arXiv preprint arXiv:2411.12946 , year=

  7. [7]

    arXiv preprint arXiv:2502.01042 , year=

    Safeswitch: Steering unsafe llm behavior via internal activation signals , author=. arXiv preprint arXiv:2502.01042 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers , author=. Advances in neural information processing systems , volume=

  9. [9]

    Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=

    Bypassing LLM guardrails: An empirical analysis of evasion attacks against prompt injection and jailbreak detection systems , author=. Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=

  10. [10]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    The structural safety generalization problem , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Annals of translational medicine , volume=

    Introduction to machine learning: k-nearest neighbors , author=. Annals of translational medicine , volume=

  13. [13]

    Ghojogh, Benyamin and Sikaroudi, Milad and Tizhoosh, H. R. and Karray, Fakhri and Crowley, Mark , year=. Weighted Fisher Discriminant Analysis in the Input and Feature Spaces , ISBN=. doi:10.1007/978-3-030-50516-5_1 , booktitle=

  14. [14]

    2025 , howpublished =

    Restrict Topics with Llama 3.1 NemoGuard 8B TopicControl NIM , author =. 2025 , howpublished =

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Transformers need glasses! information over-squashing in language tasks , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    arXiv preprint arXiv:2302.09304 , year=

    Interpretability in activation space analysis of transformers: A focused survey , author=. arXiv preprint arXiv:2302.09304 , year=

  17. [17]

    Schizophrenia , volume=

    Approximating the semantic space: word embedding techniques in psychiatric speech analysis , author=. Schizophrenia , volume=. 2024 , publisher=

  18. [18]

    arXiv preprint arXiv:2002.09247 , year=

    Is aligning embedding spaces a challenging task? an analysis of the existing methods , author=. arXiv preprint arXiv:2002.09247 , year=

  19. [19]

    arXiv preprint arXiv:2004.04523 , year=

    K-nearest neighbour classifiers: (with Python examples) , author=. arXiv preprint arXiv:2004.04523 , year=

  20. [20]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  21. [21]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  22. [22]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  23. [23]

    Proceedings of the 1st International Workshop on Large Language Models for Code , pages=

    Promptset: A programmer's prompting dataset , author=. Proceedings of the 1st International Workshop on Large Language Models for Code , pages=

  24. [24]

    Proceedings of the Conference on Health, Inference, and Learning , pages =

    MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author =. Proceedings of the Conference on Health, Inference, and Learning , pages =. 2022 , volume =

  25. [25]

    AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

    Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and others. AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:...

  26. [26]

    and Tchuindjang, M

    Alamsabi, M. and Tchuindjang, M. and Brohi, S. , title =. Algorithms , year =. doi:10.3390/a19010092 , url =

  27. [27]

    2023 , publisher =

    Hao, Jack , title =. 2023 , publisher =

  28. [28]

    Tarun, I. A. M. , title =. 2023 , publisher =

  29. [29]

    Cureus , volume=

    Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge , author=. Cureus , volume=. 2023 , publisher=

  30. [30]

    Safety Benchmark Dataset , year =

  31. [31]

    Prompt Injections Benchmark Dataset , year =

  32. [32]

    Prompt Injections Dataset , year =

  33. [33]

    2026 , publisher =

    NeurAlchemy , title =. 2026 , publisher =

  34. [34]

    2023 , url =

    Mike Conover and Matt Hayes and Ankit Mathur and others , title =. 2023 , url =

  35. [35]

    2024 , eprint=

    WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

  36. [36]

    Prompt Safety Dataset , year =

  37. [37]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  38. [38]

    2025 , howpublished =

    Llama Prompt Guard 2 , author =. 2025 , howpublished =

  39. [39]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  40. [40]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  41. [41]

    2026 , howpublished =

    Introducing Gemma 4 12B , author =. 2026 , howpublished =

  42. [42]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

  43. [43]

    2025 , url =

    QuixiAI , title =. 2025 , url =

  44. [44]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  45. [45]

    Journal of the royal statistical society: Series B (Methodological) , volume=

    Cross-validatory choice and assessment of statistical predictions , author=. Journal of the royal statistical society: Series B (Methodological) , volume=. 1974 , publisher=

  46. [46]

    Advances in neural information processing systems , volume=

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

  47. [47]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

    The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections , author=. arXiv preprint arXiv:2510.09023 , year=