pith. machine review for the scientific record. sign in

arxiv: 2604.01473 · v2 · submitted 2026-04-01 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak detectionlarge language modelstoken-level logitsguardrail methodsnumerical gradingLLM safetyattack success rate
0
0 comments X

The pith

SelfGrader detects jailbreaks by grading queries with logits over digits 0-9

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SelfGrader is a guardrail that checks user queries for malicious intent without generating full responses or inspecting internal model states. It prompts the model to assign a safety grade using only the logits for the digits zero to nine, then applies a dual-perspective rule that weighs both how malicious and how benign the query appears. This produces a stable score that lowers the attack success rate by as much as 22.66 percent on models like LLaMA-3-8B while using far less memory and running much faster than prior methods. The approach matters because existing detectors either add heavy latency from full text generation or rely on features that are hard to access and interpret.

Core claim

SelfGrader formulates jailbreak detection as a numerical grading problem by evaluating the safety of a user query within the compact set of numerical tokens (0-9) and interpreting their logit distribution as an internal safety signal. A dual-perspective scoring rule considers both maliciousness and benignness to yield a stable score reflecting harmfulness while reducing false positives.

What carries the argument

The dual-perspective scoring rule applied to the logit distribution over numerical tokens 0-9, which extracts a safety signal directly from the model's output probabilities without full generation.

If this is right

  • SelfGrader reduces attack success rate by up to 22.66% on LLaMA-3-8B compared to baselines.
  • It incurs up to 173 times lower memory overhead than competing guardrails.
  • Latency is reduced by up to 26 times while maintaining detection performance.
  • The method works across multiple LLMs and diverse jailbreak benchmarks.
  • It provides interpretable scores aligned with human intuition of maliciousness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the numerical logits reliably encode safety, similar lightweight grading could be applied to other safety-related tasks like toxicity detection.
  • The dual-perspective rule might generalize to other token sets beyond digits if the model has consistent probability patterns.
  • Deployment on edge devices becomes more feasible due to low overhead, enabling real-time query screening.
  • Future work could test whether fine-tuning the model to strengthen these logit signals improves detection further.

Load-bearing premise

The logit distribution over the fixed tokens 0 through 9 reliably signals query maliciousness in a way that aligns with human judgments of harm.

What would settle it

A collection of jailbreak and benign queries where the numerical logit scores fail to distinguish malicious from safe inputs at rates better than chance, or where the dual scoring rule produces unstable results across prompt variations.

Figures

Figures reproduced from arXiv: 2604.01473 by Jiahao Xu, Olivera Kotevska, Rui Hu, Zikai Zhang.

Figure 1
Figure 1. Figure 1: Average ASR vs. FPR of dif￾ferent defense methods on LLama-3-8B￾Instruct model. FPR on Benign Prompts. We evaluate the FPRs of different guardrail methods on Llama￾3-8B-Instruct using four benign prompt bench￾marks: AlpacaEval (instruction-following tasks), OR-Bench (over-refusal prompts), GSM8K (math reasoning), and HumanEval (code generation). In particular, GSM8K and HumanEval test whether numerical tas… view at source ↗
Figure 2
Figure 2. Figure 2: Impact of the number of NTs Q. Impact of the Number of NTs Q. Increasing Q enlarges the resolution of the NT space and reduces discretization error when mapping the model’s internal safety judgment to discrete NTs. This improves the precision and smoothness of the safety scores. In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of k and λ on defense performance. Effect of DPL Coefficient λ [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of NT-based logit distributions under AutoDAN attacks with differ [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of different guardrail methods, including generation-based, [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with human intuition of maliciousness, SelfGrader introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves up to a 22.66% reduction in ASR on LLaMA-3-8B, while maintaining significantly lower memory overhead (up to 173x) and latency (up to 26x).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SelfGrader, a lightweight jailbreak detection method for LLMs that appends a fixed safety-grading suffix to user queries and interprets the model's next-token logit distribution over the numerical tokens {0-9} as an internal safety signal. A dual-perspective scoring rule combines a maliciousness score and a benignness score derived from these logits to produce a final harmfulness metric intended to be stable and interpretable. Experiments across multiple jailbreak benchmarks and LLMs (including LLaMA-3-8B) report up to 22.66% reduction in attack success rate relative to baselines, together with memory overhead reductions up to 173x and latency reductions up to 26x, while avoiding full response generation.

Significance. If the performance and efficiency claims are reproducible, SelfGrader would constitute a practical advance for real-time guardrails by eliminating the need for text generation or internal feature extraction. The fixed numerical token set offers a compact, potentially interpretable proxy for harmfulness that could scale to resource-constrained deployments; the dual-perspective formulation is a distinctive design choice that may reduce false positives compared with single-score logit methods.

major comments (3)
  1. [§3.2] §3.2: The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.
  2. [§4.1] §4.1 and Table 2: The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.
  3. [§4.3] §4.3: No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.
minor comments (2)
  1. [Abstract] Abstract and §4: The memory and latency multipliers (173x, 26x) should be accompanied by the absolute baseline values in the same table so readers can assess absolute overhead rather than ratios alone.
  2. [§3.1] §3.1: The notation for the final score S(q) is defined piecewise but the weighting hyperparameter between the two perspectives is introduced without a sensitivity sweep or default-value justification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and robustness of our results.

read point-by-point responses
  1. Referee: [§3.2] The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.

    Authors: We agree that an ablation replacing the numerical suffix with a semantically matched non-numerical prompt would help isolate whether the signal derives from internal safety reasoning or surface statistics. In the revised manuscript we will add this control experiment in §3.2 and report the resulting logit distributions and detection performance. revision: yes

  2. Referee: [§4.1] The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.

    Authors: We acknowledge the need for statistical rigor. We will add paired statistical significance tests, standard-error estimates across multiple runs, and explicit data-split and evaluation details to §4.1 and Table 2 in the revision. revision: yes

  3. Referee: [§4.3] No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.

    Authors: This is a valid robustness concern. We will include tokenizer-perturbation and digit-token remapping experiments in the revised §4.3 to quantify sensitivity and confirm stability across tokenizers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method directly uses logits without reduction to fitted parameters or self-citations

full rationale

The paper defines SelfGrader by directly mapping the logit distribution over a fixed set of numerical tokens (0-9) to a safety signal via a dual-perspective scoring rule, with no equations or derivations that reduce any claimed metric (such as ASR reduction) to a parameter fitted on the evaluation data itself. Results are reported on external jailbreak benchmarks across multiple models, and the provided text contains no self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation that would create circularity. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that numerical-token logits encode safety information in a way that can be turned into a stable harmfulness score; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Logit distribution over numerical tokens (0-9) serves as a reliable proxy for query safety that aligns with human judgment of maliciousness.
    Invoked to justify interpreting the numerical-token probabilities as an internal safety signal.

pith-pipeline@v0.9.0 · 5534 in / 1265 out tokens · 35938 ms · 2026-05-13T21:40:42.490187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 9 internal anchors

  1. [1]

    Simple prompt injection attacks can leak personal data observed by llm agents during task execution

    Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution. arXiv preprint arXiv:2506.01055,

  2. [2]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,

  6. [6]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773,

  7. [7]

    Multilingual jailbreak challenges in large language models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474,

  8. [8]

    Rlhf workflow: From reward modeling to online rlhf

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,

  9. [9]

    Virus: Harmful fine-tuning attack for large language models bypassing guardrail moderation.arXiv preprint arXiv:2501.17433,

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Virus: Harmful fine-tuning attack for large language models bypassing guardrail moderation.arXiv preprint arXiv:2501.17433,

  10. [10]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm- based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

  11. [11]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

  12. [12]

    arXiv preprint arXiv:2502.18581 , year=

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

  13. [13]

    Qguard: Question-based zero-shot guard for multi-modal llm safety.arXiv preprint arXiv:2506.12299,

    Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, and Yunho Maeng. Qguard: Question-based zero-shot guard for multi-modal llm safety.arXiv preprint arXiv:2506.12299,

  14. [14]

    DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914,

  15. [15]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

  16. [16]

    doi:10.48550/arXiv.2501.18492 , abstract =

    Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning- based llm safeguards.arXiv preprint arXiv:2501.18492,

  17. [17]

    Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel

    URL https: //qwen.ai/blog?id=qwen3.5. Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. X-teaming: Multi-turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203,

  18. [18]

    Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts.arXiv preprint arXiv:2410.10700,

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts.arXiv preprint arXiv:2410.10700,

  19. [19]

    ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

  20. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  21. [21]

    Sok: Evaluating jailbreak guardrails for large language models.arXiv preprint arXiv:2506.10597, 2025a

    Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, and Shuai Wang. Sok: Evaluating jailbreak guardrails for large language models.arXiv preprint arXiv:2506.10597, 2025a. Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, and Juergen Rahmel. {SelfDefend}:{LLMs} can defend them- selves ...

  22. [22]

    Emoji attack: Enhancing jailbreak attacks against judge llm detection.arXiv preprint arXiv:2411.01077,

    Zhipeng Wei, Yuqi Liu, and N Benjamin Erichson. Emoji attack: Enhancing jailbreak attacks against judge llm detection.arXiv preprint arXiv:2411.01077,

  23. [23]

    Backdooring instruction-tuned large language models with virtual prompt injection.arXiv preprint arXiv:2307.16888,

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection.arXiv preprint arXiv:2307.16888,

  24. [24]

    Qwen2.5 Technical Report

    12 Preprint. Under review. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runj...

  25. [25]

    doi:10.48550/arXiv.2310.01469 , abstract =

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024b. Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. Llm lies: Hallucinations are not bugs, but features...

  26. [26]

    Low-resource languages jailbreak gpt-4

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446,

  27. [27]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  28. [28]

    Under review

    13 Preprint. Under review. Appendix This appendix provides additional experimental details, prompts, and ablation studies that support the main results. The content is organized as follows: •Appendix Section A: Experimental Settings –Appendix Section A.1: Details of the benchmark datasets –Appendix Section A.2: Attack configurations –Appendix Section A.3:...

  29. [29]

    Under review

    In these evaluations, numerical text perturbations do not cause much 22 Preprint. Under review. incorrect safety decisions. Both tasks yield minimal FPR. These observations show that the scoring mechanism is not trivially disrupted by prompt-level scoring corruption. D.5 Results on the Impact of the Number of NTsQ Guardrails Manual (IJP) GCG AutoDAN DrAtt...

  30. [30]

    are sufficient to provide robust jailbreak detection without incurring additional false positives. E.4 The Impact of DPL Coefficientλ Guardrails Manual (IJP)OR-Bench LLama-3-8B-Instruct (No Defense)7.80/- - SelfGrader(λ=0.3) 5.30/59.80 1.10 SelfGrader(λ=0.4) 3.50/36.90 1.10 SelfGrader(λ=0.5) 0.50/6.30 6.30 SelfGrader(λ=0.6) 0.00/2.00 13.80 SelfGrader(λ=0....

  31. [31]

    Guardrail Comparisons.Figure 5 compares SelfGrader with existing guardrail methods

    is sufficient to reveal the underlying separation between benign and malicious queries. Guardrail Comparisons.Figure 5 compares SelfGrader with existing guardrail methods. Given a user query, the system first constructs guardrail prompts with both positive and negative system instructions, and obtains token-level logits from the guardrail model. Unlike ge...