SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

Jiahao Xu; Olivera Kotevska; Rui Hu; Zikai Zhang

arxiv: 2604.01473 · v3 · pith:ZFOZZZP7new · submitted 2026-04-01 · 💻 cs.CR · cs.AI

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

Zikai Zhang , Rui Hu , Olivera Kotevska , Jiahao Xu This is my paper

Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak detectionlarge language modelstoken-level logitsguardrail methodsnumerical gradingLLM safetyattack success rate

0 comments

The pith

SelfGrader detects jailbreaks by grading queries with logits over digits 0-9

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SelfGrader is a guardrail that checks user queries for malicious intent without generating full responses or inspecting internal model states. It prompts the model to assign a safety grade using only the logits for the digits zero to nine, then applies a dual-perspective rule that weighs both how malicious and how benign the query appears. This produces a stable score that lowers the attack success rate by as much as 22.66 percent on models like LLaMA-3-8B while using far less memory and running much faster than prior methods. The approach matters because existing detectors either add heavy latency from full text generation or rely on features that are hard to access and interpret.

Core claim

SelfGrader formulates jailbreak detection as a numerical grading problem by evaluating the safety of a user query within the compact set of numerical tokens (0-9) and interpreting their logit distribution as an internal safety signal. A dual-perspective scoring rule considers both maliciousness and benignness to yield a stable score reflecting harmfulness while reducing false positives.

What carries the argument

The dual-perspective scoring rule applied to the logit distribution over numerical tokens 0-9, which extracts a safety signal directly from the model's output probabilities without full generation.

If this is right

SelfGrader reduces attack success rate by up to 22.66% on LLaMA-3-8B compared to baselines.
It incurs up to 173 times lower memory overhead than competing guardrails.
Latency is reduced by up to 26 times while maintaining detection performance.
The method works across multiple LLMs and diverse jailbreak benchmarks.
It provides interpretable scores aligned with human intuition of maliciousness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the numerical logits reliably encode safety, similar lightweight grading could be applied to other safety-related tasks like toxicity detection.
The dual-perspective rule might generalize to other token sets beyond digits if the model has consistent probability patterns.
Deployment on edge devices becomes more feasible due to low overhead, enabling real-time query screening.
Future work could test whether fine-tuning the model to strengthen these logit signals improves detection further.

Load-bearing premise

The logit distribution over the fixed tokens 0 through 9 reliably signals query maliciousness in a way that aligns with human judgments of harm.

What would settle it

A collection of jailbreak and benign queries where the numerical logit scores fail to distinguish malicious from safe inputs at rates better than chance, or where the dual scoring rule produces unstable results across prompt variations.

Figures

Figures reproduced from arXiv: 2604.01473 by Jiahao Xu, Olivera Kotevska, Rui Hu, Zikai Zhang.

**Figure 1.** Figure 1: Average ASR vs. FPR of different defense methods on LLama-3-8BInstruct model. FPR on Benign Prompts. We evaluate the FPRs of different guardrail methods on Llama3-8B-Instruct using four benign prompt benchmarks: AlpacaEval (instruction-following tasks), OR-Bench (over-refusal prompts), GSM8K (math reasoning), and HumanEval (code generation). In particular, GSM8K and HumanEval test whether numerical tas… view at source ↗

**Figure 2.** Figure 2: Impact of the number of NTs Q. Impact of the Number of NTs Q. Increasing Q enlarges the resolution of the NT space and reduces discretization error when mapping the model’s internal safety judgment to discrete NTs. This improves the precision and smoothness of the safety scores. In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of k and λ on defense performance. Effect of DPL Coefficient λ [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of NT-based logit distributions under AutoDAN attacks with differ [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of different guardrail methods, including generation-based, [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using anchored token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with the target safety rubric, SelfGrader constructs Probably Approximately Correct-guided ICL anchor examples and introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, adaptive attacks, benign prompt benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves strong robustness with low false positive rates, memory overhead, and latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SelfGrader gives a practical efficiency win for jailbreak detection by scoring logits over digits instead of generating text, but the numerical signal looks fragile to prompt wording.

read the letter

The main takeaway is that this method turns jailbreak detection into a quick numerical grading task using only the logits on tokens 0-9 after a fixed safety suffix. It avoids both full response sampling and heavy internal feature extraction, which cuts memory and latency by large factors while still lowering attack success rates on the tested models and benchmarks. That efficiency angle is the clearest practical contribution here. The dual-perspective score that mixes maliciousness and benignness views is a reasonable way to stabilize the output and cut false positives, and the abstract shows consistent gains across several LLMs and baselines. The experiments appear to run on standard jailbreak datasets, which is straightforward and reproducible in principle. The soft spot is the core assumption that the logit distribution over those ten tokens reliably tracks harmfulness rather than just the surface statistics of the suffix or the tokenizer's digit handling. The stress-test note flags the lack of ablations that swap in non-numerical prompts or perturb tokenization, and nothing in the reported results directly rules that out. If the signal shifts with small template changes, the claimed stability would not hold in real deployments. This is a minor-to-moderate concern depending on how thoroughly the full paper checks it, but it is the load-bearing piece. The work is aimed at people building production guardrails who need low-overhead detectors that still beat simple baselines. It is not a theoretical advance, but the empirical efficiency numbers are concrete enough that a serious referee should see it. I would send it to review with a request for the missing ablations on prompt sensitivity.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SelfGrader, a lightweight jailbreak detection method for LLMs that appends a fixed safety-grading suffix to user queries and interprets the model's next-token logit distribution over the numerical tokens {0-9} as an internal safety signal. A dual-perspective scoring rule combines a maliciousness score and a benignness score derived from these logits to produce a final harmfulness metric intended to be stable and interpretable. Experiments across multiple jailbreak benchmarks and LLMs (including LLaMA-3-8B) report up to 22.66% reduction in attack success rate relative to baselines, together with memory overhead reductions up to 173x and latency reductions up to 26x, while avoiding full response generation.

Significance. If the performance and efficiency claims are reproducible, SelfGrader would constitute a practical advance for real-time guardrails by eliminating the need for text generation or internal feature extraction. The fixed numerical token set offers a compact, potentially interpretable proxy for harmfulness that could scale to resource-constrained deployments; the dual-perspective formulation is a distinctive design choice that may reduce false positives compared with single-score logit methods.

major comments (3)

[§3.2] §3.2: The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.
[§4.1] §4.1 and Table 2: The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.
[§4.3] §4.3: No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.

minor comments (2)

[Abstract] Abstract and §4: The memory and latency multipliers (173x, 26x) should be accompanied by the absolute baseline values in the same table so readers can assess absolute overhead rather than ratios alone.
[§3.1] §3.1: The notation for the final score S(q) is defined piecewise but the weighting hyperparameter between the two perspectives is introduced without a sensitivity sweep or default-value justification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and robustness of our results.

read point-by-point responses

Referee: [§3.2] The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.

Authors: We agree that an ablation replacing the numerical suffix with a semantically matched non-numerical prompt would help isolate whether the signal derives from internal safety reasoning or surface statistics. In the revised manuscript we will add this control experiment in §3.2 and report the resulting logit distributions and detection performance. revision: yes
Referee: [§4.1] The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.

Authors: We acknowledge the need for statistical rigor. We will add paired statistical significance tests, standard-error estimates across multiple runs, and explicit data-split and evaluation details to §4.1 and Table 2 in the revision. revision: yes
Referee: [§4.3] No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.

Authors: This is a valid robustness concern. We will include tokenizer-perturbation and digit-token remapping experiments in the revised §4.3 to quantify sensitivity and confirm stability across tokenizers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method directly uses logits without reduction to fitted parameters or self-citations

full rationale

The paper defines SelfGrader by directly mapping the logit distribution over a fixed set of numerical tokens (0-9) to a safety signal via a dual-perspective scoring rule, with no equations or derivations that reduce any claimed metric (such as ASR reduction) to a parameter fitted on the evaluation data itself. Results are reported on external jailbreak benchmarks across multiple models, and the provided text contains no self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation that would create circularity. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that numerical-token logits encode safety information in a way that can be turned into a stable harmfulness score; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Logit distribution over numerical tokens (0-9) serves as a reliable proxy for query safety that aligns with human judgment of maliciousness.
Invoked to justify interpreting the numerical-token probabilities as an internal safety signal.

pith-pipeline@v0.9.0 · 5534 in / 1265 out tokens · 35938 ms · 2026-05-13T21:40:42.490187+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0–9) and interprets their logit distribution as an internal safety signal... dual-perspective scoring rule... sDPL = λ s(+) + (1−λ)(Q−s(−)−1)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the logits over digit tokens serve as a direct, high-signal-to-noise readout of safety judgment... closed, invariant, and task-aligned yet flexible numerical space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.