SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits
Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3
The pith
SelfGrader detects jailbreaks by grading queries with logits over digits 0-9
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SelfGrader formulates jailbreak detection as a numerical grading problem by evaluating the safety of a user query within the compact set of numerical tokens (0-9) and interpreting their logit distribution as an internal safety signal. A dual-perspective scoring rule considers both maliciousness and benignness to yield a stable score reflecting harmfulness while reducing false positives.
What carries the argument
The dual-perspective scoring rule applied to the logit distribution over numerical tokens 0-9, which extracts a safety signal directly from the model's output probabilities without full generation.
If this is right
- SelfGrader reduces attack success rate by up to 22.66% on LLaMA-3-8B compared to baselines.
- It incurs up to 173 times lower memory overhead than competing guardrails.
- Latency is reduced by up to 26 times while maintaining detection performance.
- The method works across multiple LLMs and diverse jailbreak benchmarks.
- It provides interpretable scores aligned with human intuition of maliciousness.
Where Pith is reading between the lines
- If the numerical logits reliably encode safety, similar lightweight grading could be applied to other safety-related tasks like toxicity detection.
- The dual-perspective rule might generalize to other token sets beyond digits if the model has consistent probability patterns.
- Deployment on edge devices becomes more feasible due to low overhead, enabling real-time query screening.
- Future work could test whether fine-tuning the model to strengthen these logit signals improves detection further.
Load-bearing premise
The logit distribution over the fixed tokens 0 through 9 reliably signals query maliciousness in a way that aligns with human judgments of harm.
What would settle it
A collection of jailbreak and benign queries where the numerical logit scores fail to distinguish malicious from safe inputs at rates better than chance, or where the dual scoring rule produces unstable results across prompt variations.
Figures
read the original abstract
Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using anchored token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with the target safety rubric, SelfGrader constructs Probably Approximately Correct-guided ICL anchor examples and introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, adaptive attacks, benign prompt benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves strong robustness with low false positive rates, memory overhead, and latency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SelfGrader, a lightweight jailbreak detection method for LLMs that appends a fixed safety-grading suffix to user queries and interprets the model's next-token logit distribution over the numerical tokens {0-9} as an internal safety signal. A dual-perspective scoring rule combines a maliciousness score and a benignness score derived from these logits to produce a final harmfulness metric intended to be stable and interpretable. Experiments across multiple jailbreak benchmarks and LLMs (including LLaMA-3-8B) report up to 22.66% reduction in attack success rate relative to baselines, together with memory overhead reductions up to 173x and latency reductions up to 26x, while avoiding full response generation.
Significance. If the performance and efficiency claims are reproducible, SelfGrader would constitute a practical advance for real-time guardrails by eliminating the need for text generation or internal feature extraction. The fixed numerical token set offers a compact, potentially interpretable proxy for harmfulness that could scale to resource-constrained deployments; the dual-perspective formulation is a distinctive design choice that may reduce false positives compared with single-score logit methods.
major comments (3)
- [§3.2] §3.2: The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.
- [§4.1] §4.1 and Table 2: The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.
- [§4.3] §4.3: No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.
minor comments (2)
- [Abstract] Abstract and §4: The memory and latency multipliers (173x, 26x) should be accompanied by the absolute baseline values in the same table so readers can assess absolute overhead rather than ratios alone.
- [§3.1] §3.1: The notation for the final score S(q) is defined piecewise but the weighting hyperparameter between the two perspectives is introduced without a sensitivity sweep or default-value justification.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and robustness of our results.
read point-by-point responses
-
Referee: [§3.2] The dual-perspective scoring rule (maliciousness + benignness logits) is introduced without an ablation that replaces the numerical grading suffix with a semantically matched but non-numerical prompt; without this control it remains possible that the reported signal arises from surface statistics of the suffix rather than model-internal safety reasoning.
Authors: We agree that an ablation replacing the numerical suffix with a semantically matched non-numerical prompt would help isolate whether the signal derives from internal safety reasoning or surface statistics. In the revised manuscript we will add this control experiment in §3.2 and report the resulting logit distributions and detection performance. revision: yes
-
Referee: [§4.1] The 22.66% ASR reduction on LLaMA-3-8B is reported without statistical significance tests, standard-error estimates, or explicit data-split details; this omission prevents verification that the gain exceeds benchmark variance and is load-bearing for the central performance claim.
Authors: We acknowledge the need for statistical rigor. We will add paired statistical significance tests, standard-error estimates across multiple runs, and explicit data-split and evaluation details to §4.1 and Table 2 in the revision. revision: yes
-
Referee: [§4.3] No tokenizer-perturbation or digit-token remapping experiment is provided; because the method relies exclusively on the logit distribution over the fixed set {0-9}, sensitivity to tokenizer-specific mappings of these tokens constitutes a potential failure mode that must be quantified.
Authors: This is a valid robustness concern. We will include tokenizer-perturbation and digit-token remapping experiments in the revised §4.3 to quantify sensitivity and confirm stability across tokenizers. revision: yes
Circularity Check
No significant circularity; method directly uses logits without reduction to fitted parameters or self-citations
full rationale
The paper defines SelfGrader by directly mapping the logit distribution over a fixed set of numerical tokens (0-9) to a safety signal via a dual-perspective scoring rule, with no equations or derivations that reduce any claimed metric (such as ASR reduction) to a parameter fitted on the evaluation data itself. Results are reported on external jailbreak benchmarks across multiple models, and the provided text contains no self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation that would create circularity. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Logit distribution over numerical tokens (0-9) serves as a reliable proxy for query safety that aligns with human judgment of maliciousness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0–9) and interprets their logit distribution as an internal safety signal... dual-perspective scoring rule... sDPL = λ s(+) + (1−λ)(Q−s(−)−1)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the logits over digit tokens serve as a direct, high-signal-to-noise readout of safety judgment... closed, invariant, and task-aligned yet flexible numerical space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.