pith. sign in

arxiv: 2606.25750 · v1 · pith:COFYDDMXnew · submitted 2026-06-24 · 💻 cs.CR · cs.CL· cs.LG

RAS: Measuring LLM Safety Through Refusal Alignment

Pith reviewed 2026-06-25 20:01 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords LLM safetyrefusal alignmentwhite-box evaluationSafeVecRAS scoreinternal representationsjailbreak promptsattack success rate
0
0 comments X

The pith

Refusal directions in a reference model's internal states can score the safety of other LLMs without judging their outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that LLM safety can be measured from how closely hidden states align with refusal directions extracted from one safety-aligned model, rather than by running unsafe prompts and judging answers. This would matter if true because current output-level checks are slow, sensitive to the choice of judge, and limited to specific prompt sets. The method first pulls layer-wise refusal directions from a reference model, picks stable layer windows where safe and unsafe prompts produce separable representations, then scores any target model by the degree of alignment under unsafe and jailbreak prompts. The resulting Refusal Alignment Score is scaled to 0-100 and tested on Llama, Gemma, and Qwen families, where it separates aligned models from uncensored and abliterated versions while tracking attack success rates. The approach therefore offers a white-box alternative that avoids full output generation.

Core claim

SafeVec extracts layer-wise refusal directions from a safety-aligned reference model, selects stable layer windows where safe and unsafe behaviors separate, and computes a Refusal Alignment Score (RAS) that maps representation-level alignment to a calibrated 0-100 safety score for any target model under unsafe prompts.

What carries the argument

SafeVec procedure that extracts and applies refusal directions from internal representations to quantify alignment via the RAS metric.

If this is right

  • RAS separates safety-aligned models from uncensored and abliterated variants across Llama, Gemma, and Qwen families.
  • RAS values track output-level attack success rates on the tested models.
  • RAS evaluation runs substantially faster than judge-based output assessment.
  • White-box measurement of representation alignment supplies a compact safety signal without needing generated text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same refusal-direction extraction could be applied during fine-tuning to monitor safety drift in real time without separate evaluation runs.
  • If the directions prove stable enough, a small set of reference models might serve as a shared safety baseline for many downstream families.
  • Representation-level scoring opens the possibility of checking safety properties that are hard to elicit through text prompts alone.

Load-bearing premise

Refusal directions taken from one safety-aligned model stay stable across chosen layer windows and transfer as a reliable safety signal to other models and families.

What would settle it

Running RAS on a fresh model family or prompt set where aligned models receive low scores and uncensored models receive high scores, or where RAS no longer tracks measured attack success rates.

Figures

Figures reproduced from arXiv: 2606.25750 by Chang-Chieh Huang, Chia-Mu Yu, Wei-Bin Lee, Yan-Lun Chen.

Figure 1
Figure 1. Figure 1: Layer-wise analysis for Llama [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity between Llama models and the refusal direction on unsafe prompts [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity between Llama models and the refusal direction on jailbreak prompts. RQ2: Correlation with output-level safety. Ta￾bles 1, 3, and 4 report raw unsafe scores, jailbreak scores, and ASR for the reference and calibration models. Across the calibration sets, safety-aligned models generally have lower ASR, while uncen￾sored or abliterated variants show higher ASR. Their raw SafeVec scores foll… view at source ↗
Figure 4
Figure 4. Figure 4: RAS and 100 × (1 − ASR) for Llama. official google/gemma-3-4b-it reference model, but also achieves a lower ASR. This suggests that some models may improve output-level safety through mechanisms not fully captured by a single reference refusal direction. We still use google/gemma-3-4b-it as the reference model because it is the official safety-aligned instruction model and provides a clean family-specific … view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise analysis for Gemma [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cosine similarity between Gemma models and the refusal direction on unsafe prompts [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cosine similarity between Qwen models and the refusal direction on unsafe prompts [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: RAS and 100 × (1 − ASR) for Gemma target models [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directions from a safety-aligned reference model, then selects stable layer windows where safe and unsafe behaviors are separable, and finally scores a target model by measuring whether its hidden states align with these refusal directions under unsafe and jailbreak prompts. The resulting metric, **RAS** (**R**efusal **A**lignment **S**core), maps representation-level refusal alignment to a calibrated 0-100 safety score. Across `Llama`, `Gemma`, and `Qwen` model families, RAS separates aligned models from uncensored and abliterated variants, tracks output-level attack success rate, and is substantially faster than judge-based evaluation. These results suggest that refusal alignment provides a compact and efficient signal for white-box LLM safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SafeVec, a white-box procedure for evaluating LLM safety by extracting layer-wise refusal directions from a safety-aligned reference model, selecting stable layer windows where safe and unsafe behaviors are separable, and computing the Refusal Alignment Score (RAS) for target models based on alignment of their hidden states with these directions under unsafe and jailbreak prompts. The resulting RAS maps to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families claim that RAS separates aligned models from uncensored and abliterated variants, tracks output-level attack success rate, and is substantially faster than judge-based evaluation.

Significance. If the empirical claims hold, this offers a potentially scalable white-box alternative to output-based safety evaluations that avoids sensitivity to judge choice and fixed prompt banks. The approach extends representation engineering techniques to safety and could provide efficiency gains if the internal signal proves robust and transferable.

major comments (2)
  1. [Method and experimental results] The transferability of refusal directions extracted from a single reference model to architecturally distinct families (Llama to Gemma and Qwen) is load-bearing for the cross-family separation claim in the abstract. The manuscript selects stable windows on the reference but supplies no quantitative check (e.g., separability metrics or direction cosine similarities computed on target-model residual streams) that the same windows remain informative once the target geometry differs, leaving open whether the observed separation reflects general safety alignment or reference-specific features.
  2. [Results] The claim that RAS tracks output-level attack success rate requires supporting statistics. No correlation coefficients, confidence intervals, or comparison against simple baselines (e.g., prompt-length or token-probability heuristics) are referenced, making it impossible to assess whether the tracking is substantive or incidental.
minor comments (2)
  1. [Method] The exact formula for RAS (including any aggregation across the selected layer window, normalization, and calibration to the 0-100 range) should be stated explicitly with an equation.
  2. Clarify the precise set of unsafe and jailbreak prompts used for both reference extraction and target scoring, and whether they overlap with any evaluation sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on transferability and statistical support. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Method and experimental results] The transferability of refusal directions extracted from a single reference model to architecturally distinct families (Llama to Gemma and Qwen) is load-bearing for the cross-family separation claim in the abstract. The manuscript selects stable windows on the reference but supplies no quantitative check (e.g., separability metrics or direction cosine similarities computed on target-model residual streams) that the same windows remain informative once the target geometry differs, leaving open whether the observed separation reflects general safety alignment or reference-specific features.

    Authors: We agree that the manuscript would benefit from explicit quantitative validation of window transferability. In the revision we will compute, for each target family, (i) separability metrics (mean projection difference and AUC between safe/unsafe prompts) using the reference-derived directions on target residual streams, and (ii) cosine similarity between the reference refusal directions and the top principal component of the target-model unsafe-minus-safe contrast within the same layer windows. These numbers will be added to Section 4 and the appendix. revision: yes

  2. Referee: [Results] The claim that RAS tracks output-level attack success rate requires supporting statistics. No correlation coefficients, confidence intervals, or comparison against simple baselines (e.g., prompt-length or token-probability heuristics) are referenced, making it impossible to assess whether the tracking is substantive or incidental.

    Authors: We accept that the current text lacks the requested statistics. The revised manuscript will report Pearson and Spearman correlations (with 95% bootstrap CIs) between RAS and attack success rate across all evaluated models and prompt sets. We will also add two simple baselines—average prompt token length and mean log-probability of refusal-related tokens—and show that RAS yields higher correlation than either baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper defines RAS explicitly as a white-box metric that extracts refusal directions from one fixed safety-aligned reference model, selects stable windows based on separability in that reference, and then computes alignment scores on target models. This construction is a deliberate design choice for efficient evaluation rather than a self-referential loop. No equations or steps reduce the final score to a fitted parameter or reference-specific feature by construction; the separation and correlation claims are presented as empirical observations across Llama/Gemma/Qwen families. No self-citation load-bearing steps or ansatz smuggling are present in the provided text. The method is self-contained against external benchmarks such as output-level ASR.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified with certainty. The central premise that stable refusal directions exist and can be extracted is treated as a domain assumption.

axioms (1)
  • domain assumption Refusal directions exist in the layer-wise hidden states of safety-aligned models and can be extracted to form a transferable safety signal
    Core premise of the SafeVec procedure described in the abstract.

pith-pipeline@v0.9.1-grok · 5743 in / 1258 out tokens · 45675 ms · 2026-06-25T20:01:30.394527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references

  1. [1]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

    SafetyBench: Evaluating the Safety of Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

  2. [2]

    Advances in Neural Information Processing Systems , year =

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author =. Advances in Neural Information Processing Systems , year =

  3. [3]

    2024 , eprint =

    A StrongREJECT for Empty Jailbreaks , author =. 2024 , eprint =

  4. [4]

    2022 , eprint =

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author =. 2022 , eprint =

  5. [5]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =

    Red Teaming Language Models with Language Models , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =

  6. [6]

    2023 , eprint =

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

  7. [7]

    2023 , eprint =

    Jailbreaking Black Box Large Language Models in Twenty Queries , author =. 2023 , eprint =

  8. [8]

    2023 , eprint =

    Jailbroken: How Does LLM Safety Training Fail? , author =. 2023 , eprint =

  9. [9]

    Advances in Neural Information Processing Systems , year =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , year =

  10. [10]

    2022 , eprint =

    Constitutional AI: Harmlessness from AI Feedback , author =. 2022 , eprint =

  11. [11]

    2023 , eprint =

    Towards Understanding Sycophancy in Language Models , author =. 2023 , eprint =

  12. [12]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , year =

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , year =

  13. [13]

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , year =

    StereoSet: Measuring Stereotypical Bias in Pretrained Language Models , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , year =

  14. [14]

    Findings of the Association for Computational Linguistics: ACL 2022 , year =

    BBQ: A Hand-Built Bias Benchmark for Question Answering , author =. Findings of the Association for Computational Linguistics: ACL 2022 , year =

  15. [15]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , year =

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , year =

  16. [16]

    Advances in Neural Information Processing Systems , year =

    Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems , year =

  17. [17]

    2023 , eprint =

    Representation Engineering: A Top-Down Approach to AI Transparency , author =. 2023 , eprint =

  18. [18]

    2022 , eprint =

    Discovering Latent Knowledge in Language Models Without Supervision , author =. 2022 , eprint =

  19. [19]

    2024 , eprint =

    Activation Addition: Steering Language Models Without Optimization , author =. 2024 , eprint =

  20. [20]

    2023 , eprint =

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author =. 2023 , eprint =

  21. [21]

    International Conference on Learning Representations , year =

    Language Models Represent Space and Time , author =. International Conference on Learning Representations , year =

  22. [22]

    Advances in Neural Information Processing Systems , year =

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. Advances in Neural Information Processing Systems , year =

  23. [23]

    Advances in Neural Information Processing Systems , year =

    Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , year =

  24. [24]

    International Conference on Learning Representations , year =

    Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations , year =

  25. [25]

    2026 IEEE Conference on Artificial Intelligence (CAI) , pages=

    Testing Method for Language Model Evaluation: A Case Study on a Localized Question Bank , author=. 2026 IEEE Conference on Artificial Intelligence (CAI) , pages=. 2026 , organization=

  26. [26]

    Proceedings of the 41st International Conference on Machine Learning , series=

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. Proceedings of the 41st International Conference on Machine Learning , series=

  27. [27]

    International Conference on Learning Representations , volume=

    Sorry-bench: Systematically evaluating large language model safety refusal , author=. International Conference on Learning Representations , volume=

  28. [28]

    The Twelfth International Conference on Learning Representations , year=

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

  29. [29]

    2024 , eprint=

    Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models , author=. 2024 , eprint=