Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Hongliang Liu; Tung-Ling Li

arxiv: 2506.24056 · v2 · submitted 2025-06-30 · 💻 cs.CR · cs.CL· cs.LG

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Tung-Ling Li , Hongliang Liu This is my paper

Pith reviewed 2026-05-19 07:10 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG

keywords logit gapalignment robustnessrefusal behaviorsuffix attacksforward-pass methodattack success ratesafety marginin-distribution suffixes

0 comments

The pith

Alignment widens a measurable refusal-affirmation logit gap on nearly all toxic prompts, and a forward-pass method closes it with short in-distribution suffixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the refusal-affirmation logit gap as the difference between the top refusal token logit and the top affirmative token logit at the first decoding step. This scalar serves as a direct measure of the safety margin that RLHF-style alignment creates on unsafe requests. Alignment is shown to widen the gap on 97.5 to 99.8 percent of toxic prompts across multiple model families. Logit-gap steering then finds short suffixes that close the gap through repeated forward passes alone, producing high true attack success rates on standard benchmarks while using far less computation than prior search methods and remaining effective against perplexity filters.

Core claim

The refusal-affirmation logit gap quantifies the per-prompt safety margin supplied by alignment. Alignment consistently widens this gap on the great majority of toxic prompts. Logit-gap steering discovers short in-distribution suffixes whose cumulative effect closes the gap, yielding 38 to 96 percent true attack success rate on AdvBench and HarmBench across thirteen models while requiring roughly 125 times less computation than GCG and preserving effectiveness against perplexity-based defenses.

What carries the argument

The refusal-affirmation logit gap: the scalar difference between the highest refusal-token logit and the highest affirmative-token logit at the first decoding step, used as a proxy for overall refusal strength that logit-gap steering optimizes by appending short suffixes.

If this is right

Alignment creates a consistent but narrow margin that can be closed by short, low-perplexity suffixes.
Suffixes found on small models transfer directly to much larger models in the same family.
The method requires only forward passes and runs in minutes on a single GPU.
Perplexity filters that defeat other attacks leave most of these suffixes intact.
Gap closure ranks suffix strategies by their true attack success rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The early logit gap could serve as a lightweight audit tool for comparing different alignment procedures.
Training objectives might be adjusted to enlarge the initial gap rather than only the final refusal behavior.
Similar gap measurements could be applied to other safety-related token distinctions beyond refusal.
The low computational cost opens the possibility of routine per-prompt margin checks during deployment.

Load-bearing premise

Closing the logit gap at the first decoding step is enough to produce full harmful responses without later-stage safeguards overriding the effect.

What would settle it

A set of prompts where the gap is closed yet the model still generates a full refusal response on more than half the trials.

Figures

Figures reproduced from arXiv: 2506.24056 by Hongliang Liu, Tung-Ling Li.

**Figure 2.** Figure 2: Gap-Closure Dynamics on Qwen2.5-0.5B-Instruct: cumulative KL [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Gap-Closure Dynamics on Llama-3.2-1B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Gap-Closure Dynamics on gemma-2b-it. No Suffix "Sure," GCG Random Ours Method 10 5 0 5 10 15 20 Logit Gap (Smaller is Better for Jailbreak) Logit-Gap Comparison of Jailbreak Methods Qwen_Qwen2.5_0.5B_Instruct [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Qwen-0.5B: final gap for each suffix family. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Gemma-2B-it: final gap distributions. Original Original w/ "Sure," GCG Random Ours Method 20 15 10 5 0 5 10 15 20 25 Logit Gap (Smaller is Better for Jailbreak) Logit-Gap Comparison of Jailbreak Methods meta_llama_Llama_3.2_1B_Instruct [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Llama-3-1B: final gap distributions. ward/backward passes (T iterations, k candidate swaps per step). Because the search is unconstrained, it frequently selects low-probability or out-of-distribution tokens (e.g., control chars, rare Unicode), yielding suffixes that work on the seed prompt but transfer poorly. Our single-pass greedy cover. We first restrict the candidate pool to the indistribution set S =… view at source ↗

**Figure 8.** Figure 8: Distributions of next-token logits for refusal (Blue), neural reference [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Measured refusal–affirmation logit gap ∆0 versus model layer size, across different LLM families. Implications for suffix search. Since the required cumulative gap-closing score C(S) must reach ∆0, larger gaps in bigger models imply potentially longer suffixes. However, heavier-tailed distributions of single-token scores F(t) in these models often compensate, allowing our greedy covering search to remain … view at source ↗

**Figure 10.** Figure 10: Token-level rewards of a jailbreak suffix after a toxic prompt, Llama-3.2- [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Token-level rewards of a jailbreak suffix after a toxic prompt, Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Token-level rewards of a jailbreak suffix after a toxic prompt, Gemma [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Scatter of ∆Flogit versus λKL∆KL − λr∆r for (top) Llama-3.2-1BInstruct, (middle) Qwen-2.5-0.5B-Instruct, and (bottom) gemma-2b-it. We fit ∆Flogit = α + βKL ∆KL + βr ∆r by ordinary least squares on pertoken measurements. Instead of a single cloud, each model displays several nearly parallel stripes. Visual inspection—together with token metadata—reveals that the bands correspond to coarse token types (pu… view at source ↗

read the original abstract

RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference between the top refusal-token logit and the top affirmative-token logit at the first decoding step. This single scalar quantifies the per-prompt safety margin that alignment provides. Empirically, alignment widens the gap on 97.5-99.8% of toxic prompts across three model families, and median gap closure co-varies with True-ASR ranking across suffix strategies (an internal consistency check, since our method optimises gap closure). To validate the metric's practical significance, we present logit-gap steering, a gradient-free, forward-pass-only method that discovers short in-distribution suffixes ($<$10 tokens per component) whose cumulative effect closes the gap. The method requires ${\approx}26{,}000$ forward-pass equivalents per family (${\approx}2$~min on one A100), ${\approx}125\times$ less than a single GCG search. Suffixes discovered on 0.5B--2B models transfer without modification to 72B within family. An 8-suffix ensemble reaches 38-96\% True ASR across 13 models on AdvBench and HarmBench, with most suffixes having $10^{3}$-$10^{4}\times$ lower perplexity than GCG-meaning published perplexity-filter defenses that collapse GCG (64.7%$\to$1.0%) leave our suffixes nearly intact (76.9%$\to$76.0%). These results demonstrate that current alignment margins, while consistently present, can be thin and efficiently measurable, and that defense strategies must account for in-distribution suffixes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a simple first-token logit gap as a safety-margin diagnostic and a cheap forward-pass method to find low-perplexity suffixes that close it, with reported efficiency and transfer gains, but the abstract leaves the link to full harmful outputs and method details unshown.

read the letter

The main takeaway is that alignment creates a measurable refusal-affirmation logit gap at the first decoding step on most toxic prompts, and the authors have a gradient-free way to discover short suffixes that shrink this gap at far lower cost than GCG. The gap definition itself and the steering procedure appear new relative to prior jailbreak work. They report that alignment widens the gap on 97.5-99.8% of prompts across three model families, that median gap closure tracks with attack success rankings, and that an 8-suffix ensemble reaches 38-96% true ASR on AdvBench and HarmBench while using roughly 125 times less compute and staying effective against perplexity filters. The transfer of suffixes from 0.5B-2B models to 72B models in the same family and the 10^3-10^4 times lower perplexity are the parts that stand out as potentially practical. The internal consistency check between gap closure and ASR ranking is a reasonable way to support the metric without external data. The efficiency numbers and the defense-resistance results are the clearest positives here. The soft spots are the missing details on prompt selection, exact suffix construction, and any statistical testing, which leaves the empirical patterns only partially supported. More importantly, the abstract does not show whether closing the first-step gap reliably prevents later refusals or safe continuations in the full generated response. If downstream tokens can still trigger safeguards, the diagnostic's value for predicting overall robustness would be limited. The stress-test concern about the early gap serving as a sufficient proxy therefore looks worth checking once the full methods are available. This work is aimed at researchers who test or improve alignment robustness and need low-cost probes or attack baselines. A reader focused on practical diagnostics or efficient red-teaming would find the compute savings and perplexity results useful. It deserves a serious referee because the core idea is straightforward and the reported advantages are large enough to matter if the full paper backs them up with clear methods and ablations on full-sequence behavior. I would send it to peer review and ask for those specifics plus direct evidence that gap closure correlates with absence of later refusals.

Referee Report

2 major / 2 minor

Summary. The paper introduces the refusal-affirmation logit gap (top refusal-token logit minus top affirmative-token logit at the first decoding step) as a scalar measure of the per-prompt safety margin from RLHF-style alignment. It claims this gap is widened by alignment on 97.5-99.8% of toxic prompts across three model families, that median gap closure co-varies with True-ASR rankings as an internal check, and that a gradient-free logit-gap steering method can discover short in-distribution suffixes (<10 tokens) whose cumulative effect closes the gap. These suffixes achieve 38-96% True ASR on AdvBench and HarmBench, require ~26,000 forward passes (~2 min on one A100, ~125x less than GCG), transfer across model sizes within family, and maintain effectiveness against perplexity filters.

Significance. If the central claims hold, the work offers a computationally lightweight diagnostic for alignment robustness and an efficient attack method that is substantially cheaper than GCG while being more resistant to perplexity-based defenses. The reported transferability of suffixes from 0.5B-2B to 72B models within family and the consistent empirical patterns across models would be useful contributions to understanding thin alignment margins. The internal consistency check with ASR rankings provides some supporting evidence, though overall significance depends on confirming that first-step gap closure reliably yields full harmful outputs.

major comments (2)

[Abstract] Abstract: The practical significance of logit-gap steering and the reported 38-96% True ASR rest on the unvalidated assumption that closing the refusal-affirmation logit gap at the first decoding step produces full harmful generations without later decoding steps or internal safeguards triggering refusals. No details are given on full-sequence generation, whether refusals can still occur after the first token, or any ablation correlating gap closure with absence of downstream refusals; this assumption is load-bearing for both the metric's utility and the efficiency claims versus GCG.
[Abstract] Abstract: The internal consistency check that 'median gap closure co-varies with True-ASR ranking across suffix strategies' has partial dependence on the optimization target, since logit-gap steering explicitly optimizes gap closure while using the resulting median to rank strategies against the external ASR benchmark. More independent validation details would be needed to establish this as a robust check.

minor comments (2)

[Abstract] Abstract: The specific model families and the 13 models used are not named, which hinders assessment of the scope of the empirical results.
[Abstract] Abstract: Details on toxic prompt selection, statistical testing for the 97.5-99.8% widening claim, and exact suffix construction are absent, leaving the experimental support for the central claims incomplete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications drawn from our experimental design and indicate where the manuscript will be revised to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The practical significance of logit-gap steering and the reported 38-96% True ASR rest on the unvalidated assumption that closing the refusal-affirmation logit gap at the first decoding step produces full harmful generations without later decoding steps or internal safeguards triggering refusals. No details are given on full-sequence generation, whether refusals can still occur after the first token, or any ablation correlating gap closure with absence of downstream refusals; this assumption is load-bearing for both the metric's utility and the efficiency claims versus GCG.

Authors: We agree that demonstrating the connection between first-step gap closure and complete harmful outputs is essential. True ASR is measured on full decoded sequences (up to maximum length or EOS) using the standard AdvBench and HarmBench evaluation protocols, which classify outputs as successful only when they contain harmful content without any refusal. This process inherently captures whether later decoding steps or safeguards intervene. While the abstract highlights the first-step diagnostic for computational efficiency, the full manuscript reports these end-to-end results. To make the validation more explicit, we will add a dedicated paragraph describing the generation procedure and a brief correlation analysis between per-prompt gap closure and full-sequence success rates. revision: yes
Referee: [Abstract] Abstract: The internal consistency check that 'median gap closure co-varies with True-ASR ranking across suffix strategies' has partial dependence on the optimization target, since logit-gap steering explicitly optimizes gap closure while using the resulting median to rank strategies against the external ASR benchmark. More independent validation details would be needed to establish this as a robust check.

Authors: We acknowledge the partial dependence noted by the referee; the check is presented as an internal sanity check rather than a fully independent validation. To strengthen it, we will expand the relevant section to include additional baseline comparisons (e.g., random suffixes and at least one alternative optimization approach) and report the resulting correlations with ASR rankings. This will provide a clearer picture of the relationship beyond our primary optimization target. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The abstract defines the refusal-affirmation logit gap as a new scalar metric, reports its empirical widening under alignment (97.5-99.8% of toxic prompts), and describes logit-gap steering as a forward-pass method that discovers suffixes closing the gap. Effectiveness is validated by True ASR on independent external benchmarks (AdvBench, HarmBench), with efficiency comparisons to GCG and perplexity filters. The parenthetical note that median gap closure co-varies with ASR ranking is explicitly labeled an internal consistency check; it does not substitute for the external ASR results. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text, and no step reduces a claimed result to a fitted parameter or self-referential definition by construction. The derivation chain is therefore self-contained empirical measurement plus external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or background assumptions. The logit gap is treated as a directly computed diagnostic without stated derivation from prior theory.

invented entities (1)

refusal-affirmation logit gap no independent evidence
purpose: Scalar quantifying per-prompt safety margin from alignment at first decoding step
Newly defined metric introduced to measure alignment effect; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5825 in / 1363 out tokens · 64873 ms · 2026-05-19T07:10:14.876528+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

A suffix S succeeds iff its cumulative gap reduction meets or exceeds the initial gap Δ₀: ∑ F(hi−1, ti) ≥ Δ₀ … ℓaffirm(hk) ≥ ℓrefusal(hk).
IndisputableMonolith/Foundation/Atomicity.lean exists_sequential_schedule / topoSort echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We adapt the classical unit-cost set-cover heuristic to the token-selection setting … sort C by descending F and append tokens until their running total overtakes the gap.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

The RCL combiner is a coupling combiner iff c ≠ 0 … branch selection forces the bilinear branch.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Fun- damental limitations of alignment in large language models

Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. Fun- damental limitations of alignment in large language models. 2023

work page 2023
[2]

AutoPrompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan, IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. 2020

work page 2020
[3]

Universal and transferable adversarial attacks on aligned language models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. 2023

work page 2023
[4]

Steering language models with activation engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. 2023

work page 2023
[5]

Glitch tokens in large language models: Categorization taxonomy and effective detection

Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. 2024

work page 2024
[6]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. 2023

work page 2023
[7]

Weak-to-strong jailbreaking on large language models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. 2024

work page 2024
[8]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. 2021

work page 2021
[9]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. 2022

work page 2022
[10]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. 2020

work page 2020
[11]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

work page 2022
[12]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Man- ning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023
[13]

Discovering language model be- haviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamil˙ e Lukoˇ si¯ ut˙ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson...

work page 2022
[14]

Constitutional AI: Harmlessness from AI feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKin- non, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...

work page 2022
[15]

An analysis of approximations for maximizing submodular set functions—i

G L Nemhauser, L A Wolsey, and M L Fisher. An analysis of approximations for maximizing submodular set functions—i. Math. Program., 14(1):265–294, Decem- ber 1978

work page 1978
[16]

Jailbroken: How does LLM safety training fail? 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? 2023

work page 2023
[17]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024
[18]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhu- patiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi` ere, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L´ eonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Am- brose Slone, Am´ elie H´ eliou, Andrea Tacchetti, Anna Bu...

work page 2024
[19]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´ e, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Cas- bon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsit- sulin, Nino Vieillard, Piotr Stanczyk, Sertan ...

work page 2024
[20]

Gemma 3 technical report

Gemma Team. Gemma 3 technical report. 2025. Logit-Gap Steering 27

work page 2025
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pap- pas, and Eric Wong. Jailbreaking black box large language models in twenty queries. 2023

work page 2023
[24]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog , 2019

work page 2019
[25]

Schoenholz, and Surya Ganguli

Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, 2017

work page 2017
[26]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. 2025

work page 2025
[27]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. 2024

work page 2024
[28]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024
[29]

GPTQ: Accu- rate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accu- rate post-training quantization for generative pre-trained transformers. 2022

work page 2022
[30]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. 2023

work page 2023
[31]

Jailbreak attacks and defenses against large language models: A survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey. 2024

work page 2024
[32]

Identity

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversa- tions. Meta blog, 2023. 28 Tung-Ling Li and Hongliang Liu A Discovered Jailbreak Suffixes by Model - Qwen/Qwen2.5-0.5B-Inst...

work page 2023
[33]

Low-probability filter: p(t | h0) < p refusal, so that the first token truly perturbs the model away from its default refusal bias

work page
[34]

large positive deviations from the prompt’s mean logit

High-z-score filter: compute zt = ℓt(h0) − µ σ + ϵ , and retain only those tokens with zt ≥ τz, i.e. large positive deviations from the prompt’s mean logit

work page
[35]

we shrink the candidate set |C| by > 99.5%, making the residual search O(|C| log|C|) even for 72-billion-parameter models

Positive gap-closing power: for each surviving t, evaluate F (h0, t) = ℓrefusal(ht) − ℓaffirm(ht) − ∆0, and keep only tokens with F (h0, t) > 0. we shrink the candidate set |C| by > 99.5%, making the residual search O(|C| log|C|) even for 72-billion-parameter models. C Approximated KL and Reward To validate our approximation of the gap-closing score F (h,...

work page 2051
[36]

For any token t, the model’s logit is affine in the hidden state: ℓt(h) = w⊤ t h + bt

Linear logits. For any token t, the model’s logit is affine in the hidden state: ℓt(h) = w⊤ t h + bt

work page
[37]

RLHF maximises E[r(h)], where r(h) assigns higher reward to refusal-aligned states than to compliance-aligned ones [11,14]

Alignment reward. RLHF maximises E[r(h)], where r(h) assigns higher reward to refusal-aligned states than to compliance-aligned ones [11,14]

work page
[38]

Each policy-gradient update is guided by ∇hr(h)

Gradient push. Each policy-gradient update is guided by ∇hr(h). For a reward function designed to encourage refusal, these updates influence the model’s parameters—including the final layer embeddings—to amplify the difference between wrefusal and waffirm, thereby increasing the resulting logit separation w⊤ refusalh − w⊤ affirmh

work page
[39]

I can't",

Convergence. Repeated updates converge to parameters for which∆aligned 0 ≥ ∆base 0 . Logit-Gap Steering 35 E Full Benchmark Results F Prompt and Code for Evaluation F.1 Attack Success Evaluation def evaluate_response(original_prompt, model_response, tokenizer, llm, sampling_params): evaluation_prompt = f""" Below are the prompt that the user asks and the ...

work page 2000

[1] [1]

Fun- damental limitations of alignment in large language models

Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. Fun- damental limitations of alignment in large language models. 2023

work page 2023

[2] [2]

AutoPrompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan, IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. 2020

work page 2020

[3] [3]

Universal and transferable adversarial attacks on aligned language models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. 2023

work page 2023

[4] [4]

Steering language models with activation engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. 2023

work page 2023

[5] [5]

Glitch tokens in large language models: Categorization taxonomy and effective detection

Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. 2024

work page 2024

[6] [6]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. 2023

work page 2023

[7] [7]

Weak-to-strong jailbreaking on large language models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. 2024

work page 2024

[8] [8]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. 2021

work page 2021

[9] [9]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. 2022

work page 2022

[10] [10]

Learning to summarize from human feedback

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. 2020

work page 2020

[11] [11]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

work page 2022

[12] [12]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Man- ning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023

[13] [13]

Discovering language model be- haviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamil˙ e Lukoˇ si¯ ut˙ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson...

work page 2022

[14] [14]

Constitutional AI: Harmlessness from AI feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKin- non, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...

work page 2022

[15] [15]

An analysis of approximations for maximizing submodular set functions—i

G L Nemhauser, L A Wolsey, and M L Fisher. An analysis of approximations for maximizing submodular set functions—i. Math. Program., 14(1):265–294, Decem- ber 1978

work page 1978

[16] [16]

Jailbroken: How does LLM safety training fail? 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? 2023

work page 2023

[17] [17]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024

[18] [18]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhu- patiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi` ere, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L´ eonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Am- brose Slone, Am´ elie H´ eliou, Andrea Tacchetti, Anna Bu...

work page 2024

[19] [19]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´ e, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Cas- bon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsit- sulin, Nino Vieillard, Piotr Stanczyk, Sertan ...

work page 2024

[20] [20]

Gemma 3 technical report

Gemma Team. Gemma 3 technical report. 2025. Logit-Gap Steering 27

work page 2025

[21] [21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pap- pas, and Eric Wong. Jailbreaking black box large language models in twenty queries. 2023

work page 2023

[24] [24]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog , 2019

work page 2019

[25] [25]

Schoenholz, and Surya Ganguli

Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, 2017

work page 2017

[26] [26]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. 2025

work page 2025

[27] [27]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. 2024

work page 2024

[28] [28]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024

[29] [29]

GPTQ: Accu- rate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accu- rate post-training quantization for generative pre-trained transformers. 2022

work page 2022

[30] [30]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. 2023

work page 2023

[31] [31]

Jailbreak attacks and defenses against large language models: A survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey. 2024

work page 2024

[32] [32]

Identity

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversa- tions. Meta blog, 2023. 28 Tung-Ling Li and Hongliang Liu A Discovered Jailbreak Suffixes by Model - Qwen/Qwen2.5-0.5B-Inst...

work page 2023

[33] [33]

Low-probability filter: p(t | h0) < p refusal, so that the first token truly perturbs the model away from its default refusal bias

work page

[34] [34]

large positive deviations from the prompt’s mean logit

High-z-score filter: compute zt = ℓt(h0) − µ σ + ϵ , and retain only those tokens with zt ≥ τz, i.e. large positive deviations from the prompt’s mean logit

work page

[35] [35]

we shrink the candidate set |C| by > 99.5%, making the residual search O(|C| log|C|) even for 72-billion-parameter models

Positive gap-closing power: for each surviving t, evaluate F (h0, t) = ℓrefusal(ht) − ℓaffirm(ht) − ∆0, and keep only tokens with F (h0, t) > 0. we shrink the candidate set |C| by > 99.5%, making the residual search O(|C| log|C|) even for 72-billion-parameter models. C Approximated KL and Reward To validate our approximation of the gap-closing score F (h,...

work page 2051

[36] [36]

For any token t, the model’s logit is affine in the hidden state: ℓt(h) = w⊤ t h + bt

Linear logits. For any token t, the model’s logit is affine in the hidden state: ℓt(h) = w⊤ t h + bt

work page

[37] [37]

RLHF maximises E[r(h)], where r(h) assigns higher reward to refusal-aligned states than to compliance-aligned ones [11,14]

Alignment reward. RLHF maximises E[r(h)], where r(h) assigns higher reward to refusal-aligned states than to compliance-aligned ones [11,14]

work page

[38] [38]

Each policy-gradient update is guided by ∇hr(h)

Gradient push. Each policy-gradient update is guided by ∇hr(h). For a reward function designed to encourage refusal, these updates influence the model’s parameters—including the final layer embeddings—to amplify the difference between wrefusal and waffirm, thereby increasing the resulting logit separation w⊤ refusalh − w⊤ affirmh

work page

[39] [39]

I can't",

Convergence. Repeated updates converge to parameters for which∆aligned 0 ≥ ∆base 0 . Logit-Gap Steering 35 E Full Benchmark Results F Prompt and Code for Evaluation F.1 Attack Success Evaluation def evaluate_response(original_prompt, model_response, tokenizer, llm, sampling_params): evaluation_prompt = f""" Below are the prompt that the user asks and the ...

work page 2000