When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Anders Gj{\o}lbye; Enyi Jiang; Sanmi Koyejo; Yibo Jacky Zhang

arxiv: 2606.08044 · v1 · pith:6QGFT45Fnew · submitted 2026-06-06 · 💻 cs.LG · cs.AI· cs.CL

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Enyi Jiang , Anders Gj{\o}lbye , Yibo Jacky Zhang , Sanmi Koyejo This is my paper

Pith reviewed 2026-06-27 20:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM safetyrepresentation robustnessaudit gaplatent vulnerabilitydissociated modelsbehavioral evaluationalignment

0 comments

The pith

Behavioral safety metrics fail to measure representation-level robustness in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that safety checks focused only on model outputs leave open the possibility of hidden vulnerabilities inside the model's internal states. It constructs dissociated models that refuse harmful requests in normal use yet can be shifted to produce unsafe outputs through small changes to their hidden layers or parameters. The authors introduce the Latent Vulnerability Score to quantify how easily such shifts occur under bounded interventions. A sympathetic reader would care because this gap means current alignment methods may secure only the visible behavior while leaving the underlying representations exposed. If the claim holds, safety work must move beyond output inspection to include checks on what happens under internal perturbations.

Core claim

Behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Dissociated models preserve safe outward behavior while remaining vulnerable in the latent space, showing substantially elevated Latent Vulnerability Scores despite comparable refusal rates, with intermediate representations being the most sensitive to intervention.

What carries the argument

The audit gap, defined as the difference between behavioral safety and robustness under intervention; carried by the construction of dissociated models and the Latent Vulnerability Score that quantifies how easily harmful behavior is elicited by bounded latent perturbations.

If this is right

Intermediate layers prove most sensitive to perturbations that elicit harmful outputs.
Dissociated models maintain refusal behavior comparable to standard aligned models yet register much higher vulnerability scores.
Harmful fine-tuning and latent perturbations reveal robustness shortfalls not visible in standard behavioral tests.
Representation-aware audits are required alongside behavioral evaluations to assess full model safety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment procedures may need explicit terms that penalize latent-space fragility rather than output behavior alone.
The same dissociation approach could be tested on non-language models to check whether the audit gap appears more broadly.
Evaluators might combine behavioral refusal rates with Latent Vulnerability Scores to produce more predictive safety rankings.

Load-bearing premise

The chosen interventions of harmful fine-tuning and layer-wise latent perturbations serve as representative tests of real-world robustness vulnerabilities.

What would settle it

A demonstration that no dissociated models can be constructed for current aligned LLMs, or that all such models show low Latent Vulnerability Scores matching their behavioral refusal rates under the same interventions, would falsify the insufficiency claim.

Figures

Figures reproduced from arXiv: 2606.08044 by Anders Gj{\o}lbye, Enyi Jiang, Sanmi Koyejo, Yibo Jacky Zhang.

**Figure 1.** Figure 1: Behavioral safety evaluation can overlook representation-level vulnerabilities. (a) We compare three models sharing the same base architecture: a safety-aligned model (Safe), a harmful fine-tuned model (Harmful), and a Dissociated model that preserves safe outputs while retaining harmful latent structure. (b) In representation space, the Dissociated model diverges from the Safe model despite similar behavi… view at source ↗

**Figure 2.** Figure 2: Our intervention framework. Latent-space intervention (targeted perturbations). A latent-space soft intervention modifies the hidden representation at a selected layer ℓ: h ′ ℓ (x) = Iℓ(hℓ(x)). In this work, we primarily study additive interventions: h ′ ℓ (x) = hℓ(x) + δ, ∥δ∥ ≤ ϵ, where δ is to test the sensitivity of model behavior to small latent perturbations, and Attack Success Rate (ASR) is late… view at source ↗

**Figure 3.** Figure 3: Harmful LoRA fine-tuning on the base and dissociated models for Gemma 2 2B. The dissociated model reaches higher judged compliance, and lower judged refusal, after fewer fine-tuning epochs. Harmful fine-tuning. We next evaluate robustness to parameter-space intervention using harmful supervised fine-tuning with LoRA. At each epoch, we generate responses to held-out harmful prompts and use an LLM judge to … view at source ↗

**Figure 4.** Figure 4: Layer-wise latent perturbation analysis comparing the base model and models exhibiting [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise latent perturbation analysis and corresponding Latent Vulnerability Score [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qwen latent intervention result. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: reports the refusal-rate and audit-probe comparison summarized in Section 3. 0.0 0.5 1.0 Refusal rate gemma2-2b llama3.2-3b qwen2.5-3b 0.80 0.74 0.69 0.83 0.83 0.71 (a) 0.00 0.25 0.50 0.75 1.00 Standalone probe score 0.93 0.70 0.92 0.78 0.92 0.76 AUROC (b) Safe response Base Dissociated Unsafe response Base Dissociated [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that some LLMs can keep safe refusal behavior while staying easy to flip via latent tweaks, but the interventions used to demonstrate this may not generalize beyond the experimental setup.

read the letter

This paper claims behavioral safety tests miss real representation-level weaknesses in LLMs. They build dissociated models that refuse harmful prompts on the surface yet show high Latent Vulnerability Score when hit with bounded latent perturbations, and they report that intermediate layers are the weakest point.

What is new is the explicit framing of an audit gap plus the construction of dissociated models and the LVS metric itself. The experiments apply harmful fine-tuning and layer-wise perturbations across several aligned models and find that outward refusal stays intact while internal vulnerability rises. That direction is worth pursuing because current evals really do focus on outputs.

The soft spot is the representativeness of the interventions. Harmful fine-tuning is used both to create the dissociated models and to test them, so the elevated LVS could be an artifact of that shared optimization path rather than evidence of a broad gap in deployed models. The abstract gives no equations for LVS, no normalization details across scales, and no error analysis, which makes it hard to judge effect sizes. Without those, it is unclear whether the result holds for attacks that do not require white-box access or the same training recipe.

The work is aimed at people already running representation-level audits or safety evals. A reader looking for a finished case that current benchmarks are broken will not find it here. It deserves peer review because the question is concrete and the setup is a reasonable first cut, even if the current evidence needs tighter controls and more baselines before the claim lands.

Referee Report

2 major / 2 minor

Summary. The paper claims that behavioral safety evaluations of LLMs provide limited evidence of internal robustness because they target outputs rather than representation-level vulnerability. The authors construct dissociated models that preserve safe outward behavior while remaining vulnerable in latent space, introduce an intervention-based framework using harmful fine-tuning and layer-wise latent perturbations, and propose the Latent Vulnerability Score (LVS) to quantify how easily harmful behavior can be elicited by bounded perturbations. They report that behavioral metrics are insufficient across safely and unsafely aligned SOTA models, with dissociated models showing elevated LVS (especially in intermediate layers) despite comparable refusal rates.

Significance. If the central results hold, the work is significant for identifying an audit gap between behavioral safety and representation-level robustness, and for providing a concrete intervention framework plus the LVS metric to address it. The construction of dissociated models and the demonstration across multiple alignment methods constitute a useful empirical contribution that could motivate representation-aware safety audits.

major comments (2)

[Abstract and §3] Abstract and §3 (evaluation framework): The claim that behavioral metrics are insufficient rests on dissociated models exhibiting substantially elevated LVS under the paper's interventions. However, no explicit equation, normalization procedure, or bounds for LVS are supplied, leaving open whether the score is comparable across model scales or whether elevated values are artifacts of the chosen perturbation magnitudes.
[Abstract and §4] Abstract and §4 (dissociated models and interventions): The construction of dissociated models via harmful fine-tuning and the use of layer-wise latent perturbations as tests of robustness are load-bearing for the central claim, yet the paper provides no discussion or evidence that these interventions generalize beyond the experimental protocol (e.g., to black-box settings or attack vectors that do not follow the same optimization path used in model construction).

minor comments (2)

[Abstract] The term 'dissociated models' is used in the abstract without an immediate definition or citation, which reduces readability for readers outside the immediate subfield.
[Abstract] The abstract states that 'intermediate representations being the most sensitive' but does not indicate whether this is quantified with per-layer LVS tables or statistical tests.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and note the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (evaluation framework): The claim that behavioral metrics are insufficient rests on dissociated models exhibiting substantially elevated LVS under the paper's interventions. However, no explicit equation, normalization procedure, or bounds for LVS are supplied, leaving open whether the score is comparable across model scales or whether elevated values are artifacts of the chosen perturbation magnitudes.

Authors: We agree that the LVS requires an explicit equation, normalization, and bounds for interpretability and cross-scale comparability. The revised manuscript will add the full mathematical definition of LVS (including normalization and perturbation bounds) to §3. revision: yes
Referee: [Abstract and §4] Abstract and §4 (dissociated models and interventions): The construction of dissociated models via harmful fine-tuning and the use of layer-wise latent perturbations as tests of robustness are load-bearing for the central claim, yet the paper provides no discussion or evidence that these interventions generalize beyond the experimental protocol (e.g., to black-box settings or attack vectors that do not follow the same optimization path used in model construction).

Authors: The interventions target white-box representation-level access to demonstrate the audit gap; we make no claim of generalization to black-box or unrelated attack vectors. We will add a limitations discussion in §4 on the scope of the protocol and note that broader generalization is left to future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain; work is empirical with LVS defined from interventions.

full rationale

The paper introduces dissociated models and the Latent Vulnerability Score (LVS) via explicit construction and intervention protocols (harmful fine-tuning, layer-wise perturbations). No equations, self-citations, or derivations are present that reduce any prediction or claim to its own inputs by construction. The audit gap claim rests on experimental outcomes comparing behavioral metrics to LVS under the stated interventions, which are independently falsifiable and not tautological. This matches the default expectation of non-circularity for empirical papers without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the ability to construct and intervene on dissociated models plus the validity of LVS as a robustness measure; these are introduced without upstream derivation or external benchmarks in the provided abstract.

axioms (1)

domain assumption Behavioral outputs alone do not determine internal representation robustness under intervention
This is the core premise defining the audit gap

invented entities (2)

dissociated models no independent evidence
purpose: Models that preserve safe outward behavior while remaining vulnerable in latent space
Constructed specifically to study the audit gap
Latent Vulnerability Score (LVS) no independent evidence
purpose: Measure how easily harmful behavior can be elicited by bounded latent perturbations
New proposed metric for the evaluation framework

pith-pipeline@v0.9.1-grok · 5744 in / 1234 out tokens · 22292 ms · 2026-06-27T20:02:27.633939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 42 canonical work pages · 17 internal anchors

[1]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , title =. International Conference on Machine Learning (ICML) , year =. 2402.04249 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[2]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2404.01318 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

A StrongREJECT for Empty Jailbreaks

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2402.10260 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2403.10462 , year =

Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , title =. arXiv preprint arXiv:2403.10462 , year =. 2403.10462 , archivePrefix =

work page arXiv
[5]

ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year =

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and Curtis, Taylor Lynn and Bucknall, Benjamin and Haupt, Andreas and Wei, Kevin and Schuett, Jonas and Jenner, Erik and Hobbhahn, Marius and Raffel, Colin and Parthasarathy, Surya and Hadfield-Menell, Dylan , title =. ACM Conference on Fairness, Accountability, and Transparency (FAc...

work page arXiv
[6]

and Kearns, Ryan O

Bean, Andrew M. and Kearns, Ryan O. and Gundecha, Pratik and others , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2511.04703 , archivePrefix =

work page arXiv
[7]

and Paullada, Amandalynne and Denton, Emily and Hanna, Alex , title =

Raji, Inioluwa Deborah and Bender, Emily M. and Paullada, Amandalynne and Denton, Emily and Hanna, Alex , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2111.15366 , archivePrefix =

work page arXiv
[8]

and Wallach, Hanna , title =

Jacobs, Abigail Z. and Wallach, Hanna , title =. ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year =. 1912.05511 , archivePrefix =

work page arXiv 1912
[9]

International Conference on Learning Representations (ICLR) , year =

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , title =. International Conference on Learning Representations (ICLR) , year =. 2406.05946 , archivePrefix =

work page arXiv
[10]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.11717 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2507.11878 , archivePrefix =

work page arXiv
[12]

International Conference on Learning Representations (ICLR) , year =

Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , title =. International Conference on Learning Representations (ICLR) , year =. 2412.09565 , archivePrefix =

work page arXiv
[13]

and Potts, Christopher and Icard, Thomas , title =

Geiger, Atticus and Ibeling, Duligur and Zur, Amir and Chaudhary, Maheep and Chauhan, Sonakshi and Huang, Jing and Arora, Aryaman and Wu, Zhengxuan and Goodman, Noah D. and Potts, Christopher and Icard, Thomas , title =. Journal of Machine Learning Research , year =. 2301.04709 , archivePrefix =

work page arXiv
[14]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zou, Andy and Phan, Long and Wang, Justin and Duenez-Guzman, Derek and Lin, Maxwell and Andriushchenko, Maksym and Wang, Rowan and Kolter, Zico and Fredrikson, Matt and Hendrycks, Dan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.04313 , archivePrefix =

work page arXiv
[15]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Borah, Abhilekh and Sharma, Chhavi and Khanna, Danush and Bhatt, Utkarsh and Singh, Gurpreet and Abdullah, Hasnat Md and Ravi, Raghav Kaushik and Jain, Vinija and Patel, Jyoti and Singh, Shubham and Sharma, Vasu and Vats, Arpita and Raja, Rahul and Chadha, Aman and Das, Amitava , title =. Conference on Empirical Methods in Natural Language Processing (EMN...

work page arXiv
[16]

and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , title =

MacDiarmid, Monte and Maxwell, Timothy and Schiefer, Nicholas and Mu, Jesse and Kaplan, Jared and Duvenaud, David and Bowman, Samuel R. and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , title =. 2024 , url =

2024
[17]

arXiv preprint arXiv:2406.15513 , year =

Ji, Jiaming and Hong, Donghai and Zhang, Borong and Chen, Boyuan and Dai, Josef and Zheng, Boren and Qiu, Tianyi and Li, Boxun and Yang, Yaodong , title =. arXiv preprint arXiv:2406.15513 , year =. 2406.15513 , archivePrefix =

work page arXiv
[18]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2307.04657 , archivePrefix =

work page arXiv
[19]

arXiv preprint arXiv:2404.05993 , year =

Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher , title =. arXiv preprint arXiv:2404.05993 , year =. 2404.05993 , archivePrefix =

work page arXiv
[20]

Findings of the Association for Computational Linguistics: EACL , year =

Wang, Yuxia and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy , title =. Findings of the Association for Computational Linguistics: EACL , year =. 2308.13387 , archivePrefix =

work page arXiv
[21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, Andy and Wang, Zifan and Carlini, Nicholas and Nasr, Milad and Kolter, J. Zico and Fredrikson, Matt , title =. arXiv preprint arXiv:2307.15043 , year =. 2307.15043 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , title =. International Conference on Learning Representations (ICLR) , year =. 2310.06987 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.18510 , archivePrefix =

work page arXiv
[24]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =. 2308.01263 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Gemma 2: Improving Open Language Models at a Practical Size

arXiv preprint arXiv:2408.00118 , year =. 2408.00118 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Qwen2.5 Technical Report

arXiv preprint arXiv:2412.15115 , year =. 2412.15115 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

2025 , organization=

Sorry-bench: Systematically evaluating large language model safety refusal , author=. 2025 , organization=

2025
[29]

Proceedings of the National Academy of Sciences , volume=

Deception abilities emerged in large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024
[30]

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Jailbreaking and mitigation of vulnerabilities in large language models , author=. arXiv preprint arXiv:2410.15236 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

2025 , school=

Towards AI Safety via Interpretability and Oversight , author=. 2025 , school=

2025
[32]

Measuring Progress on Scalable Oversight for Large Language Models

Measuring progress on scalable oversight for large language models , author=. arXiv preprint arXiv:2211.03540 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2403.12017 , year=

Supervised fine-tuning as inverse reinforcement learning , author=. arXiv preprint arXiv:2403.12017 , year=

work page arXiv
[34]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[36]

arXiv preprint arXiv:2508.17511 , year=

School of reward hacks: Hacking harmless tasks generalizes to misaligned behavior in llms , author=. arXiv preprint arXiv:2508.17511 , year=

work page arXiv
[37]

arXiv preprint arXiv:2507.03662 , year=

Re-emergent misalignment: How narrow fine-tuning erodes safety alignment in llms , author=. arXiv preprint arXiv:2507.03662 , year=

work page arXiv
[38]

arXiv preprint arXiv:2507.19672 , year=

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges , author=. arXiv preprint arXiv:2507.19672 , year=

work page arXiv
[39]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[40]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=
[42]

arXiv preprint arXiv:2509.16660 , year=

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation , author=. arXiv preprint arXiv:2509.16660 , year=

work page arXiv
[43]

Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (* SEM 2025) , pages=

Connecting Concept Layers and Rationales to Enhance Language Model Interpretability , author=. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (* SEM 2025) , pages=

2025
[44]

Sparse autoencoders reveal universal feature spaces across large language models , author=
[45]

2024 , school=

Understanding large language models: towards rigorous and targeted interpretability using probing classifiers and self-rationalisation , author=. 2024 , school=

2024
[46]

Representation in large language models

Representation in large language models , author=. arXiv preprint arXiv:2501.00885 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

acl-main.386/

Causality is Key for Interpretability Claims to Generalise , author=. arXiv preprint arXiv:2602.16698 , year=

work page arXiv
[48]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv
[49]

CEUR WORKSHOP PROCEEDINGS , volume=

Causal Mediation Analysis for Interpreting Large Language Models , author=. CEUR WORKSHOP PROCEEDINGS , volume=. 2024 , organization=

2024
[50]

Explaining and Harnessing Adversarial Examples

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2505.17406 , year=

Misaligning Reasoning with Answers--A Framework for Assessing LLM CoT Robustness , author=. arXiv preprint arXiv:2505.17406 , year=

work page arXiv
[52]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

2024 , publisher=

Dolphin 2.9: An Uncensored, General-Purpose Large Language Model , author=. 2024 , publisher=

2024
[55]

Advances in Neural Information Processing Systems , volume=

Stepwise alignment for constrained language model policy optimization , author=. Advances in Neural Information Processing Systems , volume=
[56]

arXiv preprint arXiv:2509.24384 , year=

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment , author=. arXiv preprint arXiv:2509.24384 , year=

work page arXiv
[57]

International Conference on Machine Learning , year=

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment , author=. International Conference on Machine Learning , year=

[1] [1]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , title =. International Conference on Machine Learning (ICML) , year =. 2402.04249 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2404.01318 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

A StrongREJECT for Empty Jailbreaks

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2402.10260 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2403.10462 , year =

Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , title =. arXiv preprint arXiv:2403.10462 , year =. 2403.10462 , archivePrefix =

work page arXiv

[5] [5]

ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year =

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and Curtis, Taylor Lynn and Bucknall, Benjamin and Haupt, Andreas and Wei, Kevin and Schuett, Jonas and Jenner, Erik and Hobbhahn, Marius and Raffel, Colin and Parthasarathy, Surya and Hadfield-Menell, Dylan , title =. ACM Conference on Fairness, Accountability, and Transparency (FAc...

work page arXiv

[6] [6]

and Kearns, Ryan O

Bean, Andrew M. and Kearns, Ryan O. and Gundecha, Pratik and others , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2511.04703 , archivePrefix =

work page arXiv

[7] [7]

and Paullada, Amandalynne and Denton, Emily and Hanna, Alex , title =

Raji, Inioluwa Deborah and Bender, Emily M. and Paullada, Amandalynne and Denton, Emily and Hanna, Alex , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2111.15366 , archivePrefix =

work page arXiv

[8] [8]

and Wallach, Hanna , title =

Jacobs, Abigail Z. and Wallach, Hanna , title =. ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year =. 1912.05511 , archivePrefix =

work page arXiv 1912

[9] [9]

International Conference on Learning Representations (ICLR) , year =

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , title =. International Conference on Learning Representations (ICLR) , year =. 2406.05946 , archivePrefix =

work page arXiv

[10] [10]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.11717 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2507.11878 , archivePrefix =

work page arXiv

[12] [12]

International Conference on Learning Representations (ICLR) , year =

Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , title =. International Conference on Learning Representations (ICLR) , year =. 2412.09565 , archivePrefix =

work page arXiv

[13] [13]

and Potts, Christopher and Icard, Thomas , title =

Geiger, Atticus and Ibeling, Duligur and Zur, Amir and Chaudhary, Maheep and Chauhan, Sonakshi and Huang, Jing and Arora, Aryaman and Wu, Zhengxuan and Goodman, Noah D. and Potts, Christopher and Icard, Thomas , title =. Journal of Machine Learning Research , year =. 2301.04709 , archivePrefix =

work page arXiv

[14] [14]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zou, Andy and Phan, Long and Wang, Justin and Duenez-Guzman, Derek and Lin, Maxwell and Andriushchenko, Maksym and Wang, Rowan and Kolter, Zico and Fredrikson, Matt and Hendrycks, Dan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.04313 , archivePrefix =

work page arXiv

[15] [15]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Borah, Abhilekh and Sharma, Chhavi and Khanna, Danush and Bhatt, Utkarsh and Singh, Gurpreet and Abdullah, Hasnat Md and Ravi, Raghav Kaushik and Jain, Vinija and Patel, Jyoti and Singh, Shubham and Sharma, Vasu and Vats, Arpita and Raja, Rahul and Chadha, Aman and Das, Amitava , title =. Conference on Empirical Methods in Natural Language Processing (EMN...

work page arXiv

[16] [16]

and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , title =

MacDiarmid, Monte and Maxwell, Timothy and Schiefer, Nicholas and Mu, Jesse and Kaplan, Jared and Duvenaud, David and Bowman, Samuel R. and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , title =. 2024 , url =

2024

[17] [17]

arXiv preprint arXiv:2406.15513 , year =

Ji, Jiaming and Hong, Donghai and Zhang, Borong and Chen, Boyuan and Dai, Josef and Zheng, Boren and Qiu, Tianyi and Li, Boxun and Yang, Yaodong , title =. arXiv preprint arXiv:2406.15513 , year =. 2406.15513 , archivePrefix =

work page arXiv

[18] [18]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =. 2307.04657 , archivePrefix =

work page arXiv

[19] [19]

arXiv preprint arXiv:2404.05993 , year =

Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher , title =. arXiv preprint arXiv:2404.05993 , year =. 2404.05993 , archivePrefix =

work page arXiv

[20] [20]

Findings of the Association for Computational Linguistics: EACL , year =

Wang, Yuxia and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy , title =. Findings of the Association for Computational Linguistics: EACL , year =. 2308.13387 , archivePrefix =

work page arXiv

[21] [21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, Andy and Wang, Zifan and Carlini, Nicholas and Nasr, Milad and Kolter, J. Zico and Fredrikson, Matt , title =. arXiv preprint arXiv:2307.15043 , year =. 2307.15043 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , title =. International Conference on Learning Representations (ICLR) , year =. 2310.06987 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2406.18510 , archivePrefix =

work page arXiv

[24] [24]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =. 2308.01263 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Gemma 2: Improving Open Language Models at a Practical Size

arXiv preprint arXiv:2408.00118 , year =. 2408.00118 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [27]

Qwen2.5 Technical Report

arXiv preprint arXiv:2412.15115 , year =. 2412.15115 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [28]

2025 , organization=

Sorry-bench: Systematically evaluating large language model safety refusal , author=. 2025 , organization=

2025

[28] [29]

Proceedings of the National Academy of Sciences , volume=

Deception abilities emerged in large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024

[29] [30]

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Jailbreaking and mitigation of vulnerabilities in large language models , author=. arXiv preprint arXiv:2410.15236 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

2025 , school=

Towards AI Safety via Interpretability and Oversight , author=. 2025 , school=

2025

[31] [32]

Measuring Progress on Scalable Oversight for Large Language Models

Measuring progress on scalable oversight for large language models , author=. arXiv preprint arXiv:2211.03540 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

arXiv preprint arXiv:2403.12017 , year=

Supervised fine-tuning as inverse reinforcement learning , author=. arXiv preprint arXiv:2403.12017 , year=

work page arXiv

[33] [34]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[35] [36]

arXiv preprint arXiv:2508.17511 , year=

School of reward hacks: Hacking harmless tasks generalizes to misaligned behavior in llms , author=. arXiv preprint arXiv:2508.17511 , year=

work page arXiv

[36] [37]

arXiv preprint arXiv:2507.03662 , year=

Re-emergent misalignment: How narrow fine-tuning erodes safety alignment in llms , author=. arXiv preprint arXiv:2507.03662 , year=

work page arXiv

[37] [38]

arXiv preprint arXiv:2507.19672 , year=

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges , author=. arXiv preprint arXiv:2507.19672 , year=

work page arXiv

[38] [39]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[39] [40]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=

[41] [42]

arXiv preprint arXiv:2509.16660 , year=

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation , author=. arXiv preprint arXiv:2509.16660 , year=

work page arXiv

[42] [43]

Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (* SEM 2025) , pages=

Connecting Concept Layers and Rationales to Enhance Language Model Interpretability , author=. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (* SEM 2025) , pages=

2025

[43] [44]

Sparse autoencoders reveal universal feature spaces across large language models , author=

[44] [45]

2024 , school=

Understanding large language models: towards rigorous and targeted interpretability using probing classifiers and self-rationalisation , author=. 2024 , school=

2024

[45] [46]

Representation in large language models

Representation in large language models , author=. arXiv preprint arXiv:2501.00885 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [47]

acl-main.386/

Causality is Key for Interpretability Claims to Generalise , author=. arXiv preprint arXiv:2602.16698 , year=

work page arXiv

[47] [48]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv

[48] [49]

CEUR WORKSHOP PROCEEDINGS , volume=

Causal Mediation Analysis for Interpreting Large Language Models , author=. CEUR WORKSHOP PROCEEDINGS , volume=. 2024 , organization=

2024

[49] [50]

Explaining and Harnessing Adversarial Examples

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [51]

arXiv preprint arXiv:2505.17406 , year=

Misaligning Reasoning with Answers--A Framework for Assessing LLM CoT Robustness , author=. arXiv preprint arXiv:2505.17406 , year=

work page arXiv

[51] [52]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [54]

2024 , publisher=

Dolphin 2.9: An Uncensored, General-Purpose Large Language Model , author=. 2024 , publisher=

2024

[54] [55]

Advances in Neural Information Processing Systems , volume=

Stepwise alignment for constrained language model policy optimization , author=. Advances in Neural Information Processing Systems , volume=

[55] [56]

arXiv preprint arXiv:2509.24384 , year=

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment , author=. arXiv preprint arXiv:2509.24384 , year=

work page arXiv

[56] [57]

International Conference on Machine Learning , year=

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment , author=. International Conference on Machine Learning , year=