Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Woo Seob Sim; Yu Rang Park

arxiv: 2605.20241 · v1 · pith:POYGVIC7new · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Woo Seob Sim , Yu Rang Park This is my paper

Pith reviewed 2026-05-21 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords safety probinglayer-wise geometrymargin analysislarge language modelsprompt classificationinterpretabilitybenchmark shift

0 comments

The pith

Safety evidence in large language models sits mainly in stable layer-wise margin positions rather than in changes between layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats safety separation in large language models as a geometric decomposition problem across layers. It introduces Geometry-Lite to convert each layer's final prompt token representation into signed margins under three different readouts and then condenses those profiles into boundary position, change, and shape measures. The resulting measurements show that persistent final or extremal margins together with unsafe-side layer occupancy explain most of the detection performance across models and benchmarks. Layer-to-layer drift and coarse structural summaries contribute little to overall accuracy, although drift offers minor help at very low false-positive thresholds. Under benchmark shift, class-conditional mean geometry holds up better on hard held-out cases than boundaries tuned to the training mixture.

Core claim

Prompt-level safety evidence is not primarily a layer-to-layer motion signal but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes. Final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance, while finite-difference drift and structural summaries add little to pooled AUROC. Optimized linear boundaries remain sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset.

What carries the argument

Geometry-Lite, which maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts then summarizes the margin profiles by boundary position, layer-to-layer change, and coarse shape.

If this is right

Safety probes can focus on final and extreme layer margins without large loss in pooled AUROC.
Drift signals mainly supply small recall-oriented corrections at low false-positive thresholds.
Class-conditional mean geometry offers more stable separation when benchmarks shift than boundaries fitted to the training set.
Persistent unsafe-side layer occupancy serves as a reliable indicator for aggregate detection strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same margin-position view could be applied to other prompt-level distinctions such as truthfulness or bias without assuming motion between layers.
Different model families might exhibit different stability patterns in their extremal margins, suggesting architecture-specific safety signatures.
Deployment filters could be made lighter by monitoring only the final and most extreme layers rather than all layers.

Load-bearing premise

The chosen readouts and the seven safety benchmarks together capture the dominant geometric structure of safety separation rather than artifacts of the particular prompt distributions or model families tested.

What would settle it

A new safety benchmark with substantially different prompt distributions in which boundary-position geometry no longer accounts for most detection performance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.20241 by Woo Seob Sim, Yu Rang Park.

**Figure 2.** Figure 2: TPR@5%FPR on the hard subset across six representative backbones [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity to uncertainty-slice size. We vary q, the fraction of test prompts closest to the validation-selected best single-layer probe’s probability boundary, as measured by |pstatic(y=1 | x) − 0.5|. The ordering remains stable across slice sizes. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 5.** Figure 5: Illustrative late-layer correction cases on ToxicChat. Each example is a safe prompt [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 4.** Figure 4: Class-conditional linear-boundary margin and drift on Llama-3.1-8B, BeaverTails. Lines [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geometry-Lite gives a practical decomposition showing safety margins sit mostly in stable layer positions rather than inter-layer drift, though the finding is tied to instruction-tuned models and the tested benchmarks.

read the letter

Hi, the main thing to know is that this paper decomposes prompt-level safety signals in transformers by tracking margin geometry layer by layer instead of just measuring overall accuracy. They introduce Geometry-Lite, which pulls the final token representation at each layer and computes signed margins under three readouts: centroid, local neighborhood, and supervised linear boundary. Those margins are then summarized into position, finite-difference drift, and coarse shape metrics. Across nine instruction-tuned models (1.2B to 70B) and seven safety benchmarks, the probe beats single-layer baselines and stays close to raw multi-layer stacking. The key empirical result is that persistent boundary position and unsafe-side layer occupancy drive most of the pooled AUROC, while drift adds little except for minor recall gains at low false-positive thresholds. Under benchmark shift, class-conditional means hold separation better than fitted boundaries on held-out hard subsets. That is a clean, usable instrument for seeing where the safety evidence actually lives in the stack. The work is straightforward and the patterns are reported consistently enough to be worth noting. The soft spot is scope. All models are instruction-tuned, so the geometry could partly reflect alignment artifacts rather than intrinsic safety separation. The seven benchmarks may also share prompt-length or style traits that make position signals look dominant. The paper checks shift behavior, but the claim that safety evidence is not primarily a motion signal would be more convincing with base models or broader task distributions. This is for people doing safety evaluation or mechanistic interpretability who need a lightweight way to inspect layer contributions. A reader who wants concrete readouts and dominance rankings will find it useful. I would send it for peer review because the method is reproducible and the question is relevant, even if reviewers will press on generalization.

Referee Report

2 major / 2 minor

Summary. The paper introduces Geometry-Lite, a compact prompt-level safety probe for LLMs that extracts signed margins from each layer's final prompt-token representation using centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles according to boundary position, finite-difference drift, and coarse structural shape. Across nine instruction-tuned models (1.2B–70B) and seven safety benchmarks, the method improves over single-layer baselines while remaining competitive with raw multi-layer stacking. The central empirical decomposition finds that aggregate detection performance (pooled AUROC) is dominated by persistent boundary-position features such as final/extremal margins and unsafe-side layer occupancy, whereas drift and structural summaries contribute little; drift offers minor recall-oriented gains only under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries overfit the training mixture while class-conditional mean geometry retains separation on a predefined hard held-out subset. The authors conclude that prompt-level safety evidence is expressed primarily through stable layer-wise margin geometry rather than layer-to-layer motion.

Significance. If the reported dominance ranking holds under broader conditions, the work supplies a practical, interpretable instrument for dissecting how safety signals are geometrically encoded across layers and for identifying which readout biases matter in decision-critical regimes. The multi-model, multi-benchmark design and explicit comparison to stacking baselines are strengths that allow the decomposition to be evaluated directly. The finding that drift adds little to pooled AUROC, while boundary position dominates, could guide simpler and more stable safety probes; the benchmark-shift results further highlight the distinction between training-sharp and generalization-stable geometries.

major comments (2)

[Abstract and experimental results] Abstract and § on experimental results: the claim that 'safety evidence is expressed primarily through persistent boundary-position geometry' and 'not primarily a layer-to-layer motion signal' rests on the observed dominance of final/extremal margins and unsafe-side occupancy in pooled AUROC. Because all nine backbones are instruction-tuned and the seven benchmarks may share prompt-distribution properties, this ranking could be an artifact of the tested regime rather than an intrinsic property of safety separation. A direct test on base (non-instruction-tuned) models or on safety tasks with substantially different prompt styles/lengths/topics would be required to support the broader conclusion.
[Method] Method section on readout definitions: the three chosen readouts (centroid, local-neighborhood, supervised linear-boundary) are asserted to span the relevant geometry, yet no ablation is reported that adds alternative readouts (e.g., attention-weighted or higher-moment statistics) and checks whether the dominance of boundary position over drift persists. Without this, the statement that 'finite-difference drift and structural summaries add little' remains conditional on the particular readout set.

minor comments (2)

[Abstract] The abstract states that Geometry-Lite 'remains close to raw multi-layer score stacking'; the precise quantitative gap (e.g., AUROC difference and confidence intervals) should be reported in the main results table for each benchmark.
[Method] Notation for 'signed margins' and 'unsafe-side layer occupancy' is used throughout but defined only after the readout descriptions; moving the definitions to the first use would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicate planned revisions, and note any limitations we cannot resolve in the current revision.

read point-by-point responses

Referee: The claim that safety evidence is expressed primarily through persistent boundary-position geometry rests on experiments limited to nine instruction-tuned models and seven benchmarks that may share prompt properties. This ranking could be an artifact of the tested regime. Direct tests on base models or tasks with different prompt styles/lengths/topics are needed for the broader conclusion.

Authors: We agree the experiments are confined to instruction-tuned backbones, which is the practical regime for deployed safety probes. The manuscript's claims are scoped to this setting, and the consistency across model scales supports the pattern within it. We will revise the abstract, introduction, and conclusion to explicitly qualify the findings as applying to instruction-tuned models and add a limitations paragraph stating that extension to base models and substantially different prompt distributions remains future work. revision: partial
Referee: The three readouts are asserted to span the relevant geometry, yet no ablation adds alternative readouts (e.g., attention-weighted or higher-moment statistics) to check whether boundary-position dominance over drift persists. The statement that drift and structural summaries add little is therefore conditional on the particular readout set.

Authors: The chosen readouts were intended to cover unsupervised centroid and neighborhood methods plus a supervised linear boundary, providing a balanced view of geometric separation. We will expand the Method section with a paragraph justifying this selection and acknowledging that alternatives such as attention-weighted averages or moment-based statistics were not ablated. We will also note that the observed dominance of boundary position was stable across the three readouts tested, while flagging broader readout exploration as future work. revision: partial

standing simulated objections not resolved

Direct empirical results on base (non-instruction-tuned) models are not available in the current study and cannot be added without new experiments.

Circularity Check

0 steps flagged

No significant circularity; empirical decomposition remains self-contained against benchmarks and shift tests

full rationale

The paper conducts an empirical study introducing Geometry-Lite to decompose layer-wise margin geometry across readouts (centroid, local-neighborhood, supervised linear-boundary) and summarizes profiles by boundary position, drift, and shape. Central claims—that persistent boundary-position geometry dominates pooled AUROC while finite-difference drift adds little—are supported by direct performance comparisons on nine instruction-tuned models and seven safety benchmarks, with separate reporting of linear-boundary behavior under benchmark shift and on hard held-out subsets. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed ansatzes; the decomposition uses observable quantities measured on external data rather than tautological re-expression of the same fitted margins.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the empirical validity of three readout functions and the representativeness of the chosen benchmarks and models; no new physical entities are postulated.

free parameters (1)

supervised linear boundary parameters
The linear decision boundaries are fitted to labeled safety data per layer or per mixture, introducing parameters that are optimized rather than derived from first principles.

axioms (2)

domain assumption The final prompt-token hidden state is a sufficient statistic for safety classification at each layer.
The probe operates exclusively on the last-token representation; this is a standard but unproven modeling choice in prompt-level probing literature.
ad hoc to paper Centroid, local-neighborhood, and linear-boundary readouts together span the relevant geometric aspects of safety separation.
The paper selects these three families without proving they exhaust the possible margin geometries that could matter for detection.

pith-pipeline@v0.9.0 · 5824 in / 1711 out tokens · 34307 ms · 2026-05-21T08:51:47.894117+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Geometry-Lite summarizes each margin profile along three named axes: margin level, layer-to-layer change, and structural shape.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 7 internal anchors

[1]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A holistic approach to undesired content detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009–15018, 2023

work page 2023
[3]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[5]

XSTest: A test suite for identifying exaggerated safety behaviours in large lan- guage models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large lan- guage models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

work page 2024
[6]

On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning

Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, 2023

work page 2023
[7]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[8]

Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

work page 2019
[9]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 5637–5664, 2021

work page 2021
[10]

When benchmarks lie: Evaluating malicious prompt classifiers under true distribu- tion shift

Max Fomin. When benchmarks lie: Evaluating malicious prompt classifiers under true distribu- tion shift. arXiv preprint arXiv:2602.14161, 2026

work page arXiv 2026
[11]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022
[12]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the Tuned Lens. arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Truth as a trajectory: What internal representations reveal about large language model reasoning

Hamed Damirchi, De la Jara, Ignacio Meza, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, and Javen Shi. Truth as a trajectory: What internal representations reveal about large language model reasoning. arXiv preprint arXiv:2603.01326, 2026

work page arXiv 2026
[15]

Safety layers in aligned large language models: The key to LLM security

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to LLM security. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kUH1yPMAn7

work page 2025
[16]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[17]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. In Advances in Neural Information Processing Systems (NeurIPS) Da...

work page 2024
[18]

Do-not-answer: Evaluating safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, 2024

work page 2024
[19]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

work page 2023
[20]

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31983–32016, 2025

work page 2025
[21]

ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

work page 2023
[22]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Qwen3 Technical Report

An Yang and Qwen Team. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. A Feature set: precise definitions This appendix defines the 13 scalar summaries used for each margin geometry G ∈ {cent, knn, lin}. Concatenating the three geometry-specific blocks gives the 39-dimensional Geometry-Lite represen- tation. Notation. For a fixed geometry G...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A holistic approach to undesired content detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009–15018, 2023

work page 2023

[3] [3]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[5] [5]

XSTest: A test suite for identifying exaggerated safety behaviours in large lan- guage models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large lan- guage models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

work page 2024

[6] [6]

On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning

Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, 2023

work page 2023

[7] [7]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[8] [8]

Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

work page 2019

[9] [9]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 5637–5664, 2021

work page 2021

[10] [10]

When benchmarks lie: Evaluating malicious prompt classifiers under true distribu- tion shift

Max Fomin. When benchmarks lie: Evaluating malicious prompt classifiers under true distribu- tion shift. arXiv preprint arXiv:2602.14161, 2026

work page arXiv 2026

[11] [11]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022

[12] [12]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the Tuned Lens. arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Truth as a trajectory: What internal representations reveal about large language model reasoning

Hamed Damirchi, De la Jara, Ignacio Meza, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, and Javen Shi. Truth as a trajectory: What internal representations reveal about large language model reasoning. arXiv preprint arXiv:2603.01326, 2026

work page arXiv 2026

[15] [15]

Safety layers in aligned large language models: The key to LLM security

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to LLM security. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kUH1yPMAn7

work page 2025

[16] [16]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[17] [17]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. In Advances in Neural Information Processing Systems (NeurIPS) Da...

work page 2024

[18] [18]

Do-not-answer: Evaluating safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Findings of the Association for Computational Linguistics: EACL 2024, pages 896–911, 2024

work page 2024

[19] [19]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

work page 2023

[20] [20]

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31983–32016, 2025

work page 2025

[21] [21]

ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

work page 2023

[22] [22]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Qwen3 Technical Report

An Yang and Qwen Team. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. A Feature set: precise definitions This appendix defines the 13 scalar summaries used for each margin geometry G ∈ {cent, knn, lin}. Concatenating the three geometry-specific blocks gives the 39-dimensional Geometry-Lite represen- tation. Notation. For a fixed geometry G...

work page internal anchor Pith review Pith/arXiv arXiv 2025