pith. machine review for the scientific record. sign in

arxiv: 2605.11448 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Deep Minds and Shallow Probes

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords neural representationsprobingsymmetryaffine transformationscoordinate stabilitycross-model transferprobe quotientshallow probes
0
0 comments X

The pith

Affine symmetries from equivalent realizations select a unique hierarchy of shallow probes, with linear probes as the base case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural representations realizing the same computation can differ by reparameterization of their hidden coordinates. A probe family meant to detect structure already present must therefore remain unchanged under the symmetries induced by those reparameterizations rather than depend on a particular basis. At the final readout layer these symmetries are affine transformations of the hidden coordinates. Requiring stability under this group action isolates a specific hierarchy of shallow coordinate-stable probes whose first member is the ordinary linear probe. The same principle identifies the probe-visible quotient of a representation, rather than the full hidden state, as the transferable object for cross-model monitoring.

Core claim

Equivalent realizations induce affine changes of hidden coordinates. Requiring a probe family to be stable under this group action singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. A natural object for cross-model probe transfer is then the shared probe-visible quotient—the representation modulo directions invisible to the probe family—rather than the full hidden state.

What carries the argument

The group action of affine reparameterizations on hidden coordinates at the readout layer, which enforces coordinate-stability and selects the probe hierarchy.

If this is right

  • Linear probes form the lowest level of a larger family of stable shallow probes.
  • Degree-2 members of the hierarchy capture additional structure beyond what linear probes detect.
  • Probe transfer should operate on the quotient modulo invisible directions to achieve coverage-aware portability.
  • The same stability requirement yields monitors that transfer across different model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The symmetry analysis could be extended to intermediate layers if analogous group actions can be identified there.
  • Quotient-based transfer may improve robustness when applying monitors trained on one architecture to another.
  • The framework suggests that many existing probing techniques can be re-derived as special cases of symmetry-stable families.

Load-bearing premise

That affine coordinate changes from equivalent realizations are the only relevant symmetries and that probes intended to reveal existing structure must be invariant to them.

What would settle it

An experiment in which a probe family extracts reliable structure yet fails to be stable under affine reparameterizations, or in which full hidden-state transfer outperforms quotient-based transfer across models.

Figures

Figures reproduced from arXiv: 2605.11448 by Risi Kondor, Su Hyeong Lee.

Figure 1
Figure 1. Figure 1: Two predictions of the paper’s framework. (a) The polynomial degree hierarchy is tight: on circular [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recovered minimum probe degree d ⋆ (T, L) for five Boolean tasks at six layers of Pythia-160m (left) and Pythia-410m (right). Cells are shaded by d ⋆ . “Degree” here refers to the minimum polynomial-threshold decision degree over the primitive scores (whether the label is separable by a polynomial threshold of degree d applied to the (A, b B, b Cb) score vector), not to the degree of the Boolean function a… view at source ↗
read the original abstract

Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient--the representation modulo directions invisible to the probe family--rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that neural probes should be invariant under affine reparameterizations of hidden states that arise from equivalent realizations of the final readout layer. It claims this symmetry principle uniquely determines a hierarchy of shallow coordinate-stable probes (linear probes as the degree-1 member) and that a probe-visible quotient (representation modulo directions invisible to the probe family) is the appropriate object for cross-model probe transfer. Experiments on synthetic and real-world tasks are said to illustrate when degree-2 probes add value and how quotient-based transfer improves monitor portability.

Significance. If the uniqueness derivation holds without hidden restrictions on probe functional form, the work supplies a geometric rationale for the prevalence of linear probes and a concrete mechanism for coverage-aware transfer across model families. This could shift probing from empirical heuristics toward symmetry-based design, with the quotient construction offering a practical advance for interpretability and monitoring. The experiments provide initial support for both the hierarchy and the transfer claim.

major comments (2)
  1. [Abstract / §3 (Symmetry Principle)] Abstract and theoretical core: the claim that the symmetry principle 'singles out a unique hierarchy' requires an explicit statement of the probe function class (e.g., polynomials of bounded degree). Without a proof that no other families (non-polynomial or unbounded) satisfy the stability condition under the affine group action, uniqueness does not follow from the group action alone; the skeptic concern on functional-form restriction is load-bearing for the central claim.
  2. [§4 (Quotient and Transfer)] Probe-visible quotient construction: because the quotient is defined relative to the chosen probe family, the transfer claim inherits the same dependence on the hierarchy derivation. If the hierarchy is not uniquely fixed by symmetry, the quotient is likewise not canonical; this affects the cross-model portability result.
minor comments (2)
  1. [§3] Notation for the group action and stability condition should be introduced with a single running example (e.g., a two-layer readout) before the general case to improve readability.
  2. [§5] Experimental section should report the precise synthetic data-generating process and any controls for probe capacity or regularization that could confound the degree-1 vs. degree-2 comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight an important point about the scope of the uniqueness claim, which we address by clarifying the probe function class in the revision. We respond point by point below.

read point-by-point responses
  1. Referee: [Abstract / §3 (Symmetry Principle)] Abstract and theoretical core: the claim that the symmetry principle 'singles out a unique hierarchy' requires an explicit statement of the probe function class (e.g., polynomials of bounded degree). Without a proof that no other families (non-polynomial or unbounded) satisfy the stability condition under the affine group action, uniqueness does not follow from the group action alone; the skeptic concern on functional-form restriction is load-bearing for the central claim.

    Authors: We agree that an explicit statement of the function class is needed for the uniqueness claim to be precise. In the manuscript, shallow probes are implicitly the polynomial functions of bounded degree, as this is the natural class closed under affine reparameterizations that admits a grading by total degree (with linear probes as the degree-1 member). We will revise the abstract and §3 to state explicitly that the symmetry principle is applied to the vector space of polynomial probes of degree at most d, and briefly justify why this class is appropriate: affine transformations preserve polynomial degree, yielding a finite-dimensional representation in which the hierarchy of invariant subspaces is uniquely determined by representation theory of the affine group. Within this class the hierarchy is canonical; we do not claim uniqueness over all possible function families, as non-polynomial probes fall outside the shallow-probe setting studied here. revision: yes

  2. Referee: [§4 (Quotient and Transfer)] Probe-visible quotient construction: because the quotient is defined relative to the chosen probe family, the transfer claim inherits the same dependence on the hierarchy derivation. If the hierarchy is not uniquely fixed by symmetry, the quotient is likewise not canonical; this affects the cross-model portability result.

    Authors: We concur that the quotient construction is relative to the probe family. With the clarification in §3 that the family is the symmetry-selected hierarchy of polynomial probes of bounded degree, the quotient becomes the canonical object for that family. We will revise §4 to make this dependence explicit, stating that cross-model transfer is performed with respect to the same polynomial probe class on both models, and that the resulting quotient captures precisely the directions visible to the chosen probes. The experimental results on synthetic and real-world portability continue to demonstrate the practical benefit of this coverage-aware transfer within the stated class. revision: yes

Circularity Check

0 steps flagged

Symmetry principle derivation is self-contained without reduction to inputs by construction

full rationale

The paper starts from the group action of affine reparameterizations induced by equivalent readout realizations and derives a stability condition for probe families. This is used to identify a hierarchy whose degree-1 case is the linear probe and to motivate the probe-visible quotient. No equation or claim in the abstract or described chain defines the hierarchy in terms of itself, renames a fitted quantity as a prediction, or relies on a self-citation whose content is unverified. The uniqueness statement is presented as following from the symmetry principle applied to shallow probes; experiments are described as supporting rather than constituting the derivation. The central claims therefore remain independent of the paper's own fitted values or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that equivalent realizations induce only affine changes and that probes should be invariant under those changes; no free parameters or invented entities are visible in the abstract.

axioms (2)
  • domain assumption Neural representations are not unique objects; equivalent downstream computations may differ by reparameterization of hidden coordinates.
    Opening sentence of abstract; used to motivate the symmetry requirement.
  • domain assumption A probe family intended to reveal structure already present should be stable under the relevant representation symmetries.
    Stated as the design principle that selects the hierarchy.

pith-pipeline@v0.9.0 · 5467 in / 1261 out tokens · 65103 ms · 2026-05-13T02:15:29.038919+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 10 internal anchors

  1. [1]

    Computational Linguistics , year =

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguis- tics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/

  2. [2]

    What you can cram into a single \ &!\#* vector:

    Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Asso...

  3. [3]

    Designing and Interpreting Probes with Control Tasks

    John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Chi...

  4. [4]

    Pareto probing: Trading off accuracy for complexity

    Tiago Pimentel, Naomi Saphra, Adina Williams, and Ryan Cotterell. Pareto probing: Trading off accuracy for complexity. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3138–3153, 2020

  5. [5]

    Information-Theoretic Probing for Linguistic Structure

    Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4609–4622, Online, July 2020. ...

  6. [6]

    White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell

    Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. A non-linear structural probe. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Comp...

  7. [7]

    Kolda and Brett W

    Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.SIAM Review, 51(3): 455–500, 2009. doi: 10.1137/07070111X. URLhttps://doi.org/10.1137/07070111X

  8. [8]

    Scalable interpretability via polynomials

    Abhimanyu Dubey, Filip Radenovic, and Dhruv Mahajan. Scalable interpretability via polynomials. InAdvances in Neural Information Processing Systems, volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ ee81a23d6b83ac15fbeb5b7a30934e0b-Abstract-Conference.html

  9. [9]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URLhttps://arxiv.org/abs/2412.15115

  10. [10]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

  11. [11]

    Characterization of polynomials by their invariance properties

    José María Almira and Ya-Qing Hu. Characterization of polynomials by their invariance properties. Aequationes Mathematicae, 99:2725–2744, 2025. doi: 10.1007/s00010-025-01190-5. URL https: //doi.org/10.1007/s00010-025-01190-5

  12. [12]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

  13. [13]

    Liu, Matt Gardner, Yonatan Belinkov, Matthew E

    Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Min...

  14. [14]

    and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https:// aclanthology.org/N19-1112/. 12

  15. [15]

    BERT Rediscovers the Classical NLP Pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URLhttps://aclanthology.org/P19-1452/

  16. [16]

    Willison, S

    Elena V oita and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/ 2020.emnlp-main.14. URLhttps://aclanthology.org/2020.emnlp-main.14/

  17. [17]

    John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representa- tions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4129–4138, 2019

  18. [18]

    Understanding image representations by measuring their equivariance and equivalence

    Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 991–999, 2015

  19. [19]

    Revisiting model stitching to compare neural representations

    Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Systems, volume 34, pages 225–236, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 01ded4259d101feb739b06c399e9cd9c-Abstract.html

  20. [20]

    On the functional similarity of robust and non-robust neural representations

    András Balogh and Márk Jelasity. On the functional similarity of robust and non-robust neural representations. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 1614–1635. PMLR, 2023. URL https: //proceedings.mlr.press/v202/balogh23a.html

  21. [21]

    Transferring linear features across language models with model stitching

    Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick. Transferring linear features across language models with model stitching. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025. URL https://openreview.net/forum?id=Qvvy0X63Fv. Spotlight; arXiv:2506.06609

  22. [22]

    How not to stitch representations to measure similarity: Task loss matching versus direct matching

    Andras Balogh and Mark Jelasity. How not to stitch representations to measure similarity: Task loss matching versus direct matching. InProceedings of the AAAI Conference on Artificial Intelligence,

  23. [23]

    Svcca: Sin- gular vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Sin- gular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems 30, pages 6076–6085. Curran As- sociates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ dc6a7e655d7e5840e6673...

  24. [24]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 2019. URL https://proceedings.mlr.press/v97/kornblith19a.html

  25. [25]

    Discovering latent knowledge in language models without supervision

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations,

  26. [26]

    URLhttps://openreview.net/forum?id=ETKGuby0hcs. 13

  27. [27]

    The internal state of an LLM knows when it ' s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https: //aclanthology.org/2023.findings-emnlp.68/

  28. [28]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM), 2024. URLhttps://arxiv.org/abs/2310.06824

  29. [29]

    Eliciting latent knowledge from quirky language models

    Alex Mallen, Madeline Brumley, Julia Kharchenko, and Nora Belrose. Eliciting latent knowledge from quirky language models. InConference on Language Modeling (COLM), 2024. URL https: //arxiv.org/abs/2312.01037

  30. [30]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023. URLhttps://arxiv.org/abs/2303.08112

  31. [31]

    Do LLMs “know” internally when they follow instructions? InThe Thirteenth International Conference on Learning Representations, 2025

    Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley You Ren, Andrew Miller, Udhyakumar Nallasamy, and Jaya Narain. Do LLMs “know” internally when they follow instructions? InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=qIN5VDdEOr

  32. [32]

    Beyond linear probes: Dynamic safety monitoring for language models

    James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=AGWa8whf92. Published as a conference paper at ICLR 2026

  33. [33]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/ abs/2310.01405

  34. [34]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://arxiv.org/abs/2306.03341

  35. [35]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. URLhttps://arxiv.org/abs/2308.10248

  36. [36]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Aug...

  37. [37]

    Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. URLhttps://arxiv.org/abs/2406.11717

  38. [38]

    Linear probe penalties reduce LLM sycophancy

    Henry Papadatos and Rachel Freedman. Linear probe penalties reduce LLM sycophancy. InNeurIPS 2024 Workshop on Socially Responsible Language Modelling Research (SoLaR), 2024. URLhttps: //openreview.net/forum?id=6N2yES22rG. 14

  39. [39]

    Large language models can strategi- cally deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

    Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023. URL https: //arxiv.org/abs/2311.07590

  40. [40]

    arXiv preprint arXiv:2503.10965 , year =

    Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, 2025. URL https://arxiv.org/abs/ 2503.10965

  41. [41]

    When truthful representations flip under deceptive instructions?arXiv preprint arXiv:2507.22149, 2025

    Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, and Pan Li. When truthful representations flip under deceptive instructions?arXiv preprint arXiv:2507.22149, 2025. URL https://arxiv.org/abs/2507.22149

  42. [42]

    A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023

    Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023. doi: 10.1609/aaai.v37i12.26752. URL https://ojs.aaai.org/index.php/...

  43. [43]

    GPT-4 system card, 2023

    OpenAI. GPT-4 system card, 2023. URL https://cdn.openai.com/papers/ gpt-4-system-card.pdf. OpenAI system card

  44. [44]

    GPT-4o system card, 2024

    OpenAI. GPT-4o system card, 2024. URL https://cdn.openai.com/ gpt-4o-system-card.pdf. OpenAI system card

  45. [45]

    Openai o3 and o4-mini system card, 2025

    OpenAI. Openai o3 and o4-mini system card, 2025. URL https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . OpenAI system card

  46. [46]

    Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

    Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837, 2025. URLhttps://arxiv.org/abs/2501.18837

  47. [47]

    Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

    Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, et al. Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026. URL https://arxiv.org/abs/2601.04603

  48. [48]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Le...

  49. [49]

    ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation

    Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2023, pages 4694–4702, Singapore, December 2023. As...

  50. [50]

    BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track, volume 36, 2023. 15

  51. [51]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013

  52. [52]

    Bowman, Miriam Connor, John Bauer, and Christopher D

    Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel R. Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904. European Language Resources Association (ELRA), 2014. URL...

  53. [53]

    For each i∈ {1,2} , there exists a unique linear map Qi :H i → C ∗ satisfying Qi(v)(Ei(ℓ)) = ℓ(v)for allv∈H i and allℓ∈V i.In particular,Q i(hi(x)) = evx for allx∈X. 42

  54. [54]

    The kernel ofQ i is exactly the probe-invisible subspaceK(V i)

  55. [55]

    match all activations

    The induced map Qi :Z(V i) =H i/K(Vi)→ C ∗ is a linear isomorphism. Consequently the probe-visible quotients of the two models are canonically isomorphic to the same abstract space C∗. In particular, H1/K(V1) ∼= C∗ ∼= H2/K(V2). Because C is finite-dimensional, the canonical bidual identification also givesC ∼= (C∗)∗. Proof.Fixi∈ {1,2}. Because the concept...

  56. [56]

    4.Linear on raw concat:logistic regression on concatenated hidden states[h subj;h verb](2dfeatures)

    Quadratic on scores:logistic regression on [ssubj ⊙s verb;s subj;s verb] where s are the 3 morphological- number probe scores (9 features). 4.Linear on raw concat:logistic regression on concatenated hidden states[h subj;h verb](2dfeatures). 5.Bilinear on raw:logistic regression on[h subj ⊙h verb;h subj;h verb](3dfeatures)

  57. [57]

    we transfer probes

    Quotient-level quadratic:project each hidden state through the quotient, then fit a quadratic on quotient coordinates. Table 18 shows that the quadratic head on probe scores (bacc = 0.863±0.007 ) outperforms the bilinear probe on raw hidden states, though these operate on different feature representations (5 pre-trained probe scores vs 3d raw dimensions) ...

  58. [58]

    Table 27: Effect of alignment-text domain on Qwen-7B → Qwen-3B zero-label transfer (AUROC)

    Alignment text must cover the concept space.We tested whether domain-independent text suffices for alignment by using only SST-2 movie reviews (5K samples) instead of safety-relevant text. Table 27: Effect of alignment-text domain on Qwen-7B → Qwen-3B zero-label transfer (AUROC). SST-2 alignment transfers sentiment but fails on safety; safety-domain text ...

  59. [59]

    zero-label transfer

    Alignment budget scaling.How many paired samples are needed for reliable alignment? We sweep alignment budget n from 100 to 76,000 for Qwen-7B →Qwen-3B transfer (mean AUROC over 5 concepts, 10 repeats per budget): Table 28: Alignment-budget scaling for Qwen-7B → Qwen-3B zero-label transfer. Cells are mean AUROC across five safety concepts (toxicity, jailb...