pith. sign in

arxiv: 2509.13316 · v4 · pith:ZGLP37XFnew · submitted 2025-09-16 · 💻 cs.CL · cs.LG

Do Activation Verbalization Methods Convey Privileged Information?

Pith reviewed 2026-05-18 15:36 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords activation verbalizationLLM interpretabilityprivileged informationverbalizer LLMtarget model internalsbenchmarkscontrolled experiments
0
0 comments X

The pith

Activation verbalization methods typically convey the verbalizer LLM's parametric knowledge rather than privileged information about the target model's internals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether translating LLM activations into natural language using a second verbalizer model provides unique insights into the target model's internal representations. It evaluates popular methods and datasets from prior work and shows that strong benchmark performance is achievable without any access to target model activations. Controlled experiments demonstrate that the resulting verbalizations more often reflect the verbalizer LLM's own encoded knowledge. These findings point to the need for improved benchmarks and controls when assessing whether such methods reveal how target models actually operate.

Core claim

Activation verbalization approaches do not necessarily convey privileged knowledge about the internal workings of the target LLM. Instead, verbalizations frequently reflect the parametric knowledge of the verbalizer LLM that generates them, and existing benchmarks can be solved without using the target model's activations at all.

What carries the argument

Controlled experiments that isolate verbalizer parametric knowledge and input features from target model activations to measure their separate contributions to verbalization outputs.

If this is right

  • Existing benchmarks from prior work do not require target model internals to achieve high performance.
  • Verbalizations often capture the verbalizer LLM's knowledge or input properties instead of target model internals.
  • New benchmarks and experimental controls are needed to test whether verbalization methods yield meaningful insights into target LLM operations.
  • Current performance metrics on verbalization tasks cannot be taken as evidence of access to target model representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future verbalization work could include explicit controls that subtract or match the verbalizer's parametric knowledge against the target.
  • The same concern about secondary-model leakage may apply to other interpretability techniques that decode representations with an auxiliary model.
  • Experiments using target and verbalizer models with deliberately mismatched training data could provide a stronger test of the separation.
  • Better isolation of activation information could improve the reliability of these methods for model debugging and analysis.

Load-bearing premise

The datasets and benchmarks from prior work can distinguish information that comes from target model internals from information already available in the inputs or the verbalizer model.

What would settle it

A result in which verbalizations produced with target activations differ in specific, measurable respects from those produced by verbalizer-only or input-only baselines, in ways that track unique target model behaviors.

Figures

Figures reproduced from arXiv: 2509.13316 by Alberto Mario Ceballos Arroyo, Byron C. Wallace, Giordano Rogers, Millicent Li, Naomi Saphra.

Figure 1
Figure 1. Figure 1: Two ways that a verbalizer (M2) might describe an activation. In our preferred scenario (a), the description employs privileged information beyond what is accessible in the input (xinput), so the country of origin for Alice can be determined from the target (M1) model’s activations. Alternatively, (b) verbalization may give no privileged insights into the operations of M1 since M2 may only be accessing inp… view at source ↗
Figure 2
Figure 2. Figure 2: Two ways of verbalizing descriptions of model activations. In (a), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We use the following setup to assess whether verbalization techniques communicate priv [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We show the effect of using an xprompt that is semantically similar or adversarial. We average across all tasks and tested prompts for space; see Appendix Subsection H.4 for the full prompt and task breakdown. Our key finding is shown above. In Appendix [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We show the effects of small prompt manipulations. For both LIT and [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We show the significant effect of adding prompt distractors, with incorrect labels, to [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
read the original abstract

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about the inputs provided to it? We critically evaluate popular verbalization methods and datasets used in prior work and find that one can perform well on such benchmarks without access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM that generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that activation verbalization methods, which employ a second LLM to translate target LLM activations into natural language descriptions, do not reliably convey privileged information about the target model's internal representations. Instead, strong performance on existing benchmarks is possible without target activations, and controlled experiments indicate that the verbalizations primarily reflect the verbalizer LLM's parametric knowledge or input information rather than target internals.

Significance. If substantiated, the result would be moderately significant for LLM interpretability research by exposing limitations in current verbalization techniques and motivating the creation of more rigorous benchmarks with better controls for verbalizer knowledge. The manuscript's empirical comparisons to prior benchmarks and use of controlled experiments isolating verbalizer effects are strengths that ground the critique in falsifiable tests.

major comments (2)
  1. [§4] §4, Baselines without target activations: the claim that prior datasets are unsuitable rests on these results showing competitive performance; the section should report exact metrics (e.g., accuracy or F1), number of runs, and statistical significance tests to confirm the baselines truly undermine the benchmarks.
  2. [§5.1] §5.1, Controlled experiments: the finding that verbalizations reflect verbalizer parametric knowledge rather than target knowledge is central; clarify the exact procedure for attributing knowledge sources (e.g., via ablation of activation input or comparison to verbalizer-only prompts) and report effect sizes.
minor comments (2)
  1. [Introduction] Introduction: expand the discussion of specific prior verbalization papers with one additional sentence on their claimed advantages to better frame the critique.
  2. [Table 1] Table 1: ensure column headers explicitly label whether rows include target activations or not for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4] §4, Baselines without target activations: the claim that prior datasets are unsuitable rests on these results showing competitive performance; the section should report exact metrics (e.g., accuracy or F1), number of runs, and statistical significance tests to confirm the baselines truly undermine the benchmarks.

    Authors: We agree that additional reporting details would strengthen the section. While the manuscript already demonstrates competitive performance of baselines without target activations, we will revise §4 to explicitly include the exact accuracy and F1 scores, the number of runs performed, and results from statistical significance tests (such as paired t-tests) to more rigorously support the conclusion that these datasets are unsuitable for evaluating verbalization methods. revision: yes

  2. Referee: [§5.1] §5.1, Controlled experiments: the finding that verbalizations reflect verbalizer parametric knowledge rather than target knowledge is central; clarify the exact procedure for attributing knowledge sources (e.g., via ablation of activation input or comparison to verbalizer-only prompts) and report effect sizes.

    Authors: We appreciate the suggestion to enhance clarity here. The experiments in §5.1 attribute knowledge sources through systematic ablations: we compare verbalizations and task performance when the verbalizer receives target activations versus when activations are replaced by random noise or omitted entirely, and versus verbalizer-only prompts that receive only the raw input text. We will expand the description of this procedure in the revised manuscript and add effect sizes (e.g., Cohen's d for performance differences) to quantify the dominance of verbalizer parametric knowledge. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper advances its claims through direct empirical tests, including baselines that achieve strong performance without target model activations and controlled experiments isolating verbalizer parametric knowledge. These steps rely on comparisons to external prior benchmarks and new experimental controls rather than any derivation, equation, or self-citation that reduces the result to its own inputs by construction. No self-definitional steps, fitted predictions, or load-bearing self-citations appear in the abstract or described methodology; the work is therefore independent of the circularity patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing verbalization benchmarks are meant to measure privileged internal information; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Existing verbalization benchmarks and datasets are designed to evaluate whether methods convey privileged information from target model activations.
    The paper's critique of these benchmarks as inadequate depends on this premise about their intended purpose.

pith-pipeline@v0.9.0 · 5693 in / 1117 out tokens · 38894 ms · 2026-05-18T15:36:20.433463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs exhibit domain-specific privileged knowledge in hidden states for factual correctness but not math reasoning, visible only on model disagreement subsets.

  2. Shared Lexical Task Representations Explain Behavioral Variability In LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Physics of language models: Part 3.1, knowledge storage and extraction, 2024

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024. URL https://arxiv.org/abs/2309.14316

  2. [2]

    William P. Alston. Varieties of privileged access. American Philosophical Quarterly, 8 0 (3): 0 223--41, 1971

  3. [3]

    On the pitfalls of analyzing individual neurons in language models

    Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=8uz0EWPQIMu

  4. [4]

    Chain-of-thought reasoning in the wild is not always faithful

    Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. In Workshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview.net/forum?id=L8094Whth0

  5. [5]

    Faithfulness tests for natural language explanations

    Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 28...

  6. [6]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URL https://arxiv.org/abs/1611.09268

  7. [7]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, March 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

  8. [8]

    Analysis methods in neural language processing: A survey

    Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7: 0 49--72, 2019. doi:10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004/

  9. [9]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. URL https://arxiv.org/abs/2303.08112

  10. [10]

    Mechanistic interpretability for ai safety – a review, 2024

    Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety--a review. arXiv preprint arXiv:2404.14082, 2024

  11. [11]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

  12. [12]

    Looking inward: Language models can learn about themselves by introspection

    Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eb5pkwIB5i

  13. [13]

    Wallace, and Silvio Amir

    Alberto Mario Ceballos-Arroyo, Monica Munnangi, Jiuding Sun, Karen Zhang, Jered McInerney, Byron C. Wallace, and Silvio Amir. Open (clinical) LLM s are sensitive to instruction phrasings. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii (eds.), Proceedings of the 23rd Workshop on Biomedical Natural Language Processin...

  14. [14]

    Selfie: Self-interpretation of large language model embeddings

    Haozhe Chen, Carl Vondrick, and Chengzhi Mao. Selfie: Self-interpretation of large language model embeddings. In ICML, 2024. URL https://openreview.net/forum?id=gjgRKbdYR7

  15. [15]

    Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neu...

  16. [16]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  17. [17]

    Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau

    Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau. NN sight and NDIF : Democratizi...

  18. [18]

    Patchscopes: A unifying framework for inspecting hidden representations of language models

    Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2401.06102

  19. [19]

    Eliciting textual descriptions from representations of continuous prompts

    Daniela Gottesman, Mor Geva, and Dana Ramati. Eliciting textual descriptions from representations of continuous prompts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 16545--16562, Vienna, Austria, July 2025. Association for Computational Lin...

  20. [20]

    Making the

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6325--6334, 2017. doi:10.1109/CVPR.2017.670

  21. [21]

    Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4351--4367, Online, November 2020. Association for Computation...

  22. [22]

    Linearity of relation decoding in transformer language models

    Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w7LU2s14kE

  23. [23]

    Designing and Interpreting Probes with Control Tasks

    John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2733--2743, Hong Kong, Chin...

  24. [24]

    Lo RA : Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  25. [25]

    Rigorously assessing natural language explanations of neurons

    Jing Huang, Atticus Geiger, Karel D ' Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 31...

  26. [26]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4198--4205, Online, July 2020. Association for Computational Lin...

  27. [27]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  28. [28]

    Divyansh Kaushik and Zachary C. Lipton. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 5010--5015, Brussels, Belgium, October-Nove...

  29. [29]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), ICLR (Poster), 2015. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14

  30. [30]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

  31. [31]

    Lost in embeddings: Information loss in vision-language models, 2025

    Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, and Anders Søgaard. Lost in embeddings: Information loss in vision-language models, 2025. URL https://arxiv.org/abs/2509.11986

  32. [32]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and...

  33. [33]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =

    R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 3428--3448, Florence, Italy, July 2019. Association for Co...

  34. [34]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-h6WAS6eE4

  35. [35]

    Un ministral, des ministraux

    Mistral. Un ministral, des ministraux. https://mistral.ai/news/ministraux, 2024. Accessed: 2025-05-18

  36. [36]

    Text embeddings reveal (almost) as much as text

    John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. Text embeddings reveal (almost) as much as text. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12448--12460, Singapore, December 2023. Association for Computational Linguistics. doi:10.1...

  37. [37]

    Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

    Basel Mousi, Nadir Durrani, and Fahim Dalvi. Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=nIuJXuSdhn

  38. [38]

    Multi-task transfer matters during instruction-tuning

    David Mueller, Mark Dredze, and Nicholas Andrews. Multi-task transfer matters during instruction-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14880--14891, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-ac...

  39. [39]

    interpreting gpt: the logit lens, 2020

    nostalgebraist. interpreting gpt: the logit lens, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  40. [40]

    Future lens: Anticipating subsequent tokens from a single hidden state

    Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wallace, and David Bau. Future lens: Anticipating subsequent tokens from a single hidden state. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.\ 548--560, 2023

  41. [41]

    Latentqa: Teaching llms to decode activations into natural language, 2024

    Alexander Pan, Lijie Chen, and Jacob Steinhardt. Latentqa: Teaching llms to decode activations into natural language, 2024. URL https://arxiv.org/abs/2412.08686

  42. [42]

    BLEU: a Method for Automatic Evaluation of Machine Translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. B leu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

  43. [43]

    Scikit-learn: Machine learning in python

    Fabian Pedregosa, Ga\" e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \' E douard Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12 0 ...

  44. [44]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (1), January 2020. ISSN 1532-4435

  45. [45]

    A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646, 2024

  46. [46]

    Mechanistic? arXiv preprint arXiv:2410.09087, 2024

    Naomi Saphra and Sarah Wiegreffe. Mechanistic? arXiv preprint arXiv:2410.09087, 2024

  47. [47]

    Open Problems in Mechanistic Interpretability

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

  48. [48]

    Hsu, Richard J

    Chandan Singh, Aliyah R. Hsu, Richard J. Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models. CoRR, abs/2305.09863, 2023. doi:10.48550/ARXIV.2305.09863. URL https://doi.org/10.48550/arXiv.2305.09863

  49. [49]

    Language models fail to introspect about their knowledge of language, 2025 a

    Siyuan Song, Jennifer Hu, and Kyle Mahowald. Language models fail to introspect about their knowledge of language, 2025 a . URL https://arxiv.org/abs/2503.07513

  50. [50]

    arXiv preprint arXiv:2508.14802 , year =

    Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in ai, 2025 b . URL https://arxiv.org/abs/2508.14802

  51. [51]

    Evaluating the zero-shot robustness of instruction-tuned language models

    Jiuding Sun, Chantal Shaib, and Byron C Wallace. Evaluating the zero-shot robustness of instruction-tuned language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=g9diuvxN6D

  52. [52]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

  53. [53]

    Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 2300--2344, 2022

  54. [54]

    Jump to conclusions: Short-cutting transformers with linear transformations

    Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 9615--9625, Torino, Italy, May 2024. ELRA and ICCL. URL https://aclan...

  55. [55]

    Improving the robustness of large language models via consistency alignment

    Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint Internationa...

  56. [56]

    Regularization and variable selection via the elastic net

    Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 0 (2): 0 301--320, 2005. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/3647580

  57. [57]

    How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

    Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations, 2025. URL https://arxiv.org/abs/2503.21676

  58. [58]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  59. [59]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  60. [60]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  61. [61]

    Angular and Scale Alignment losses are evaluated for long-term video generation, with MSE as a naive baseline of aligning both angular and scale information

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...