Do Activation Verbalization Methods Convey Privileged Information?

Alberto Mario Ceballos Arroyo; Byron C. Wallace; Giordano Rogers; Millicent Li; Naomi Saphra

arxiv: 2509.13316 · v4 · pith:ZGLP37XFnew · submitted 2025-09-16 · 💻 cs.CL · cs.LG

Do Activation Verbalization Methods Convey Privileged Information?

Millicent Li , Alberto Mario Ceballos Arroyo , Giordano Rogers , Naomi Saphra , Byron C. Wallace This is my paper

Pith reviewed 2026-05-18 15:36 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords activation verbalizationLLM interpretabilityprivileged informationverbalizer LLMtarget model internalsbenchmarkscontrolled experiments

0 comments

The pith

Activation verbalization methods typically convey the verbalizer LLM's parametric knowledge rather than privileged information about the target model's internals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether translating LLM activations into natural language using a second verbalizer model provides unique insights into the target model's internal representations. It evaluates popular methods and datasets from prior work and shows that strong benchmark performance is achievable without any access to target model activations. Controlled experiments demonstrate that the resulting verbalizations more often reflect the verbalizer LLM's own encoded knowledge. These findings point to the need for improved benchmarks and controls when assessing whether such methods reveal how target models actually operate.

Core claim

Activation verbalization approaches do not necessarily convey privileged knowledge about the internal workings of the target LLM. Instead, verbalizations frequently reflect the parametric knowledge of the verbalizer LLM that generates them, and existing benchmarks can be solved without using the target model's activations at all.

What carries the argument

Controlled experiments that isolate verbalizer parametric knowledge and input features from target model activations to measure their separate contributions to verbalization outputs.

If this is right

Existing benchmarks from prior work do not require target model internals to achieve high performance.
Verbalizations often capture the verbalizer LLM's knowledge or input properties instead of target model internals.
New benchmarks and experimental controls are needed to test whether verbalization methods yield meaningful insights into target LLM operations.
Current performance metrics on verbalization tasks cannot be taken as evidence of access to target model representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future verbalization work could include explicit controls that subtract or match the verbalizer's parametric knowledge against the target.
The same concern about secondary-model leakage may apply to other interpretability techniques that decode representations with an auxiliary model.
Experiments using target and verbalizer models with deliberately mismatched training data could provide a stronger test of the separation.
Better isolation of activation information could improve the reliability of these methods for model debugging and analysis.

Load-bearing premise

The datasets and benchmarks from prior work can distinguish information that comes from target model internals from information already available in the inputs or the verbalizer model.

What would settle it

A result in which verbalizations produced with target activations differ in specific, measurable respects from those produced by verbalizer-only or input-only baselines, in ways that track unique target model behaviors.

Figures

Figures reproduced from arXiv: 2509.13316 by Alberto Mario Ceballos Arroyo, Byron C. Wallace, Giordano Rogers, Millicent Li, Naomi Saphra.

**Figure 1.** Figure 1: Two ways that a verbalizer (M2) might describe an activation. In our preferred scenario (a), the description employs privileged information beyond what is accessible in the input (xinput), so the country of origin for Alice can be determined from the target (M1) model’s activations. Alternatively, (b) verbalization may give no privileged insights into the operations of M1 since M2 may only be accessing inp… view at source ↗

**Figure 2.** Figure 2: Two ways of verbalizing descriptions of model activations. In (a), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We use the following setup to assess whether verbalization techniques communicate priv [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: We show the effect of using an xprompt that is semantically similar or adversarial. We average across all tasks and tested prompts for space; see Appendix Subsection H.4 for the full prompt and task breakdown. Our key finding is shown above. In Appendix [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: We show the effects of small prompt manipulations. For both LIT and [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗

**Figure 6.** Figure 6: We show the significant effect of adding prompt distractors, with incorrect labels, to [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

read the original abstract

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about the inputs provided to it? We critically evaluate popular verbalization methods and datasets used in prior work and find that one can perform well on such benchmarks without access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM that generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Verbalization methods mostly track the verbalizer LLM's own knowledge rather than revealing privileged target internals, and prior benchmarks allow decent performance without any target activations.

read the letter

The core finding is that activation verbalization does not seem to deliver privileged access to the target model's internals. The outputs instead largely reflect what the verbalizer LLM already knows from its parameters, and the paper shows you can get reasonable results on existing benchmarks without ever using the target activations at all. This undercuts how much those benchmarks actually test for internal decoding. The controlled experiments separate the verbalizer's contribution from the target's and make a reasonable case that the verbalizer dominates. That is the main new angle here, building on earlier verbalization papers by adding this specific check for privileged information. It is a straightforward empirical point and the logic of the controls follows from the abstract description. The work earns credit for running those comparisons and for flagging that the datasets from prior work are not ideal for this evaluation goal. A softer spot is the reliance on those same benchmarks even while critiquing them; without the full methods, data splits, and exact metrics it is hard to judge how tightly the experiments rule out input leakage or other confounds. The abstract presents the sequence clearly, but the strength of the claim depends on details not visible here. This is useful for researchers who use or evaluate verbalization techniques in LLM interpretability. Anyone who wants to know whether their method is actually reading model internals rather than just rephrasing inputs or the verbalizer's priors would get something out of it. It is not a complete dismissal of the approach, just a call for better tests. I would send it to peer review so the experimental controls can be checked in detail and the field can decide whether new benchmarks are required.

Referee Report

2 major / 2 minor

Summary. The paper claims that activation verbalization methods, which employ a second LLM to translate target LLM activations into natural language descriptions, do not reliably convey privileged information about the target model's internal representations. Instead, strong performance on existing benchmarks is possible without target activations, and controlled experiments indicate that the verbalizations primarily reflect the verbalizer LLM's parametric knowledge or input information rather than target internals.

Significance. If substantiated, the result would be moderately significant for LLM interpretability research by exposing limitations in current verbalization techniques and motivating the creation of more rigorous benchmarks with better controls for verbalizer knowledge. The manuscript's empirical comparisons to prior benchmarks and use of controlled experiments isolating verbalizer effects are strengths that ground the critique in falsifiable tests.

major comments (2)

[§4] §4, Baselines without target activations: the claim that prior datasets are unsuitable rests on these results showing competitive performance; the section should report exact metrics (e.g., accuracy or F1), number of runs, and statistical significance tests to confirm the baselines truly undermine the benchmarks.
[§5.1] §5.1, Controlled experiments: the finding that verbalizations reflect verbalizer parametric knowledge rather than target knowledge is central; clarify the exact procedure for attributing knowledge sources (e.g., via ablation of activation input or comparison to verbalizer-only prompts) and report effect sizes.

minor comments (2)

[Introduction] Introduction: expand the discussion of specific prior verbalization papers with one additional sentence on their claimed advantages to better frame the critique.
[Table 1] Table 1: ensure column headers explicitly label whether rows include target activations or not for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4, Baselines without target activations: the claim that prior datasets are unsuitable rests on these results showing competitive performance; the section should report exact metrics (e.g., accuracy or F1), number of runs, and statistical significance tests to confirm the baselines truly undermine the benchmarks.

Authors: We agree that additional reporting details would strengthen the section. While the manuscript already demonstrates competitive performance of baselines without target activations, we will revise §4 to explicitly include the exact accuracy and F1 scores, the number of runs performed, and results from statistical significance tests (such as paired t-tests) to more rigorously support the conclusion that these datasets are unsuitable for evaluating verbalization methods. revision: yes
Referee: [§5.1] §5.1, Controlled experiments: the finding that verbalizations reflect verbalizer parametric knowledge rather than target knowledge is central; clarify the exact procedure for attributing knowledge sources (e.g., via ablation of activation input or comparison to verbalizer-only prompts) and report effect sizes.

Authors: We appreciate the suggestion to enhance clarity here. The experiments in §5.1 attribute knowledge sources through systematic ablations: we compare verbalizations and task performance when the verbalizer receives target activations versus when activations are replaced by random noise or omitted entirely, and versus verbalizer-only prompts that receive only the raw input text. We will expand the description of this procedure in the revised manuscript and add effect sizes (e.g., Cohen's d for performance differences) to quantify the dominance of verbalizer parametric knowledge. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper advances its claims through direct empirical tests, including baselines that achieve strong performance without target model activations and controlled experiments isolating verbalizer parametric knowledge. These steps rely on comparisons to external prior benchmarks and new experimental controls rather than any derivation, equation, or self-citation that reduces the result to its own inputs by construction. No self-definitional steps, fitted predictions, or load-bearing self-citations appear in the abstract or described methodology; the work is therefore independent of the circularity patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing verbalization benchmarks are meant to measure privileged internal information; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Existing verbalization benchmarks and datasets are designed to evaluate whether methods convey privileged information from target model activations.
The paper's critique of these benchmarks as inadequate depends on this premise about their intended purpose.

pith-pipeline@v0.9.0 · 5693 in / 1117 out tokens · 38894 ms · 2026-05-18T15:36:20.433463+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

verbalizations often reflect the parametric knowledge of the verbalizer LLM that generated them, rather than the knowledge of the target LLM whose activations are decoded
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We create a new evaluation task in Section 5 to study whether verbalizers express knowledge added by the target model during processing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
cs.CL 2026-04 unverdicted novelty 7.0

LLMs exhibit domain-specific privileged knowledge in hidden states for factual correctness but not math reasoning, visible only on model disagreement subsets.
Shared Lexical Task Representations Explain Behavioral Variability In LLMs
cs.CL 2026-04 unverdicted novelty 5.0

LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 7 internal anchors

[1]

Physics of language models: Part 3.1, knowledge storage and extraction, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024. URL https://arxiv.org/abs/2309.14316

work page arXiv 2024
[2]

William P. Alston. Varieties of privileged access. American Philosophical Quarterly, 8 0 (3): 0 223--41, 1971

work page 1971
[3]

On the pitfalls of analyzing individual neurons in language models

Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=8uz0EWPQIMu

work page 2022
[4]

Chain-of-thought reasoning in the wild is not always faithful

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. In Workshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview.net/forum?id=L8094Whth0

work page 2025
[5]

Faithfulness tests for natural language explanations

Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 28...

work page doi:10.18653/v1/2023.acl-short.25 2023
[6]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URL https://arxiv.org/abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, March 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[8]

Analysis methods in neural language processing: A survey

Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7: 0 49--72, 2019. doi:10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004/

work page doi:10.1162/tacl_a_00254 2019
[9]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. URL https://arxiv.org/abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Mechanistic interpretability for ai safety – a review, 2024

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety--a review. arXiv preprint arXiv:2404.14082, 2024

work page arXiv 2024
[11]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

work page 2023
[12]

Looking inward: Language models can learn about themselves by introspection

Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eb5pkwIB5i

work page 2025
[13]

Wallace, and Silvio Amir

Alberto Mario Ceballos-Arroyo, Monica Munnangi, Jiuding Sun, Karen Zhang, Jered McInerney, Byron C. Wallace, and Silvio Amir. Open (clinical) LLM s are sensitive to instruction phrasings. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii (eds.), Proceedings of the 23rd Workshop on Biomedical Natural Language Processin...

work page doi:10.18653/v1/2024.bionlp-1.5 2024
[14]

Selfie: Self-interpretation of large language model embeddings

Haozhe Chen, Carl Vondrick, and Chengzhi Mao. Selfie: Self-interpretation of large language model embeddings. In ICML, 2024. URL https://openreview.net/forum?id=gjgRKbdYR7

work page 2024
[15]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neu...

work page arXiv 2014
[16]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[17]

Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau

Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau. NN sight and NDIF : Democratizi...

work page 2025
[18]

Patchscopes: A unifying framework for inspecting hidden representations of language models

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2401.06102

work page arXiv 2024
[19]

Eliciting textual descriptions from representations of continuous prompts

Daniela Gottesman, Mor Geva, and Dana Ramati. Eliciting textual descriptions from representations of continuous prompts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 16545--16562, Vienna, Austria, July 2025. Association for Computational Lin...

work page doi:10.18653/v1/2025.findings-acl.849 2025
[20]

Making the

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6325--6334, 2017. doi:10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017
[21]

Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4351--4367, Online, November 2020. Association for Computation...

work page doi:10.18653/v1/2020.findings-emnlp.390 2020
[22]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w7LU2s14kE

work page 2024
[23]

Designing and Interpreting Probes with Control Tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2733--2743, Hong Kong, Chin...

work page doi:10.18653/v1/d19-1275 2019
[24]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[25]

Rigorously assessing natural language explanations of neurons

Jing Huang, Atticus Geiger, Karel D ' Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 31...

work page doi:10.18653/v1/2023.blackboxnlp-1.24 2023
[26]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4198--4205, Online, July 2020. Association for Computational Lin...

work page doi:10.18653/v1/2020.acl-main.386 2020
[27]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.05221 2022
[28]

Divyansh Kaushik and Zachary C. Lipton. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 5010--5015, Brussels, Belgium, October-Nove...

work page doi:10.18653/v1/d18-1546 2018
[29]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), ICLR (Poster), 2015. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14

work page 2015
[30]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Lost in embeddings: Information loss in vision-language models, 2025

Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, and Anders Søgaard. Lost in embeddings: Information loss in vision-language models, 2025. URL https://arxiv.org/abs/2509.11986

work page arXiv 2025
[32]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and...

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023
[33]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 3428--3448, Florence, Italy, July 2019. Association for Co...

work page doi:10.18653/v1/p19-1334 2019
[34]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-h6WAS6eE4

work page 2022
[35]

Un ministral, des ministraux

Mistral. Un ministral, des ministraux. https://mistral.ai/news/ministraux, 2024. Accessed: 2025-05-18

work page 2024
[36]

Text embeddings reveal (almost) as much as text

John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. Text embeddings reveal (almost) as much as text. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12448--12460, Singapore, December 2023. Association for Computational Linguistics. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.765 2023
[37]

Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Basel Mousi, Nadir Durrani, and Fahim Dalvi. Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=nIuJXuSdhn

work page 2023
[38]

Multi-task transfer matters during instruction-tuning

David Mueller, Mark Dredze, and Nicholas Andrews. Multi-task transfer matters during instruction-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14880--14891, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-ac...

work page doi:10.18653/v1/2024.findings-acl.883 2024
[39]

interpreting gpt: the logit lens, 2020

nostalgebraist. interpreting gpt: the logit lens, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

work page 2020
[40]

Future lens: Anticipating subsequent tokens from a single hidden state

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wallace, and David Bau. Future lens: Anticipating subsequent tokens from a single hidden state. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.\ 548--560, 2023

work page 2023
[41]

Latentqa: Teaching llms to decode activations into natural language, 2024

Alexander Pan, Lijie Chen, and Jacob Steinhardt. Latentqa: Teaching llms to decode activations into natural language, 2024. URL https://arxiv.org/abs/2412.08686

work page arXiv 2024
[42]

BLEU: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. B leu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

work page doi:10.3115/1073083.1073135 2002
[43]

Scikit-learn: Machine learning in python

Fabian Pedregosa, Ga\" e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \' E douard Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12 0 ...

work page 2011
[44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (1), January 2020. ISSN 1532-4435

work page 2020
[45]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646, 2024

work page arXiv 2024
[46]

Mechanistic? arXiv preprint arXiv:2410.09087, 2024

Naomi Saphra and Sarah Wiegreffe. Mechanistic? arXiv preprint arXiv:2410.09087, 2024

work page arXiv 2024
[47]

Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Hsu, Richard J

Chandan Singh, Aliyah R. Hsu, Richard J. Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models. CoRR, abs/2305.09863, 2023. doi:10.48550/ARXIV.2305.09863. URL https://doi.org/10.48550/arXiv.2305.09863

work page doi:10.48550/arxiv.2305.09863 2023
[49]

Language models fail to introspect about their knowledge of language, 2025 a

Siyuan Song, Jennifer Hu, and Kyle Mahowald. Language models fail to introspect about their knowledge of language, 2025 a . URL https://arxiv.org/abs/2503.07513

work page arXiv 2025
[50]

arXiv preprint arXiv:2508.14802 , year =

Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in ai, 2025 b . URL https://arxiv.org/abs/2508.14802

work page arXiv 2025
[51]

Evaluating the zero-shot robustness of instruction-tuned language models

Jiuding Sun, Chantal Shaib, and Byron C Wallace. Evaluating the zero-shot robustness of instruction-tuned language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=g9diuvxN6D

work page 2024
[52]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

work page 2023
[53]

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 2300--2344, 2022

work page 2022
[54]

Jump to conclusions: Short-cutting transformers with linear transformations

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 9615--9625, Torino, Italy, May 2024. ELRA and ICCL. URL https://aclan...

work page 2024
[55]

Improving the robustness of large language models via consistency alignment

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint Internationa...

work page 2024
[56]

Regularization and variable selection via the elastic net

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 0 (2): 0 301--320, 2005. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/3647580

work page arXiv 2005
[57]

How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations, 2025. URL https://arxiv.org/abs/2503.21676

work page arXiv 2025
[58]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[59]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[60]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[61]

Angular and Scale Alignment losses are evaluated for long-term video generation, with MSE as a naive baseline of aligning both angular and scale information

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2003

[1] [1]

Physics of language models: Part 3.1, knowledge storage and extraction, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024. URL https://arxiv.org/abs/2309.14316

work page arXiv 2024

[2] [2]

William P. Alston. Varieties of privileged access. American Philosophical Quarterly, 8 0 (3): 0 223--41, 1971

work page 1971

[3] [3]

On the pitfalls of analyzing individual neurons in language models

Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=8uz0EWPQIMu

work page 2022

[4] [4]

Chain-of-thought reasoning in the wild is not always faithful

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. In Workshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview.net/forum?id=L8094Whth0

work page 2025

[5] [5]

Faithfulness tests for natural language explanations

Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 28...

work page doi:10.18653/v1/2023.acl-short.25 2023

[6] [6]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URL https://arxiv.org/abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, March 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[8] [8]

Analysis methods in neural language processing: A survey

Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7: 0 49--72, 2019. doi:10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004/

work page doi:10.1162/tacl_a_00254 2019

[9] [9]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. URL https://arxiv.org/abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Mechanistic interpretability for ai safety – a review, 2024

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety--a review. arXiv preprint arXiv:2404.14082, 2024

work page arXiv 2024

[11] [11]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

work page 2023

[12] [12]

Looking inward: Language models can learn about themselves by introspection

Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eb5pkwIB5i

work page 2025

[13] [13]

Wallace, and Silvio Amir

Alberto Mario Ceballos-Arroyo, Monica Munnangi, Jiuding Sun, Karen Zhang, Jered McInerney, Byron C. Wallace, and Silvio Amir. Open (clinical) LLM s are sensitive to instruction phrasings. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, and Junichi Tsujii (eds.), Proceedings of the 23rd Workshop on Biomedical Natural Language Processin...

work page doi:10.18653/v1/2024.bionlp-1.5 2024

[14] [14]

Selfie: Self-interpretation of large language model embeddings

Haozhe Chen, Carl Vondrick, and Chengzhi Mao. Selfie: Self-interpretation of large language model embeddings. In ICML, 2024. URL https://openreview.net/forum?id=gjgRKbdYR7

work page 2024

[15] [15]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neu...

work page arXiv 2014

[16] [16]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[17] [17]

Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau

Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau. NN sight and NDIF : Democratizi...

work page 2025

[18] [18]

Patchscopes: A unifying framework for inspecting hidden representations of language models

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2401.06102

work page arXiv 2024

[19] [19]

Eliciting textual descriptions from representations of continuous prompts

Daniela Gottesman, Mor Geva, and Dana Ramati. Eliciting textual descriptions from representations of continuous prompts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 16545--16562, Vienna, Austria, July 2025. Association for Computational Lin...

work page doi:10.18653/v1/2025.findings-acl.849 2025

[20] [20]

Making the

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6325--6334, 2017. doi:10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017

[21] [21]

Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4351--4367, Online, November 2020. Association for Computation...

work page doi:10.18653/v1/2020.findings-emnlp.390 2020

[22] [22]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w7LU2s14kE

work page 2024

[23] [23]

Designing and Interpreting Probes with Control Tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2733--2743, Hong Kong, Chin...

work page doi:10.18653/v1/d19-1275 2019

[24] [24]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[25] [25]

Rigorously assessing natural language explanations of neurons

Jing Huang, Atticus Geiger, Karel D ' Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.\ 31...

work page doi:10.18653/v1/2023.blackboxnlp-1.24 2023

[26] [26]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4198--4205, Online, July 2020. Association for Computational Lin...

work page doi:10.18653/v1/2020.acl-main.386 2020

[27] [27]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.05221 2022

[28] [28]

Divyansh Kaushik and Zachary C. Lipton. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 5010--5015, Brussels, Belgium, October-Nove...

work page doi:10.18653/v1/d18-1546 2018

[29] [29]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), ICLR (Poster), 2015. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14

work page 2015

[30] [30]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Lost in embeddings: Information loss in vision-language models, 2025

Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, and Anders Søgaard. Lost in embeddings: Information loss in vision-language models, 2025. URL https://arxiv.org/abs/2509.11986

work page arXiv 2025

[32] [32]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and...

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023

[33] [33]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 3428--3448, Florence, Italy, July 2019. Association for Co...

work page doi:10.18653/v1/p19-1334 2019

[34] [34]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-h6WAS6eE4

work page 2022

[35] [35]

Un ministral, des ministraux

Mistral. Un ministral, des ministraux. https://mistral.ai/news/ministraux, 2024. Accessed: 2025-05-18

work page 2024

[36] [36]

Text embeddings reveal (almost) as much as text

John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. Text embeddings reveal (almost) as much as text. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12448--12460, Singapore, December 2023. Association for Computational Linguistics. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.765 2023

[37] [37]

Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Basel Mousi, Nadir Durrani, and Fahim Dalvi. Can LLM s facilitate interpretation of pre-trained language models? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=nIuJXuSdhn

work page 2023

[38] [38]

Multi-task transfer matters during instruction-tuning

David Mueller, Mark Dredze, and Nicholas Andrews. Multi-task transfer matters during instruction-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14880--14891, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-ac...

work page doi:10.18653/v1/2024.findings-acl.883 2024

[39] [39]

interpreting gpt: the logit lens, 2020

nostalgebraist. interpreting gpt: the logit lens, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

work page 2020

[40] [40]

Future lens: Anticipating subsequent tokens from a single hidden state

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wallace, and David Bau. Future lens: Anticipating subsequent tokens from a single hidden state. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.\ 548--560, 2023

work page 2023

[41] [41]

Latentqa: Teaching llms to decode activations into natural language, 2024

Alexander Pan, Lijie Chen, and Jacob Steinhardt. Latentqa: Teaching llms to decode activations into natural language, 2024. URL https://arxiv.org/abs/2412.08686

work page arXiv 2024

[42] [42]

BLEU: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. B leu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.\ 311--318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

work page doi:10.3115/1073083.1073135 2002

[43] [43]

Scikit-learn: Machine learning in python

Fabian Pedregosa, Ga\" e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and \' E douard Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12 0 ...

work page 2011

[44] [44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 0 (1), January 2020. ISSN 1532-4435

work page 2020

[45] [45]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646, 2024

work page arXiv 2024

[46] [46]

Mechanistic? arXiv preprint arXiv:2410.09087, 2024

Naomi Saphra and Sarah Wiegreffe. Mechanistic? arXiv preprint arXiv:2410.09087, 2024

work page arXiv 2024

[47] [47]

Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Hsu, Richard J

Chandan Singh, Aliyah R. Hsu, Richard J. Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models. CoRR, abs/2305.09863, 2023. doi:10.48550/ARXIV.2305.09863. URL https://doi.org/10.48550/arXiv.2305.09863

work page doi:10.48550/arxiv.2305.09863 2023

[49] [49]

Language models fail to introspect about their knowledge of language, 2025 a

Siyuan Song, Jennifer Hu, and Kyle Mahowald. Language models fail to introspect about their knowledge of language, 2025 a . URL https://arxiv.org/abs/2503.07513

work page arXiv 2025

[50] [50]

arXiv preprint arXiv:2508.14802 , year =

Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in ai, 2025 b . URL https://arxiv.org/abs/2508.14802

work page arXiv 2025

[51] [51]

Evaluating the zero-shot robustness of instruction-tuned language models

Jiuding Sun, Chantal Shaib, and Byron C Wallace. Evaluating the zero-shot robustness of instruction-tuned language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=g9diuvxN6D

work page 2024

[52] [52]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

work page 2023

[53] [53]

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 2300--2344, 2022

work page 2022

[54] [54]

Jump to conclusions: Short-cutting transformers with linear transformations

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 9615--9625, Torino, Italy, May 2024. ELRA and ICCL. URL https://aclan...

work page 2024

[55] [55]

Improving the robustness of large language models via consistency alignment

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. Improving the robustness of large language models via consistency alignment. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint Internationa...

work page 2024

[56] [56]

Regularization and variable selection via the elastic net

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 0 (2): 0 301--320, 2005. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/3647580

work page arXiv 2005

[57] [57]

How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations, 2025. URL https://arxiv.org/abs/2503.21676

work page arXiv 2025

[58] [58]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[59] [59]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[60] [60]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[61] [61]

Angular and Scale Alignment losses are evaluated for long-term video generation, with MSE as a naive baseline of aligning both angular and scale information

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2003