Fast & Faithful Function Vectors

Anton Segeler; Minh An Pham; Patrick Kahardipraja; Reduan Achtibat; Sebastian Lapuschkin; Thomas Wiegand; Wojciech Samek

arxiv: 2606.05079 · v1 · pith:2HPNQRMPnew · submitted 2026-06-03 · 💻 cs.CL · cs.LG

Fast & Faithful Function Vectors

Minh An Pham , Anton Segeler , Thomas Wiegand , Wojciech Samek , Sebastian Lapuschkin , Patrick Kahardipraja , Reduan Achtibat This is my paper

Pith reviewed 2026-06-28 06:44 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords function vectorsin-context learningattention head selectionLayer-wise Relevance PropagationLLM steeringdistributed steering

0 comments

The pith

Layer-wise Relevance Propagation for head selection makes function vectors more efficient and accurate, and distributed steering outperforms aggregation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to formulate function vectors, which are task representations drawn from in-context learning and used to steer large language models. It varies two choices: which attention heads to use when building the vector, and how to apply the steering signal. Gradient-based attributions via Layer-wise Relevance Propagation for head selection cut computation while raising accuracy. Steering the vectors in a distributed way across heads also raises accuracy over simply adding them together.

Core claim

For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation.

What carries the argument

Function vectors as in-context task representations, modified by LRP-based attention head selection and distributed application of the steering signal.

If this is right

Function vectors can be extracted with fewer heads while retaining or improving steering performance.
Distributed steering produces stronger task control than vector addition across the same heads.
The same attribution method can be reused to rank heads for other in-context tasks.
Efficiency gains allow function-vector methods to run on longer contexts or larger models without proportional cost increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LRP ranking might identify useful heads even when the underlying model changes architecture or training data.
Distributed steering could be combined with other attribution techniques to further reduce the number of active heads needed.
If the efficiency improvement scales, function vectors might become practical for real-time steering in deployed chat systems.

Load-bearing premise

The measured gains in speed and accuracy arise from the LRP head selection and distributed steering rather than from differences in model size, task choice, or other unstated experimental details.

What would settle it

A controlled rerun of the head-selection and steering experiments that keeps every other variable fixed and finds no efficiency or accuracy lift from LRP or distributed application would falsify the central claims.

Figures

Figures reproduced from arXiv: 2606.05079 by Anton Segeler, Minh An Pham, Patrick Kahardipraja, Reduan Achtibat, Sebastian Lapuschkin, Thomas Wiegand, Wojciech Samek.

**Figure 1.** Figure 1: Overview on how definitions of FVs affect efficiency and accuracy on Llama-3.2-3B. be understood as task representations for in-context learning (ICL; Brown et al., 2020). The so-called function vectors (FVs) can be extracted from attention heads, which then triggers the execution of a task. However, despite of its usefulness (Yang et al., 2026; Liu et al., 2026), there is little consensus on how to define… view at source ↗

**Figure 2.** Figure 2: Average accuracies over all tasks. Finding 1: Distributed FVs improve performance compared to averaged FVs In [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Position of extracted heads within Llama-3.2-3B. of the target token among the top softmax probabilities is 4.3. Looking at the predictions themselves, we find that 26 out of 41 failures are predictions of synonyms of the target token (e.g. bad for evil, disorder for chaos) or subword prefixes that plausibly continue into a valid antonym under multi-token decoding (e.g. un-, non-, anti-). Only the remaini… view at source ↗

**Figure 4.** Figure 4: Accuracies per task with global heads on Llama-3.2-3B. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracies per task with global heads on Llama-3.1-8B. adjective_v_verb_3 adjective_v_verb_5 alphabetically_last_3 animal_v_object_3 animal_v_object_5 antonym capitalize capitalize_first_letter capitalize_last_letter choose_first_of_3 choose_first_of_5 choose_last_of_3 choose_last_of_5 choose_middle_of_3 choose_middle_of_5 color_v_animal_3 color_v_animal_5 concept_v_object_3 concept_v_object_5 conll2003_lo… view at source ↗

**Figure 6.** Figure 6: Accuracies per task with global heads on Qwen3-4B. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Average accuracies over the all tasks with per-task heads. adjective_v_verb_3 adjective_v_verb_5 alphabetically_last_3 animal_v_object_3 animal_v_object_5 antonym capitalize capitalize_first_letter capitalize_last_letter choose_first_of_3 choose_first_of_5 choose_last_of_3 choose_last_of_5 choose_middle_of_3 choose_middle_of_5 color_v_animal_3 color_v_animal_5 concept_v_object_3 concept_v_object_5 conll200… view at source ↗

**Figure 8.** Figure 8: Accuracies per task with per-task heads on Llama-3.2-3B. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracies per task with per-task heads on Llama-3.1-8B. adjective_v_verb_3 adjective_v_verb_5 alphabetically_last_3 animal_v_object_3 animal_v_object_5 antonym capitalize capitalize_first_letter capitalize_last_letter choose_first_of_3 choose_first_of_5 choose_last_of_3 choose_last_of_5 choose_middle_of_3 choose_middle_of_5 color_v_animal_3 color_v_animal_5 concept_v_object_3 concept_v_object_5 conll2003_… view at source ↗

**Figure 10.** Figure 10: Accuracies per task with per-task heads on Qwen3-4B. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Position of extracted heads within Llama-3.1-8B. 0 5 10 15 20 25 30 Head Index 5 10 15 20 25 30 35 Layer Index AIE LRP Overlap [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Position of extracted heads within Qwen3-4B. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracies for injecting FVs at different layers for Llama3.2-3B. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Intervention Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy AIE + FV LRP + FV [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Accuracies for injecting FVs at different layers for Llama3.1-8B. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracies for injecting FVs at different layers for Qwen3-4B. F. Exploring larger values for K Davidson et al. (2025) choose K = 20 for patching FVs, therefore we choose to adapt this. Additionally, we evaluated all models on K = 40 and show the result in [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Conparison of FV patching with K = 20 and K = 40. G. List of Tasks We include the list of tasks that were either used or omitted [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

read the original abstract

Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims LRP head selection and distributed steering improve function vectors but gives no numbers, baselines, or controls, so the gains cannot be assessed.

read the letter

The main takeaway is that this paper looks at two tweaks to function vectors—LRP for picking attention heads and applying the vectors in a distributed way rather than aggregating them—and says both improve accuracy and efficiency. That is the extent of what is new: a pair of empirical comparisons on design choices already present in the cited FV papers.

It does a couple of things right. It narrows the focus to those two degrees of freedom and releases the code. That is useful for anyone already running these experiments and wanting to try the variants.

The soft spots are more serious. The abstract states the improvements without any accuracy numbers, efficiency metrics, dataset details, error bars, or ablation tables. The stress-test concern lands: we have no way to know whether the reported gains come from LRP and distributed application or from unmentioned differences in model scale, task choice, prompt format, or hyperparameters. Without those controls the attribution does not follow.

This is narrow work for people already inside the steering subfield. A reader working on in-context control might want to see the full results if they exist, but there is nothing here that would change how most people think about function vectors or justify citing the paper. It does not rise to the level that deserves referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an empirical study on function vectors (FVs) for steering large language models during in-context learning. It varies FV definitions along two axes—attention head selection (using gradient-based attributions via Layer-wise Relevance Propagation) and steering application (distributed vs. simple aggregation)—and claims that LRP head selection improves both efficiency and accuracy while distributed steering improves accuracy over aggregation. Public code is provided.

Significance. If the reported gains are shown to be robustly attributable to the LRP and distributed-steering choices (rather than unisolated experimental variables), the work would provide practical guidance for more efficient and accurate FV steering and strengthen the empirical toolkit for analyzing in-context learning. Public code is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the claims that LRP head selection 'substantially improves efficiency as well as accuracy' and that distributed steering 'yields a higher accuracy' are asserted without any numerical results, baselines, datasets, error bars, or statistical tests, so the magnitude and reliability of the improvements cannot be evaluated from the provided text.
[Experimental sections] Experimental sections (head-selection and steering results): the central attribution of gains to LRP attributions and distributed application requires explicit controls or ablations for model scale, task distribution, prompt formatting, and aggregation hyperparameters. No such matched controls or tables isolating these factors are described, so the causal link to the design choices does not follow.

minor comments (1)

[Abstract] Abstract: consider including at least summary metrics or improvement ranges to make the high-level claims more informative to readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that LRP head selection 'substantially improves efficiency as well as accuracy' and that distributed steering 'yields a higher accuracy' are asserted without any numerical results, baselines, datasets, error bars, or statistical tests, so the magnitude and reliability of the improvements cannot be evaluated from the provided text.

Authors: Abstracts are space-constrained and conventionally summarize findings at a high level. The manuscript body reports concrete numerical gains, baselines from prior FV work, multiple datasets, error bars over multiple runs, and statistical comparisons. We will revise the abstract to incorporate key quantitative results (e.g., accuracy deltas and efficiency metrics) while remaining within length limits. revision: partial
Referee: [Experimental sections] Experimental sections (head-selection and steering results): the central attribution of gains to LRP attributions and distributed application requires explicit controls or ablations for model scale, task distribution, prompt formatting, and aggregation hyperparameters. No such matched controls or tables isolating these factors are described, so the causal link to the design choices does not follow.

Authors: All reported comparisons hold model scale, task distribution, and prompt formatting fixed while varying only the head-selection method or the steering application (distributed vs. aggregation). Tables directly contrast LRP against random and gradient baselines under identical conditions. We nevertheless agree that dedicated ablations on aggregation hyperparameters would further isolate their contribution and will add a supplementary table or section with these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of existing FV techniques with no derivations or self-referential reductions

full rationale

The paper is framed as an empirical study of design choices in function vectors (head selection via LRP attributions, distributed vs. aggregated steering). No equations, derivations, fitted parameters, or predictions are described that could reduce to inputs by construction. Central claims rest on experimental comparisons rather than self-definition, self-citation chains, or renamed known results. This matches the default non-circular outcome for empirical work without load-bearing mathematical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of the in-context learning and attribution literature (e.g., that relevance scores from LRP faithfully reflect causal importance) without introducing new ones.

pith-pipeline@v0.9.1-grok · 5635 in / 1208 out tokens · 30378 ms · 2026-06-28T06:44:24.657629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 13 canonical work pages

[1]

2014 , organization=

Deep inside convolutional networks: visualising image classification models and saliency maps , author=. 2014 , organization=

2014
[2]

Proceedings of the 39th International Conference on Machine Learning , pages =

Ali, Ameen and Schnake, Thomas and Eberle, Oliver and Montavon, Gr. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022
[3]

Advances in neural information processing systems , volume=

Understanding and improving layer normalization , author=. Advances in neural information processing systems , volume=
[4]

2024 , editor =

Achtibat, Reduan and Hatefi, Sayed Mohammad Vakilzadeh and Dreyer, Maximilian and Jain, Aakriti and Wiegand, Thomas and Lapuschkin, Sebastian and Samek, Wojciech , booktitle =. 2024 , editor =

2024
[5]

PloS one , volume=

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation , author=. PloS one , volume=. 2015 , publisher=

2015
[6]

arXiv preprint arXiv:2502.15886 , year=

A close look at decomposition-based XAI-methods for transformer language models , author=. arXiv preprint arXiv:2502.15886 , year=

arXiv
[7]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[8]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[9]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=
[10]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[11]

Do different prompting methods yield a common task representation in language models? , volume =

Davidson, Guy and Gureckis, Todd and Lake, Brenden and Williams, Adina , editor =. Do different prompting methods yield a common task representation in language models? , volume =. Advances in neural information processing systems , publisher =
[12]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[14]

International conference on learning representations , author =

Function vectors in large language models , volume =. International conference on learning representations , author =
[15]

2022 , eprint=

In-context Learning and Induction Heads , author=. 2022 , eprint=

2022
[16]

Forty-second International Conference on Machine Learning , year=

Which Attention Heads Matter for In-Context Learning? , author=. Forty-second International Conference on Machine Learning , year=
[17]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[18]

In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

Reynolds, Laria and McDonell, Kyle , title =. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , articleno =. 2021 , isbn =. doi:10.1145/3411763.3451760 , abstract =

work page doi:10.1145/3411763.3451760 2021
[19]

2025 , eprint=

The broader spectrum of in-context learning , author=. 2025 , eprint=

2025
[20]

Deep RNN s Encode Soft Hierarchical Syntax

Blevins, Terra and Levy, Omer and Zettlemoyer, Luke. Deep RNN s Encode Soft Hierarchical Syntax. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2003

work page doi:10.18653/v1/p18-2003 2018
[21]

What do Neural Machine Translation Models Learn about Morphology?

Belinkov, Yonatan and Durrani, Nadir and Dalvi, Fahim and Sajjad, Hassan and Glass, James. What do Neural Machine Translation Models Learn about Morphology?. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1080

work page doi:10.18653/v1/p17-1080 2017
[22]

The emergence of number and syntax units in LSTM language models

Lakretz, Yair and Kruszewski, German and Desbordes, Theo and Hupkes, Dieuwke and Dehaene, Stanislas and Baroni, Marco. The emergence of number and syntax units in LSTM language models. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa...

work page doi:10.18653/v1/n19-1002 2019
[23]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020
[24]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. Linguistic Knowledge and Transferability of Contextual Representations. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi...

work page doi:10.18653/v1/n19-1112 2019
[25]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=
[26]

Language Models as Knowledge Bases?

Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250 2019
[27]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Roberts, Adam and Raffel, Colin and Shazeer, Noam. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.437

work page doi:10.18653/v1/2020.emnlp-main.437 2020
[28]

Linguistic Regularities in Continuous Space Word Representations

Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

2013
[29]

Recurrent

Csord \'a s, R \'o bert and Potts, Christopher and Manning, Christopher D and Geiger, Atticus. Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.17

work page doi:10.18653/v1/2024.blackboxnlp-1.17 2024
[30]

Language Models Implement Simple W ord2 V ec-style Vector Arithmetic

Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281

work page doi:10.18653/v1/2024.naacl-long.281 2024
[31]

The Thirteenth International Conference on Learning Representations , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=
[32]

Causal Representation Learning Workshop at NeurIPS 2023 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Causal Representation Learning Workshop at NeurIPS 2023 , year=

2023
[33]

1941 , publisher =

The Library of Babel , author =. 1941 , publisher =

1941
[34]

2024 , eprint=

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation , author=. 2024 , eprint=

2024
[35]

François Chollet , title =
[36]

In-Context Learning Creates Task Vectors

Hendel, Roee and Geva, Mor and Globerson, Amir. In-Context Learning Creates Task Vectors. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.624

work page doi:10.18653/v1/2023.findings-emnlp.624 2023
[37]

arXiv preprint arXiv:2508.21258 , year=

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching , author=. arXiv preprint arXiv:2508.21258 , year=

arXiv
[38]

Attribution Patching Outperforms Automated Circuit Discovery

Syed, Aaquib and Rager, Can and Conmy, Arthur. Attribution Patching Outperforms Automated Circuit Discovery. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.25

work page doi:10.18653/v1/2024.blackboxnlp-1.25 2024
[39]

arXiv preprint arXiv:2601.22594 , year=

Language Model Circuits Are Sparse in the Neuron Basis , author=. arXiv preprint arXiv:2601.22594 , year=

Pith/arXiv arXiv
[40]

Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =

Pearl, Judea , title =. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =. 2001 , isbn =

2001
[41]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =
[42]

findings-emnlp.214/

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1580

work page doi:10.18653/v1/p19-1580 2019
[43]

Are Sixteen Heads Really Better than One? , url =

Michel, Paul and Levy, Omer and Neubig, Graham , booktitle =. Are Sixteen Heads Really Better than One? , url =
[44]

The Fourteenth International Conference on Learning Representations , year=

Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insights , author=. The Fourteenth International Conference on Learning Representations , year=
[45]

2026 , eprint=

What do Language Models Learn and When? The Implicit Curriculum Hypothesis , author=. 2026 , eprint=

2026
[46]

Causal Abstractions of Neural Networks , url =

Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , booktitle =. Causal Abstractions of Neural Networks , url =
[47]

2026 , eprint=

From Weights to Activations: Is Steering the Next Frontier of Adaptation? , author=. 2026 , eprint=

2026
[48]

Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition , author=. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

2003
[49]

International Conference on Learning Representations , year=

Word translation without parallel data , author=. International Conference on Learning Representations , year=
[50]

Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network

Nguyen, Kim Anh and Schulte im Walde, Sabine and Vu, Ngoc Thang. Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 2017

2017
[51]

International conference on learning representations , author =

Linearity of relation decoding in transformer language models , volume =. International conference on learning representations , author =
[52]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=. 2017 , url=

2017

[1] [1]

2014 , organization=

Deep inside convolutional networks: visualising image classification models and saliency maps , author=. 2014 , organization=

2014

[2] [2]

Proceedings of the 39th International Conference on Machine Learning , pages =

Ali, Ameen and Schnake, Thomas and Eberle, Oliver and Montavon, Gr. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022

[3] [3]

Advances in neural information processing systems , volume=

Understanding and improving layer normalization , author=. Advances in neural information processing systems , volume=

[4] [4]

2024 , editor =

Achtibat, Reduan and Hatefi, Sayed Mohammad Vakilzadeh and Dreyer, Maximilian and Jain, Aakriti and Wiegand, Thomas and Lapuschkin, Sebastian and Samek, Wojciech , booktitle =. 2024 , editor =

2024

[5] [5]

PloS one , volume=

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation , author=. PloS one , volume=. 2015 , publisher=

2015

[6] [6]

arXiv preprint arXiv:2502.15886 , year=

A close look at decomposition-based XAI-methods for transformer language models , author=. arXiv preprint arXiv:2502.15886 , year=

arXiv

[7] [7]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[8] [8]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

[9] [9]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=

[10] [10]

Proceedings of the 40th International Conference on Machine Learning , pages =

Transformers Learn In-Context by Gradient Descent , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[11] [11]

Do different prompting methods yield a common task representation in language models? , volume =

Davidson, Guy and Gureckis, Todd and Lake, Brenden and Williams, Adina , editor =. Do different prompting methods yield a common task representation in language models? , volume =. Advances in neural information processing systems , publisher =

[12] [12]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[14] [14]

International conference on learning representations , author =

Function vectors in large language models , volume =. International conference on learning representations , author =

[15] [15]

2022 , eprint=

In-context Learning and Induction Heads , author=. 2022 , eprint=

2022

[16] [16]

Forty-second International Conference on Machine Learning , year=

Which Attention Heads Matter for In-Context Learning? , author=. Forty-second International Conference on Machine Learning , year=

[17] [17]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

[18] [18]

In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

Reynolds, Laria and McDonell, Kyle , title =. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , articleno =. 2021 , isbn =. doi:10.1145/3411763.3451760 , abstract =

work page doi:10.1145/3411763.3451760 2021

[19] [19]

2025 , eprint=

The broader spectrum of in-context learning , author=. 2025 , eprint=

2025

[20] [20]

Deep RNN s Encode Soft Hierarchical Syntax

Blevins, Terra and Levy, Omer and Zettlemoyer, Luke. Deep RNN s Encode Soft Hierarchical Syntax. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2003

work page doi:10.18653/v1/p18-2003 2018

[21] [21]

What do Neural Machine Translation Models Learn about Morphology?

Belinkov, Yonatan and Durrani, Nadir and Dalvi, Fahim and Sajjad, Hassan and Glass, James. What do Neural Machine Translation Models Learn about Morphology?. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1080

work page doi:10.18653/v1/p17-1080 2017

[22] [22]

The emergence of number and syntax units in LSTM language models

Lakretz, Yair and Kruszewski, German and Desbordes, Theo and Hupkes, Dieuwke and Dehaene, Stanislas and Baroni, Marco. The emergence of number and syntax units in LSTM language models. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa...

work page doi:10.18653/v1/n19-1002 2019

[23] [23]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020

[24] [24]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. Linguistic Knowledge and Transferability of Contextual Representations. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi...

work page doi:10.18653/v1/n19-1112 2019

[25] [25]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=

[26] [26]

Language Models as Knowledge Bases?

Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250 2019

[27] [27]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Roberts, Adam and Raffel, Colin and Shazeer, Noam. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.437

work page doi:10.18653/v1/2020.emnlp-main.437 2020

[28] [28]

Linguistic Regularities in Continuous Space Word Representations

Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

2013

[29] [29]

Recurrent

Csord \'a s, R \'o bert and Potts, Christopher and Manning, Christopher D and Geiger, Atticus. Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.17

work page doi:10.18653/v1/2024.blackboxnlp-1.17 2024

[30] [30]

Language Models Implement Simple W ord2 V ec-style Vector Arithmetic

Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281

work page doi:10.18653/v1/2024.naacl-long.281 2024

[31] [31]

The Thirteenth International Conference on Learning Representations , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=

[32] [32]

Causal Representation Learning Workshop at NeurIPS 2023 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Causal Representation Learning Workshop at NeurIPS 2023 , year=

2023

[33] [33]

1941 , publisher =

The Library of Babel , author =. 1941 , publisher =

1941

[34] [34]

2024 , eprint=

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation , author=. 2024 , eprint=

2024

[35] [35]

François Chollet , title =

[36] [36]

In-Context Learning Creates Task Vectors

Hendel, Roee and Geva, Mor and Globerson, Amir. In-Context Learning Creates Task Vectors. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.624

work page doi:10.18653/v1/2023.findings-emnlp.624 2023

[37] [37]

arXiv preprint arXiv:2508.21258 , year=

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching , author=. arXiv preprint arXiv:2508.21258 , year=

arXiv

[38] [38]

Attribution Patching Outperforms Automated Circuit Discovery

Syed, Aaquib and Rager, Can and Conmy, Arthur. Attribution Patching Outperforms Automated Circuit Discovery. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.25

work page doi:10.18653/v1/2024.blackboxnlp-1.25 2024

[39] [39]

arXiv preprint arXiv:2601.22594 , year=

Language Model Circuits Are Sparse in the Neuron Basis , author=. arXiv preprint arXiv:2601.22594 , year=

Pith/arXiv arXiv

[40] [40]

Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =

Pearl, Judea , title =. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence , pages =. 2001 , isbn =

2001

[41] [41]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

[42] [42]

findings-emnlp.214/

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1580

work page doi:10.18653/v1/p19-1580 2019

[43] [43]

Are Sixteen Heads Really Better than One? , url =

Michel, Paul and Levy, Omer and Neubig, Graham , booktitle =. Are Sixteen Heads Really Better than One? , url =

[44] [44]

The Fourteenth International Conference on Learning Representations , year=

Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insights , author=. The Fourteenth International Conference on Learning Representations , year=

[45] [45]

2026 , eprint=

What do Language Models Learn and When? The Implicit Curriculum Hypothesis , author=. 2026 , eprint=

2026

[46] [46]

Causal Abstractions of Neural Networks , url =

Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , booktitle =. Causal Abstractions of Neural Networks , url =

[47] [47]

2026 , eprint=

From Weights to Activations: Is Steering the Next Frontier of Adaptation? , author=. 2026 , eprint=

2026

[48] [48]

Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition , author=. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

2003

[49] [49]

International Conference on Learning Representations , year=

Word translation without parallel data , author=. International Conference on Learning Representations , year=

[50] [50]

Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network

Nguyen, Kim Anh and Schulte im Walde, Sabine and Vu, Ngoc Thang. Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 2017

2017

[51] [51]

International conference on learning representations , author =

Linearity of relation decoding in transformer language models , volume =. International conference on learning representations , author =

[52] [52]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=. 2017 , url=

2017