How Do Language Models Compose Functions?
Pith reviewed 2026-05-18 11:00 UTC · model grok-4.3
The pith
Language models solve two-hop tasks either by computing the intermediate result or by direct mapping, with embedding geometry deciding which path is taken.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern LLMs solve two-hop factual recall tasks expressed as g(f(x)) using either a compositional mechanism that computes the intermediate f(x) along the way or a direct mechanism with no detectable signature of f(x), and embedding space geometry determines which mechanism is employed.
What carries the argument
Linear decoding of residual stream activations to detect or rule out computation of the intermediate variable f(x) during two-hop factual recall.
If this is right
- Tasks whose embedding-space representation is a direct translation from x to g(f(x)) tend to be solved by the direct mechanism.
- The compositionality gap arises when models can reach the answer without computing the intermediate step.
- Models can employ different mechanisms for different tasks depending on how those tasks sit in embedding space.
- Idiomatic factual recall favors direct solving when geometry permits it.
Where Pith is reading between the lines
- Changing embedding geometry through training data or fine-tuning might shift models toward more compositional processing on novel inputs.
- Direct mechanisms could be more brittle when the model encounters slight variations in how facts are phrased.
- The same split between compositional and direct routes may appear in other multi-step reasoning problems such as multi-hop question answering or arithmetic.
- More powerful detection methods beyond linear probes could reveal hidden compositional structure even in cases currently classified as direct.
Load-bearing premise
Linear probes on residual activations can detect whether the intermediate f(x) was computed, even if it is represented in some other non-linear or distributed form.
What would settle it
A two-hop task where linear probes show no trace of f(x) yet a more complete analysis reveals a non-linear representation of f(x) that the model actually uses to reach the correct output.
Figures
read the original abstract
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap", i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. We then decode residual stream representations and identify two processing mechanisms: one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that embedding space geometry is strongly related to which mechanism is employed, where the idiomatic mechanism is dominant when tasks are represented by translations from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates whether LLMs solve two-hop factual recall tasks of the form g(f(x)) using compositional mechanisms. It first verifies the compositionality gap in modern models. It then uses linear probes on residual stream activations to identify two mechanisms: compositional (where f(x) is computed and linearly decodable en route to g(f(x))) versus direct (no detectable signature of the intermediate f(x)). It reports that embedding-space geometry strongly predicts which mechanism is used, with idiomatic translations from x to g(f(x)) favoring the direct route. Data and code are released.
Significance. If the reported mechanistic distinction and geometry correlation hold under more robust tests, the work would advance mechanistic interpretability by providing an observational taxonomy of how LLMs handle composition and a potential geometric predictor of internal strategy. The public release of code and data is a clear strength that enables follow-up work.
major comments (2)
- [§4.2] §4.2 (Probe Analysis): The partition into compositional versus direct mechanisms rests on linear probes failing to recover f(x) in the direct class. This does not rule out non-linear or distributed representations of the intermediate variable, which could still reflect internal composition without triggering the chosen probes and would collapse the reported dichotomy.
- [§5] §5 (Geometry Correlation): The claimed strong relation between embedding geometry and mechanism choice lacks controls for confounds such as task frequency or lexical overlap; without these, it is unclear whether geometry is causal or merely correlated with the probe outcomes.
minor comments (2)
- [§3] Notation for the residual stream positions and layer indices is introduced without a consolidated table, making it hard to track which activations are probed at each step.
- [Figure 2] Figure 2 caption does not specify the exact statistical test or multiple-comparison correction used for the reported significance levels.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, making revisions where we agree that additional analysis or clarification is warranted.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Probe Analysis): The partition into compositional versus direct mechanisms rests on linear probes failing to recover f(x) in the direct class. This does not rule out non-linear or distributed representations of the intermediate variable, which could still reflect internal composition without triggering the chosen probes and would collapse the reported dichotomy.
Authors: We thank the referee for pointing this out. Our classification into compositional and direct mechanisms is based on the detectability of the intermediate f(x) using linear probes on the residual stream activations, following standard practices in mechanistic interpretability. While this does not preclude the existence of non-linear representations, the absence of a linear signature is a meaningful distinction for our taxonomy. In the revised manuscript, we will expand §4.2 to explicitly discuss this limitation and note that our 'direct' category means no linearly decodable intermediate. We will also add results from non-linear probes (e.g., small MLPs) in the appendix to test for more complex representations, though these may not alter the main conclusions. revision: partial
-
Referee: [§5] §5 (Geometry Correlation): The claimed strong relation between embedding geometry and mechanism choice lacks controls for confounds such as task frequency or lexical overlap; without these, it is unclear whether geometry is causal or merely correlated with the probe outcomes.
Authors: We agree that establishing the relationship between embedding geometry and mechanism choice would benefit from controls for potential confounds. In the original analysis, we focused on the geometric properties as an observational correlate. For the revision, we will introduce controls by subsampling tasks to balance for frequency and lexical overlap, and re-evaluate the correlation strength. This will be added to §5, along with a discussion on whether geometry appears to be a robust predictor independent of these factors. We believe this will strengthen the claim without overclaiming causality. revision: yes
Circularity Check
No significant circularity in observational activation analysis
full rationale
The paper's central findings rest on empirical experiments: confirming the compositionality gap via task performance, then using linear probes on residual stream activations to classify tasks as compositional (detectable f(x)) or direct (no detectable signature), followed by correlation with embedding geometry. These steps are observational classifications and measurements rather than derivations that reduce to inputs by construction. No equations or self-referential definitions appear; no fitted parameters are relabeled as predictions; no load-bearing self-citations or uniqueness theorems are invoked. The analysis is self-contained against external benchmarks such as probe-based decoding experiments and does not rely on prior author work to force its taxonomy or conclusions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Residual stream activations can be linearly decoded to detect presence or absence of intermediate variables
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify two processing mechanisms: one which solves tasks compositionally, computing f(x) along the way to g(f(x)), and one which solves them directly, without any detectable signature of the intermediate variable f(x).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson
URL https://dl.acm .org/doi/10.1145/3132847.3132921. Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries.Empirical Methods in Natural Language Processing (EMNLP),
-
[2]
URLhttps://arxiv.org/pdf/2406.12775. Susan Carey. Précis of The Origin of Concepts.Behavioral and Brain Sciences (BBS),
-
[3]
URL https://doi.org/10.1017/S0140525X10000919. Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and Fate: Limits of Transformers on Compositionality.Neural Information Processing Systems (NeurIPS),
-
[4]
Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenenbaum
URL https://arxiv.org/pdf/2305.18654. Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenenbaum. Library Learning for Neurally-Guided Bayesian Program Induction.Neural Information Processing Systems (NeurIPS),
-
[5]
URLhttps://dl.acm.org/doi/10.5555/3327757.3327878. Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning.Philosophical Transactions of the Royal Society A,
-
[6]
URLhttps://arxiv.org/pdf/2006.08381. Katrin Erk. Vector Space Models of Word Meaning and Phrase Meaning: A Survey.Language and Linguistics Compass,
-
[7]
URLhttps://doi.org/10.1002/lnco.362. Jonathan St BT Evans. Logic and Human Reasoning: An Assessment of the Deduction Paradigm. Psychological Bulletin,
-
[8]
Jerry A Fodor.The Language of Thought
URLhttps://doi.org/10.1037/0033-2909.128.6.978. Jerry A Fodor.The Language of Thought. Harvard University Press,
-
[9]
Daniel Furrer, Marc van Zee, Nathan Scales, and Nathanael Schärli
URLhttps://doi.org/10.1016/0010-0277(88)90031-5. Daniel Furrer, Marc van Zee, Nathan Scales, and Nathanael Schärli. Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures,
-
[10]
Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg
URL https://arxiv.org/ pdf/2007.08970. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space.Empirical Methods in Natural Language Processing (EMNLP),
-
[11]
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva
URLhttps://arxiv.org/pdf/2203.14680. Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models.International Conference on Machine Learning (ICML),
-
[12]
Patchscopes: A unifying framework for inspecting hidden representations of language models
URLhttps://arxiv.org/pdf/2401.06102. Thomas L. Griffiths, Brenden M. Lake, R. Thomas McCoy, Ellie Pavlick, and Taylor W. Webb. Whither symbols in the era of advanced neural networks?,
-
[13]
URL https://arxiv.org/pdf/ 2508.05776. 10 Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of Relation Decoding in Transformer Language Models.International Conference on Learning Representations (ICLR),
-
[14]
Linearity of relation decoding in transformer language models
URL https: //arxiv.org/pdf/2308.09124. John Hewitt and Christopher D. Manning. A Structural Probe for Finding Syntax in Word Represen- tations.North American Chapter of the Association for Computational Linguistics (NAACL),
- [15]
-
[16]
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
URL https: //arxiv.org/pdf/1612.06890. Daniel Kahneman and Amos Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
URLhttps://doi.org/10.1016/0010-0285(72)90016-3. Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data...
-
[18]
URLhttps://arxiv.org/pdf/1912.09713. Najoung Kim and Tal Linzen. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation.Empirical Methods in Natural Language Processing (EMNLP),
-
[19]
URL https://arxiv.org/pdf/2010.05465. Brenden Lake and Marco Baroni. Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks.International Conference on Machine Learning (ICML),
-
[20]
URLhttps://arxiv.org/pdf/1711.00350. Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building Machines That Learn and Think Like People.Behavioral and Brain Sciences (BBS),
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Building Machines That Learn and Think Like People
URL https://arxiv.org/pdf/1604.00289. Andrew K Lampinen, Ishita Dasgupta, Stephanie C Y Chan, Hannah R Sheahan, Antonia Creswell, Dharshan Kumaran, James L McClelland, and Felix Hill. Language models, like humans, show content effects on reasoning tasks.PNAS Nexus,
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
-
[23]
Zhaoyi Li, Gangwei Jiang, Hong Xie, Linqi Song, Defu Lian, and Ying Wei
URLhttps://arxiv.org/pdf/2410.02102. Zhaoyi Li, Gangwei Jiang, Hong Xie, Linqi Song, Defu Lian, and Ying Wei. Understanding and Patching Compositional Reasoning in LLMs.Findings of the Association for Computational Linguistics (ACL),
-
[24]
Meng Lu, Ruochen Zhang, Carsten Eickhoff, and Ellie Pavlick
URLhttps://arxiv.org/pdf/2402.14328. Meng Lu, Ruochen Zhang, Carsten Eickhoff, and Ellie Pavlick. Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline,
-
[25]
URLhttps://arxiv.org/pdf/2212.07796. R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve.Proceedings of the National Academy of Sciences (PNAS),
-
[26]
Kate McCurdy, Paul Soulos, Paul Smolensky, Roland Fernandez, and Jianfeng Gao
URL https://arxiv.org/pdf/2309.13638. Kate McCurdy, Paul Soulos, Paul Smolensky, Roland Fernandez, and Jianfeng Gao. Toward Compo- sitional Behavior in Neural Models: A Survey of Current Views.Empirical Methods in Natural Language Processing (EMNLP),
-
[27]
URL https://aclanthology.org/2024.emnlp-main. 524.pdf. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language Models Implement Simple Word2Vec- style Vector Arithmetic.North American Chapter of the Association for Computational Linguistics (NAACL),
work page 2024
-
[28]
Language models implement simple word2vec-style vector arithmetic, 2024
URLhttps://arxiv.org/pdf/2305.16130. Jack Merullo, Noah A. Smith, Sarah Wiegreffe, and Yanai Elazar. On Linear Representations and Pretraining Data Frequency in Language Models.International Conference on Learning Representations (ICLR),
-
[29]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig
URLhttps://arxiv.org/pdf/2504.12459. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations.North American Chapter of the Association for Computational Linguistics (NAACL),
-
[30]
Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network
URLhttps://arxiv.org/pdf/1701.02962. nostalgebraist. interpreting GPT: the logit lens,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Measuring and Narrowing the Compositionality Gap in Language Models
URLhttps://arxiv.org/pdf/2210.03350. Jake Quilty-Dunn, Nicolas Porot, and Eric Mandelbaum. The best game in town: The reemergence of the language-of-thought hypothesis across the cognitive sciences.Behavioral and Brain Sciences (BBS),
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Jacob Russin, Sam Whitman McGrath, Danielle J
URLhttps://doi.org/10.1017/S0140525X22002849. Jacob Russin, Sam Whitman McGrath, Danielle J. Williams, and Lotem Elber-Dorozko. From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks,
-
[33]
Jacob Russin, Ellie Pavlick, and Michael J Frank
URL https://arxiv.org/pdf/2405.15164. Jacob Russin, Ellie Pavlick, and Michael J Frank. Parallel trade-offs in human cognition and neural networks: The dynamic interplay between in-context and in-weight learning.Proceedings of the National Academy of Sciences (PNAS),
-
[34]
Yuval Shalev, Amir Feder, and Ariel Goldstein
URLhttps://www.pnas.org/doi/10.1073/pna s.2510270122. Yuval Shalev, Amir Feder, and Ariel Goldstein. Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning,
-
[35]
URLhttps://arxiv.org/pdf/2406.13858. Zoltán Gendler Szabó. Compositionality. InThe Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall
-
[36]
Bert rediscovers the classical nlp pipeline.arXiv preprint arXiv:1905.05950,
URL https://arxiv.org/pdf/1905.05950. 12 Jörg Tiedemann and Santhosh Thottingal. OPUS-MT – Building open translation services for the World.European Association for Machine Translation (EAMT),
-
[37]
URLhttps://aclantho logy.org/2020.eamt-1.61.pdf. Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function Vectors in Large Language Models.International Conference on Learning Representations (ICLR),
work page 2020
-
[38]
Function vectors in large language models
URLhttps://arxiv.org/pdf/2310.15213. Ivan Vegner, Sydelle de Souza, Valentin Forch, Martha Lewis, and Leonidas A. A. Doumas. Be- havioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey.Asso- ciation for Computational Linguistics (ACL),
-
[39]
URL https://arxiv.org/pdf/2506.04461. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. Neural Information Processing Systems (NeurIPS),
-
[40]
Denny Vrandeˇci´c and Markus Krötzsch
URL https://arxiv.org/pdf/2004 .12265. Denny Vrandeˇci´c and Markus Krötzsch. Wikidata: a free collaborative knowledgebase.Communica- tions of the ACM (CACM),
work page 2004
-
[41]
Boshi Wang, Xiang Yue, Yu Su, and Huan Sun
URLhttps://dl.acm.org/doi/10.1145/2629489. Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization.Neural Information Processing Systems (NeurIPS),
-
[42]
Martin Wattenberg and Fernanda B
URLhttps://arxiv.org/pdf/2405.15071. Martin Wattenberg and Fernanda B. Viégas. Relational Composition in Neural Networks: A Survey and Call to Action.Mechanistic Interpretability Workshop at ICML,
-
[43]
URL https://arxiv. org/pdf/2407.14662. Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do Large Language Models Latently Perform Multi-Hop Reasoning?Association for Computational Linguistics (ACL), 2024a. URLhttps://arxiv.org/pdf/2402.16837. Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, and Mor Geva. Do Large ...
-
[44]
Hangyeol Yu, Myeongho Jeong, Jamin Shin, Hyeongdon Moon, Juneyoung Park, and Seungtaek Choi
URLhttps://arxiv.org/pdf/2505.14530. Hangyeol Yu, Myeongho Jeong, Jamin Shin, Hyeongdon Moon, Juneyoung Park, and Seungtaek Choi. Towards Zero-Shot Functional Compositionality of Language Models,
-
[45]
Zeping Yu, Yonatan Belinkov, and Sophia Ananiadou
URL https: //arxiv.org/pdf/2303.03103. Zeping Yu, Yonatan Belinkov, and Sophia Ananiadou. Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models.Empirical Methods in Natural Language Processing (EMNLP),
-
[46]
URLhttps://arxiv.org/pdf/2502.10835. Yanli Zhou, Brenden M. Lake, and Adina Williams. Compositional learning of functions in humans and machines.Annual Meeting of the Cognitive Science Society (CogSci),
-
[47]
URL https: //arxiv.org/pdf/2403.12201. 13 A DATACREATION Table 2: List of our tasks, showing x, g(x), and f(g(x)) for the random example in Table
-
[48]
Tasks with neitherg(x)norf(g(x))are omitted.f(g(x))only shown if distinct fromg(f(x)). f g x g(x) f(g(x)) Word→Antonym English→Spanish bogus false — Word→Antonym English→German philosophical philosophisch — Word→Antonym English→French excessive excessive — x + 10 2x 699 1398 1408 x + 100 2x 922 1844 1944 x mod 20 2x 891 1782 2 Word→Numeric 2x one hundred ...
work page 1944
-
[49]
and [ modern] in this example. 14 Representational analysisIn Sec. 4, we analyze the model’s computation from x→g(f(x)) . Consider the query for “Heartbreak Hotel” → “1935”: i.e. “... Q: Heartbreak Hotel \n A: ”. Here, multiple tokens ([ Heart][break][ Hotel][ \][n][ A:][ ] ) are central to the computation. We therefore analyze all residual streams for th...
work page 1935
-
[50]
instead of logit lens. This method is proposed as one that is more closely aligned with a language model’s computation than other methods (such as logit lens). We would specifically like to use this method to decode a representation into vocabulary-space logits. To do so, we prompt a model with the “token identity prompt”, in which random tokens are repea...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.