Inside the LLM Word Factory
Pith reviewed 2026-06-27 18:20 UTC · model grok-4.3
The pith
Transformer models detokenize subwords via a two-stage attention-then-MLP process in early layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In Llama2-7B, English detokenization is a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families. The depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. A probe for determining the success of the detokenization process based on early-layer activations alone performs at 0.94-0.97 AUROC depending on the amount of context.
What carries the argument
The two-stage detokenization process at early layers in which attention transmits signals from nonfinal subwords and the MLP composes them with the local embedding.
If this is right
- Detokenization can be localized and studied by isolating attention signal transmission and MLP composition through activation patching.
- The two-stage structure holds across twelve models from eight families, indicating a shared solution to subword aggregation.
- The number of layers required for detokenization ranges from 1-5 in RoPE models to 5-10 in learned-absolute models.
- Early-layer activations contain enough information for a probe that identifies successful detokenization at 0.94-0.97 AUROC.
- Attention can use sequential relays to carry the token-specific signal across multiple nonfinal subwords when needed.
Where Pith is reading between the lines
- If detokenization completes so early, later layers can treat tokens as already word-level units when performing higher-level reasoning.
- The probe based on early activations could monitor tokenization quality in deployed systems without requiring a full model forward pass.
- The difference in layer depth between RoPE and learned-absolute encodings suggests that positional encoding choices shape how quickly models learn to aggregate subwords.
- Applying the same patching method to non-English text could test whether the two-stage mechanism is language-specific or universal.
Load-bearing premise
The controlled paired experiments with activation patching isolate the causal contribution of attention and MLP components at the identified layers without residual confounding from later layers or unpatched pathways.
What would settle it
Running the paired activation-patching experiments on Llama2-7B and finding that patching the layer-1 attention and MLP leaves detokenization accuracy unchanged would falsify the localization claim.
Figures
read the original abstract
Transformer language models process input provided as subword fragments, but natural language semantics usually rely on word-level concepts. Detokenization is the process where models reconcile these two facts, aggregating subwords into word-level representations through their computation. Prior work has found that this takes place mostly in early-to-middle layers, but so far the exact mechanics of the process have not been pinned down. We venture deep into detokenization using activation patching in controlled paired experiments that isolate the contribution of different model components, localizing English detokenization in Llama2-7B to a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families, but the depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. Finally, we provide a probe for determining the success of the detokenization process based on early-layer activations alone, performing at 0.94-0.97 AUROC depending on the amount of context.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that detokenization in LLMs—aggregating subword tokens into word-level representations—occurs via a two-stage process localized to Layer 1 in Llama2-7B: attention heads transmit a token-specific signal from nonfinal subwords (with sequential relays), while the MLP composes this with the local embedding. Activation patching in controlled paired experiments is used to isolate these contributions. The two-stage structure is reported to generalize to twelve models across eight families, with the number of layers involved depending on positional encoding (RoPE: 1-5 layers; learned absolute: 5-10 layers). A probe based on early-layer activations achieves 0.94-0.97 AUROC for detecting successful detokenization.
Significance. If the causal localization via patching holds without residual confounding, the work would offer a concrete mechanistic account of an early-layer computation that bridges subword tokenization and word-level semantics, extending prior observations about early-to-middle layer involvement. The cross-model generalization and the early-layer probe represent potential strengths for interpretability research.
major comments (2)
- [Experimental setup and results on Llama2-7B (implied in abstract description of controlled paired experiments)] The central localization claim rests on activation patching isolating the causal role of Layer 1 attention and MLP. However, in a residual architecture, later layers can receive the original subword embeddings directly via the residual stream and potentially reconstruct or route around the patched signal. The manuscript does not report experiments that (a) patch all downstream pathways, (b) test whether the detokenization metric can be recovered by any combination of layers >1, or (c) compare against patching only layers >1. Without these controls, the strict localization to Layer 1 (and the narrow early band in other models) is not causally secured.
- [Generalization experiments across models] The generalization claim across twelve models and the dependence on positional encoding type (RoPE vs. learned absolute) is load-bearing for the broader contribution. The abstract provides no quantitative breakdown of per-model layer ranges, error bars, or controls for model size/family confounds that would allow evaluation of whether the reported depth differences are robust or post-hoc.
minor comments (2)
- The probe AUROC values (0.94-0.97) are reported as depending on context amount, but no table or figure details the exact context lengths, baseline comparisons, or how context is defined in the probe construction.
- Notation for 'nonfinal subwords' and 'sequential relays' should be defined more explicitly with reference to specific token positions or attention patterns in the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the strength of our causal claims and generalization results. We respond to each major point below.
read point-by-point responses
-
Referee: [Experimental setup and results on Llama2-7B (implied in abstract description of controlled paired experiments)] The central localization claim rests on activation patching isolating the causal role of Layer 1 attention and MLP. However, in a residual architecture, later layers can receive the original subword embeddings directly via the residual stream and potentially reconstruct or route around the patched signal. The manuscript does not report experiments that (a) patch all downstream pathways, (b) test whether the detokenization metric can be recovered by any combination of layers >1, or (c) compare against patching only layers >1. Without these controls, the strict localization to Layer 1 (and the narrow early band in other models) is not causally secured.
Authors: The paired clean/corrupted activation patching design isolates the contribution of Layer 1 by showing that targeted interventions there reliably alter the final detokenization metric while the residual stream from earlier layers remains intact. Nevertheless, the referee's suggested controls would further rule out reconstruction by later layers. We will add experiments that (i) patch only layers >1 and (ii) compare against full downstream patching, reporting whether detokenization can be recovered without the Layer 1 stage. These results and an expanded discussion of residual pathways will appear in the revised manuscript. revision: yes
-
Referee: [Generalization experiments across models] The generalization claim across twelve models and the dependence on positional encoding type (RoPE vs. learned absolute) is load-bearing for the broader contribution. The abstract provides no quantitative breakdown of per-model layer ranges, error bars, or controls for model size/family confounds that would allow evaluation of whether the reported depth differences are robust or post-hoc.
Authors: Section 4 of the full manuscript already contains a per-model table listing the exact layer ranges for all twelve models, with results grouped by positional-encoding type and accompanied by standard-error bars from repeated runs. To address potential size/family confounds we selected models spanning 7B–13B parameters across eight families while holding the RoPE vs. absolute distinction as the primary variable; we will add an explicit paragraph discussing these design choices and any remaining limitations. A concise summary table will also be moved into the abstract for the revision. revision: partial
Circularity Check
No circularity: localization rests on external activation-patching interventions
full rationale
The paper's central claim—that English detokenization occurs via a two-stage attention-then-MLP process localized to Layer 1 in Llama2-7B and early layers in other models—is obtained through controlled paired activation-patching experiments that measure causal contributions. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce this localization result to an input by construction. The work is self-contained against external benchmarks via interventional measurements rather than internal derivations or renamings, making any circularity score of 0 the appropriate finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2303.08112 Eliciting latent predictions from T ransformers with the T uned L ens . Preprint, arXiv:2303.08112
Pith/arXiv arXiv 2023
-
[2]
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, and 1 others. 2023. P ythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430
2023
-
[3]
BigScience Workshop . 2022. BLOOM : A 176 B -parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100
Pith/arXiv arXiv 2022
-
[4]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo : Large scale autoregressive language modeling with Mesh-Tensorflow
-
[5]
Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer El - Showk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, and 14 others. 2022. https://transformer-circuits.pub/2022/solu/index.html Softmax lin...
2022
-
[6]
Kawin Ethayarajh. 2019. https://doi.org/10.18653/v1/D19-1006 How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...
-
[7]
Sheridan Feucht, David Atkinson, Byron C Wallace, and David Bau. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.543 Token erasure as a footprint of implicit vocabulary items in LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9727--9739, Miami, Florida, USA. Association for Computational Linguistics
-
[8]
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. https://proceedings.neurips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html Causal abstractions of neural networks . In Advances in Neural Information Processing Systems, volume 34
2021
-
[9]
Gemma Team . 2024. G emma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118
Pith/arXiv arXiv 2024
-
[10]
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.3 Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30--45, Abu Dhabi, United Arab Emirates. Association f...
-
[11]
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics
work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
-
[12]
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. https://openreview.net/forum?id=JYs1R9IMJr Finding neurons in a haystack: Case studies with sparse probing . Transactions on Machine Learning Research
2023
-
[13]
Go Kamoda, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, and Kentaro Inui. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.355 Weight-based analysis of detokenization in language models: Understanding the first stage of inference without inference . In Findings of the Association for Computational Linguistics: NAACL 2025, pages...
-
[14]
Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2025. https://openreview.net/forum?id=328vch6tRs From tokens to words: On the inner lexicon of LLM s . In The Thirteenth International Conference on Learning Representations
2025
-
[15]
Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics
-
[16]
Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. 2024. https://arxiv.org/abs/2406.19384 The remarkable robustness of LLM s: Stages of inference? arXiv preprint arXiv:2406.19384
arXiv 2024
-
[17]
Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics
-
[18]
Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . In Proceedings of the First Conference on Language Modeling (COLM)
Pith/arXiv arXiv 2024
-
[19]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. https://arxiv.org/abs/2202.05262 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35
Pith/arXiv arXiv 2022
-
[20]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. https://openreview.net/forum?id=Byj72udxe Pointer sentinel mixture models . In Proceedings of the 5th International Conference on Learning Representations
2017
-
[21]
Smith, and Mike Lewis
Ofir Press, Noah A. Smith, and Mike Lewis. 2022. https://openreview.net/forum?id=R8sQPpGCv0 Train short, test long: Attention with linear biases enables input length extrapolation . In International Conference on Learning Representations
2022
-
[22]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://openai.com/index/better-language-models/ Language models are unsupervised multitask learners . OpenAI blog, 1(8):9
2019
-
[23]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. https://doi.org/10.18653/v1/2024.acl-long.828 Steering llama 2 via contrastive activation addition . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504--15522, Bangkok, Thailand. Assoc...
-
[24]
Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...
-
[25]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics
-
[26]
Aaditya K. Singh and DJ Strouse. 2024. https://arxiv.org/abs/2402.14903 Tokenization counts: the impact of tokenization on arithmetic in frontier LLM s . arXiv preprint arXiv:2402.14903
arXiv 2024
-
[27]
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. https://arxiv.org/abs/2104.09864 R o F ormer: Enhanced transformer with rotary position embedding . arXiv preprint arXiv:2104.09864
Pith/arXiv arXiv 2021
-
[28]
Rachael Tatman. 2017. English word frequency. https://www.kaggle.com/datasets/rtatman/english-word-frequency. Derived from the Google Web Trillion Word Corpus
2017
-
[29]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. https://arxiv.org/abs/2307.09288 L lama 2: Open foundation and fine-tuned chat models . arXiv preprint arXiv:2307.09288
Pith/arXiv arXiv 2023
-
[30]
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html Investigating gender bias in language models using causal mediation analysis . In Advances in Neural Information Processing Systems, volume 33
2020
-
[31]
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B : A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax
2021
-
[32]
Slamet Widodo, Herlambang Brawijaya, and Samudi Samudi. 2022. Stratified k-fold cross validation optimization on machine learning for prediction. Sinkron: jurnal dan penelitian teknik informatika, 6(4):2407--2414
2022
-
[33]
Susan Zhang, Stephen Roller, Naman Goyal, and 1 others. 2022. OPT : Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.