Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
Pith reviewed 2026-05-22 08:30 UTC · model grok-4.3
The pith
Singular value decomposition of an LLM's output weight matrix uncovers semantic clusters in its vocabulary directly from the parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Singular value decomposition of the lm_head weight matrix of a transformer-based large language model reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth
What carries the argument
The left singular vectors of the lm_head weight matrix, each identifying a cluster of vocabulary tokens most strongly associated with a particular direction in the model's output space.
If this is right
- Models display systematic differences in singular value spectra and vocabulary cluster structures that reflect their distinct training compositions.
- Ethically concerning subspaces originate in pretraining and remain after post-training alignment.
- The Vocabulary Cluster Score quantifies subspace coherence while the Weighted Projection Score detects glitch tokens without inference.
- This analysis can serve as a standard pre-release safety auditing step for large language models.
Where Pith is reading between the lines
- Extending SVD analysis to other weight matrices could map additional internal representations of knowledge.
- Using these subspaces to guide tokenizer adjustments might improve output controllability and reduce unwanted content.
- Routine application across model families could help trace how specific data curation decisions shape final behavior.
- The method opens a path to static weight-based diagnostics for data leakage that complement dynamic testing.
Load-bearing premise
The left singular vectors of the lm_head matrix correspond to semantically meaningful directions whose associated vocabulary clusters directly expose the model's training data composition and curation philosophy without requiring model inference or external validation.
What would settle it
A check revealing that the vocabulary tokens with highest alignment to each left singular vector form incoherent or random groups with no relation to training data themes would disprove the interpretability of the subspaces.
Figures
read the original abstract
We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that singular value decomposition applied directly to the lm_head weight matrix of transformer LLMs (W ∈ ℝ^{V×D}) yields left singular vectors whose large-magnitude entries define interpretable vocabulary clusters. These clusters purportedly expose training-data composition, curation choices, and ethically problematic content using only five lines of PyTorch and no model inference. The authors analyze GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, report systematic differences in singular-value spectra and cluster structure (graduated hierarchy in GPT, pre-19th-century orthography dominance in Gemma, broad multilingual plus ethically flagged subspaces in Qwen), introduce Vocabulary Cluster Score (VCS) and Weighted Projection Score (WPS), recover a known glitch token via WPS, and advocate SVD-based auditing as a pre-release safety step.
Significance. If the semantic interpretability of the left singular vectors and their link to training data can be rigorously established, the method would supply an unusually lightweight, inference-free diagnostic for model transparency and safety auditing. The computational simplicity and the recovery of a documented glitch token are genuine strengths that could encourage wider adoption for tokenizer optimization and controllable LLM design. At present, however, the absence of quantitative controls limits the strength of these implications.
major comments (3)
- [Abstract] Abstract: the central assertion that left singular vectors 'identify the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction' is presented without any empirical check, such as measuring logit shifts after adding a scaled right singular vector v_j to a hidden state or comparing against a null model obtained by randomizing or frequency-matching the lm_head weights.
- [Model analysis sections] Model analysis (Gemma and Qwen sections): the reported cluster structures (stepwise pre-19th-century orthography in Gemma; 'ethically inappropriate' subspaces in Qwen) are interpreted as exposing curation philosophy, yet no quantitative comparison to randomized or permuted baselines is supplied to rule out tokenizer artifacts or output-layer frequency biases, and no inter-rater or external-criterion validation is given for labeling subspaces as ethically inappropriate.
- [WPS definition and experiments] WPS and glitch-token recovery: while WPS recovers the known token shokubutsu-hyakka-tsu on GPT-OSS-120B, the manuscript provides neither a systematic evaluation on a larger set of documented glitch tokens nor a comparison against alternative static detectors, leaving the claim that WPS constitutes a reliable 'static glitch token detector' unsupported.
minor comments (2)
- [Abstract] The abstract advertises 'five lines of PyTorch' but the manuscript does not display the explicit code snippet, which would improve immediate reproducibility.
- Notation for the decomposition W = U Σ V^T and the definitions of VCS and WPS would benefit from an early equation block to avoid ambiguity when readers implement the method.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments. We have addressed the concerns regarding empirical validation, baselines, and systematic evaluation by incorporating additional analyses and controls in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central assertion that left singular vectors 'identify the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction' is presented without any empirical check, such as measuring logit shifts after adding a scaled right singular vector v_j to a hidden state or comparing against a null model obtained by randomizing or frequency-matching the lm_head weights.
Authors: The claim follows from the SVD decomposition of the lm_head matrix, where the left singular vectors represent the vocabulary directions most responsive to alignments in the hidden state space via the right singular vectors. To empirically validate this, we have added experiments in the revision that inject scaled right singular vectors into model hidden states and observe the resulting changes in output logits, confirming preferential activation of the corresponding cluster tokens. Comparisons to randomized lm_head matrices are also included to rule out null effects. revision: yes
-
Referee: [Model analysis sections] Model analysis (Gemma and Qwen sections): the reported cluster structures (stepwise pre-19th-century orthography in Gemma; 'ethically inappropriate' subspaces in Qwen) are interpreted as exposing curation philosophy, yet no quantitative comparison to randomized or permuted baselines is supplied to rule out tokenizer artifacts or output-layer frequency biases, and no inter-rater or external-criterion validation is given for labeling subspaces as ethically inappropriate.
Authors: We have added quantitative comparisons to randomized, permuted, and frequency-matched baselines in the revised model analysis sections, showing that the reported structures are not explained by these artifacts. For the ethical subspaces, we have included the full token lists and clarified the labeling process as author-driven inspection; while inter-rater validation was not performed, we discuss the potential subjectivity as a limitation of the current study. revision: partial
-
Referee: [WPS definition and experiments] WPS and glitch-token recovery: while WPS recovers the known token shokubutsu-hyakka-tsu on GPT-OSS-120B, the manuscript provides neither a systematic evaluation on a larger set of documented glitch tokens nor a comparison against alternative static detectors, leaving the claim that WPS constitutes a reliable 'static glitch token detector' unsupported.
Authors: We have expanded the WPS section to include systematic evaluation on a collection of additional documented glitch tokens and benchmarked it against alternative static methods like frequency-based detection. The results support WPS as an effective static detector, and we have updated the claims to reflect this enhanced evaluation. revision: yes
Circularity Check
No significant circularity; analysis is a direct standard SVD on published weights
full rationale
The paper performs a standard singular value decomposition W = U Σ V^T on the lm_head weight matrix using five lines of PyTorch with no model inference or parameter fitting. The introduced Vocabulary Cluster Score (VCS) and Weighted Projection Score (WPS) are defined directly from the resulting left singular vectors and their token magnitudes rather than being optimized or fitted to reproduce the same data in a self-referential manner. No self-citations, uniqueness theorems, or ansatzes from prior author work are used to justify the core decomposition or its interpretation. The derivation chain consists of a mathematically fixed linear algebra operation applied to external model weights, making the result independent of the paper's own outputs or assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Left singular vectors of the lm_head weight matrix align with directions that preferentially activate coherent groups of vocabulary tokens.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
singular value decomposition of the lm_head weight matrix ... reveals interpretable semantic subspaces directly from the model weights
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Vocabulary Cluster Score (VCS) ... mean pairwise cosine similarity among the lm_head row vectors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2025. URL https://arxiv.org/abs/2303.08112
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Retrofitting large language models with dynamic tokenization, 2025
Darius Feher, Ivan Vulić, and Benjamin Minixhofer. Retrofitting large language models with dynamic tokenization, 2025. URL https://arxiv.org/abs/2411.18553
-
[3]
Vocabulary customization for efficient domain-specific llm deployment, 2025
Christian Herold, Michael Kozielski, Nicholas Santavas, Yannick Versley, and Shahram Khadivi. Vocabulary customization for efficient domain-specific llm deployment, 2025. URL https://arxiv.org/abs/2509.26124
-
[4]
Backward lens: Projecting language model gradients into the vocabulary space, 2024
Shahar Katz, Yonatan Belinkov, Mor Geva, and Lior Wolf. Backward lens: Projecting language model gradients into the vocabulary space, 2024. URL https://arxiv.org/abs/2402.12865
-
[5]
TokAlign : Efficient vocabulary adaptation via token alignment, 2025
Chong Li, Jiajun Zhang, and Chengqing Zong. TokAlign : Efficient vocabulary adaptation via token alignment, 2025. URL https://arxiv.org/abs/2506.03523
-
[6]
Glitch tokens in large language models: Categorization taxonomy and effective detection
Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. Proc. ACM Softw. Eng., 1 0 (FSE), July 2024. doi:10.1145/3660799. URL https://doi.org/10.1145/3660799
-
[7]
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023. URL https://arxiv.org/abs/2305.13169
-
[8]
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Hisashi Miyashita. Controlling logical collapse in LLMs via algebraic ontology projection over F2 , 2026. URL https://arxiv.org/abs/2605.12968
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Reinforcement learning finetunes small subnetworks in large language models, 2025
Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711
-
[10]
Interpreting GPT : the logit lens
nostalgebraist. Interpreting GPT : the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
work page 2020
-
[11]
Exploring the space of topic coherence measures
Michael R\" o der, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, pp.\ 399–408, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450333177. doi:10.1145/2684822.2685324. URL https://doi.org/10.114...
-
[12]
SolidGoldMagikarp (plus, prompt generation)
Jessica Rumbelow and mwatkins. SolidGoldMagikarp (plus, prompt generation). Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/
work page 2023
-
[13]
Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter
Craig W. Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. Boundless byte pair encoding: Breaking the pre-tokenization barrier, 2025. URL https://arxiv.org/abs/2504.00178
-
[14]
Evaluation methods for unsupervised word embeddings
Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods for unsupervised word embeddings. In Llu \'i s M \`a rquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 298--307, Lisbon, Portugal, September 2015. Association for Computational Li...
-
[15]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016. URL https://arxiv.org/abs/1508.07909
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
LiteToken : Removing intermediate merge residues from bpe tokenizers, 2026
Yike Sun, Haotong Yang, Zhouchen Lin, and Muhan Zhang. LiteToken : Removing intermediate merge residues from bpe tokenizers, 2026. URL https://arxiv.org/abs/2602.04706
-
[17]
V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9 0 (1), March 2019. ISSN 2045-2322. doi:10.1038/s41598-019-41695-z. URL http://dx.doi.org/10.1038/s41598-019-41695-z
-
[18]
Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,
Zihui Wu, Haichang Gao, Ping Wang, Shudong Zhang, Zhaoxiang Liu, and Shiguo Lian. Glitchminer: Mining glitch tokens in large language models via gradient-based discrete optimization, 2025. URL https://arxiv.org/abs/2410.15052
-
[19]
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax bottleneck: A high-rank rnn language model, 2018. URL https://arxiv.org/abs/1711.03953
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Curran Associates Inc., Red Hook, NY, USA, 2019
work page 2019
-
[21]
Glitchprober: Advancing effective detection and mitigation of glitch tokens in large language models
Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, and Haoyu Wang. Glitchprober: Advancing effective detection and mitigation of glitch tokens in large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE '24, pp.\ 643–655, New York, NY, USA, 2024. As...
-
[22]
Enhancing large language models through adaptive tokenizers
Mengyu Zheng, Hanting Chen, Tianyu Guo, Chong Zhu, Binfan Zheng, Chang Xu, and Yunhe Wang. Enhancing large language models through adaptive tokenizers. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9798331314385
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.