pith. sign in

arxiv: 2605.22005 · v1 · pith:AAL3TGYGnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.CL

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

Pith reviewed 2026-05-22 08:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords singular value decompositionlm_head weightsvocabulary clusterstraining data auditglitch token detectionsemantic subspacesmodel safetypretraining analysis
0
0 comments X

The pith

Singular value decomposition of an LLM's output weight matrix uncovers semantic clusters in its vocabulary directly from the parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that singular value decomposition applied to the lm_head weight matrix of transformer-based large language models can expose interpretable semantic subspaces using only a few lines of code and no model inference. Each left singular vector highlights groups of vocabulary tokens that tend to activate together when the hidden state aligns with that direction, revealing details about the training data composition and curation choices. The authors examine several models and observe distinct patterns, such as hierarchical subspaces in one case, historical language dominance in another, and problematic content in a third that persists from pretraining. They introduce quantitative scores to assess cluster coherence and detect glitch tokens. A sympathetic reader would care because this provides a lightweight method to audit what the model has internalized and to identify issues before deployment.

Core claim

Singular value decomposition of the lm_head weight matrix of a transformer-based large language model reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth

What carries the argument

The left singular vectors of the lm_head weight matrix, each identifying a cluster of vocabulary tokens most strongly associated with a particular direction in the model's output space.

If this is right

  • Models display systematic differences in singular value spectra and vocabulary cluster structures that reflect their distinct training compositions.
  • Ethically concerning subspaces originate in pretraining and remain after post-training alignment.
  • The Vocabulary Cluster Score quantifies subspace coherence while the Weighted Projection Score detects glitch tokens without inference.
  • This analysis can serve as a standard pre-release safety auditing step for large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending SVD analysis to other weight matrices could map additional internal representations of knowledge.
  • Using these subspaces to guide tokenizer adjustments might improve output controllability and reduce unwanted content.
  • Routine application across model families could help trace how specific data curation decisions shape final behavior.
  • The method opens a path to static weight-based diagnostics for data leakage that complement dynamic testing.

Load-bearing premise

The left singular vectors of the lm_head matrix correspond to semantically meaningful directions whose associated vocabulary clusters directly expose the model's training data composition and curation philosophy without requiring model inference or external validation.

What would settle it

A check revealing that the vocabulary tokens with highest alignment to each left singular vector form incoherent or random groups with no relation to training data themes would disprove the interpretability of the subspaces.

Figures

Figures reproduced from arXiv: 2605.22005 by Hisashi Miyashita.

Figure 1
Figure 1. Figure 1: Singular value decay curves for the top 20 singular vectors of each model (linear scale, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GPT-OSS-120B (instruct): top-15 tokens per singular vector, [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gemma-2-2B (instruct): top-15 tokens per singular vector, [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that singular value decomposition applied directly to the lm_head weight matrix of transformer LLMs (W ∈ ℝ^{V×D}) yields left singular vectors whose large-magnitude entries define interpretable vocabulary clusters. These clusters purportedly expose training-data composition, curation choices, and ethically problematic content using only five lines of PyTorch and no model inference. The authors analyze GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, report systematic differences in singular-value spectra and cluster structure (graduated hierarchy in GPT, pre-19th-century orthography dominance in Gemma, broad multilingual plus ethically flagged subspaces in Qwen), introduce Vocabulary Cluster Score (VCS) and Weighted Projection Score (WPS), recover a known glitch token via WPS, and advocate SVD-based auditing as a pre-release safety step.

Significance. If the semantic interpretability of the left singular vectors and their link to training data can be rigorously established, the method would supply an unusually lightweight, inference-free diagnostic for model transparency and safety auditing. The computational simplicity and the recovery of a documented glitch token are genuine strengths that could encourage wider adoption for tokenizer optimization and controllable LLM design. At present, however, the absence of quantitative controls limits the strength of these implications.

major comments (3)
  1. [Abstract] Abstract: the central assertion that left singular vectors 'identify the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction' is presented without any empirical check, such as measuring logit shifts after adding a scaled right singular vector v_j to a hidden state or comparing against a null model obtained by randomizing or frequency-matching the lm_head weights.
  2. [Model analysis sections] Model analysis (Gemma and Qwen sections): the reported cluster structures (stepwise pre-19th-century orthography in Gemma; 'ethically inappropriate' subspaces in Qwen) are interpreted as exposing curation philosophy, yet no quantitative comparison to randomized or permuted baselines is supplied to rule out tokenizer artifacts or output-layer frequency biases, and no inter-rater or external-criterion validation is given for labeling subspaces as ethically inappropriate.
  3. [WPS definition and experiments] WPS and glitch-token recovery: while WPS recovers the known token shokubutsu-hyakka-tsu on GPT-OSS-120B, the manuscript provides neither a systematic evaluation on a larger set of documented glitch tokens nor a comparison against alternative static detectors, leaving the claim that WPS constitutes a reliable 'static glitch token detector' unsupported.
minor comments (2)
  1. [Abstract] The abstract advertises 'five lines of PyTorch' but the manuscript does not display the explicit code snippet, which would improve immediate reproducibility.
  2. Notation for the decomposition W = U Σ V^T and the definitions of VCS and WPS would benefit from an early equation block to avoid ambiguity when readers implement the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's insightful comments. We have addressed the concerns regarding empirical validation, baselines, and systematic evaluation by incorporating additional analyses and controls in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central assertion that left singular vectors 'identify the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction' is presented without any empirical check, such as measuring logit shifts after adding a scaled right singular vector v_j to a hidden state or comparing against a null model obtained by randomizing or frequency-matching the lm_head weights.

    Authors: The claim follows from the SVD decomposition of the lm_head matrix, where the left singular vectors represent the vocabulary directions most responsive to alignments in the hidden state space via the right singular vectors. To empirically validate this, we have added experiments in the revision that inject scaled right singular vectors into model hidden states and observe the resulting changes in output logits, confirming preferential activation of the corresponding cluster tokens. Comparisons to randomized lm_head matrices are also included to rule out null effects. revision: yes

  2. Referee: [Model analysis sections] Model analysis (Gemma and Qwen sections): the reported cluster structures (stepwise pre-19th-century orthography in Gemma; 'ethically inappropriate' subspaces in Qwen) are interpreted as exposing curation philosophy, yet no quantitative comparison to randomized or permuted baselines is supplied to rule out tokenizer artifacts or output-layer frequency biases, and no inter-rater or external-criterion validation is given for labeling subspaces as ethically inappropriate.

    Authors: We have added quantitative comparisons to randomized, permuted, and frequency-matched baselines in the revised model analysis sections, showing that the reported structures are not explained by these artifacts. For the ethical subspaces, we have included the full token lists and clarified the labeling process as author-driven inspection; while inter-rater validation was not performed, we discuss the potential subjectivity as a limitation of the current study. revision: partial

  3. Referee: [WPS definition and experiments] WPS and glitch-token recovery: while WPS recovers the known token shokubutsu-hyakka-tsu on GPT-OSS-120B, the manuscript provides neither a systematic evaluation on a larger set of documented glitch tokens nor a comparison against alternative static detectors, leaving the claim that WPS constitutes a reliable 'static glitch token detector' unsupported.

    Authors: We have expanded the WPS section to include systematic evaluation on a collection of additional documented glitch tokens and benchmarked it against alternative static methods like frequency-based detection. The results support WPS as an effective static detector, and we have updated the claims to reflect this enhanced evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis is a direct standard SVD on published weights

full rationale

The paper performs a standard singular value decomposition W = U Σ V^T on the lm_head weight matrix using five lines of PyTorch with no model inference or parameter fitting. The introduced Vocabulary Cluster Score (VCS) and Weighted Projection Score (WPS) are defined directly from the resulting left singular vectors and their token magnitudes rather than being optimized or fitted to reproduce the same data in a self-referential manner. No self-citations, uniqueness theorems, or ansatzes from prior author work are used to justify the core decomposition or its interpretation. The derivation chain consists of a mathematically fixed linear algebra operation applied to external model weights, making the result independent of the paper's own outputs or assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central interpretation rests on the unproven mapping from singular vectors to semantic vocabulary clusters; no free parameters are explicitly fitted in the abstract description, though choice of retained singular vectors is implicit.

axioms (1)
  • domain assumption Left singular vectors of the lm_head weight matrix align with directions that preferentially activate coherent groups of vocabulary tokens.
    This mapping is invoked to claim that clusters expose training data composition.

pith-pipeline@v0.9.0 · 5835 in / 1288 out tokens · 37582 ms · 2026-05-22T08:30:59.928976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2025. URL https://arxiv.org/abs/2303.08112

  2. [2]

    Retrofitting large language models with dynamic tokenization, 2025

    Darius Feher, Ivan Vulić, and Benjamin Minixhofer. Retrofitting large language models with dynamic tokenization, 2025. URL https://arxiv.org/abs/2411.18553

  3. [3]

    Vocabulary customization for efficient domain-specific llm deployment, 2025

    Christian Herold, Michael Kozielski, Nicholas Santavas, Yannick Versley, and Shahram Khadivi. Vocabulary customization for efficient domain-specific llm deployment, 2025. URL https://arxiv.org/abs/2509.26124

  4. [4]

    Backward lens: Projecting language model gradients into the vocabulary space, 2024

    Shahar Katz, Yonatan Belinkov, Mor Geva, and Lior Wolf. Backward lens: Projecting language model gradients into the vocabulary space, 2024. URL https://arxiv.org/abs/2402.12865

  5. [5]

    TokAlign : Efficient vocabulary adaptation via token alignment, 2025

    Chong Li, Jiajun Zhang, and Chengqing Zong. TokAlign : Efficient vocabulary adaptation via token alignment, 2025. URL https://arxiv.org/abs/2506.03523

  6. [6]

    Glitch tokens in large language models: Categorization taxonomy and effective detection

    Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection. Proc. ACM Softw. Eng., 1 0 (FSE), July 2024. doi:10.1145/3660799. URL https://doi.org/10.1145/3660799

  7. [7]

    A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023

    Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023. URL https://arxiv.org/abs/2305.13169

  8. [8]

    Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

    Hisashi Miyashita. Controlling logical collapse in LLMs via algebraic ontology projection over F2 , 2026. URL https://arxiv.org/abs/2605.12968

  9. [9]

    Reinforcement learning finetunes small subnetworks in large language models, 2025

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711

  10. [10]

    Interpreting GPT : the logit lens

    nostalgebraist. Interpreting GPT : the logit lens. LessWrong, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  11. [11]

    Exploring the space of topic coherence measures

    Michael R\" o der, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, pp.\ 399–408, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450333177. doi:10.1145/2684822.2685324. URL https://doi.org/10.114...

  12. [12]

    SolidGoldMagikarp (plus, prompt generation)

    Jessica Rumbelow and mwatkins. SolidGoldMagikarp (plus, prompt generation). Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/

  13. [13]

    Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter

    Craig W. Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. Boundless byte pair encoding: Breaking the pre-tokenization barrier, 2025. URL https://arxiv.org/abs/2504.00178

  14. [14]

    Evaluation methods for unsupervised word embeddings

    Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. Evaluation methods for unsupervised word embeddings. In Llu \'i s M \`a rquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 298--307, Lisbon, Portugal, September 2015. Association for Computational Li...

  15. [15]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016. URL https://arxiv.org/abs/1508.07909

  16. [16]

    LiteToken : Removing intermediate merge residues from bpe tokenizers, 2026

    Yike Sun, Haotong Yang, Zhouchen Lin, and Muhan Zhang. LiteToken : Removing intermediate merge residues from bpe tokenizers, 2026. URL https://arxiv.org/abs/2602.04706

  17. [17]

    V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9 0 (1), March 2019. ISSN 2045-2322. doi:10.1038/s41598-019-41695-z. URL http://dx.doi.org/10.1038/s41598-019-41695-z

  18. [18]

    Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,

    Zihui Wu, Haichang Gao, Ping Wang, Shudong Zhang, Zhaoxiang Liu, and Shiguo Lian. Glitchminer: Mining glitch tokens in large language models via gradient-based discrete optimization, 2025. URL https://arxiv.org/abs/2410.15052

  19. [19]

    Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax bottleneck: A high-rank rnn language model, 2018. URL https://arxiv.org/abs/1711.03953

  20. [20]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Curran Associates Inc., Red Hook, NY, USA, 2019

  21. [21]

    Glitchprober: Advancing effective detection and mitigation of glitch tokens in large language models

    Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, and Haoyu Wang. Glitchprober: Advancing effective detection and mitigation of glitch tokens in large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE '24, pp.\ 643–655, New York, NY, USA, 2024. As...

  22. [22]

    Enhancing large language models through adaptive tokenizers

    Mengyu Zheng, Hanting Chen, Tianyu Guo, Chong Zhu, Binfan Zheng, Chang Xu, and Yunhe Wang. Enhancing large language models through adaptive tokenizers. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9798331314385