pith. sign in

arxiv: 2605.23821 · v1 · pith:HUBOGQ3Gnew · submitted 2026-05-22 · 💻 cs.CL · cs.LG

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

Pith reviewed 2026-05-25 04:11 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords hierarchical geometryword embeddingsco-occurrence kernelhypernymyspectral analysislanguage modelsWordNet
0
0 comments X

The pith

The hierarchical geometry of concepts in language model embeddings arises from the spectral properties of word co-occurrence statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that hypernym relations encoded in WordNet-style taxonomies appear geometrically in word embeddings because words that share hypernyms co-occur more frequently. Under mild positivity and decay conditions on the co-occurrence kernel, the leading eigenvectors of the resulting Gram matrix separate broad taxonomic branches first and then finer sub-branches, producing a coarse-to-fine splitting that mirrors the underlying tree. The same pattern is observed both in word2vec embeddings trained on sampled WordNet subtrees and in the unembeddings of Gemma 2B, indicating that the geometry follows directly from pairwise statistics rather than from any dedicated hierarchical mechanism inside the model.

Core claim

Starting from the assumption that words closer on the WordNet hypernym graph co-occur more often, the spectrum of the embedding Gram matrix under mild positivity and decay conditions on the co-occurrence kernel produces leading eigenvectors that first separate broad taxonomic branches and then progressively finer sub-branches, resulting in a hierarchical splitting geometry with coarse-to-fine spectral organization that mirrors the tree.

What carries the argument

The spectrum of the embedding Gram matrix of the co-occurrence kernel, whose leading eigenvectors successively isolate broader then narrower branches of the hypernym tree.

If this is right

  • The same coarse-to-fine splitting signature appears in both static word2vec embeddings and Gemma 2B unembeddings.
  • Hierarchical concept geometry in LLMs can emerge without any hierarchy-specific functional mechanism.
  • The organization is fully determined by the spectral properties of pairwise word statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that analogous hierarchical structure could appear in any embedding space whose Gram matrix satisfies similar positivity and decay conditions on its kernel.
  • It may explain why taxonomic relations surface in embeddings trained only on next-token prediction without explicit tree supervision.
  • The same mechanism could be tested on correlation matrices from other modalities or languages to check whether the coarse-to-fine pattern is generic.

Load-bearing premise

Words closer on the WordNet hypernym graph co-occur more often.

What would settle it

A WordNet subtree in which the leading eigenvectors of the co-occurrence Gram matrix fail to isolate broad branches before finer ones would falsify the claimed spectral organization.

Figures

Figures reproduced from arXiv: 2605.23821 by Andres Nava, Matthieu Wyart.

Figure 1
Figure 1. Figure 1: (a) WordNet hierarchy for a taxonomy of organisms. (b) Mean co-occurrence statistic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean co-occurrence statistic M⋆ ij decays with semantic distance under two taxonomy constructions. Both the original WordNet-distance measure (left) and the contracted arborescence used in our experiments (right) show monotone, approximately exponential decay. Dashed curves show fitted exponential kernels f(d) = αe−βd; shaded bands denote one standard error. We first consider the idealized setting in which… view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical splitting geometry in word2vec and LLM representation vectors; ex￾ample: taxonomy of organisms (Left) Gram matrices from theory: a fitted exponential kernel f(d) = 1.967 · e −1.235·d (top), word2vec (middle), and Gemma (bottom). Inspection of the eigen￾structure reveals qualitative agreement with theory. (Right) Top eigenvectors of each Gram matrix, visualized by projecting each node’s represe… view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical splitting geometry in binary trees sampled from WordNet. Top-k eigenspace alignment between theoretical and empirical Gram matrices for (a) organism-rooted trees, and (b) cognition-rooted trees. Panel (c) summarizes all eligible roots using the alignment area from Equation (11). Both word2vec and Gemma unembeddings align substantially above shuffled-label baselines. Shaded bands indicate one s… view at source ↗
Figure 5
Figure 5. Figure 5: Concept-level orthogonal innovations from co-occurrence statistics. Concept vectors are estimated using 70% of descendant tokens independently selected for each synset, following Park et al. [13]. Both Gemma unembeddings and theoretical co-occurrence embeddings yield parent–child innovations concentrated near zero, while a shuffled-parent baseline is substantially displaced. 6 Limitations Our theory applie… view at source ↗
Figure 6
Figure 6. Figure 6: Distance decay in low-rank PSD truncations of M⋆ . We restrict M⋆ to the eligible lemma set, retain the leading r positive eigenmodes, and form rank-r PSD Gram matrices M+ r = UrΛrU ⊤ r . Curves show the mean entry of M+ r as a function of semantic distance in the original WordNet graph (left) and in the contracted arborescence used in our experiments (right). The decay persists across ranks and distance c… view at source ↗
Figure 7
Figure 7. Figure 7: Distance decay persists after conditioning on lowest-common-ancestor depth. For each sampled complete binary tree, we evaluate M⋆ ij for every unordered node pair (i, j), including diagonal pairs, and assign each pair its induced tree distance Dij and lowest-common-ancestor depth depthLCAij . Pairs are then grouped by (Dij , depthLCAij ), and bin means are computed after aggregating over sampled trees for … view at source ↗
Figure 8
Figure 8. Figure 8: Recreation of the main eigenspace-alignment experiment with an additional within￾tree shuffle baseline. We repeat the top-k eigenspace-alignment analysis from [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Robustness to kernel parameterization. We compare the alignment obtained using the exponential distance kernel from the main text with the alignment obtained using a shifted power-law kernel, f(d) = 1.967 · (1 + d) −2.153. The empirical alignment remains above both the global shuffle and within-tree shuffle baselines, indicating that the result is driven by distance-dependent hierarchical decay rather than… view at source ↗
Figure 10
Figure 10. Figure 10: Root-sweep robustness for L = 2. We repeat the root-sweep analysis using complete binary trees of depth L = 2, corresponding to 2 L+1 − 1 = 7 nodes per sampled tree. This relaxation substantially increases the number of eligible roots relative to L = 3, from 21 to 144. The alignment area remains above the within-tree shuffle baseline for most roots. “Sea Turtle”. This choice is intentionally simple and mo… view at source ↗
Figure 11
Figure 11. Figure 11: Centered unwhitened and internal-activation representation controls. We evaluate alignment using globally centered unwhitened Gemma embeddings, globally centered Gemma middle-layer residual-stream activations, and globally centered Llama middle-layer residual-stream activations. For each representation, the global mean vector is subtracted before computing the eigenspace alignment. Centering removes the c… view at source ↗
Figure 12
Figure 12. Figure 12: Additional concept-vector diagnostics for Gemma and co-occurrence embeddings. We apply the concept-vector estimator of Park et al. [13] to Gemma unembeddings and to theoretical co-occurrence embeddings constructed from the fitted distance kernel. Concept vectors are estimated from independently selected 70% training subsets of descendant tokens for each synset. Top row: projection-based concept separation… view at source ↗
read the original abstract

We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a distributional theory of hypernymy in language representations. It starts from the assumption that words closer on the WordNet hypernym graph co-occur more frequently, then proves that under positivity and decay conditions on the resulting co-occurrence kernel the leading eigenvectors of the word2vec Gram matrix exhibit hierarchical splitting: broad taxonomic branches separate first, followed by progressively finer sub-branches, yielding a coarse-to-fine spectral organization that mirrors the tree. The same signature is reported to hold in Gemma 2B unembeddings. The central claim is that this geometry emerges from pairwise statistics rather than hierarchy-specific mechanisms.

Significance. If the derivation is correct and the kernel assumption holds with the stated conditions, the result supplies a parameter-free spectral explanation for hierarchical geometry observed across embedding methods. It links standard co-occurrence statistics directly to tree-like organization in both static and contextual models, offering a unified account that could be tested on additional corpora and architectures.

major comments (2)
  1. [Abstract and theoretical derivation] The proof (theoretical section following the assumption statement) shows that a kernel with positivity and decay yields the claimed eigenvector splitting, but the mapping from WordNet distance to the kernel is justified solely by the modeling choice that co-occurrence decreases with hypernym distance. No quantitative verification (e.g., measured decay rates or correlation values between pairwise co-occurrence counts and shortest-path distances on the sampled subtrees) is supplied to confirm that the positivity/decay conditions are met in the data used for the word2vec experiments; this assumption is load-bearing for transferring the theorem to real embeddings.
  2. [Empirical validation] The empirical confirmation across WordNet subtrees (experimental section) reports that the predicted splitting signature appears in word2vec and extends to Gemma, yet the manuscript does not include controls that would isolate the contribution of the WordNet-derived kernel from other factors (topical or frequency-based associations) that also shape co-occurrence; without such controls the experiments cannot rule out that the observed geometry arises for reasons orthogonal to the tree structure assumed in the proof.
minor comments (2)
  1. [Theoretical setup] Notation for the co-occurrence kernel and the Gram matrix should be introduced with explicit definitions before the spectral analysis begins to improve readability.
  2. [Introduction] The abstract states the assumption is 'empirically verified' but the main text would benefit from a short dedicated paragraph or table summarizing the verification statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions that will strengthen the connection between the theoretical assumptions and the empirical results.

read point-by-point responses
  1. Referee: [Abstract and theoretical derivation] The proof (theoretical section following the assumption statement) shows that a kernel with positivity and decay yields the claimed eigenvector splitting, but the mapping from WordNet distance to the kernel is justified solely by the modeling choice that co-occurrence decreases with hypernym distance. No quantitative verification (e.g., measured decay rates or correlation values between pairwise co-occurrence counts and shortest-path distances on the sampled subtrees) is supplied to confirm that the positivity/decay conditions are met in the data used for the word2vec experiments; this assumption is load-bearing for transferring the theorem to real embeddings.

    Authors: We agree that explicit quantitative checks on the kernel conditions are necessary to make the transfer from theorem to data fully rigorous. In the revised manuscript we will add a dedicated subsection that reports (i) Pearson and Spearman correlations between empirical co-occurrence counts and WordNet shortest-path distances across the sampled subtrees, (ii) verification that the resulting kernel is positive, and (iii) confirmation of the required decay behavior. These statistics will be computed on the same corpora used for the word2vec training runs. revision: yes

  2. Referee: [Empirical validation] The empirical confirmation across WordNet subtrees (experimental section) reports that the predicted splitting signature appears in word2vec and extends to Gemma, yet the manuscript does not include controls that would isolate the contribution of the WordNet-derived kernel from other factors (topical or frequency-based associations) that also shape co-occurrence; without such controls the experiments cannot rule out that the observed geometry arises for reasons orthogonal to the tree structure assumed in the proof.

    Authors: We accept that additional controls are required to demonstrate specificity to the WordNet-induced co-occurrence structure. In revision we will include two control experiments: (1) word2vec embeddings trained on a version of the corpus in which co-occurrence statistics have been randomized while preserving marginal frequencies, and (2) embeddings derived from synthetic hierarchies whose distance kernels do not match WordNet. We will show that the hierarchical splitting signature is markedly weaker or absent under these controls, thereby isolating the contribution of the tree-structured kernel. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from external empirical assumption via spectral analysis

full rationale

The paper begins with the stated modeling assumption that co-occurrence probability decreases with WordNet hypernym distance, then imposes positivity and decay conditions on the resulting kernel and applies spectral analysis to prove the coarse-to-fine eigenvector splitting. This chain is a direct mathematical consequence of the kernel properties and does not reduce any claimed prediction or theorem to a fitted parameter, self-citation, or quantity defined in terms of the output. Verification on word2vec and Gemma embeddings is presented as external confirmation rather than part of the derivation itself. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on one domain assumption about co-occurrence frequencies and two mathematical conditions on the kernel; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Words closer on the WordNet hypernym graph co-occur more often
    Stated as the empirically verified starting assumption in the abstract.
  • standard math Mild positivity and decay conditions on the co-occurrence kernel
    Invoked to characterize the spectrum of the Gram matrix.

pith-pipeline@v0.9.0 · 5688 in / 1262 out tokens · 29938 ms · 2026-05-25T04:11:42.947592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Not all language model features are one-dimensionally linear

    Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations, 2025

  2. [2]

    Language models implement simple Word2Vec-style vector arithmetic

    Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple Word2Vec-style vector arithmetic. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–504...

  3. [3]

    The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

    Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

  4. [4]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations, 2024

  5. [5]

    When models manipulate manifolds: The geometry of a counting task

    Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025

  6. [6]

    Mickiewicz, James L

    Hanlin Zhu, Melissa Franch, Elizabeth A. Mickiewicz, James L. Belanger, Rhiannon L. Cowan, Kalman A. Katlowitz, Ana G. Chavez, Assia Chericoni, Danika Paulo, Xinyuan Yan, Shervin Rahimpour, Ben Shofty, Eleonora Bartoli, Jay A. Hennig, Nicole R. Provenza, Elliot H. Smith, Steven T. Piantadosi, Benjamin Y . Hayden, and Sameer A. Sheth. A geometric foundatio...

  7. [7]

    George A. Miller. Wordnet: a lexical database for english.Commun. ACM, 38(11):39–41, November 1995

  8. [8]

    Poincaré embeddings for learning hierarchical represen- tations

    Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical represen- tations. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  9. [9]

    Hierarchical concept embedding & pursuit for interpretable image classification.arXiv preprint arXiv:2602.11448, 2026

    Nghia Nguyen, Tianjiao Ding, and René Vidal. Hierarchical concept embedding & pursuit for interpretable image classification.arXiv preprint arXiv:2602.11448, 2026

  10. [10]

    Learning semantic hierarchies via word embeddings

    Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. Learning semantic hierarchies via word embeddings. In Kristina Toutanova and Hua Wu, editors,Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1199–1209, Baltimore, Maryland, June 2014. Association for Computational...

  11. [11]

    Linear Representations of Hierarchical Concepts in Language Models

    Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, and Kentaro Inui. Linear representations of hierarchical concepts in language models.arXiv preprint arXiv:2604.07886, 2026

  12. [12]

    Emergence of phonemic, syntactic, and semantic representations in artificial neural networks.arXiv preprint arXiv:2601.18617, 2026

    Pierre Orhan, Pablo Diego-Simón, Emmanuel Chemla, Yair Lakretz, Yves Boubenec, and Jean-Rémi King. Emergence of phonemic, syntactic, and semantic representations in artificial neural networks.arXiv preprint arXiv:2601.18617, 2026

  13. [13]

    The geometry of categorical and hierarchical concepts in large language models

    Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 10

  14. [14]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019

  15. [15]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Learning hierarchical categories in deep neural networks. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 35, 2013

  16. [16]

    Neural word embedding as implicit matrix factorization

    Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27, 2014

  17. [17]

    On the emergence of linear analogies in word embeddings

    Daniel J Korchinski, Dhruva Karkada, Yasaman Bahri, and Matthieu Wyart. On the emergence of linear analogies in word embeddings. InAdvances in Neural Information Processing Systems, 2025

  18. [18]

    Symmetry in language statistics shapes the geometry of model representations.arXiv preprint arXiv:2602.15029, 2026

    Dhruva Karkada, Daniel J Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri. Symmetry in language statistics shapes the geometry of model representations.arXiv preprint arXiv:2602.15029, 2026

  19. [19]

    Distributed repre- sentations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

  20. [20]

    Glove: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

  21. [21]

    Closed-form training dynamics reveal learned features and linear structure in word2vec-like models

    Dhruva Karkada, James B Simon, Yasaman Bahri, and Michael R DeWeese. Closed-form training dynamics reveal learned features and linear structure in word2vec-like models. In Advances in Neural Information Processing Systems, 2025

  22. [22]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Anto...

  23. [23]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  24. [24]

    Leslie Kish.Survey sampling.Wiley, 1965

  25. [25]

    Transformerlens

    Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/ TransformerLensOrg/TransformerLens, 2022. 11

  26. [26]

    The llama 3 herd of models, 07 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, et al. The llama 3 herd of models, 07 2024

  27. [27]

    θm + 1 2 m−1X a=0 θa # , r, s= 0, . . . , h,(42) and A(h) r,s =αq rqs

    Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, February 2004. 12 A Proofs for the hierarchy-aligned spectral theory This appendix gives the formal proofs for the theoretical claims in Sections 3.1 to 3.3. The purpose of the organization below is to ma...