pith. sign in

arxiv: 2605.22471 · v1 · pith:F33FWQSSnew · submitted 2026-05-21 · 💻 cs.LG

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

Pith reviewed 2026-05-22 07:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords graph transformerstokenizationexpressivityrandom walksspectral methodsdepth separationgraph representation
5
0 comments X

The pith

The choice of how to tokenize a graph for a transformer sets hard limits on the depth needed to recover its structure or solve a task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that tokenization is not a neutral preprocessing step but a core determinant of what a transformer can compute at a given depth. Three standard families—spectral, random-walk, and adjacency—are shown to create incompatible depth regimes: the same graph property can be realized shallowly under one map and only deeply under another. Random-walk tokenization is proved lossy for any fixed walk length, so the original graph cannot be recovered in general, while spectral tokenization preserves information yet remains poorly conditioned for local decisions. The work further proves that a bounded-depth transformer cannot convert between these families in general, establishing concrete impossibility results and lower bounds on depth.

Core claim

Different tokenizations induce distinct depth regimes for the same graph computation: random-walk tokenization is lossy for any walk length and therefore cannot recover the input graph, spectral tokenization is lossless but ill-conditioned for local tasks, and no limited-depth transformer can map one family into another in general.

What carries the argument

The graph-to-token map, realized in three families (spectral, random-walk, adjacency), that determines which structural information reaches the transformer input and thereby fixes the minimal depth required for expressivity.

If this is right

  • Random-walk tokenization of any fixed length prevents recovery of the original graph structure.
  • Spectral tokenization preserves all information yet forces greater depth for tasks that depend on local neighborhoods.
  • A transformer of bounded depth cannot in general translate between spectral and random-walk representations.
  • Tasks with different locality requirements will systematically favor different tokenization families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid inputs that feed multiple tokenizations in parallel could let a single model access complementary structural signals without increasing depth.
  • The separation results suggest that tokenization choice should be treated as a hyper-parameter that is tuned jointly with depth rather than fixed in advance.

Load-bearing premise

The analysis assumes ordinary transformer layers with only standard attention and feed-forward blocks and no extra graph-specific operations that might circumvent tokenization limits.

What would settle it

Construct a small graph family where, under random-walk tokenization of fixed length, no transformer of depth d can distinguish two non-isomorphic graphs that a depth-d transformer with adjacency tokenization separates perfectly.

Figures

Figures reproduced from arXiv: 2605.22471 by Amir Globerson, Clayton Sanford, Gilad Yehudai, Gil Harari, Joan Bruna, Maya Bechler-Speicher.

Figure 1
Figure 1. Figure 1: Summary of conversion lim￾its between graph tokenizations. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic experiments validating theoretical depth separations. (a) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A comparison of the original graph and its switched counterpart, illustrating the [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that the choice of graph tokenization fundamentally affects Transformer expressivity on graphs. It analyzes spectral, random-walk, and adjacency tokenizations as building blocks, proving that they induce distinct depth regimes for the same graph computations. Key results include that random-walk tokenization is lossy for any fixed walk length (preventing graph recovery in general), spectral tokenization is lossless but ill-conditioned for local tasks, and limited-depth Transformers cannot convert between these families in general. These theoretical separations are validated by controlled experiments on synthetic and real-world tasks, which also show that tasks favor different tokenizations and that combining them leverages complementary signals.

Significance. If the results hold, the work is significant because it establishes tokenization as a core, non-trivial component of graph Transformer design rather than a preprocessing detail. The explicit proofs for lossiness and depth-regime separations, together with the controlled experiments that validate the predicted distinctions, provide a rigorous foundation that could guide architecture choices and motivate hybrid tokenization approaches. This advances understanding of expressivity limits in a practical setting.

major comments (1)
  1. Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.
minor comments (2)
  1. Abstract: The final sentence on combining tokenizations could briefly indicate which classes of tasks benefit from which representations to improve immediate clarity for readers.
  2. Experimental section: Ensure that figure captions explicitly label the tokenization variants and depth settings used in each panel so that the validation of the theoretical separations is immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comment on our theoretical results. We address the point below.

read point-by-point responses
  1. Referee: Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.

    Authors: We agree that an explicit, uniform confirmation of the architectural assumptions will strengthen the presentation. In the revised manuscript we will insert a dedicated clarifying paragraph at the beginning of the theoretical lower-bound section (and reference it in the statements of the relevant theorems) stating that all impossibility and separation results—including the limited-depth conversion between spectral and random-walk families—are established for standard Transformer layers consisting of fixed multi-head attention and feed-forward blocks, without any additional graph-specific mechanisms or inductive biases. This statement will be applied uniformly to every claimed separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core claims consist of mathematical proofs that different tokenizations induce distinct depth regimes for transformers, that random-walk tokenization is lossy for any fixed length, and that limited-depth transformers cannot convert between spectral and random-walk families. These results are derived from standard transformer expressivity arguments, explicit lower-bound constructions, and direct analysis of the tokenization maps themselves. The experiments serve only as empirical validation of the predicted separations and do not supply the source quantities for the theoretical statements. No load-bearing step reduces by construction to a fitted parameter, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The derivation is therefore self-contained against external benchmarks of transformer theory.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard mathematical properties of graphs and linear algebra for the tokenization maps; no free parameters are introduced in the theoretical claims, and no new entities are postulated.

axioms (2)
  • standard math Standard properties of finite undirected graphs and their adjacency matrices hold.
    Invoked when defining spectral and adjacency tokenizations.
  • domain assumption Transformer layers consist of standard multi-head attention and position-wise feed-forward blocks without custom graph operators.
    Used to establish the depth lower bounds.

pith-pipeline@v0.9.0 · 5796 in / 1310 out tokens · 33519 ms · 2026-05-22T07:00:19.395012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

    A Generalization of Transformer Networks to Graphs , author =. AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

  2. [2]

    IEEE Signal Processing Magazine , volume =

    Geometric Deep Learning: Going beyond Euclidean Data , author =. IEEE Signal Processing Magazine , volume =

  3. [3]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Depth-Width Tradeoffs for Transformers on Graph Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  4. [4]

    arXiv preprint arXiv:2503.01805 , year=

    Depth vs Width Tradeoffs in Graph Transformers , author=. arXiv preprint arXiv:2503.01805 , year=

  5. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Do Transformers Really Perform Badly for Graph Representation? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  6. [6]

    Journal of Machine Learning Research (JMLR) , volume=

    Benchmarking Graph Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=

  7. [7]

    1999 , publisher=

    Introduction to Circuit Complexity: A Uniform Approach , author=. 1999 , publisher=

  8. [8]

    Deep Sets , year =

    Zaheer, Manzil and Kottur, Satwik and Ravanbhakhsh, Siamak and P\'. Deep Sets , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

  9. [9]

    International Conference on Learning Representations , year=

    How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations , year=

  10. [10]

    2026 , eprint=

    GraphBench: Next-generation graph learning benchmarking , author=. 2026 , eprint=

  11. [11]

    Transactions of the Association for Computational Linguistics , volume=

    Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=

  12. [12]

    Aequationes Mathematicae , volume=

    Constructing cospectral graphs , author=. Aequationes Mathematicae , volume=. 1982 , publisher=

  13. [13]

    2012 , publisher=

    Spectra of Graphs , author=. 2012 , publisher=

  14. [14]

    Linear Algebra and its Applications , volume=

    Laplacian matrices of graphs: a survey , author=. Linear Algebra and its Applications , volume=. 1994 , publisher=

  15. [15]

    Advances in Neural Information Processing Systems , year =

    Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , author =. Advances in Neural Information Processing Systems , year =

  16. [16]

    ICML , year=

    Neural Message Passing for Quantum Chemistry , author=. ICML , year=

  17. [17]

    AAAI , year=

    Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks , author=. AAAI , year=

  18. [18]

    arXiv preprint arXiv:1907.03199 , year=

    What Graph Neural Networks Cannot Learn: Depth vs Width , author=. arXiv preprint arXiv:1907.03199 , year=

  19. [19]

    ICLR , year=

    Graph Attention Networks , author=. ICLR , year=

  20. [20]

    ICLR , year=

    How Attentive are Graph Attention Networks? , author=. ICLR , year=

  21. [21]

    Graph-bert: Only attention is needed for learning graph representations

    Graph-BERT: Only Attention is Needed for Learning Graph Representations , author=. arXiv preprint arXiv:2001.05140 , year=

  22. [22]

    NeurIPS , year=

    Recipe for a General, Powerful, Scalable Graph Transformer , author=. NeurIPS , year=

  23. [23]

    Talk like a graph: Encoding graphs for large language models

    Talk Like a Graph: Encoding Graphs for Large Language Models , author=. arXiv preprint arXiv:2310.04560 , year=

  24. [24]

    NeurIPS , year=

    Rethinking Graph Transformers with Spectral Attention , author=. NeurIPS , year=

  25. [25]

    ICLR , year=

    Sign and Basis Invariant Networks for Spectral Graph Representation Learning , author=. ICLR , year=

  26. [26]

    ICLR , year=

    Are Transformers Universal Approximators of Sequence-to-Sequence Functions? , author=. ICLR , year=

  27. [27]

    arXiv preprint arXiv:2107.13168 , year=

    Statistically Meaningful Approximation: A Case Study on Approximating Turing Machines with Transformers , author=. arXiv preprint arXiv:2107.13168 , year=

  28. [28]

    arXiv preprint arXiv:2309.06979 , year=

    Auto-Regressive Next-Token Predictors are Universal Learners , author=. arXiv preprint arXiv:2309.06979 , year=

  29. [29]

    TACL , year=

    The Parallelism Tradeoff: Limitations of Log-Precision Transformers , author=. TACL , year=

  30. [30]

    arXiv preprint arXiv:2210.10749 , year=

    Transformers Learn Shortcuts to Automata , author=. arXiv preprint arXiv:2210.10749 , year=

  31. [31]

    TACL , year=

    Formal Language Recognition by Hard Attention Transformers , author=. TACL , year=

  32. [32]

    SODA , year=

    A Model of Computation for MapReduce , author=. SODA , year=

  33. [33]

    arXiv preprint arXiv:2405.18512 , year=

    Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. arXiv preprint arXiv:2405.18512 , year=

  34. [34]

    arXiv preprint arXiv:2402.09268 , year=

    Transformers, Parallel Computation, and Logarithmic Depth , author=. arXiv preprint arXiv:2402.09268 , year=

  35. [35]

    Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

    Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

  36. [36]

    and Sterling, Teague and Mysinger, Michael M

    Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =. 2012 , doi =

  37. [37]

    Advances in Neural Information Processing Systems , volume=

    Can Graph Neural Networks Count Substructures? , author=. Advances in Neural Information Processing Systems , volume=

  38. [38]

    Journal of the ACM , volume =

    Undirected Connectivity in Log-Space , author =. Journal of the ACM , volume =

  39. [39]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Representational Strengths and Limitations of Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  40. [40]

    arXiv preprint arXiv:2407.15160 , year =

    When Can Transformers Count to n? , author =. arXiv preprint arXiv:2407.15160 , year =

  41. [41]

    and Coates, Mark and Torr, Philip H.S

    Ma, Liheng and Lin, Chen and Lim, Derek and Romero-Soriano, Adriana and Dokania, Puneet K. and Coates, Mark and Torr, Philip H.S. and Lim, Ser-Nam , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  42. [42]

    International Conference on Learning Representations , year=

    Graph Neural Networks with Learnable Structural and Positional Representations , author=. International Conference on Learning Representations , year=