Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
Pith reviewed 2026-05-22 07:00 UTC · model grok-4.3
The pith
The choice of how to tokenize a graph for a transformer sets hard limits on the depth needed to recover its structure or solve a task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Different tokenizations induce distinct depth regimes for the same graph computation: random-walk tokenization is lossy for any walk length and therefore cannot recover the input graph, spectral tokenization is lossless but ill-conditioned for local tasks, and no limited-depth transformer can map one family into another in general.
What carries the argument
The graph-to-token map, realized in three families (spectral, random-walk, adjacency), that determines which structural information reaches the transformer input and thereby fixes the minimal depth required for expressivity.
If this is right
- Random-walk tokenization of any fixed length prevents recovery of the original graph structure.
- Spectral tokenization preserves all information yet forces greater depth for tasks that depend on local neighborhoods.
- A transformer of bounded depth cannot in general translate between spectral and random-walk representations.
- Tasks with different locality requirements will systematically favor different tokenization families.
Where Pith is reading between the lines
- Hybrid inputs that feed multiple tokenizations in parallel could let a single model access complementary structural signals without increasing depth.
- The separation results suggest that tokenization choice should be treated as a hyper-parameter that is tuned jointly with depth rather than fixed in advance.
Load-bearing premise
The analysis assumes ordinary transformer layers with only standard attention and feed-forward blocks and no extra graph-specific operations that might circumvent tokenization limits.
What would settle it
Construct a small graph family where, under random-walk tokenization of fixed length, no transformer of depth d can distinguish two non-isomorphic graphs that a depth-d transformer with adjacency tokenization separates perfectly.
Figures
read the original abstract
Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the choice of graph tokenization fundamentally affects Transformer expressivity on graphs. It analyzes spectral, random-walk, and adjacency tokenizations as building blocks, proving that they induce distinct depth regimes for the same graph computations. Key results include that random-walk tokenization is lossy for any fixed walk length (preventing graph recovery in general), spectral tokenization is lossless but ill-conditioned for local tasks, and limited-depth Transformers cannot convert between these families in general. These theoretical separations are validated by controlled experiments on synthetic and real-world tasks, which also show that tasks favor different tokenizations and that combining them leverages complementary signals.
Significance. If the results hold, the work is significant because it establishes tokenization as a core, non-trivial component of graph Transformer design rather than a preprocessing detail. The explicit proofs for lossiness and depth-regime separations, together with the controlled experiments that validate the predicted distinctions, provide a rigorous foundation that could guide architecture choices and motivate hybrid tokenization approaches. This advances understanding of expressivity limits in a practical setting.
major comments (1)
- Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.
minor comments (2)
- Abstract: The final sentence on combining tokenizations could briefly indicate which classes of tasks benefit from which representations to improve immediate clarity for readers.
- Experimental section: Ensure that figure captions explicitly label the tokenization variants and depth settings used in each panel so that the validation of the theoretical separations is immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comment on our theoretical results. We address the point below.
read point-by-point responses
-
Referee: Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.
Authors: We agree that an explicit, uniform confirmation of the architectural assumptions will strengthen the presentation. In the revised manuscript we will insert a dedicated clarifying paragraph at the beginning of the theoretical lower-bound section (and reference it in the statements of the relevant theorems) stating that all impossibility and separation results—including the limited-depth conversion between spectral and random-walk families—are established for standard Transformer layers consisting of fixed multi-head attention and feed-forward blocks, without any additional graph-specific mechanisms or inductive biases. This statement will be applied uniformly to every claimed separation. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core claims consist of mathematical proofs that different tokenizations induce distinct depth regimes for transformers, that random-walk tokenization is lossy for any fixed length, and that limited-depth transformers cannot convert between spectral and random-walk families. These results are derived from standard transformer expressivity arguments, explicit lower-bound constructions, and direct analysis of the tokenization maps themselves. The experiments serve only as empirical validation of the predicted separations and do not supply the source quantities for the theoretical statements. No load-bearing step reduces by construction to a fitted parameter, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The derivation is therefore self-contained against external benchmarks of transformer theory.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard properties of finite undirected graphs and their adjacency matrices hold.
- domain assumption Transformer layers consist of standard multi-head attention and position-wise feed-forward blocks without custom graph operators.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that different tokenizations induce distinct depth regimes... random-walk tokenization is lossy for any walk length... spectral tokenization is lossless, it is ill-conditioned for local tasks.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Adjacency Lower Bound for k-Walks)... Theorem 2 (Planarity Undecidability)... Theorem 5 (Laplacian Ill-Conditioning Lower Bound)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =
A Generalization of Transformer Networks to Graphs , author =. AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =
-
[2]
IEEE Signal Processing Magazine , volume =
Geometric Deep Learning: Going beyond Euclidean Data , author =. IEEE Signal Processing Magazine , volume =
-
[3]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Depth-Width Tradeoffs for Transformers on Graph Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[4]
arXiv preprint arXiv:2503.01805 , year=
Depth vs Width Tradeoffs in Graph Transformers , author=. arXiv preprint arXiv:2503.01805 , year=
-
[5]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Do Transformers Really Perform Badly for Graph Representation? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[6]
Journal of Machine Learning Research (JMLR) , volume=
Benchmarking Graph Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=
-
[7]
Introduction to Circuit Complexity: A Uniform Approach , author=. 1999 , publisher=
work page 1999
-
[8]
Zaheer, Manzil and Kottur, Satwik and Ravanbhakhsh, Siamak and P\'. Deep Sets , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
-
[9]
International Conference on Learning Representations , year=
How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations , year=
-
[10]
GraphBench: Next-generation graph learning benchmarking , author=. 2026 , eprint=
work page 2026
-
[11]
Transactions of the Association for Computational Linguistics , volume=
Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=
-
[12]
Aequationes Mathematicae , volume=
Constructing cospectral graphs , author=. Aequationes Mathematicae , volume=. 1982 , publisher=
work page 1982
- [13]
-
[14]
Linear Algebra and its Applications , volume=
Laplacian matrices of graphs: a survey , author=. Linear Algebra and its Applications , volume=. 1994 , publisher=
work page 1994
-
[15]
Advances in Neural Information Processing Systems , year =
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , author =. Advances in Neural Information Processing Systems , year =
- [16]
-
[17]
Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks , author=. AAAI , year=
-
[18]
arXiv preprint arXiv:1907.03199 , year=
What Graph Neural Networks Cannot Learn: Depth vs Width , author=. arXiv preprint arXiv:1907.03199 , year=
- [19]
- [20]
-
[21]
Graph-bert: Only attention is needed for learning graph representations
Graph-BERT: Only Attention is Needed for Learning Graph Representations , author=. arXiv preprint arXiv:2001.05140 , year=
-
[22]
Recipe for a General, Powerful, Scalable Graph Transformer , author=. NeurIPS , year=
-
[23]
Talk like a graph: Encoding graphs for large language models
Talk Like a Graph: Encoding Graphs for Large Language Models , author=. arXiv preprint arXiv:2310.04560 , year=
-
[24]
Rethinking Graph Transformers with Spectral Attention , author=. NeurIPS , year=
-
[25]
Sign and Basis Invariant Networks for Spectral Graph Representation Learning , author=. ICLR , year=
-
[26]
Are Transformers Universal Approximators of Sequence-to-Sequence Functions? , author=. ICLR , year=
-
[27]
arXiv preprint arXiv:2107.13168 , year=
Statistically Meaningful Approximation: A Case Study on Approximating Turing Machines with Transformers , author=. arXiv preprint arXiv:2107.13168 , year=
-
[28]
arXiv preprint arXiv:2309.06979 , year=
Auto-Regressive Next-Token Predictors are Universal Learners , author=. arXiv preprint arXiv:2309.06979 , year=
-
[29]
The Parallelism Tradeoff: Limitations of Log-Precision Transformers , author=. TACL , year=
-
[30]
arXiv preprint arXiv:2210.10749 , year=
Transformers Learn Shortcuts to Automata , author=. arXiv preprint arXiv:2210.10749 , year=
-
[31]
Formal Language Recognition by Hard Attention Transformers , author=. TACL , year=
- [32]
-
[33]
arXiv preprint arXiv:2405.18512 , year=
Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. arXiv preprint arXiv:2405.18512 , year=
-
[34]
arXiv preprint arXiv:2402.09268 , year=
Transformers, Parallel Computation, and Logarithmic Depth , author=. arXiv preprint arXiv:2402.09268 , year=
-
[35]
Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =
work page 2020
-
[36]
and Sterling, Teague and Mysinger, Michael M
Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =. 2012 , doi =
work page 2012
-
[37]
Advances in Neural Information Processing Systems , volume=
Can Graph Neural Networks Count Substructures? , author=. Advances in Neural Information Processing Systems , volume=
-
[38]
Undirected Connectivity in Log-Space , author =. Journal of the ACM , volume =
-
[39]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Representational Strengths and Limitations of Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[40]
arXiv preprint arXiv:2407.15160 , year =
When Can Transformers Count to n? , author =. arXiv preprint arXiv:2407.15160 , year =
-
[41]
and Coates, Mark and Torr, Philip H.S
Ma, Liheng and Lin, Chen and Lim, Derek and Romero-Soriano, Adriana and Dokania, Puneet K. and Coates, Mark and Torr, Philip H.S. and Lim, Ser-Nam , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[42]
International Conference on Learning Representations , year=
Graph Neural Networks with Learnable Structural and Positional Representations , author=. International Conference on Learning Representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.