Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

Amir Globerson; Clayton Sanford; Gilad Yehudai; Gil Harari; Joan Bruna; Maya Bechler-Speicher

arxiv: 2605.22471 · v1 · pith:F33FWQSSnew · submitted 2026-05-21 · 💻 cs.LG

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

Maya Bechler-Speicher , Gilad Yehudai , Gil Harari , Clayton Sanford , Amir Globerson , Joan Bruna This is my paper

Pith reviewed 2026-05-22 07:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph transformerstokenizationexpressivityrandom walksspectral methodsdepth separationgraph representation

0 comments

The pith

The choice of how to tokenize a graph for a transformer sets hard limits on the depth needed to recover its structure or solve a task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that tokenization is not a neutral preprocessing step but a core determinant of what a transformer can compute at a given depth. Three standard families—spectral, random-walk, and adjacency—are shown to create incompatible depth regimes: the same graph property can be realized shallowly under one map and only deeply under another. Random-walk tokenization is proved lossy for any fixed walk length, so the original graph cannot be recovered in general, while spectral tokenization preserves information yet remains poorly conditioned for local decisions. The work further proves that a bounded-depth transformer cannot convert between these families in general, establishing concrete impossibility results and lower bounds on depth.

Core claim

Different tokenizations induce distinct depth regimes for the same graph computation: random-walk tokenization is lossy for any walk length and therefore cannot recover the input graph, spectral tokenization is lossless but ill-conditioned for local tasks, and no limited-depth transformer can map one family into another in general.

What carries the argument

The graph-to-token map, realized in three families (spectral, random-walk, adjacency), that determines which structural information reaches the transformer input and thereby fixes the minimal depth required for expressivity.

If this is right

Random-walk tokenization of any fixed length prevents recovery of the original graph structure.
Spectral tokenization preserves all information yet forces greater depth for tasks that depend on local neighborhoods.
A transformer of bounded depth cannot in general translate between spectral and random-walk representations.
Tasks with different locality requirements will systematically favor different tokenization families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid inputs that feed multiple tokenizations in parallel could let a single model access complementary structural signals without increasing depth.
The separation results suggest that tokenization choice should be treated as a hyper-parameter that is tuned jointly with depth rather than fixed in advance.

Load-bearing premise

The analysis assumes ordinary transformer layers with only standard attention and feed-forward blocks and no extra graph-specific operations that might circumvent tokenization limits.

What would settle it

Construct a small graph family where, under random-walk tokenization of fixed length, no transformer of depth d can distinguish two non-isomorphic graphs that a depth-d transformer with adjacency tokenization separates perfectly.

Figures

Figures reproduced from arXiv: 2605.22471 by Amir Globerson, Clayton Sanford, Gilad Yehudai, Gil Harari, Joan Bruna, Maya Bechler-Speicher.

**Figure 2.** Figure 2: Synthetic experiments validating theoretical depth separations. (a) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of the original graph and its switched counterpart, illustrating the [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tokenization choices create distinct depth regimes for graph Transformers, with new proofs that random-walk is lossy for any length and conversions between families are hard at limited depth.

read the letter

The main thing here is that tokenization is not neutral. Different graph-to-token maps force different depths for the same computation, and some maps lose information outright. The paper proves that random-walk tokenization stays lossy no matter the walk length, that spectral tokenization is lossless but ill-conditioned for local tasks, and that a shallow Transformer cannot convert between these families in general. These results are new and do not collapse into the usual graph-Transformer expressivity bounds they cite. The controlled experiments on synthetic graphs and a few real tasks line up with the predicted separations and show that combining tokenizations can help on some problems. That part is useful and cleanly executed. The proofs rest on standard Transformer blocks without extra graph mechanisms, which is a reasonable scope but means the lower bounds apply most directly to vanilla setups. Real models often add message-passing or other layers that might soften the limits, so the practical impact could be smaller than the theory suggests. The experiments are small-scale and synthetic-heavy, which is fine for validation but leaves open how the gaps behave at larger sizes or with noisy data. This is for people working on expressivity and representation choices in graph models. Readers who care about scaling or about why one architecture works better than another on certain tasks will get something concrete from it. The formal grounding is strong enough that it should go to referees rather than get desk-rejected.

Referee Report

1 major / 2 minor

Summary. The paper claims that the choice of graph tokenization fundamentally affects Transformer expressivity on graphs. It analyzes spectral, random-walk, and adjacency tokenizations as building blocks, proving that they induce distinct depth regimes for the same graph computations. Key results include that random-walk tokenization is lossy for any fixed walk length (preventing graph recovery in general), spectral tokenization is lossless but ill-conditioned for local tasks, and limited-depth Transformers cannot convert between these families in general. These theoretical separations are validated by controlled experiments on synthetic and real-world tasks, which also show that tasks favor different tokenizations and that combining them leverages complementary signals.

Significance. If the results hold, the work is significant because it establishes tokenization as a core, non-trivial component of graph Transformer design rather than a preprocessing detail. The explicit proofs for lossiness and depth-regime separations, together with the controlled experiments that validate the predicted distinctions, provide a rigorous foundation that could guide architecture choices and motivate hybrid tokenization approaches. This advances understanding of expressivity limits in a practical setting.

major comments (1)

Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.

minor comments (2)

Abstract: The final sentence on combining tokenizations could briefly indicate which classes of tasks benefit from which representations to improve immediate clarity for readers.
Experimental section: Ensure that figure captions explicitly label the tokenization variants and depth settings used in each panel so that the validation of the theoretical separations is immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comment on our theoretical results. We address the point below.

read point-by-point responses

Referee: Theoretical sections on expressivity lower bounds: The central impossibility results (e.g., limited-depth conversion between spectral and random-walk families) rest on the assumption of standard Transformer layers with fixed attention and feed-forward blocks and no additional graph-specific mechanisms. While the paper scopes this to the lower-bound constructions, explicitly confirming that the assumption applies uniformly to all claimed separations would strengthen the load-bearing claims.

Authors: We agree that an explicit, uniform confirmation of the architectural assumptions will strengthen the presentation. In the revised manuscript we will insert a dedicated clarifying paragraph at the beginning of the theoretical lower-bound section (and reference it in the statements of the relevant theorems) stating that all impossibility and separation results—including the limited-depth conversion between spectral and random-walk families—are established for standard Transformer layers consisting of fixed multi-head attention and feed-forward blocks, without any additional graph-specific mechanisms or inductive biases. This statement will be applied uniformly to every claimed separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core claims consist of mathematical proofs that different tokenizations induce distinct depth regimes for transformers, that random-walk tokenization is lossy for any fixed length, and that limited-depth transformers cannot convert between spectral and random-walk families. These results are derived from standard transformer expressivity arguments, explicit lower-bound constructions, and direct analysis of the tokenization maps themselves. The experiments serve only as empirical validation of the predicted separations and do not supply the source quantities for the theoretical statements. No load-bearing step reduces by construction to a fitted parameter, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The derivation is therefore self-contained against external benchmarks of transformer theory.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard mathematical properties of graphs and linear algebra for the tokenization maps; no free parameters are introduced in the theoretical claims, and no new entities are postulated.

axioms (2)

standard math Standard properties of finite undirected graphs and their adjacency matrices hold.
Invoked when defining spectral and adjacency tokenizations.
domain assumption Transformer layers consist of standard multi-head attention and position-wise feed-forward blocks without custom graph operators.
Used to establish the depth lower bounds.

pith-pipeline@v0.9.0 · 5796 in / 1310 out tokens · 33519 ms · 2026-05-22T07:00:19.395012+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that different tokenizations induce distinct depth regimes... random-walk tokenization is lossy for any walk length... spectral tokenization is lossless, it is ill-conditioned for local tasks.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Adjacency Lower Bound for k-Walks)... Theorem 2 (Planarity Undecidability)... Theorem 5 (Laplacian Ill-Conditioning Lower Bound)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

A Generalization of Transformer Networks to Graphs , author =. AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

work page
[2]

IEEE Signal Processing Magazine , volume =

Geometric Deep Learning: Going beyond Euclidean Data , author =. IEEE Signal Processing Magazine , volume =

work page
[3]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Depth-Width Tradeoffs for Transformers on Graph Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[4]

arXiv preprint arXiv:2503.01805 , year=

Depth vs Width Tradeoffs in Graph Transformers , author=. arXiv preprint arXiv:2503.01805 , year=

work page arXiv
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Do Transformers Really Perform Badly for Graph Representation? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[6]

Journal of Machine Learning Research (JMLR) , volume=

Benchmarking Graph Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=

work page
[7]

1999 , publisher=

Introduction to Circuit Complexity: A Uniform Approach , author=. 1999 , publisher=

work page 1999
[8]

Deep Sets , year =

Zaheer, Manzil and Kottur, Satwik and Ravanbhakhsh, Siamak and P\'. Deep Sets , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

work page
[9]

International Conference on Learning Representations , year=

How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations , year=

work page
[10]

2026 , eprint=

GraphBench: Next-generation graph learning benchmarking , author=. 2026 , eprint=

work page 2026
[11]

Transactions of the Association for Computational Linguistics , volume=

Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[12]

Aequationes Mathematicae , volume=

Constructing cospectral graphs , author=. Aequationes Mathematicae , volume=. 1982 , publisher=

work page 1982
[13]

2012 , publisher=

Spectra of Graphs , author=. 2012 , publisher=

work page 2012
[14]

Linear Algebra and its Applications , volume=

Laplacian matrices of graphs: a survey , author=. Linear Algebra and its Applications , volume=. 1994 , publisher=

work page 1994
[15]

Advances in Neural Information Processing Systems , year =

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , author =. Advances in Neural Information Processing Systems , year =

work page
[16]

ICML , year=

Neural Message Passing for Quantum Chemistry , author=. ICML , year=

work page
[17]

AAAI , year=

Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks , author=. AAAI , year=

work page
[18]

arXiv preprint arXiv:1907.03199 , year=

What Graph Neural Networks Cannot Learn: Depth vs Width , author=. arXiv preprint arXiv:1907.03199 , year=

work page arXiv 1907
[19]

ICLR , year=

Graph Attention Networks , author=. ICLR , year=

work page
[20]

ICLR , year=

How Attentive are Graph Attention Networks? , author=. ICLR , year=

work page
[21]

Graph-bert: Only attention is needed for learning graph representations

Graph-BERT: Only Attention is Needed for Learning Graph Representations , author=. arXiv preprint arXiv:2001.05140 , year=

work page arXiv 2001
[22]

NeurIPS , year=

Recipe for a General, Powerful, Scalable Graph Transformer , author=. NeurIPS , year=

work page
[23]

Talk like a graph: Encoding graphs for large language models

Talk Like a Graph: Encoding Graphs for Large Language Models , author=. arXiv preprint arXiv:2310.04560 , year=

work page arXiv
[24]

NeurIPS , year=

Rethinking Graph Transformers with Spectral Attention , author=. NeurIPS , year=

work page
[25]

ICLR , year=

Sign and Basis Invariant Networks for Spectral Graph Representation Learning , author=. ICLR , year=

work page
[26]

ICLR , year=

Are Transformers Universal Approximators of Sequence-to-Sequence Functions? , author=. ICLR , year=

work page
[27]

arXiv preprint arXiv:2107.13168 , year=

Statistically Meaningful Approximation: A Case Study on Approximating Turing Machines with Transformers , author=. arXiv preprint arXiv:2107.13168 , year=

work page arXiv
[28]

arXiv preprint arXiv:2309.06979 , year=

Auto-Regressive Next-Token Predictors are Universal Learners , author=. arXiv preprint arXiv:2309.06979 , year=

work page arXiv
[29]

TACL , year=

The Parallelism Tradeoff: Limitations of Log-Precision Transformers , author=. TACL , year=

work page
[30]

arXiv preprint arXiv:2210.10749 , year=

Transformers Learn Shortcuts to Automata , author=. arXiv preprint arXiv:2210.10749 , year=

work page arXiv
[31]

TACL , year=

Formal Language Recognition by Hard Attention Transformers , author=. TACL , year=

work page
[32]

SODA , year=

A Model of Computation for MapReduce , author=. SODA , year=

work page
[33]

arXiv preprint arXiv:2405.18512 , year=

Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. arXiv preprint arXiv:2405.18512 , year=

work page arXiv
[34]

arXiv preprint arXiv:2402.09268 , year=

Transformers, Parallel Computation, and Logarithmic Depth , author=. arXiv preprint arXiv:2402.09268 , year=

work page arXiv
[35]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020
[36]

and Sterling, Teague and Mysinger, Michael M

Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =. 2012 , doi =

work page 2012
[37]

Advances in Neural Information Processing Systems , volume=

Can Graph Neural Networks Count Substructures? , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

Journal of the ACM , volume =

Undirected Connectivity in Log-Space , author =. Journal of the ACM , volume =

work page
[39]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Representational Strengths and Limitations of Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[40]

arXiv preprint arXiv:2407.15160 , year =

When Can Transformers Count to n? , author =. arXiv preprint arXiv:2407.15160 , year =

work page arXiv
[41]

and Coates, Mark and Torr, Philip H.S

Ma, Liheng and Lin, Chen and Lim, Derek and Romero-Soriano, Adriana and Dokania, Puneet K. and Coates, Mark and Torr, Philip H.S. and Lim, Ser-Nam , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[42]

International Conference on Learning Representations , year=

Graph Neural Networks with Learnable Structural and Positional Representations , author=. International Conference on Learning Representations , year=

work page

[1] [1]

AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

A Generalization of Transformer Networks to Graphs , author =. AAAI Workshop on Deep Learning on Graphs: Methods and Applications , year =

work page

[2] [2]

IEEE Signal Processing Magazine , volume =

Geometric Deep Learning: Going beyond Euclidean Data , author =. IEEE Signal Processing Magazine , volume =

work page

[3] [3]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Depth-Width Tradeoffs for Transformers on Graph Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[4] [4]

arXiv preprint arXiv:2503.01805 , year=

Depth vs Width Tradeoffs in Graph Transformers , author=. arXiv preprint arXiv:2503.01805 , year=

work page arXiv

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Do Transformers Really Perform Badly for Graph Representation? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[6] [6]

Journal of Machine Learning Research (JMLR) , volume=

Benchmarking Graph Neural Networks , author=. Journal of Machine Learning Research (JMLR) , volume=

work page

[7] [7]

1999 , publisher=

Introduction to Circuit Complexity: A Uniform Approach , author=. 1999 , publisher=

work page 1999

[8] [8]

Deep Sets , year =

Zaheer, Manzil and Kottur, Satwik and Ravanbhakhsh, Siamak and P\'. Deep Sets , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

work page

[9] [9]

International Conference on Learning Representations , year=

How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations , year=

work page

[10] [10]

2026 , eprint=

GraphBench: Next-generation graph learning benchmarking , author=. 2026 , eprint=

work page 2026

[11] [11]

Transactions of the Association for Computational Linguistics , volume=

Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[12] [12]

Aequationes Mathematicae , volume=

Constructing cospectral graphs , author=. Aequationes Mathematicae , volume=. 1982 , publisher=

work page 1982

[13] [13]

2012 , publisher=

Spectra of Graphs , author=. 2012 , publisher=

work page 2012

[14] [14]

Linear Algebra and its Applications , volume=

Laplacian matrices of graphs: a survey , author=. Linear Algebra and its Applications , volume=. 1994 , publisher=

work page 1994

[15] [15]

Advances in Neural Information Processing Systems , year =

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , author =. Advances in Neural Information Processing Systems , year =

work page

[16] [16]

ICML , year=

Neural Message Passing for Quantum Chemistry , author=. ICML , year=

work page

[17] [17]

AAAI , year=

Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks , author=. AAAI , year=

work page

[18] [18]

arXiv preprint arXiv:1907.03199 , year=

What Graph Neural Networks Cannot Learn: Depth vs Width , author=. arXiv preprint arXiv:1907.03199 , year=

work page arXiv 1907

[19] [19]

ICLR , year=

Graph Attention Networks , author=. ICLR , year=

work page

[20] [20]

ICLR , year=

How Attentive are Graph Attention Networks? , author=. ICLR , year=

work page

[21] [21]

Graph-bert: Only attention is needed for learning graph representations

Graph-BERT: Only Attention is Needed for Learning Graph Representations , author=. arXiv preprint arXiv:2001.05140 , year=

work page arXiv 2001

[22] [22]

NeurIPS , year=

Recipe for a General, Powerful, Scalable Graph Transformer , author=. NeurIPS , year=

work page

[23] [23]

Talk like a graph: Encoding graphs for large language models

Talk Like a Graph: Encoding Graphs for Large Language Models , author=. arXiv preprint arXiv:2310.04560 , year=

work page arXiv

[24] [24]

NeurIPS , year=

Rethinking Graph Transformers with Spectral Attention , author=. NeurIPS , year=

work page

[25] [25]

ICLR , year=

Sign and Basis Invariant Networks for Spectral Graph Representation Learning , author=. ICLR , year=

work page

[26] [26]

ICLR , year=

Are Transformers Universal Approximators of Sequence-to-Sequence Functions? , author=. ICLR , year=

work page

[27] [27]

arXiv preprint arXiv:2107.13168 , year=

Statistically Meaningful Approximation: A Case Study on Approximating Turing Machines with Transformers , author=. arXiv preprint arXiv:2107.13168 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2309.06979 , year=

Auto-Regressive Next-Token Predictors are Universal Learners , author=. arXiv preprint arXiv:2309.06979 , year=

work page arXiv

[29] [29]

TACL , year=

The Parallelism Tradeoff: Limitations of Log-Precision Transformers , author=. TACL , year=

work page

[30] [30]

arXiv preprint arXiv:2210.10749 , year=

Transformers Learn Shortcuts to Automata , author=. arXiv preprint arXiv:2210.10749 , year=

work page arXiv

[31] [31]

TACL , year=

Formal Language Recognition by Hard Attention Transformers , author=. TACL , year=

work page

[32] [32]

SODA , year=

A Model of Computation for MapReduce , author=. SODA , year=

work page

[33] [33]

arXiv preprint arXiv:2405.18512 , year=

Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. arXiv preprint arXiv:2405.18512 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2402.09268 , year=

Transformers, Parallel Computation, and Logarithmic Depth , author=. arXiv preprint arXiv:2402.09268 , year=

work page arXiv

[35] [35]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020

[36] [36]

and Sterling, Teague and Mysinger, Michael M

Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =. 2012 , doi =

work page 2012

[37] [37]

Advances in Neural Information Processing Systems , volume=

Can Graph Neural Networks Count Substructures? , author=. Advances in Neural Information Processing Systems , volume=

work page

[38] [38]

Journal of the ACM , volume =

Undirected Connectivity in Log-Space , author =. Journal of the ACM , volume =

work page

[39] [39]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Representational Strengths and Limitations of Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[40] [40]

arXiv preprint arXiv:2407.15160 , year =

When Can Transformers Count to n? , author =. arXiv preprint arXiv:2407.15160 , year =

work page arXiv

[41] [41]

and Coates, Mark and Torr, Philip H.S

Ma, Liheng and Lin, Chen and Lim, Derek and Romero-Soriano, Adriana and Dokania, Puneet K. and Coates, Mark and Torr, Philip H.S. and Lim, Ser-Nam , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[42] [42]

International Conference on Learning Representations , year=

Graph Neural Networks with Learnable Structural and Positional Representations , author=. International Conference on Learning Representations , year=

work page