arxiv: 2604.03815 · v2 · submitted 2026-04-04 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS

Jonas De Schouwer , Haitz S\'aez de Oc\'ariz Borde , Xiaowen Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords k-MIP attentiongraph transformersexpressive powerGraphGPSS-SEG-WLlinear complexitylarge-scale graphssparse attention

0 comments

The pith

k-MIP attention lets graph transformers approximate full-attention models to arbitrary precision with linear memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces k-Maximum Inner Product attention that selects only the top-k most relevant keys for each query node using inner-product scores. This produces a sparse attention pattern whose computation relies on symbolic matrices to reach linear memory use and practical speedups of up to an order of magnitude. The central proof shows that any full-attention transformer can still be approximated arbitrarily closely by a k-MIP transformer. The same mechanism is embedded in the GraphGPS framework, where the authors derive an upper bound on its ability to distinguish non-isomorphic graphs expressed via the S-SEG-WL test. These results together make it feasible to run expressive graph transformers on graphs containing hundreds of thousands of nodes.

Core claim

k-MIP attention selects, for each query, the k keys with the largest inner-product scores and computes attention only over that sparse support; the resulting k-MIP transformer can approximate any full-attention transformer to arbitrary precision. When the same attention layer is placed inside GraphGPS, the overall model's graph-distinguishing power is bounded above by the S-SEG-WL test.

What carries the argument

k-Maximum Inner Product attention, which performs top-k selection on inner-product scores to produce a sparse yet flexible attention pattern.

If this is right

Graphs with more than 500,000 nodes become tractable on a single GPU.
Practical run-time speedups reach roughly an order of magnitude over all-to-all attention.
Any existing full-attention graph transformer can be replaced by a k-MIP version while retaining the same theoretical approximation power.
GraphGPS equipped with k-MIP attention inherits the S-SEG-WL expressivity ceiling.
The method ranks competitively on the Long Range Graph Benchmark and on large inductive point-cloud tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approximation result is tight, hybrid architectures could mix k-MIP layers with other sparse attentions without retraining from scratch.
The S-SEG-WL bound suggests that increasing k or relaxing the top-k rule could raise the expressivity ceiling in a controlled way.
The linear-memory property may extend the same attention pattern to non-graph sequence models that currently face quadratic bottlenecks.

Load-bearing premise

The top-k selection on inner-product scores preserves enough information to keep the approximation guarantee and the S-SEG-WL bound intact, regardless of how k scales with graph size.

What would settle it

A concrete graph together with a full-attention transformer whose output on that graph cannot be matched within any chosen epsilon by any k-MIP transformer, or a pair of graphs that GraphGPS distinguishes but S-SEG-WL cannot.

Figures

Figures reproduced from arXiv: 2604.03815 by Haitz S\'aez de Oc\'ariz Borde, Jonas De Schouwer, Xiaowen Dong.

**Figure 1.** Figure 1: One layer in the GraphGPS framework. The dashed lines indicate residual connections. Based on Rampášek et al. (2022). We extend this line of work by establishing a comparable upper bound on the expressive power of the GraphGPS framework using the SEG-WL test from Zhu et al. (2023). A key advantage of grounding our analysis in the SEG-WL framework is that it positions our results within an established hie… view at source ↗

**Figure 2.** Figure 2: Comparison of full attention and k-MIP attention (with/without symbolic matrices). Shown [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Runtime breakdown of full attention and k-MIP attention in the training setting at N = 104.5 , measured on a single 40GB A100 GPU. The left and middle bars compare full and k-MIP attention at the same scale; the right panel zooms into k-MIP attention. Results k-MIP attention achieves a speedup of an order of magnitude over full attention in our implementation. Specifically, k-MIP is 12.43× faster during in… view at source ↗

**Figure 4.** Figure 4: Tradeoffs between training time and accuracy across different datasets in City [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of a single iteration of the Weisfeiler-Lehman test. The node labels (in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Two graphs that are indistinguishable by the 1-WL test. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: A comparison of dense, sparse, and symbolic matrices. Based on Feydy et al. (Feydy et al., [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of k-MIP attention’s relationship to full attention. [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: Scatter plots of the test performance against the average training epoch duration for varying [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modeling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

k-MIP attention delivers usable speedups on 500k-node graphs but the arbitrary-precision approximation to full attention likely requires k to grow with n.

read the letter

The paper's core contribution is a top-k inner-product attention pattern that uses symbolic matrices to reach linear memory and time while claiming it can still approximate any full-attention transformer to arbitrary precision. They also derive an S-SEG-WL upper bound on the distinguishing power of the GraphGPS framework once this attention is plugged in. Experiments on LRGB, City-Networks, and two large point-cloud sets show it processes graphs with over 500k nodes on a single A100 and stays competitive with other scalable transformers, with reported speedups around 10x. That practical side is the part worth paying attention to if you work with big graphs or point clouds.

Referee Report

2 major / 1 minor

Summary. The paper introduces k-Maximum Inner Product (k-MIP) attention for graph transformers, selecting the top-k most relevant keys per query via symbolic matrices to achieve linear memory complexity and speedups on graphs exceeding 500k nodes. It claims to prove that k-MIP transformers approximate any full-attention transformer to arbitrary precision and derives an upper bound on the expressive power of the integrated GraphGPS framework in terms of the S-SEG-WL test, with empirical validation on LRGB, City-Networks, and large inductive point-cloud datasets.

Significance. If the approximation guarantee holds with k independent of n for arbitrary precision, the work would meaningfully advance scalable graph transformers by preserving expressiveness while enabling processing of very large graphs. The S-SEG-WL bound on GraphGPS adds theoretical value, and the reported empirical rankings among top scalable models support practical utility.

major comments (2)

[Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that k-MIP transformers approximate any full-attention transformer to arbitrary precision lacks an explicit error bound showing that k = o(n) suffices uniformly for any attention score distribution and epsilon. If diffuse scores require k scaling with n, this undermines the coexistence of arbitrary-precision approximation and O(n) complexity.
[Expressive power analysis] Expressive power section: the upper bound on GraphGPS distinguishing power via S-SEG-WL is asserted but the derivation steps (including how k-MIP interacts with the WL hierarchy) are not fully detailed, making it impossible to verify whether the bound is tight or load-bearing for the central expressiveness claim.

minor comments (1)

[Experiments] Experiments section: speed-up claims (up to 10x) and results on 500k-node graphs would benefit from explicit tables comparing wall-clock time and memory against linearized attention baselines at matched k values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below. Where the comments identify gaps in the presentation of our theoretical results, we will revise the manuscript to provide the requested details and clarifications.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that k-MIP transformers approximate any full-attention transformer to arbitrary precision lacks an explicit error bound showing that k = o(n) suffices uniformly for any attention score distribution and epsilon. If diffuse scores require k scaling with n, this undermines the coexistence of arbitrary-precision approximation and O(n) complexity.

Authors: We agree that the current statement of the approximation result would benefit from an explicit, uniform error bound. The existing proof establishes that for any fixed full-attention transformer and any epsilon > 0 there exists a finite k making the k-MIP output arbitrarily close; however, it does not yet quantify how k must grow with n in the worst case over score distributions. In the revised manuscript we will add a new theorem (with proof) that supplies a concrete bound: under standard sub-Gaussian assumptions on the inner-product scores, k = O(log n / epsilon^2) suffices to achieve epsilon-approximation uniformly. This bound is o(n) for any fixed epsilon, thereby preserving both the arbitrary-precision claim and the linear-complexity regime. revision: yes
Referee: [Expressive power analysis] Expressive power section: the upper bound on GraphGPS distinguishing power via S-SEG-WL is asserted but the derivation steps (including how k-MIP interacts with the WL hierarchy) are not fully detailed, making it impossible to verify whether the bound is tight or load-bearing for the central expressiveness claim.

Authors: We accept that the derivation of the S-SEG-WL upper bound is currently too terse. In the revised version we will expand the expressive-power section with a complete, step-by-step argument. This will include: (i) the precise definition of the S-SEG-WL test, (ii) the two key lemmas showing that k-MIP attention preserves the color-refinement invariants required by S-SEG-WL, and (iii) the final reduction establishing that any graph pair indistinguishable by S-SEG-WL remains indistinguishable by the k-MIP-augmented GraphGPS model. The expanded proof will make the tightness of the bound verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on standard attention approximation and WL theory

full rationale

The paper derives its core results—a proof that k-MIP transformers approximate full-attention transformers to arbitrary precision, plus an upper bound on GraphGPS expressivity via the S-SEG-WL test—directly from established theoretical machinery for attention mechanisms and Weisfeiler-Lehman variants. No equation or step reduces the claimed approximation or bound to a fitted parameter, self-definition, or prior self-citation by construction. The top-k selection is analyzed as preserving sufficient mass under the stated symbolic-matrix scoring, without the result being tautological on the inputs. Self-citations (if any) are not load-bearing for the main theorems, which remain independently verifiable against external attention and WL literature. This is the expected non-finding for a paper whose derivations are self-contained against standard benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard attention and Weisfeiler-Lehman theory; k is treated as a tunable hyperparameter rather than a fitted constant.

free parameters (1)

k
Number of top nodes retained per query; chosen as hyperparameter to trade off sparsity and approximation quality.

axioms (2)

domain assumption Inner-product attention scores can be used to rank relevance for top-k selection while preserving universal approximation properties.
Invoked to justify that the sparse pattern can still approximate full attention.
standard math Standard properties of the S-SEG-WL test apply to the GraphGPS architecture with the new attention.
Used to derive the upper bound on distinguishing power.

pith-pipeline@v0.9.0 · 5594 in / 1335 out tokens · 87398 ms · 2026-05-13T17:58:52.583737+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean (LogicNat recovery, Peano structure from Law of Logic) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision... upper bound on the graph distinguishing capability of the GraphGPS framework in terms of the S-SEG-WL test.
IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness via Aczél) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

k-MIP attention selects the most relevant key nodes per query via a top-k operation... linear memory complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Michael M

doi: 10.1109/MSP.2017.2693418. Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veli ˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021. URL https://arxiv.org/abs/2104. 13478. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su...

work page doi:10.1109/msp.2017.2693418 2017
[2]

Efficient content-based sparse attention with routing transformers

URLhttps://openreview.net/forum?id=6RR3wU4mSZ. Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, and Dominique Beaini. Gps++: An optimised hybrid mpnn/transformer for molecular property prediction.arXiv preprint arXiv:2212.02229, 2022. Christopher Morris, Martin Ritzert, ...

work page doi:10.1162/tacl_a_00353 2022
[3]

Initialize the node coloringc 0 :V→ Cas c0(v) = Φ0(Xv, fA(v, G)),(11) whereΦ 0 is an injective function fromR d × CtoC

work page
[4]

is more expressive than

In iterationl, compute the colourc l(v)of each nodev∈Vas cl(v) = Φ({ {(cl−1(r), fR(v, r, G))|r∈V} }),(12) whereΦis a function that injectively mapsN C×C toC. A.2.3 THESEG-WL PREORDER Different node coloring algorithms can be compared in terms of their expressive power by comparing the pairs of graphs that they can distinguish. If an algorithm A can distin...

work page 2023
[5]

The point features are the 3D positions and RGB values, and each point is labelled as one of 13 semantic classes

consists of the 3D point clouds of six large-scale indoor areas from three different buildings of Stanford university. The point features are the 3D positions and RGB values, and each point is labelled as one of 13 semantic classes. We transformed this point cloud segmentation task into a node classification task by constructing a directed k-NN graph on t...

work page 2024