arxiv: 2105.14491 · v3 · submitted 2021-05-30 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

How Attentive are Graph Attention Networks?

Shaked Brody , Uri Alon , Eran Yahav

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph attention networksGATGATv2static attentiondynamic attentiongraph neural networksexpressivenessbenchmarks

0 comments

The pith

Graph Attention Networks use static attention that cannot express simple graph problems, fixed by reordering to create dynamic GATv2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Graph Attention Networks attend to neighbors using a mechanism in which the ranking of attention scores stays the same no matter which node is querying. This static property stops the model from learning certain functions on graphs, as shown when GAT fails to fit the training data on a controlled synthetic task. By changing the order of the linear transformation and the attention score computation, the authors obtain GATv2 that conditions attention rankings on the query node. The change keeps the parameter count identical yet produces strictly higher expressiveness. A reader would care because GAT remains a standard building block for graph tasks, and the fix improves results on real benchmarks without added cost.

Core claim

Graph Attention Networks compute attention scores such that the relative ordering among a node's neighbors is independent of the node's own representation, which the authors term static attention. This restriction means GAT cannot represent functions that require the ranking of neighbors to change with the query. The paper exhibits the limitation on a simple synthetic graph problem where the model cannot fit the training data, then shows that moving the linear transformation inside the attention function produces GATv2, which implements dynamic attention and is provably more expressive than the original GAT.

What carries the argument

Static attention in GAT, defined as neighbor ranking that does not depend on the query node representation.

If this is right

GAT cannot solve graph problems that require attention rankings to depend on the querying node.
GATv2 is strictly more expressive than GAT while using the same number of parameters.
GATv2 outperforms the original GAT on 11 graph benchmarks from OGB and other sources.
The modification is simple enough to integrate into existing GNN libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other attention-based graph models that fix neighbor rankings independently of the query may carry the same limitation.
Replacing GAT layers with GATv2 in existing pipelines offers a low-cost way to test whether dynamic attention helps on a given dataset.
Expressiveness checks using small controlled graphs could become a routine step when designing new graph attention variants.

Load-bearing premise

That the controlled synthetic problem captures the expressiveness limits that matter for real graph tasks and that reordering the operations fully converts static attention to dynamic attention without side effects.

What would settle it

A graph task in which each node must rank its neighbors differently according to its own features; GAT will fail to fit the data while GATv2 will succeed.

read the original abstract

Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats . GATv2 is available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAT attention is static by design and GATv2 removes that limit with a cheap reordering, though the synthetic experiment may mix expressivity with optimization effects.

read the letter

The main point is that standard GAT computes static attention: the ranking of scores over neighbors stays the same no matter which node is querying, because the LeakyReLU sees separate terms for each node. This paper shows that this stops GAT from fitting even a simple synthetic training set. They reorder the operations to get GATv2, which makes attention depend on the query and is strictly more expressive, then report better numbers on 11 benchmarks at the same parameter count. The code is public and already in PyTorch Geometric and other libraries. That is the useful part: a clear diagnosis and a zero-cost patch that actually moves the needle on real tasks. The formal split between static and dynamic attention is also cleanly stated. The soft spot is the synthetic result. Without a proof that parameters exist for GAT to solve the task, the failure to fit could come from how the reordering changes gradient paths rather than from the static/dynamic distinction alone. The benchmark wins still stand, but the claim that GAT cannot express certain problems rests on weaker ground than the abstract suggests. This paper is for people who use or extend GAT-style models in practice. A reader who cares about attention mechanisms or small architectural tweaks will get something concrete from it. It deserves a serious referee because the idea is simple, the fix is cheap, and the empirical gains are there even if the theory side needs tightening. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that GAT computes only static attention (attention score rankings independent of the query node) because of the additive form LeakyReLU(a^T [W h_i || W h_j]) = LeakyReLU(c_i + d_j). It introduces a controlled synthetic graph problem in which GAT fails to fit the training data, proposes GATv2 via reordering of operations to obtain dynamic attention, proves GATv2 is strictly more expressive, and reports that GATv2 outperforms GAT on 11 OGB and other benchmarks at matched parameter cost.

Significance. If the central claims hold, the work is significant: it isolates a concrete, previously under-appreciated restriction in a widely adopted GNN, supplies a minimal architectural change that restores dynamic attention, and demonstrates consistent empirical gains. The public code release and integration into PyTorch Geometric, DGL, and TensorFlow GNN are concrete strengths that aid reproducibility.

major comments (2)

[Controlled synthetic problem / expressivity argument] Synthetic-task section: the claim that GAT cannot express the target function rests on observed non-convergence rather than a proof that no parameter setting exists. Because reordering to GATv2 also changes the computation graph and gradient flow, the training-set failure could be an optimization artifact rather than a pure expressivity limit; a parameter-existence argument or explicit construction showing the required attention ranking is impossible under the static form would be needed to make the inference load-bearing.
[GATv2 proposal] GATv2 definition and expressivity claim: while the reordering is presented as converting static to dynamic attention, the manuscript should explicitly verify that the new ordering preserves the original parameter count and does not introduce new degrees of freedom that could explain the performance difference independently of the static/dynamic distinction.

minor comments (2)

[Abstract] Abstract: the phrase '11 OGB and other benchmarks' should either list the datasets or point to the specific table/figure that enumerates them.
[Evaluation] Experimental section: confirm that all reported improvements include standard deviations over multiple runs and appropriate statistical tests; the current description leaves the strength of the outperformance claim unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [Controlled synthetic problem / expressivity argument] Synthetic-task section: the claim that GAT cannot express the target function rests on observed non-convergence rather than a proof that no parameter setting exists. Because reordering to GATv2 also changes the computation graph and gradient flow, the training-set failure could be an optimization artifact rather than a pure expressivity limit; a parameter-existence argument or explicit construction showing the required attention ranking is impossible under the static form would be needed to make the inference load-bearing.

Authors: We agree that the original presentation relies on empirical non-convergence and that a formal argument would make the expressivity claim more robust. In the revised manuscript we will add an explicit construction: we exhibit a small graph and target attention ranking that cannot be realized by any choice of parameters under the static (additive) form LeakyReLU(c_i + d_j), because the ranking is forced to be independent of the query node. This construction is independent of optimization or gradient flow and directly shows that the limitation is architectural rather than an artifact of training dynamics. We retain the synthetic experiment as supporting evidence but now anchor the claim with the parameter-existence argument. revision: yes
Referee: [GATv2 proposal] GATv2 definition and expressivity claim: while the reordering is presented as converting static to dynamic attention, the manuscript should explicitly verify that the new ordering preserves the original parameter count and does not introduce new degrees of freedom that could explain the performance difference independently of the static/dynamic distinction.

Authors: We confirm that GATv2 uses exactly the same number and dimensionality of parameters as GAT. The only change is the order of the linear transformation, concatenation, and nonlinearity; no additional weight matrices or biases are introduced. In the revised manuscript we will insert a short paragraph (with an accompanying table) that explicitly counts the parameters for both models on the standard OGB setups, showing they are identical. We will also note that any performance difference must therefore stem from the change in attention expressivity rather than from increased model capacity. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper derives the static attention property directly from the additive decomposition of GAT's scoring function (LeakyReLU(a^T [W h_i || W h_j]) reducing to a query-independent ranking), introduces an independent formal definition of static vs. dynamic attention, and demonstrates the limitation via a new synthetic task where GAT fails to fit while the reordered GATv2 succeeds. GATv2 is obtained by a straightforward reordering of linear and nonlinearity operations that makes the score query-dependent by construction. No step reduces a claimed result to a prior fitted quantity, self-citation chain, or renamed input; the expressivity argument rests on the new definitions and the controlled experiment rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard graph-neural-network assumptions about node features and neighborhood aggregation plus the newly introduced GATv2 construction; no ad-hoc fitted constants or new physical entities are invoked.

axioms (1)

domain assumption Standard assumptions on graph structure, node features, and message passing in GNNs
Implicit background for all GAT-style models.

invented entities (1)

GATv2 no independent evidence
purpose: Dynamic-attention variant of GAT
Proposed architecture whose expressiveness is claimed to exceed GAT.

pith-pipeline@v0.9.0 · 5518 in / 1348 out tokens · 54909 ms · 2026-05-17T02:26:02.690351+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Frequency-Space Mechanics: A Sequence and Coordinate-Free Representation for Protein Function Prediction
q-bio.BM 2026-05 unverdicted novelty 7.0

Vibrational mode graphs from molecular dynamics enable sequence-free protein function prediction via graph neural networks, with entrainment improving signals for collective dynamics.
Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
cs.AI 2026-05 unverdicted novelty 7.0

Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
cs.NI 2026-05 conditional novelty 7.0

Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
cs.LG 2026-05 unverdicted novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
Concept Graph Convolutions: Message Passing in the Concept Space
cs.LG 2026-04 unverdicted novelty 7.0

Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
cs.CV 2026-04 unverdicted novelty 7.0

Introduces VietPET-RoI dataset with fine-grained RoI annotations for Vietnamese 3D PET/CT and HiRRA graph framework that improves report generation by modeling region dependencies, claiming large gains over prior models.
Beyond Nodes vs. Edges: A Multi-View Fusion Framework for Provenance-Based Intrusion Detection
cs.CR 2026-04 unverdicted novelty 7.0

PROVFUSION fuses three complementary views of provenance data with lightweight schemes and voting to achieve higher detection accuracy and lower false positives than node- or edge-only baselines on nine benchmarks.
CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction
cs.AR 2026-04 accept novelty 7.0

CapBench is a new multi-PDK dataset of post-layout 3D windows with high-fidelity capacitance labels and multiple ML-ready representations, plus baseline results showing CNN accuracy versus GNN speed trade-offs.
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
cs.LG 2026-04 unverdicted novelty 7.0

SCOT learns explicit soft region correspondences via entropic optimal transport and a shared prototype hub to improve multi-source cross-city transfer accuracy and robustness.
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
cs.LG 2026-04 unverdicted novelty 7.0

SCOT uses Sinkhorn entropic optimal transport to learn explicit soft correspondences between unequal region sets for multi-source cross-city transfer, adding contrastive sharpening and cycle reconstruction for stabili...
ID-PaS+ : Identity-Aware Predict-and-Search for General Mixed-Integer Linear Programs
cs.AI 2025-12 unverdicted novelty 7.0

ID-PaS+ introduces an identity-aware predict-and-search framework for general parametric MIPs that outperforms Gurobi and prior PAS methods on real-world large-scale instances.
Random-Set Graph Neural Networks
cs.AI 2026-05 unverdicted novelty 6.0

RS-GNNs predict random sets over classes using belief functions to jointly produce class probabilities and epistemic uncertainty estimates for graph nodes.
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
cs.AI 2026-05 unverdicted novelty 6.0

SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
Qubit-Scalable CVRP via Lagrangian Knapsack Decomposition and Noise-Aware Quantum Execution
quant-ph 2026-04 unverdicted novelty 6.0

A hybrid quantum framework decomposes CVRP into bounded-width knapsack subproblems, trains a reinforcement learning controller for Lagrangian multipliers, and uses a contextual bandit to adapt quantum hardware executi...
A Structure-Preserving Graph Neural Solver for Parametric Hyperbolic Conservation Laws
physics.comp-ph 2026-04 unverdicted novelty 6.0

A structure-preserving GNN solver for parametric hyperbolic conservation laws achieves superior long-horizon stability and orders-of-magnitude speedups over high-resolution simulations on supersonic flow benchmarks.
Learning Ad Hoc Network Dynamics via Graph-Structured World Models
cs.LG 2026-04 unverdicted novelty 6.0

G-RSSM learns per-node dynamics in wireless ad hoc networks via graph attention and trains clustering policies through imagined rollouts, generalizing from N=50 training to larger networks.
A Texture-Generalizable Deep Material Network via Orientation-Aware Interaction Learning for Polycrystal Modeling and Texture Evolution
cs.CE 2025-12 unverdicted novelty 6.0

TACS-GNN-ODMN infers micromechanical parameters from arbitrary polycrystal textures to build generalizable ODMN surrogates that predict nonlinear responses and texture evolution without retraining.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 17 Pith papers · 5 internal anchors

[1]

Learning to represent programs with graphs

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJOFETxR-

work page 2018
[2]

On the bottleneck of graph neural networks and its practical implications

Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=i80OPhOCVH2

work page 2021
[3]

Diffusion-convolutional neural networks

James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in neural information processing systems, pages 1993--2001, 2016

work page 1993
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Interaction networks for learning about objects, relations and physics

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4509--4517, 2016

work page 2016
[6]

Gnn-film: Graph neural networks with feature-wise linear modulation

Marc Brockschmidt. Gnn-film: Graph neural networks with feature-wise linear modulation. Proceedings of the 36th International Conference on Machine Learning, ICML , 2020. URL https://github.com/microsoft/tf-gnn-samples

work page 2020
[7]

Geometric deep learning: going beyond euclidean data

Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34 0 (4): 0 18--42, 2017

work page 2017
[8]

Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021

work page 2021
[9]

Relational Graph Attention Networks

Dan Busbridge, Dane Sherburn, Pietro Cavallo, and Nils Y Hammerla. Relational graph attention networks. arXiv preprint arXiv:1904.05811, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[10]

Approximation by superpositions of a sigmoidal function

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2 0 (4): 0 303--314, 1989

work page 1989
[11]

Programmable Agents

Misha Denil, Sergio G \'o mez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable agents. arXiv preprint arXiv:1706.06383, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

One-shot imitation learning

Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1087--1098, 2017

work page 2017
[13]

Convolutional networks on graphs for learning molecular fingerprints

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al \'a n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224--2232, 2015

work page 2015
[14]

A generalization of transformer networks to graphs

Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020

work page arXiv 2012
[15]

Benchmarking graph neural networks

Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020

work page arXiv 2003
[16]

Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric . In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

work page 2019
[17]

On the approximate realization of continuous mappings by neural networks

Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2 0 (3): 0 183--192, 1989

work page 1989
[18]

Graph representation learning via hard and channel-wise attention networks

Hongyang Gao and Shuiwang Ji. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 741--749, 2019

work page 2019
[19]

Neural message passing for quantum chemistry

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263--1272. JMLR. org, 2017

work page 2017
[20]

pytorch-gat

Aleksa Gordić. pytorch-gat. https://github.com/gordicaleksa/pytorch-GAT, 2020

work page 2020
[21]

A new model for learning in graph domains

Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729--734. IEEE, 2005

work page 2005
[22]

Inductive representation learning on large graphs

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024--1034, 2017

work page 2017
[23]

Approximation capabilities of multilayer feedforward networks

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 0 (2): 0 251--257, 1991

work page 1991
[24]

Multilayer feedforward networks are universal approximators

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989

work page 1989
[25]

Vain: attentional multi-agent predictive modeling

Yedid Hoshen. Vain: attentional multi-agent predictive modeling. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2698--2708, 2017

work page 2017
[26]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020

work page arXiv 2005
[27]

Syntax-aware aspect level sentiment classification with graph attention networks

Binxuan Huang and Kathleen M Carley. Syntax-aware aspect level sentiment classification with graph attention networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5472--5480, 2019

work page 2019
[28]

Combining label propagation and simple models out-performs graph neural networks

Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. Combining label propagation and simple models out-performs graph neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=8E1-f3VhX1o

work page 2021
[29]

Transformers are graph neural networks

Chaitanya Joshi. Transformers are graph neural networks. The Gradient, 2020

work page 2020
[30]

How to find your friendly neighborhood: Graph attention design with self-supervision

Dongkwan Kim and Alice Oh. How to find your friendly neighborhood: Graph attention design with self-supervision. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Wi5KUNlqWty

work page 2021
[31]

Semi-supervised classification with graph convolutional networks

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017

work page 2017
[32]

Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks

Vineet Kosaraju, Amir Sadeghian, Roberto Mart\' n-Mart\' n, Ian Reid, Hamid Rezatofighi, and Silvio Savarese. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, v...

work page 2019
[33]

Attention Models in Graphs: A Survey

John Boaz Lee, Ryan A Rossi, Sungchul Kim, Nesreen K Ahmed, and Eunyee Koh. Attention models in graphs: A survey. arXiv preprint arXiv:1807.07984, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Multilayer feedforward networks with a nonpolynomial activation function can approximate any function

Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6 0 (6): 0 861--867, 1993

work page 1993
[35]

Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9267--9276, 2019

Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9267--9276, 2019

work page 2019
[36]

Deeper insights into graph convolutional networks for semi-supervised learning

Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

work page 2018
[37]

Gated graph sequence neural networks

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In International Conference on Learning Representations, 2016

work page 2016
[38]

Gated relational graph attention networks, 2021

Denis Lukovnikov and Asja Fischer. Gated relational graph attention networks, 2021. URL https://openreview.net/forum?id=v-9E8egy_i

work page 2021
[39]

Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015 , pages 1412--1421, 2015. URL http://aclweb.org/anthology/D/D15/D15-1166.pdf

work page 2015
[40]

Entity-aware dependency-based deep graph attention network for comparative preference classification

Nianzu Ma, Sahisnu Mazumder, Hao Wang, and Bing Liu. Entity-aware dependency-based deep graph attention network for comparative preference classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5782--5788, 2020

work page 2020
[41]

Geometric deep learning on graphs and manifolds using mixture model cnns

Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5115--5124, 2017

work page 2017
[42]

Learning attention-based embeddings for relation prediction in knowledge graphs

Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attention-based embeddings for relation prediction in knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4710--4723, 2019

work page 2019
[43]

Minimum width for universal approximation

Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=O-XJwyoIF-k

work page 2021
[44]

Approximation theory of the mlp model

Allan Pinkus. Approximation theory of the mlp model. Acta Numerica 1999: Volume 8, 8: 0 143--195, 1999

work page 1999
[45]

Deepinf: Social influence prediction with deep learning

Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. Deepinf: Social influence prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18), 2018

work page 2018
[46]

Quantum chemistry structures and properties of 134 kilo molecules

Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1: 0 140022, 2014

work page 2014
[47]

Self-supervised graph transformer on large-scale molecular data

Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33, 2020 a

work page 2020
[48]

Dropedge: Towards deep graph convolutional networks on node classification

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2020 b . URL https://openreview.net/forum?id=Hkx1qkrKPr

work page 2020
[49]

A simple neural network module for relational reasoning

Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4974--4983, 2017

work page 2017
[50]

The graph neural network model

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20 0 (1): 0 61--80, 2008

work page 2008
[51]

Collective classification in network data

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29 0 (3): 0 93--93, 2008

work page 2008
[52]

Masked label prediction: Unified massage passing model for semi-supervised classification

Yunsheng Shi, Zhengjie Huang, Shikun Feng, and Yu Sun. Masked label prediction: Unified massage passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509, 2020

work page arXiv 2009
[53]

Attention-based Graph Neural Network for Semi-supervised Learning

Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000--6010, 2017

work page 2017
[55]

Graph attention networks

Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018

work page 2018
[56]

Pointer graph networks

Petar Veli c kovi \'c , Lars Buesing, Matthew Overlan, Razvan Pascanu, Oriol Vinyals, and Charles Blundell. Pointer graph networks. Advances in Neural Information Processing Systems, 33, 2020

work page 2020
[57]

Veli c kovi \'c

Petar et al. Veli c kovi \'c . Graph attention networks. 2018

work page 2018
[58]

Improving graph attention networks with large margin-based constraints

Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention networks with large margin-based constraints. arXiv preprint arXiv:1910.11945, 2019 a

work page arXiv 1910
[59]

Deep graph library: A graph-centric, highly-performant package for graph neural networks

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019 b

work page arXiv 1909
[60]

Heterogeneous graph attention network

Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In The World Wide Web Conference, pages 2022--2032, 2019 c

work page 2022
[61]

Bag of tricks of semi-supervised classification with graph neural networks

Yangkun Wang. Bag of tricks of semi-supervised classification with graph neural networks. arXiv preprint arXiv:2103.13355, 2021

work page arXiv 2021
[62]

On the practical computational power of finite precision rnns for language recognition

Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740--745, 2018

work page 2018
[63]

Simplifying graph convolutional networks

Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861--6871. PMLR, 2019

work page 2019
[64]

A comprehensive survey on graph neural networks

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020

work page 2020
[65]

How powerful are graph neural networks? In International Conference on Learning Representations, 2019

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km

work page 2019
[66]

Distilling knowledge from graph convolutional networks

Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang. Distilling knowledge from graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[67]

Graphsaint: Graph sampling based inductive learning method

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931, 2019

work page arXiv 1907
[68]

Gaan: Gated attention networks for learning on large and spatiotemporal graphs

Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit - Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 339--349, 2018

work page 2018
[69]

Adaptive structural fingerprints for graph attention networks

Kai Zhang, Yaokang Zhu, Jun Wang, and Jie Zhang. Adaptive structural fingerprints for graph attention networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJxWx0NYPr

work page 2020
[70]

Pairnorm: Tackling oversmoothing in gnns

Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkecl1rtwB

work page 2020