Graph Representation Learning via Hard and Channel-Wise Attention Networks

Hongyang Gao; Shuiwang Ji

arxiv: 1907.04652 · v1 · pith:OW3G3H2Enew · submitted 2019-07-05 · 💻 cs.LG · stat.ML

Graph Representation Learning via Hard and Channel-Wise Attention Networks

Hongyang Gao , Shuiwang Ji This is my paper

Pith reviewed 2026-05-25 02:05 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords graph attentionhard attentionchannel-wise attentiongraph representation learningnode embeddinggraph embeddingcomputational efficiencyattention operators

0 comments

The pith

Hard graph attention on important nodes and channel-wise operations improve embedding performance while reducing compute demands compared to standard soft attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two new attention operators for graphs to overcome the high resource use and performance limits of existing soft graph attention methods. The hard version selects only key neighboring nodes for aggregation, which raises accuracy and lowers cost. The channel-wise version moves attention to operate on feature channels rather than the full adjacency structure, removing a major computational bottleneck. If these changes hold, graph models can scale to larger networks while maintaining or improving results on node classification and graph-level tasks. Readers would care because many real-world graphs, such as social or biological networks, exceed the size current attention methods can handle efficiently.

Core claim

We introduce the hard graph attention operator (hGAO), which applies hard attention to attend only to important nodes, and the channel-wise graph attention operator (cGAO), which performs attention along channels and avoids dependence on the adjacency matrix. Deep models built with these operators achieve consistently better performance than prior methods, with hGAO showing significant gains over standard graph attention on both node and graph embedding tasks, while cGAO delivers dramatic reductions in computational resources.

What carries the argument

Hard graph attention operator (hGAO) and channel-wise graph attention operator (cGAO), which replace soft attention with selective node focus and channel-based aggregation to process neighbor information.

If this is right

hGAO yields significantly higher performance than standard GAO on node embedding and graph embedding tasks.
cGAO reduces computational resource requirements enough to apply attention-based models to large graphs.
Deep models incorporating either operator show consistent performance gains across multiple graph tasks.
Efficiency improvements from cGAO remove the adjacency-matrix dependency that limits scale in prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hard selection preserves key paths, the operators might transfer to link-prediction settings where missing edges matter.
Channel-wise attention could be tested on graphs with high-dimensional features to check whether it scales better than node-wise methods.
Combining both operators in one architecture might compound the efficiency and accuracy gains, though this remains untested.

Load-bearing premise

Selecting only important nodes and shifting attention to channels will not discard critical structural information or add biases that reduce accuracy on unseen graphs.

What would settle it

A direct comparison experiment on a large held-out graph dataset where models using hGAO or cGAO produce lower accuracy or higher error than standard GAO.

Figures

Figures reproduced from arXiv: 1907.04652 by Hongyang Gao, Shuiwang Ji.

**Figure 2.** Figure 2: An illustration of our proposed GANet described in Section 3.3. In this example, the input graph contains 6 nodes, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Results of employing different k values in hGAOs using hGANet on PROTEINS, COLLAB, and MUTAG datasets under inductive learning settings. We use the same experimental setups described in Section 4.2. We report the graph classification accuracies in this figure. We can see that the best performances is achieved when k = 8. operator in [31] on graph data, we do not provide comparisons with it in this work. 4.… view at source ↗

read the original abstract

Attention operators have been widely applied in various fields, including computer vision, natural language processing, and network embedding learning. Attention operators on graph data enables learnable weights when aggregating information from neighboring nodes. However, graph attention operators (GAOs) consume excessive computational resources, preventing their applications on large graphs. In addition, GAOs belong to the family of soft attention, instead of hard attention, which has been shown to yield better performance. In this work, we propose novel hard graph attention operator (hGAO) and channel-wise graph attention operator (cGAO). hGAO uses the hard attention mechanism by attending to only important nodes. Compared to GAO, hGAO improves performance and saves computational cost by only attending to important nodes. To further reduce the requirements on computational resources, we propose the cGAO that performs attention operations along channels. cGAO avoids the dependency on the adjacency matrix, leading to dramatic reductions in computational resource requirements. Experimental results demonstrate that our proposed deep models with the new operators achieve consistently better performance. Comparison results also indicates that hGAO achieves significantly better performance than GAO on both node and graph embedding tasks. Efficiency comparison shows that our cGAO leads to dramatic savings in computational resources, making them applicable to large graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines hGAO for hard attention on selected nodes and cGAO for channel-wise attention to cut adjacency costs, but the abstract supplies no numbers or ablations to check the performance and efficiency claims.

read the letter

The one or two things to know are that the authors introduce hGAO, which applies hard attention by attending only to important nodes, and cGAO, which performs attention along channels to avoid adjacency matrix costs. These are meant to improve on standard graph attention operators both in performance and efficiency. The new part is the concrete definitions of these operators tailored to graphs. Hard attention is borrowed from other areas but here it's used to select nodes, and channel-wise is a fresh way to decouple from the graph structure. The paper does well at identifying the compute problem with existing attention on graphs and offering specific mechanisms to tackle it. Where it is soft is the support for the claims. The abstract says the models achieve better performance and cGAO leads to dramatic savings, but no actual results, tables, or details are given. This makes it tough to see if the gains are real or if hard attention introduces biases. The assumption that focusing on important nodes won't lose critical info needs evidence from experiments. This paper is for graph ML researchers interested in scaling attention to bigger graphs. Readers working on practical applications might find the operators worth trying if the full paper shows solid comparisons. It deserves a serious referee to examine the methods and results in detail. I recommend sending it for peer review so the experimental claims can be properly assessed.

Referee Report

2 major / 1 minor

Summary. The paper proposes two new attention operators for graphs: the hard graph attention operator (hGAO), which applies hard attention to attend only to important nodes, and the channel-wise graph attention operator (cGAO), which performs attention along channels to avoid adjacency-matrix dependency. It claims these yield better performance than standard graph attention operators (GAO) on node and graph embedding tasks while dramatically reducing computational requirements, enabling use on large graphs.

Significance. If the experimental claims hold with proper controls, the operators could address a key scalability bottleneck in attention-based graph models, offering both accuracy gains and efficiency improvements that would be valuable for practical deployment on large-scale graph data.

major comments (2)

[Abstract] Abstract: the central performance and efficiency claims ('consistently better performance', 'significantly better performance', 'dramatic savings') are asserted without any quantitative metrics, datasets, error bars, or ablation results, which is load-bearing for the paper's contribution and prevents verification of whether the operators actually deliver the stated benefits.
[Abstract] The weakest assumption—that hard attention to 'important nodes' and channel-wise operations will not discard critical structural information or introduce biases on unseen graphs—is not tested or bounded anywhere in the provided text, undermining the generalization claim.

minor comments (1)

[Abstract] Define all acronyms (GAO, hGAO, cGAO) on first use and ensure consistent capitalization throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the points below and will revise the manuscript to strengthen the presentation of results and claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance and efficiency claims ('consistently better performance', 'significantly better performance', 'dramatic savings') are asserted without any quantitative metrics, datasets, error bars, or ablation results, which is load-bearing for the paper's contribution and prevents verification of whether the operators actually deliver the stated benefits.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. The full manuscript reports detailed experimental results on node and graph classification tasks across multiple datasets, with performance comparisons and efficiency measurements. In revision, we will update the abstract to reference key quantitative outcomes from the experiments section, such as accuracy gains and resource reductions, while maintaining brevity. revision: yes
Referee: [Abstract] The weakest assumption—that hard attention to 'important nodes' and channel-wise operations will not discard critical structural information or introduce biases on unseen graphs—is not tested or bounded anywhere in the provided text, undermining the generalization claim.

Authors: The manuscript evaluates the operators empirically on diverse graph datasets for node and graph embedding tasks, showing consistent improvements that provide indirect support for retaining useful information. We acknowledge that no theoretical bounds on information loss or explicit bias tests on out-of-distribution graphs are included. We will add a discussion of these limitations in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces novel hard graph attention operator (hGAO) and channel-wise graph attention operator (cGAO) as new architectural components, then reports experimental performance gains on node and graph embedding tasks. No derivation chain, first-principles predictions, or fitted parameters are described that reduce to the inputs by construction. The central claims rest on the explicit definitions of the proposed operators and on empirical comparisons, which remain independent of any self-citation load-bearing steps or renaming of known results. The provided abstract and context contain no equations or self-referential fitting procedures that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on two new operators whose benefits are asserted via experiments; the ledger records the domain assumption that hard attention is superior and the two invented operators themselves.

axioms (1)

domain assumption Hard attention yields better performance than soft attention on graph embedding tasks
Abstract states this as established motivation for proposing hGAO.

invented entities (2)

hGAO no independent evidence
purpose: Hard-attention graph operator that attends only to important nodes
New operator introduced by the paper to address computational cost and performance of standard GAOs.
cGAO no independent evidence
purpose: Channel-wise attention operator that avoids adjacency-matrix dependency
New operator introduced by the paper to achieve dramatic computational savings.

pith-pipeline@v0.9.0 · 5752 in / 1341 out tokens · 25269 ms · 2026-05-25T02:05:47.597399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al
[2]

In OSDI, Vol

Tensorflow: a system for large-scale machine learning. In OSDI, Vol. 16. 265–283
[3]

Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph kernels. Bioinformatics 21, suppl_1 (2005), i47–i56

2005
[4]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27

2011
[5]

Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu- tional neural networks on graphs with fast localized spectral filtering. InAdvances in Neural Information Processing Systems . 3844–3852

2016
[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Paul D Dobson and Andrew J Doig. 2003. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology 330, 4 (2003), 771–783

2003
[8]

Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining . ACM, 1416–1424

2018
[9]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics . 249–256

2010
[10]

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra
[11]

In International Conference on Machine Learning

DRAW: A recurrent neural network for image generation. In International Conference on Machine Learning . 1462–1471
[12]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems . 1024–1034

2017
[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural Computation 9, 8 (1997), 1735–1780

1997
[14]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial trans- former networks. In Advances in Neural Information Processing Systems . 2017– 2025

2015
[15]

Felix Juefei-Xu, Eshan Verma, Parag Goel, Anisha Cherodian, and Marios Sav- vides. 2016. Deepgender: Occlusion and low resolution robust facial gender classification via progressively trained convolutional neural networks with at- tention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 68–77

2016
[16]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. The International Conference on Learning Representations (2015)

2015
[17]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (2017)

2017
[18]

Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade . Springer, 9–48

2012
[19]

Guanbin Li, Xiang He, Wei Zhang, Huiyou Chang, Le Dong, and Liang Lin. 2018. Non-locally enhanced encoder-decoder network for single image de-raining. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, 1056–1064

2018
[20]

Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fine attention models for document summarization. In Proceedings of the Workshop on New Frontiers in Summarization. 33–42

2017
[21]

Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective ap- proaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421

2015
[22]

Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. 2018. Learning visual question answering by bootstrapping hard attention. InEuropean Conference on Computer Vision . Springer, 3–20

2018
[23]

Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. InInternational Conference on Machine Learning. 2014–2023

2016
[24]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 701–710

2014
[25]

Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Attention-aware deep reinforcement learning for video face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3931–3940

2017
[26]

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI Magazine 29, 3 (2008), 93

2008
[27]

Shiv Shankar, Siddhant Garg, and Sunita Sarawagi. 2018. Surprisingly easy hard- attention for sequence to sequence learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . 640–645

2018
[28]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958

2014
[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems . 6000–6010

2017
[30]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph attention networks. In International Confer- ence on Learning Representations

2017
[31]

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Se- quence to sequence for sets. International Conference on Learning Representations (2016)

2016
[32]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non- local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. 4

2018
[33]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048–2057

2015
[34]

Pinar Yanardag and SVN Vishwanathan. 2015. A structural smoothing frame- work for robust graph comparison. In Advances in Neural Information Processing Systems. 2134–2142

2015
[35]

Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi- supervised learning with graph embeddings. In International Conference on Ma- chine Learning. 40–48

2016
[36]

Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems . 4800–4810

2018
[37]

Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2019. ST-UNet: A spatio- temporal U-network for graph-structured time series modeling. arXiv preprint arXiv:1903.05631 (2019)

work page arXiv 2019
[38]

Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end- to-end deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence

2018
[39]

Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. 2018. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision . 267–283

2018

[1] [1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al

[2] [2]

In OSDI, Vol

Tensorflow: a system for large-scale machine learning. In OSDI, Vol. 16. 265–283

[3] [3]

Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. 2005. Protein function prediction via graph kernels. Bioinformatics 21, suppl_1 (2005), i47–i56

2005

[4] [4]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27

2011

[5] [5]

Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu- tional neural networks on graphs with fast localized spectral filtering. InAdvances in Neural Information Processing Systems . 3844–3852

2016

[6] [6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Paul D Dobson and Andrew J Doig. 2003. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology 330, 4 (2003), 771–783

2003

[8] [8]

Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining . ACM, 1416–1424

2018

[9] [9]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics . 249–256

2010

[10] [10]

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra

[11] [11]

In International Conference on Machine Learning

DRAW: A recurrent neural network for image generation. In International Conference on Machine Learning . 1462–1471

[12] [12]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems . 1024–1034

2017

[13] [13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural Computation 9, 8 (1997), 1735–1780

1997

[14] [14]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial trans- former networks. In Advances in Neural Information Processing Systems . 2017– 2025

2015

[15] [15]

Felix Juefei-Xu, Eshan Verma, Parag Goel, Anisha Cherodian, and Marios Sav- vides. 2016. Deepgender: Occlusion and low resolution robust facial gender classification via progressively trained convolutional neural networks with at- tention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 68–77

2016

[16] [16]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. The International Conference on Learning Representations (2015)

2015

[17] [17]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (2017)

2017

[18] [18]

Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade . Springer, 9–48

2012

[19] [19]

Guanbin Li, Xiang He, Wei Zhang, Huiyou Chang, Le Dong, and Liang Lin. 2018. Non-locally enhanced encoder-decoder network for single image de-raining. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, 1056–1064

2018

[20] [20]

Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fine attention models for document summarization. In Proceedings of the Workshop on New Frontiers in Summarization. 33–42

2017

[21] [21]

Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective ap- proaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421

2015

[22] [22]

Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. 2018. Learning visual question answering by bootstrapping hard attention. InEuropean Conference on Computer Vision . Springer, 3–20

2018

[23] [23]

Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. InInternational Conference on Machine Learning. 2014–2023

2016

[24] [24]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 701–710

2014

[25] [25]

Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Attention-aware deep reinforcement learning for video face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3931–3940

2017

[26] [26]

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI Magazine 29, 3 (2008), 93

2008

[27] [27]

Shiv Shankar, Siddhant Garg, and Sunita Sarawagi. 2018. Surprisingly easy hard- attention for sequence to sequence learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . 640–645

2018

[28] [28]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958

2014

[29] [29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems . 6000–6010

2017

[30] [30]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph attention networks. In International Confer- ence on Learning Representations

2017

[31] [31]

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Se- quence to sequence for sets. International Conference on Learning Representations (2016)

2016

[32] [32]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non- local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. 4

2018

[33] [33]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048–2057

2015

[34] [34]

Pinar Yanardag and SVN Vishwanathan. 2015. A structural smoothing frame- work for robust graph comparison. In Advances in Neural Information Processing Systems. 2134–2142

2015

[35] [35]

Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi- supervised learning with graph embeddings. In International Conference on Ma- chine Learning. 40–48

2016

[36] [36]

Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems . 4800–4810

2018

[37] [37]

Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2019. ST-UNet: A spatio- temporal U-network for graph-structured time series modeling. arXiv preprint arXiv:1903.05631 (2019)

work page arXiv 2019

[38] [38]

Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end- to-end deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence

2018

[39] [39]

Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. 2018. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision . 267–283

2018