{"work":{"id":"7dd5bb04-b448-4f32-9719-5fc799641fd9","openalex_id":null,"doi":null,"arxiv_id":"1710.10903","raw_key":null,"title":"Graph Attention Networks","authors":null,"authors_text":"Petar Veli\\v{c}kovi\\'c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\\`o, Yoshua Bengio","year":2017,"venue":"stat.ML","abstract":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).","external_url":"https://arxiv.org/abs/1710.10903","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T18:56:08.946392+00:00","pith_arxiv_id":"1710.10903","created_at":"2026-05-08T23:14:26.477817+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":false,"display_title":"Graph Attention Networks","render_title":"Graph Attention Networks"},"hub":{"state":{"work_id":"7dd5bb04-b448-4f32-9719-5fc799641fd9","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":141,"external_cited_by_count":null,"distinct_field_count":24,"first_pith_cited_at":"2018-04-11T14:13:03+00:00","last_pith_cited_at":"2026-05-22T15:18:53+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-08T22:53:55.233613+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":15},{"context_role":"method","n":3},{"context_role":"baseline","n":2}],"polarity_counts":[{"context_polarity":"background","n":14},{"context_polarity":"use_method","n":3},{"context_polarity":"baseline","n":2},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Graph Attention Networks","claims":[{"claim_text":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"work (GIN-0) [18], which aggregates neighborhood informa- tion with multi-layer perceptrons; (2) Deep Graph Convolu- tional Neural (DGCNN) [30], which employs sort pooling for hierarchical graph representation; (3) PATCHY-SAN Con- volutional Neural Network (PSCN) [31], a local receptive- field method inspired by convolutional neural networks; (4) Graph Attention Network (GAT) [32], which uses attention mechanisms for adaptive neighborhood aggregation; (5) Graph Convolutional Network (GCN) [16], ","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"which we denote as GPS+PCL and GPS+GCCM. Please note that GCCM is not tied to any specific backbone and can be applied to other graph neural network architectures. Due to space limitations, we further adopt other network backbones to GCCM in Appendix E.1. For reference, we also include the results of deterministic prediction models (GCN [ 21], GIN [22], GAT [23], GatedGCN [24], PNA [25], DGN [26], GIN-AK+ [27]. SAN [1], EGT [28], GRIT [29], and GPS [3]). 5.2 Main Results We evaluate GCCM on the ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"TopoU-Net keeps the encoder-decoder and skip-connection principle, but replaces Euclidean scale by rank structure in a combinatorial complex. Graph neural networks and graph hierarchy.Graph neural networks learn representations by aggregating information over pairwise graph neighborhoods. Representative architectures in- clude GCN [22], GraphSAGE [20], GAT [29], and GIN [31]. Extensions such as MixHop [1] and H2GCN [36] incorporate multi-hop or heterophily-aware aggregation. Hierarchical graph m","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the Subject node has a variable length, determined by the number of distinct executables in the training data. During inference, an unknown executable could appear due to new normal behavior or malicious actions. In this case, we model the corresponding feature vector as a vector of zeros. The location information, including path and IP address details, is encoded using a transformer-based autoencoder [37], implemented in PyTorch. To do this, a location is split into character-based segments, an","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review 9 P45 Identifying Tactics of Advanced Persistent Threats with Limited Attack Traces [5] P46 Improving Automated Labeling for ATT&CK Tactics in Malware Threat Reports [32] P47 KnowCTI: Knowledge-based cyber threat intelligence entity and relation extraction [154] P48 Labeling NIDS Rules with MITRE ATT&CK Techniques Using ChatGPT [28] P49 Leveraging BERT's Power to Classify TTP from Unstructu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"target should be recoverable from its matched target counterpart, penalizing correspondences that are geomet- rically plausible but semantically incoherent. Cross-attention.GivenZ s∈Rns×dandZ t∈Rnt×d, defineQ s =Z sW⊤ q ,K s =Z sW⊤ k ,Q t =Z tW⊤ q ,K t = ZtW⊤ k , with learnableW q, Wk ∈Rd×d. The cross- attention maps are As→t= softmax (QsK⊤ t√ d ) ∈Rns×nt,(13) At→s= softmax (QtK⊤ s√ d ) ∈Rnt×ns.(14) One-sided cycle + entropy penalty.We enforce approximate recovery of source identities: Lcyc =∥As","claim_type":"method","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Graph Attention Networks because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"},{"n":3,"context_role":"method"},{"n":2,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T02:42:02.886672+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"23dc317d-a3c9-4d23-abdf-7eb69e712578","orcid":null,"display_name":"Petar Veli\\v{c}kovi\\'c"},{"id":"4d4195ca-df0f-4f41-aca3-8882fd3133f4","orcid":null,"display_name":"Guillem Cucurull"},{"id":"ac5725a5-a87f-420a-8c17-ccecf1cc52d7","orcid":null,"display_name":"Arantxa Casanova"},{"id":"1082d606-8a76-4b8d-8467-bc720709c855","orcid":null,"display_name":"Adriana Romero"},{"id":"edddeda6-b83f-427a-9cf6-c023f04a774f","orcid":null,"display_name":"Pietro Li\\`o"},{"id":"5a7a1034-5ab5-46b2-9f24-66204fbd9091","orcid":null,"display_name":"Yoshua Bengio"}]},"error":null,"updated_at":"2026-05-20T02:42:03.744458+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T09:48:28.099914+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Semi-Supervised Classification with Graph Convolutional Networks","work_id":"21fff118-807d-49cd-8229-f7087ba57b5d","shared_citers":34},{"title":"How Powerful are Graph Neural Networks?","work_id":"cb4e089d-c63e-4231-9e6b-b2ccb82c5329","shared_citers":13},{"title":"How attentive are graph attention networks?arXiv preprint arXiv:2105.14491","work_id":"b81a77dc-03fc-40f2-a635-ba6c4a250268","shared_citers":8},{"title":"Variational graph auto-encoders","work_id":"2e8716eb-69bd-4617-8256-8543ba2c215c","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"arXiv preprint arXiv:2009.08366 , year=","work_id":"05f7fd92-3de8-42af-9707-70010caa34c6","shared_citers":4},{"title":"Efficient Estimation of Word Representations in Vector Space","work_id":"59edaa01-a696-45b3-9a08-5eae777a799e","shared_citers":4},{"title":"Fast Graph Representation Learning with PyTorch Geometric","work_id":"7030f6d8-1c95-4c18-a988-bfcf31d4cb4c","shared_citers":4},{"title":"Relational inductive biases, deep learning, and graph networks","work_id":"858410c0-7a66-4b27-b4e5-49aee9725be0","shared_citers":4},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":3},{"title":"Do transformers really perform bad for graph representation?","work_id":"ae97a162-b9ae-43a5-af71-9269ebbf4f25","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":3},{"title":"Qwen2.5-Coder Technical Report","work_id":"09ba463d-6377-4017-9801-444ffb94b056","shared_citers":3},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":3},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":3},{"title":"Sdplib 1.2, a library of semidefinite program- ming test problems.Optimization Methods and Software, 11(1-4):683–690","work_id":"88e9cbb3-5e47-44fc-b1f0-f8bf82b9e07c","shared_citers":3},{"title":"A comprehensive survey on graph neural networks","work_id":"d6e48d96-8e44-4df5-bf45-c5c2b3f8a059","shared_citers":2},{"title":"A generalization of transformer networks to graphs","work_id":"0438ea98-886e-4e29-ac3c-fe11ffe8b0f3","shared_citers":2},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":2},{"title":"arXiv:2203.03850 [cs.CL]","work_id":"0135787d-f4bc-4966-8b6d-8c2e1f6bd31f","shared_citers":2},{"title":"ArXiv abs/2007.08663(2020)","work_id":"04426675-b2e3-4517-be09-08f2df683d14","shared_citers":2}],"time_series":[{"n":1,"year":2018},{"n":67,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T09:48:22.292217+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T09:48:35.047861+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Graph Attention Networks","claims":[{"claim_text":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"work (GIN-0) [18], which aggregates neighborhood informa- tion with multi-layer perceptrons; (2) Deep Graph Convolu- tional Neural (DGCNN) [30], which employs sort pooling for hierarchical graph representation; (3) PATCHY-SAN Con- volutional Neural Network (PSCN) [31], a local receptive- field method inspired by convolutional neural networks; (4) Graph Attention Network (GAT) [32], which uses attention mechanisms for adaptive neighborhood aggregation; (5) Graph Convolutional Network (GCN) [16], ","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"which we denote as GPS+PCL and GPS+GCCM. Please note that GCCM is not tied to any specific backbone and can be applied to other graph neural network architectures. Due to space limitations, we further adopt other network backbones to GCCM in Appendix E.1. For reference, we also include the results of deterministic prediction models (GCN [ 21], GIN [22], GAT [23], GatedGCN [24], PNA [25], DGN [26], GIN-AK+ [27]. SAN [1], EGT [28], GRIT [29], and GPS [3]). 5.2 Main Results We evaluate GCCM on the ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"TopoU-Net keeps the encoder-decoder and skip-connection principle, but replaces Euclidean scale by rank structure in a combinatorial complex. Graph neural networks and graph hierarchy.Graph neural networks learn representations by aggregating information over pairwise graph neighborhoods. Representative architectures in- clude GCN [22], GraphSAGE [20], GAT [29], and GIN [31]. Extensions such as MixHop [1] and H2GCN [36] incorporate multi-hop or heterophily-aware aggregation. Hierarchical graph m","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the Subject node has a variable length, determined by the number of distinct executables in the training data. During inference, an unknown executable could appear due to new normal behavior or malicious actions. In this case, we model the corresponding feature vector as a vector of zeros. The location information, including path and IP address details, is encoded using a transformer-based autoencoder [37], implemented in PyTorch. To do this, a location is split into character-based segments, an","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review 9 P45 Identifying Tactics of Advanced Persistent Threats with Limited Attack Traces [5] P46 Improving Automated Labeling for ATT&CK Tactics in Malware Threat Reports [32] P47 KnowCTI: Knowledge-based cyber threat intelligence entity and relation extraction [154] P48 Labeling NIDS Rules with MITRE ATT&CK Techniques Using ChatGPT [28] P49 Leveraging BERT's Power to Classify TTP from Unstructu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"target should be recoverable from its matched target counterpart, penalizing correspondences that are geomet- rically plausible but semantically incoherent. Cross-attention.GivenZ s∈Rns×dandZ t∈Rnt×d, defineQ s =Z sW⊤ q ,K s =Z sW⊤ k ,Q t =Z tW⊤ q ,K t = ZtW⊤ k , with learnableW q, Wk ∈Rd×d. The cross- attention maps are As→t= softmax (QsK⊤ t√ d ) ∈Rns×nt,(13) At→s= softmax (QtK⊤ s√ d ) ∈Rnt×ns.(14) One-sided cycle + entropy penalty.We enforce approximate recovery of source identities: Lcyc =∥As","claim_type":"method","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Graph Attention Networks because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"},{"n":3,"context_role":"method"},{"n":2,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T02:42:02.891745+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Graph Attention Networks","claims":[{"claim_text":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Graph Attention Networks because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T09:48:32.703595+00:00"}},"summary":{"title":"Graph Attention Networks","claims":[{"claim_text":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Graph Attention Networks because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Semi-Supervised Classification with Graph Convolutional Networks","work_id":"21fff118-807d-49cd-8229-f7087ba57b5d","shared_citers":34},{"title":"How Powerful are Graph Neural Networks?","work_id":"cb4e089d-c63e-4231-9e6b-b2ccb82c5329","shared_citers":13},{"title":"How attentive are graph attention networks?arXiv preprint arXiv:2105.14491","work_id":"b81a77dc-03fc-40f2-a635-ba6c4a250268","shared_citers":8},{"title":"Variational graph auto-encoders","work_id":"2e8716eb-69bd-4617-8256-8543ba2c215c","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"arXiv preprint arXiv:2009.08366 , year=","work_id":"05f7fd92-3de8-42af-9707-70010caa34c6","shared_citers":4},{"title":"Efficient Estimation of Word Representations in Vector Space","work_id":"59edaa01-a696-45b3-9a08-5eae777a799e","shared_citers":4},{"title":"Fast Graph Representation Learning with PyTorch Geometric","work_id":"7030f6d8-1c95-4c18-a988-bfcf31d4cb4c","shared_citers":4},{"title":"Relational inductive biases, deep learning, and graph networks","work_id":"858410c0-7a66-4b27-b4e5-49aee9725be0","shared_citers":4},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":3},{"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","shared_citers":3},{"title":"Do transformers really perform bad for graph representation?","work_id":"ae97a162-b9ae-43a5-af71-9269ebbf4f25","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":3},{"title":"Qwen2.5-Coder Technical Report","work_id":"09ba463d-6377-4017-9801-444ffb94b056","shared_citers":3},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":3},{"title":"Representation Learning with Contrastive Predictive Coding","work_id":"7b08a1d4-d565-424e-9c86-6ef244b7b90a","shared_citers":3},{"title":"Sdplib 1.2, a library of semidefinite program- ming test problems.Optimization Methods and Software, 11(1-4):683–690","work_id":"88e9cbb3-5e47-44fc-b1f0-f8bf82b9e07c","shared_citers":3},{"title":"A comprehensive survey on graph neural networks","work_id":"d6e48d96-8e44-4df5-bf45-c5c2b3f8a059","shared_citers":2},{"title":"A generalization of transformer networks to graphs","work_id":"0438ea98-886e-4e29-ac3c-fe11ffe8b0f3","shared_citers":2},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":2},{"title":"arXiv:2203.03850 [cs.CL]","work_id":"0135787d-f4bc-4966-8b6d-8c2e1f6bd31f","shared_citers":2},{"title":"ArXiv abs/2007.08663(2020)","work_id":"04426675-b2e3-4517-be09-08f2df683d14","shared_citers":2}],"time_series":[{"n":1,"year":2018},{"n":67,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"1082d606-8a76-4b8d-8467-bc720709c855","orcid":null,"display_name":"Adriana Romero","source":"manual","import_confidence":0.72},{"id":"ac5725a5-a87f-420a-8c17-ccecf1cc52d7","orcid":null,"display_name":"Arantxa Casanova","source":"manual","import_confidence":0.72},{"id":"4d4195ca-df0f-4f41-aca3-8882fd3133f4","orcid":null,"display_name":"Guillem Cucurull","source":"manual","import_confidence":0.72},{"id":"23dc317d-a3c9-4d23-abdf-7eb69e712578","orcid":null,"display_name":"Petar Veli\\v{c}kovi\\'c","source":"manual","import_confidence":0.72},{"id":"edddeda6-b83f-427a-9cf6-c023f04a774f","orcid":null,"display_name":"Pietro Li\\`o","source":"manual","import_confidence":0.72},{"id":"5a7a1034-5ab5-46b2-9f24-66204fbd9091","orcid":null,"display_name":"Yoshua Bengio","source":"manual","import_confidence":0.72}]}}