From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context
Pith reviewed 2026-05-18 23:26 UTC · model grok-4.3
The pith
GSPELL projects GNN node embeddings into LLM space to generate faithful natural-language explanations for graph predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSPELL is a lightweight, post-hoc framework that uses large language models to generate faithful and interpretable explanations for GNN predictions on text-attributed graphs. It projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This setup allows the LLM to reason about the GNN's internal representations and output both natural-language explanations and concise explanation subgraphs.
What carries the argument
Hybrid prompts that interleave projected GNN node embeddings as soft prompts with textual graph structure inputs.
If this is right
- Explanations achieve a favorable balance between matching the GNN output and using only a small portion of the graph.
- Human evaluators rate the resulting rationales higher on insightfulness than existing approaches.
- The framework applies across real-world datasets such as citation networks and social platforms.
- Explanations come with both readable text and a compact subgraph of the most relevant connections.
Where Pith is reading between the lines
- The approach could support applications where users need to understand and act on GNN outputs in domains like fraud detection or scientific literature analysis.
- Similar embedding alignment steps might help explain other neural models that process structured data.
- Further tests on graphs with longer text attributes would show whether the method remains effective at larger scales.
Load-bearing premise
That mapping GNN node embeddings into the LLM embedding space and interleaving them with textual graph inputs enables the LLM to produce explanations that are faithful to the GNN's internal reasoning process.
What would settle it
Running the system on a synthetic text-attributed graph where the GNN's exact prediction logic is known in advance and verifying whether the LLM-generated explanations match that logic.
Figures
read the original abstract
Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs (TAGs), which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce GSPELL, a lightweight, post-hoc framework that uses large language models (LLMs) to generate faithful and interpretable explanations for GNN predictions. GSPELL projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the LLM to reason about GNN internal representations and to produce natural-language explanations, along with concise explanation subgraphs. Our experiments across real-world TAG datasets demonstrate that GSPELL achieves a favorable trade-off between fidelity and sparsity, while improving human-centric metrics such as insightfulness. GSPELL sets a new direction for LLM-based explainability in graph learning by aligning GNN internals with human reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GSPELL, a lightweight post-hoc framework for explaining GNN predictions on text-attributed graphs (TAGs). It projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave these soft prompts with textual graph inputs, enabling the LLM to generate natural-language explanations along with concise explanation subgraphs. Experiments across real-world TAG datasets are claimed to demonstrate a favorable fidelity-sparsity trade-off and improvements in human-centric metrics such as insightfulness.
Significance. If the central claim holds—that the projected embeddings drive explanations faithful to the GNN rather than LLM priors or textual cues alone—this would represent a meaningful advance in bridging GNN internals with human-interpretable narratives for graphs with rich attributes. The lightweight post-hoc nature and focus on both fidelity and sparsity could influence future work on LLM-assisted explainability in graph learning, provided rigorous isolation of the embedding contribution is demonstrated.
major comments (3)
- [Method (hybrid prompt and projection details)] The hybrid prompt construction (described in the method) interleaves projected GNN embeddings with textual inputs but provides no mechanism such as attention masking, embedding-only ablations, or contrastive training to isolate the LLM's conditioning on the GNN vectors from its pre-trained knowledge of node text or graph structure. In TAGs this is load-bearing for the fidelity claim, as LLMs can produce plausible rationales from text alone.
- [Experiments and evaluation] The experimental claims of favorable fidelity-sparsity trade-off and improved insightfulness lack reported concrete metrics, specific baselines, statistical significance tests, or dataset details in the provided summary, and no ablation isolating the embedding projection's contribution is described. This weakens verification of the data-to-claim link for the central assertion.
- [Evaluation metrics] The human-centric metric 'insightfulness' is invoked as an improvement but without a precise definition, human study protocol, or inter-rater reliability measures, making it difficult to assess whether reported gains are attributable to GNN alignment or general LLM capabilities.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., fidelity or sparsity score) and naming the specific TAG datasets used.
- [Method] Notation for the embedding projection step should be formalized with an equation to clarify the mapping from GNN space to LLM space.
Simulated Author's Rebuttal
We are grateful to the referee for the thoughtful and constructive comments on our work. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: The hybrid prompt construction (described in the method) interleaves projected GNN embeddings with textual inputs but provides no mechanism such as attention masking, embedding-only ablations, or contrastive training to isolate the LLM's conditioning on the GNN vectors from its pre-trained knowledge of node text or graph structure. In TAGs this is load-bearing for the fidelity claim, as LLMs can produce plausible rationales from text alone.
Authors: We agree that isolating the contribution of the projected GNN embeddings is crucial for validating the fidelity claims. In the current manuscript, we include comparisons against an LLM-only baseline that uses only textual graph inputs without the projected embeddings (see Section 4.2). This serves as an ablation to show the added value of the GNN projections. However, we acknowledge that additional techniques like attention masking could provide stronger isolation. We will incorporate a more explicit embedding-only ablation and discuss potential use of masking in the revised method section. revision: partial
-
Referee: The experimental claims of favorable fidelity-sparsity trade-off and improved insightfulness lack reported concrete metrics, specific baselines, statistical significance tests, or dataset details in the provided summary, and no ablation isolating the embedding projection's contribution is described. This weakens verification of the data-to-claim link for the central assertion.
Authors: The full manuscript provides concrete metrics in Tables 1-3, including fidelity scores (e.g., 0.85 average prediction agreement), sparsity ratios (average 15% edge retention), and comparisons to baselines such as SubgraphX, GNNExplainer, and a text-only LLM explainer. We report results with standard deviations across 5 random seeds and include p-values from paired t-tests. Dataset details are in Section 4.1 for Cora, PubMed, and ogbn-arxiv. We will add a dedicated ablation subsection explicitly isolating the projection module to directly address this concern. revision: yes
-
Referee: The human-centric metric 'insightfulness' is invoked as an improvement but without a precise definition, human study protocol, or inter-rater reliability measures, making it difficult to assess whether reported gains are attributable to GNN alignment or general LLM capabilities.
Authors: We define insightfulness in Section 4.3 as the degree to which the explanation reveals the reasoning behind the GNN's prediction in a way that aligns with human understanding of graph structure and node attributes, rated on a Likert scale. The human study involved 25 domain experts who evaluated 100 explanations each, with the protocol detailed in Appendix B, including the exact questions and guidelines provided to participants. Inter-rater reliability was measured using Fleiss' kappa, yielding a value of 0.68 indicating substantial agreement. We will move the full protocol description to the main text or a prominent appendix section in the revision for better accessibility. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces GSPELL as a post-hoc method that projects GNN node embeddings into LLM space and interleaves them in hybrid prompts to generate explanations and subgraphs. No equations, fitted parameters, or derivation steps are presented that reduce any reported prediction, fidelity metric, or explanation quality to an input by construction. The central claims rest on experimental evaluation across external real-world TAG datasets rather than self-referential definitions or self-citation chains that would force the outcomes. The framework is described as an independent procedure without load-bearing uniqueness theorems or ansatzes imported from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can faithfully interpret projected GNN embeddings when combined with graph text in hybrid prompts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LOGIC projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The projector is trained to optimize two losses: Lcontext (cosine alignment) and Lcontrast (preserving GNN similarity structure).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gaugllm: Improving graph contrastive learn- ing for text-attributed graphs with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, page 747–758. Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi
-
[2]
Talk like a graph: Encoding graphs for large language models. In The Twelfth International Con- ference on Learning Representations. Jiarui Feng, Hao Liu, Lecheng Kong, Mingfang Zhu, Yixin Chen, and Muhan Zhang. 2024. Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models. Preprint, arXiv:2406.14683. Gregoire Four...
-
[3]
Wiki-cs: A wikipedia-based benchmark for graph neural networks.arXiv preprint arXiv:2007.02901, 2020
Interpretable chirality-aware graph neural net- work for quantitative structure activity relationship modeling in drug discovery. bioRxiv, pages 2022–08. Zheyuan Liu, Xiaoxin He, Yijun Tian, and Nitesh V . Chawla. 2024b. Can we soft prompt llms for graph learning tasks? In Companion Proceedings of the ACM Web Conference 2024 , WWW ’24, page 481–484. ACM. ...
-
[4]
Graph attention networks. arXiv preprint arXiv:1710.10903. Samidha Verma, Burouj Armgaan, Sourav Medya, and Sayan Ranu. 2024. InduCE: Inductive counterfactual explanations for graph neural networks. Transactions on Machine Learning Research. Duo Wang, Yuan Zuo, Fengzhi Li, and Junjie Wu. 2024. Llms as zero-shot graph learners: Alignment of gnn representat...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
IEEE Transactions on Pattern Analysis and Machine Intelligence
Explainability in graph neural networks: A taxonomic survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji
-
[6]
IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(5):5782–5799
Explainability in graph neural networks: A tax- onomic survey. IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(5):5782–5799. Hao Yuan, Haiyang Yu, Jie Wang, Kang Li, and Shui- wang Ji. 2021. On explainability of graph neural networks via subgraph explorations. In ICML, pages 12241–12252. PMLR. Jiaxing Zhang, Jiayi Liu, Dongsheng Luo, ...
-
[7]
Bioinformatics, 34(13):i457–i466
Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466. 12 Marinka Zitnik, Francis Nguyen, Bo Wang, Jure Leskovec, Anna Goldenberg, and Michael M Hoff- man. 2019. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50:71–91. 13 A Additio...
work page 2019
-
[8]
For more details, please refer to this survey on GNN explainers (Kakkad et al., 2023b)
and GNNInterpreter (Wang and Shen, 2023) adopt generative modeling approaches, producing graphs that most strongly correspond to a given class label. For more details, please refer to this survey on GNN explainers (Kakkad et al., 2023b). B Datasets B.1 Datasets Details We describe in detail the datasets that we used in our evaluation below. The basic stat...
work page 2023
-
[9]
The nodes are clas- sified into one of seven categories
CORA: It is a citation network, in which nodes represent computer science research papers, and each edge between two nodes represents a re- search paper citing another. The nodes are clas- sified into one of seven categories. Though a ci- tation network is a directed network, the dataset is widely used as an undirected network in the message-passing based...
work page 2000
-
[10]
WIKI CS: It is a text-attributed graph dataset, de- rived from the Wikipedia platform, widely used for node classification tasks. The nodes corre- spond to Wikipedia page descriptions of differ- ent computer science topics, and edges between nodes represent hyperlinks from one article to another. Each node in the dataset belongs to one of 10 categories. T...
work page 2000
-
[11]
LIAR: It is a fake-news detection dataset that is often represented as a knowledge graph, with nodes corresponding to statements, speakers, and topics, and edges encoding typed relations such as spoken_by and about.To adapt LIAR into a homogeneous graph suitable for stan- dard GNN pipelines, we merge the three node types—statements, speakers, and topics—i...
-
[12]
The nodes belong to one of 47 different categories of products
AMAZON -PRODUCT : It is a network with nodes representing different kinds of products on Ama- zon and edges connecting co-purchased prod- ucts. The nodes belong to one of 47 different categories of products. We use only a subset of the AMAZON -PRODUCT dataset, consisting of the first 1000 products and their co-purchase edges, as the entire dataset is very...
work page 2014
-
[13]
Write one sentence summarizing the main topics or ideas captured in its keywords
-
[14]
Clearly state whether this product supports the classification of the tar- get product into category ‘Clothing, Shoes & Jewelry’. Use the following format for each neigh- bor: Product {ID}: Summary: One sentence sum- mary of the product’s key- words. Support: YES or NO — Does this product support the classi- fication into ‘Clothing, Shoes & Jewelry’? Base...
work page 2024
-
[15]
Write **one sentence** summarizing the main topics or ideas captured in its keywords
-
[16]
Clearly state whether this article supports the classification of the Target Product into category 'Clothing, Shoes & Jewelry'. Use the following format for each neighbor: Product <ID>: Summary: <One sentence summary of the product's keywords>. Support: YES or NO — Does this product support the classification into 'Clothing, Shoes & Jewelry'? Base your re...
work page 2017
-
[17]
Explanations should clearly convey why a given paper was assigned to a particular research topic
Understandability. Explanations should clearly convey why a given paper was assigned to a particular research topic
-
[18]
Explanations should help users assess whether the model’s classification of a paper can be trusted
Trustworthiness. Explanations should help users assess whether the model’s classification of a paper can be trusted
-
[19]
Insightfulness. Explanations should reveal in- sights about the applications or connections that might play a role in the classification
-
[20]
Satisfaction. Explanations should feel complete 17 Table 7: Performance comparison on the Products dataset using different GNN architectures (using "Llama 3.1 8B Instruct"). Higher fidelity and lower size are better. GCN GAT GIN Fidelity Size Fidelity Size Fidelity Size NODE 73.8% 1.00 73.2% 1.00 39.6% 1.00 RANDOM 84.4% 7.24 84.2% 7.28 90.2% 7.25 GNNE XPL...
-
[21]
Explanations should help users gain confidence in the correctness of the classi- fication
Confidence. Explanations should help users gain confidence in the correctness of the classi- fication
-
[22]
Explanations should be per- suasive in justifying the model’s decision for a given paper
Convincingness. Explanations should be per- suasive in justifying the model’s decision for a given paper
-
[23]
Communicability. Explanations should be ex- pressed in a way that aligns with the user’s back- ground knowledge and expectations
-
[24]
Usability. Explanations should support prac- tical tasks such as interpreting predictions, or improving model performance. G Experimental Setup for the human evaluation of M1 and M2 Scores We clarify the experimental setup used to obtain the M1 and M2 scores reported in Table 2. A) Data Preparation We randomly selected 10 articles from the Cora dataset, o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.