LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Pratyusha Vemuri; Zhongyuan Wang

arxiv: 2606.17579 · v1 · pith:7HTZUSQHnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI· cs.CL· cs.SI

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Zhongyuan Wang , Pratyusha Vemuri This is my paper

Pith reviewed 2026-06-27 02:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SI

keywords LLM featuresGNNconcatenationhomophilygraph benchmarksPubMedCorafeature interference

0 comments

The pith

Concatenating LLM node features directly to graph models can degrade accuracy on homophilous benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that introducing LLM-generated features through simple concatenation to an MLP or GNN can systematically lower test accuracy on standard homophilous graph datasets such as PubMed and Cora. This occurs even though other forms of LLM integration like joint training improve results on the same benchmarks. A new measure called Delta_sig, which captures how well the LLM features discriminate classes on their own, predicts the concatenation effect more reliably than graph homophily. The finding matters because it cautions against assuming that richer features always help when added naively.

Core claim

On the Planetoid public split with bag-of-words features, concatenating SBERT-encoded GPT-4o-mini TAPE features to an MLP reduces PubMed test accuracy by 17.0 percentage points and Cora by 4.3 points. The degradation is smaller with GCN backbones or random splits and reverses on WikiCS and ogbn-arxiv. Delta_sig correlates with the concatenation cost across nine datasets.

What carries the argument

Delta_sig, a measure of LLM-alone discriminability that is used to predict whether concatenation will produce non-positive accuracy change.

If this is right

Concatenation interference is strongest in the low-Delta_sig, small-n regime.
The effect follows a power law relating drop magnitude to the square root of LLM feature dimension over sample size.
Dimension-controlled ablations show the drop lies between PCA reduction and Gaussian noise addition.
Delta_sig classifies seven of nine datasets correctly for non-positive concat cost using a threshold around 13.8 pp.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners should check LLM feature discriminability before concatenating rather than assuming benefit.
Future work could test whether the same interference appears when concatenating to more advanced GNN architectures.
Delta_sig might generalize to other feature types beyond LLMs on graph tasks.

Load-bearing premise

The accuracy drops are caused by interference from the concatenation step itself rather than by differences in training dynamics or unmeasured feature properties.

What would settle it

Re-running the PubMed MLP experiment with the same LLM features but identical optimization and seed settings that eliminates the 17 pp gap.

Figures

Figures reproduced from arXiv: 2606.17579 by Pratyusha Vemuri, Zhongyuan Wang.

**Figure 1.** Figure 1: Concat cost ∆concat: MLP test accuracy change from adding FLLM on top of Forig. PubMed degrades by 17.0±0.3 pp over 10 seeds; Cora by 4.3±0.6 pp. The gain flips to +11.7 pp on ogbn-arxiv. h values reported below each dataset. 4 Experiments 4.1 Headline: concatenation interference on homophilous benchmarks [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Concat cost ∆concat vs. LLM-alone signal ∆sig. Each point is one dataset (10 seeds, ±SE). Vertical dashed line at the bootstrap-best change-point τ = 13.8 pp (95% CI [0, 13.8] pp; Appendix I); the rule “∆sig ≤ τ predicts non-positive concat cost” classifies 7/9 datasets correctly. Below-threshold datasets cluster at non-positive concat cost (including PubMed’s −17 pp), with two false positives (Amazon-Rati… view at source ↗

**Figure 3.** Figure 3: Mechanism ablation on PubMed (10 seeds ±SE). Same-dim zeros produce no degradation; same-source PCA-of-Forig produces only −2 pp; real FLLM concat produces −17 pp (unchanged by halving weight decay); pure Gaussian noise produces −37 pp. The LLM-feature interference is specific to informational content, not dim or regularization. is strongly rank-deficient (participation-ratio rank ∼ 30, entropy rank ∼ 92 o… view at source ↗

**Figure 4.** Figure 4: Concat cost ∆concat on PubMed, four architectures. MLP: −17.0 ± 0.3; GCN: −7.25 ± 0.27; GCNII: −5.6 (from 10-seed Shapley coalition data); GAT: −3.25±0.31. All four are negative and statistically clear. The magnitude decreases as the architecture gains message-passing sophistication: GCN smooths neighbors, GCNII adds identity and initial-residual pass-through, and GAT’s attention plausibly downweights the … view at source ↗

**Figure 5.** Figure 5: PubMed mechanism collapse: nine PubMed configurations fall onto |∆concat| ∝ ( p dl/n) 1.31 , r 2 = 0.97; Cora and CiteSeer public-split stars are overlaid for context but not part of the regression. Coral circles: PubMed train-fraction sweep at dl = 384. Dark coral square: PubMed public split at dl = 768 (MPNet). Stars: public-split headlines on Cora / CiteSeer / PubMed. Dashed line: log-log power-law fit … view at source ↗

**Figure 6.** Figure 6: Concat cost decays monotonically with training-set size. ∆concat on Cora / CiteSeer / PubMed vs. number of training labels n (log scale). Stars mark each dataset’s Planetoid public-split label budget (Cora 140, CiteSeer 120, PubMed 60) with the public-split ∆concat value. The random-split points extrapolate cleanly toward the public-split star, especially on PubMed where n = 59 random-split reproduces ∆con… view at source ↗

**Figure 7.** Figure 7: Structure absorbs some but not all concatenation interference. GCNII-2 reduces PubMed’s concat cost from −17 to −5.6 pp; Cora’s from −4.3 to −0.5 pp. Strong positive datasets (WikiCS, ogbn-arxiv) lose some of their MLP gain under GCNII because structure provides overlapping information [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: 4-factor Shapley bars. Forig (green) is the top contributor on 7 of 9 datasets (with FLLM top on WikiCS and ogbn-arxiv); FLLM Shapley values (coral) average out the direct concat cost shown in [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Concatenating LLM features hurts accuracy on homophilous Planetoid splits, and the paper measures the drops cleanly while offering Delta_sig as a rough signal for when to expect trouble.

read the letter

The core observation is that pure input concatenation of SBERT-encoded GPT-4o-mini features onto bag-of-words inputs drops MLP accuracy by 17 points on PubMed and 4 points on Cora under the public Planetoid split. The drop shrinks with GCN-family backbones or random splits and reverses on WikiCS and ogbn-arxiv. Delta_sig, computed from the LLM features alone, correlates more strongly with the observed concat cost than homophily does across the nine datasets they check.

The experiments are straightforward and address several obvious confounds. The backbone and split ablations show the effect is not limited to one architecture. The dimension-controlled ablation on PubMed places the LLM drop between PCA reduction and same-dimension Gaussian noise, which rules out raw dimensionality or weight-decay artifacts. The power-law fit on nine PubMed configurations is tight and lines up with the headline deficit in the low-Delta_sig, small-n regime.

Delta_sig itself is post-hoc and the change-point tau is fitted, so the classification rule is best treated as an interpretive tool rather than a pre-specified predictor. The training protocol keeps the same learning rate, epochs, and regularization for the baseline and concatenated models. The paper does not run a hyperparameter sweep to check whether the higher-dimensional LLM features simply need different optimization settings to reach the same optimum. That leaves a narrow opening for an under-optimization account, though the noise ablation already narrows the space of simple statistical explanations.

The work is aimed at anyone integrating LLM node features into GNN pipelines on citation-style graphs. It deserves a serious referee because the negative result is measured directly, the controls are reasonable, and the finding challenges a common integration choice even if the new metric needs further validation.

Referee Report

1 major / 2 minor

Summary. The paper claims that introducing LLM-generated node features (SBERT-encoded GPT-4o-mini TAPE) via pure input concatenation to bag-of-words features systematically degrades accuracy on homophilous Planetoid benchmarks when using an MLP backbone on the public split (PubMed: -17.0 +/- 0.3 pp; Cora: -4.3 +/- 0.6 pp), with the effect attenuating or reversing under relaxed conditions (GCN/GCNII/GAT backbones, random splits, other datasets). It introduces Delta_sig (LLM-alone discriminability) as an interpretive measure that correlates more strongly with concatenation cost than homophily (r^2=0.38 vs 0.06), supported by a bootstrap change-point at tau=13.8 pp, a dimension-controlled ablation on PubMed, and a power-law fit |Delta_concat| ~ (sqrt(d_l/n))^1.31 (r^2=0.97) across nine PubMed configurations.

Significance. If the central measurements hold, the work supplies concrete evidence that simple concatenation of LLM features can harm rather than help on standard homophilous benchmarks, contrasting with gains reported for joint training or distillation pipelines. Strengths include the direct accuracy measurements, the dimension-controlled PCA/noise ablation that rules out dimensionality and weight-decay artifacts, the high-r^2 power-law relation on multiple PubMed configurations, and the transparent bootstrap analysis of the Delta_sig change-point. These elements provide a falsifiable lens for when concatenation is likely to be neutral or detrimental.

major comments (1)

[Experimental protocol and ablation sections] The central claim attributes the observed accuracy drops directly to 'pure input concatenation' interference. The reported protocol uses identical training hyperparameters for the bag-of-words baseline and the concatenated model. While the dimension-controlled ablation (placing the LLM drop between PCA at -2.3 pp and Gaussian noise at -37.3 pp) rules out dimensionality and weight-decay artifacts, no evidence is provided that the higher-dimensional concatenated inputs reach an equivalent optimum (e.g., via separate hyperparameter search, learning-curve comparison, or adjusted LR/epochs). This leaves open the possibility that part of the -17 pp PubMed drop arises from under-optimization rather than feature interference per se, which is load-bearing for the attribution in the title and abstract.

minor comments (2)

[Section introducing Delta_sig] The definition and computation of Delta_sig (computed from LLM features alone before GNN training) should be stated with an explicit equation or pseudocode in the main text to make the correlation analysis fully reproducible without reference to the appendix.
[Power-law analysis paragraph] The power-law fit is reported with r^2=0.97 on nine PubMed configurations; adding the fitted exponent with its uncertainty and the exact list of configurations (e.g., as a table row) would strengthen the claim that the low-Delta_sig, small-n corner explains the headline deficit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The recognition of the paper's direct measurements, ablations, and analyses is appreciated. We address the single major comment below.

read point-by-point responses

Referee: [Experimental protocol and ablation sections] The central claim attributes the observed accuracy drops directly to 'pure input concatenation' interference. The reported protocol uses identical training hyperparameters for the bag-of-words baseline and the concatenated model. While the dimension-controlled ablation (placing the LLM drop between PCA at -2.3 pp and Gaussian noise at -37.3 pp) rules out dimensionality and weight-decay artifacts, no evidence is provided that the higher-dimensional concatenated inputs reach an equivalent optimum (e.g., via separate hyperparameter search, learning-curve comparison, or adjusted LR/epochs). This leaves open the possibility that part of the -17 pp PubMed drop arises from under-optimization rather than feature interference per se, which is load-bearing for the attribution in the title and abstract.

Authors: We acknowledge the validity of this observation. The manuscript employs the same set of training hyperparameters for the bag-of-words baseline and the LLM-concatenated model to ensure a controlled comparison. While this protocol does not include a dedicated hyperparameter optimization for the concatenated inputs, the dimension-controlled ablation demonstrates that the observed drop cannot be attributed solely to increased input dimensionality, as PCA reduction yields only -2.3 pp while LLM concatenation yields -17.0 pp. Furthermore, the Gaussian noise control at matched dimension produces a much larger drop (-37.3 pp), indicating that the LLM features introduce a specific interference effect beyond optimization challenges from dimensionality. The power-law relationship fitted across nine PubMed configurations with r^2 = 0.97 provides additional evidence for a systematic phenomenon tied to feature discriminability (Delta_sig). Nevertheless, to strengthen the claim and rule out under-optimization, we will conduct a separate hyperparameter search for the concatenated model on PubMed and include learning curve analyses in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are direct empirical measurements

full rationale

The paper reports direct measurements of test accuracy drops on Planetoid splits when concatenating LLM features to MLP/GNN backbones. Delta_sig is computed solely from LLM features prior to any GNN training, and the reported r^2 correlations, bootstrap change-point tau, and power-law exponent are descriptive fits to the observed accuracy deltas across the 9 datasets and 9 PubMed configurations. These fits are not used to derive the headline drops or to claim first-principles predictions; the central attribution to concatenation interference rests on the measured deltas themselves (with dimension-controlled ablations). No self-citations, uniqueness theorems, or ansatzes appear in the provided text to support the claims. The derivation chain is therefore self-contained observational reporting rather than reduction to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard supervised learning assumptions plus the new Delta_sig metric whose threshold is chosen by bootstrap on the same nine datasets; one fitted exponent and one change-point value are introduced to summarize the observed pattern.

free parameters (2)

tau = 13.8 pp
Bootstrap-selected change-point separating positive and non-positive concatenation cost
power-law exponent = 1.31
Exponent in the fit |Delta_concat| proportional to (sqrt(d_l/n))^alpha across nine PubMed configurations

axioms (1)

standard math Standard assumptions on random initialization, fixed public splits, and early-stopping behavior in GNN training
Invoked when reporting mean accuracy and standard deviation across seeds

invented entities (1)

Delta_sig no independent evidence
purpose: Scalar measure of LLM-feature class discriminability used to predict concatenation cost
Newly defined quantity whose correlation with observed accuracy change is reported

pith-pipeline@v0.9.1-grok · 5947 in / 1436 out tokens · 45000 ms · 2026-06-27T02:08:55.015062+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. (2020). Simple and deep graph convolutional networks. In ICML

2020
[2]

Chen, R. et al. (2024). LLaGA: Large Language and Graph Assistant. In ICML

2024
[3]

Chen, Z., Mao, H., Liu, J., Song, Y., Li, B., Jin, W., Fatemi, B., Tsitsulin, A., Perozzi, B., Liu, H., and Tang, J. (2024). Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights. In NeurIPS Datasets and Benchmarks Track. arXiv:2406.10727. 10

work page arXiv 2024
[4]

K., and Li, J

Li, Y., Wang, P., Zhu, X., Chen, A., Jiang, H., Cai, D., Chan, W. K., and Li, J. (2024). GLBench: A Comprehensive Benchmark for Graph with Large Language Models. In NeurIPS Datasets and Benchmarks Track. arXiv:2407.07457

work page arXiv 2024
[5]

Wu, X., Shen, Y., Ge, F., Shan, C., Jiao, Y., Sun, X., and Cheng, H. (2025). When Do LLMs Help With Node Classification? A Comprehensive Analysis. In ICML

2025
[6]

Muschalik, M., Fumagalli, F., Frazzetto, P., Strotherm, J., Hermes, L., Sperduti, A., Hüllermeier, E., and Hammer, B. (2025). Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks. arXiv:2501.16944

work page arXiv 2025
[7]

He, X. et al. (2023). Harnessing explanations: LLM-to-LM interpreter for enhanced text-attributed graph representation learning. arXiv:2305.19523 (TAPE)

work page arXiv 2023
[8]

Luan, S. et al. (2024). The heterophily paradox: when homophily fails and when it succeeds

2024
[9]

Ma, Y. et al. (2022). Is homophily a necessity for graph neural networks? In ICLR

2022
[10]

Tang, J. et al. (2024). GraphGPT: Graph instruction tuning for large language models

2024
[11]

Wang, R. et al. (2025). TANS: Topology-aware neighbor summarization for LLM-on-graph. NAACL

2025
[12]

Ying, R. et al. (2019). GNNExplainer: Generating explanations for graph neural networks. NeurIPS

2019
[13]

Yuan, H. et al. (2021). On explainability of graph neural networks via subgraph explorations. ICML

2021
[14]

Zhao, J. et al. (2023). Learning on large-scale text-attributed graphs via variational inference. In ICLR (GLEM)

2023
[15]

Zhu, J. et al. (2020). Beyond homophily in graph neural networks: Current limitations and effective designs. NeurIPS

2020
[16]

Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. (2018). Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, NeurIPS

2018
[17]

Duval, A., and Malliaros, F. D. (2021). GraphSVX: Shapley value explanations for graph neural networks. In ECML-PKDD

2021
[18]

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. (2021). Extracting training data from large language models. In30th USENIX Security Symposium, pp. 2633–2650

2021
[19]

A., García-Ferrero, I., Etxaniz, J., de Lacalle, O

Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. InFindings of EMNLP

2023
[20]

Chaudhuri, A., Dutta, A., Bui, T., and Georgescu, S. (2025). A closer look at multimodal representation collapse.arXiv preprint arXiv:2505.22483

work page arXiv 2025
[21]

Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProc. ICML

2022
[22]

Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. InProc. CVPR, pp. 8238–8247

2022
[23]

naive Bayes

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations.Bernoulli, 10(6):989–1010

2004
[24]

(2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed

Hastie, T., Tibshirani, R., and Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer. 11

2009
[25]

Yang, H., Wang, X., Tao, Q., Hu, S., Lin, Z., and Zhang, M. (2024). GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language Model. arXiv:2412.06849

work page arXiv 2024
[26]

Hewitt, J., and Manning, C. D. (2019). A Structural Probe for Finding Syntax in Word Representations. InNAACL-HLT, pp. 4129–4138

2019
[27]

Belinkov, Y., and Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72

2019
[28]

Voita, E., and Titov, I. (2020). Information-Theoretic Probing with Minimum Description Length. In EMNLP, pp. 183–196

2020
[29]

Zheng, Y., Zhang, Z., Wang, Z., Li, X., Luan, S., Peng, X., and Chen, L. (2025). Disentangling and Re-evaluating The Effectiveness of Graph Structure Learning For GNNs. In NeurIPS Datasets and Benchmarks Track. OpenReview brvLHfbSQX

2025
[30]

When Structure Doesn't Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected

Xu, H., You, Y., and Ma, T. (2025). When Structure Doesn’t Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected. arXiv:2511.16767. 12 A LLM Feature Generation Details Text-attributed datasets.For Cora, CiteSeer, PubMed, WikiCS, and ogbn-arxiv we reuse the per-node TAPE explanations released by Wu et al.[5] (generated with GPT-4o-min...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

CATEGORY (or TYPE for Texas) / 3

TOPIC / 2. CATEGORY (or TYPE for Texas) / 3. CONTEXT / 4. KEYWORDS: 5 keywords. For Amazon-Ratings, the initial smoke-test agent wrote a deterministic template generator (scripts/gen_amazon.py) that inserts numerical feature values into one of sixteen product-domain topic templates indexed by(node_id,L 2). The full V3 validation in Appendix F shows this r...

2026
[32]

TOPIC: An 88-word isolated page with no hyperlinks, resembling a moderately detailed but disconnected personal write-up. 2. TYPE: student. 3. CONTEXT: A standalone participant whose content is visible but decoupled from the topology. 4. KEYWORDS: moderate bio, no links, disconnected, personal write-up, self-contained actor (fresh Sonnet, node index 346 in...
[33]

This is F2 in the main paper (Fig

Threshold behavior inMl (equivalently∆ sig).At fixed( n,d o,d l), sign of∆ concat flips withMl around a threshold depending onn. This is F2 in the main paper (Fig. 2)
[34]

∆sig predicts better at point estimate,

Monotone decay in1/√n.At fixed( Ml,d o,d l), the penalty term scales as1/√n, so|∆ concat|should decay monotonically with training-set sizen. This is validated empirically by the train-fraction curves in Appendix J. Both predictions are qualitative: we do not estimateC1,C 2 or∥Σ∥op from data. The role of this analysis is to place the empirical observations...

2000

[1] [1]

Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. (2020). Simple and deep graph convolutional networks. In ICML

2020

[2] [2]

Chen, R. et al. (2024). LLaGA: Large Language and Graph Assistant. In ICML

2024

[3] [3]

Chen, Z., Mao, H., Liu, J., Song, Y., Li, B., Jin, W., Fatemi, B., Tsitsulin, A., Perozzi, B., Liu, H., and Tang, J. (2024). Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights. In NeurIPS Datasets and Benchmarks Track. arXiv:2406.10727. 10

work page arXiv 2024

[4] [4]

K., and Li, J

Li, Y., Wang, P., Zhu, X., Chen, A., Jiang, H., Cai, D., Chan, W. K., and Li, J. (2024). GLBench: A Comprehensive Benchmark for Graph with Large Language Models. In NeurIPS Datasets and Benchmarks Track. arXiv:2407.07457

work page arXiv 2024

[5] [5]

Wu, X., Shen, Y., Ge, F., Shan, C., Jiao, Y., Sun, X., and Cheng, H. (2025). When Do LLMs Help With Node Classification? A Comprehensive Analysis. In ICML

2025

[6] [6]

Muschalik, M., Fumagalli, F., Frazzetto, P., Strotherm, J., Hermes, L., Sperduti, A., Hüllermeier, E., and Hammer, B. (2025). Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks. arXiv:2501.16944

work page arXiv 2025

[7] [7]

He, X. et al. (2023). Harnessing explanations: LLM-to-LM interpreter for enhanced text-attributed graph representation learning. arXiv:2305.19523 (TAPE)

work page arXiv 2023

[8] [8]

Luan, S. et al. (2024). The heterophily paradox: when homophily fails and when it succeeds

2024

[9] [9]

Ma, Y. et al. (2022). Is homophily a necessity for graph neural networks? In ICLR

2022

[10] [10]

Tang, J. et al. (2024). GraphGPT: Graph instruction tuning for large language models

2024

[11] [11]

Wang, R. et al. (2025). TANS: Topology-aware neighbor summarization for LLM-on-graph. NAACL

2025

[12] [12]

Ying, R. et al. (2019). GNNExplainer: Generating explanations for graph neural networks. NeurIPS

2019

[13] [13]

Yuan, H. et al. (2021). On explainability of graph neural networks via subgraph explorations. ICML

2021

[14] [14]

Zhao, J. et al. (2023). Learning on large-scale text-attributed graphs via variational inference. In ICLR (GLEM)

2023

[15] [15]

Zhu, J. et al. (2020). Beyond homophily in graph neural networks: Current limitations and effective designs. NeurIPS

2020

[16] [16]

Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. (2018). Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, NeurIPS

2018

[17] [17]

Duval, A., and Malliaros, F. D. (2021). GraphSVX: Shapley value explanations for graph neural networks. In ECML-PKDD

2021

[18] [18]

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. (2021). Extracting training data from large language models. In30th USENIX Security Symposium, pp. 2633–2650

2021

[19] [19]

A., García-Ferrero, I., Etxaniz, J., de Lacalle, O

Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. InFindings of EMNLP

2023

[20] [20]

Chaudhuri, A., Dutta, A., Bui, T., and Georgescu, S. (2025). A closer look at multimodal representation collapse.arXiv preprint arXiv:2505.22483

work page arXiv 2025

[21] [21]

Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProc. ICML

2022

[22] [22]

Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. InProc. CVPR, pp. 8238–8247

2022

[23] [23]

naive Bayes

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations.Bernoulli, 10(6):989–1010

2004

[24] [24]

(2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed

Hastie, T., Tibshirani, R., and Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer. 11

2009

[25] [25]

Yang, H., Wang, X., Tao, Q., Hu, S., Lin, Z., and Zhang, M. (2024). GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language Model. arXiv:2412.06849

work page arXiv 2024

[26] [26]

Hewitt, J., and Manning, C. D. (2019). A Structural Probe for Finding Syntax in Word Representations. InNAACL-HLT, pp. 4129–4138

2019

[27] [27]

Belinkov, Y., and Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72

2019

[28] [28]

Voita, E., and Titov, I. (2020). Information-Theoretic Probing with Minimum Description Length. In EMNLP, pp. 183–196

2020

[29] [29]

Zheng, Y., Zhang, Z., Wang, Z., Li, X., Luan, S., Peng, X., and Chen, L. (2025). Disentangling and Re-evaluating The Effectiveness of Graph Structure Learning For GNNs. In NeurIPS Datasets and Benchmarks Track. OpenReview brvLHfbSQX

2025

[30] [30]

When Structure Doesn't Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected

Xu, H., You, Y., and Ma, T. (2025). When Structure Doesn’t Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected. arXiv:2511.16767. 12 A LLM Feature Generation Details Text-attributed datasets.For Cora, CiteSeer, PubMed, WikiCS, and ogbn-arxiv we reuse the per-node TAPE explanations released by Wu et al.[5] (generated with GPT-4o-min...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

CATEGORY (or TYPE for Texas) / 3

TOPIC / 2. CATEGORY (or TYPE for Texas) / 3. CONTEXT / 4. KEYWORDS: 5 keywords. For Amazon-Ratings, the initial smoke-test agent wrote a deterministic template generator (scripts/gen_amazon.py) that inserts numerical feature values into one of sixteen product-domain topic templates indexed by(node_id,L 2). The full V3 validation in Appendix F shows this r...

2026

[32] [32]

TOPIC: An 88-word isolated page with no hyperlinks, resembling a moderately detailed but disconnected personal write-up. 2. TYPE: student. 3. CONTEXT: A standalone participant whose content is visible but decoupled from the topology. 4. KEYWORDS: moderate bio, no links, disconnected, personal write-up, self-contained actor (fresh Sonnet, node index 346 in...

[33] [33]

This is F2 in the main paper (Fig

Threshold behavior inMl (equivalently∆ sig).At fixed( n,d o,d l), sign of∆ concat flips withMl around a threshold depending onn. This is F2 in the main paper (Fig. 2)

[34] [34]

∆sig predicts better at point estimate,

Monotone decay in1/√n.At fixed( Ml,d o,d l), the penalty term scales as1/√n, so|∆ concat|should decay monotonically with training-set sizen. This is validated empirically by the train-fraction curves in Appendix J. Both predictions are qualitative: we do not estimateC1,C 2 or∥Σ∥op from data. The role of this analysis is to place the empirical observations...

2000