pith. sign in

arxiv: 2606.17579 · v1 · pith:7HTZUSQHnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI· cs.CL· cs.SI

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Pith reviewed 2026-06-27 02:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SI
keywords LLM featuresGNNconcatenationhomophilygraph benchmarksPubMedCorafeature interference
0
0 comments X

The pith

Concatenating LLM node features directly to graph models can degrade accuracy on homophilous benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that introducing LLM-generated features through simple concatenation to an MLP or GNN can systematically lower test accuracy on standard homophilous graph datasets such as PubMed and Cora. This occurs even though other forms of LLM integration like joint training improve results on the same benchmarks. A new measure called Delta_sig, which captures how well the LLM features discriminate classes on their own, predicts the concatenation effect more reliably than graph homophily. The finding matters because it cautions against assuming that richer features always help when added naively.

Core claim

On the Planetoid public split with bag-of-words features, concatenating SBERT-encoded GPT-4o-mini TAPE features to an MLP reduces PubMed test accuracy by 17.0 percentage points and Cora by 4.3 points. The degradation is smaller with GCN backbones or random splits and reverses on WikiCS and ogbn-arxiv. Delta_sig correlates with the concatenation cost across nine datasets.

What carries the argument

Delta_sig, a measure of LLM-alone discriminability that is used to predict whether concatenation will produce non-positive accuracy change.

If this is right

  • Concatenation interference is strongest in the low-Delta_sig, small-n regime.
  • The effect follows a power law relating drop magnitude to the square root of LLM feature dimension over sample size.
  • Dimension-controlled ablations show the drop lies between PCA reduction and Gaussian noise addition.
  • Delta_sig classifies seven of nine datasets correctly for non-positive concat cost using a threshold around 13.8 pp.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners should check LLM feature discriminability before concatenating rather than assuming benefit.
  • Future work could test whether the same interference appears when concatenating to more advanced GNN architectures.
  • Delta_sig might generalize to other feature types beyond LLMs on graph tasks.

Load-bearing premise

The accuracy drops are caused by interference from the concatenation step itself rather than by differences in training dynamics or unmeasured feature properties.

What would settle it

Re-running the PubMed MLP experiment with the same LLM features but identical optimization and seed settings that eliminates the 17 pp gap.

Figures

Figures reproduced from arXiv: 2606.17579 by Pratyusha Vemuri, Zhongyuan Wang.

Figure 1
Figure 1. Figure 1: Concat cost ∆concat: MLP test accuracy change from adding FLLM on top of Forig. PubMed degrades by 17.0±0.3 pp over 10 seeds; Cora by 4.3±0.6 pp. The gain flips to +11.7 pp on ogbn-arxiv. h values reported below each dataset. 4 Experiments 4.1 Headline: concatenation interference on homophilous benchmarks [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Concat cost ∆concat vs. LLM-alone signal ∆sig. Each point is one dataset (10 seeds, ±SE). Vertical dashed line at the bootstrap-best change-point τ = 13.8 pp (95% CI [0, 13.8] pp; Appendix I); the rule “∆sig ≤ τ predicts non-positive concat cost” classifies 7/9 datasets correctly. Below-threshold datasets cluster at non-positive concat cost (including PubMed’s −17 pp), with two false positives (Amazon-Rati… view at source ↗
Figure 3
Figure 3. Figure 3: Mechanism ablation on PubMed (10 seeds ±SE). Same-dim zeros produce no degradation; same-source PCA-of-Forig produces only −2 pp; real FLLM concat produces −17 pp (unchanged by halving weight decay); pure Gaussian noise produces −37 pp. The LLM-feature interference is specific to informational content, not dim or regularization. is strongly rank-deficient (participation-ratio rank ∼ 30, entropy rank ∼ 92 o… view at source ↗
Figure 4
Figure 4. Figure 4: Concat cost ∆concat on PubMed, four architectures. MLP: −17.0 ± 0.3; GCN: −7.25 ± 0.27; GCNII: −5.6 (from 10-seed Shapley coalition data); GAT: −3.25±0.31. All four are negative and statistically clear. The magnitude decreases as the architecture gains message-passing sophistication: GCN smooths neighbors, GCNII adds identity and initial-residual pass-through, and GAT’s attention plausibly downweights the … view at source ↗
Figure 5
Figure 5. Figure 5: PubMed mechanism collapse: nine PubMed configurations fall onto |∆concat| ∝ ( p dl/n) 1.31 , r 2 = 0.97; Cora and CiteSeer public-split stars are overlaid for context but not part of the regression. Coral circles: PubMed train-fraction sweep at dl = 384. Dark coral square: PubMed public split at dl = 768 (MPNet). Stars: public-split headlines on Cora / CiteSeer / PubMed. Dashed line: log-log power-law fit … view at source ↗
Figure 6
Figure 6. Figure 6: Concat cost decays monotonically with training-set size. ∆concat on Cora / CiteSeer / PubMed vs. number of training labels n (log scale). Stars mark each dataset’s Planetoid public-split label budget (Cora 140, CiteSeer 120, PubMed 60) with the public-split ∆concat value. The random-split points extrapolate cleanly toward the public-split star, especially on PubMed where n = 59 random-split reproduces ∆con… view at source ↗
Figure 7
Figure 7. Figure 7: Structure absorbs some but not all concatenation interference. GCNII-2 reduces PubMed’s concat cost from −17 to −5.6 pp; Cora’s from −4.3 to −0.5 pp. Strong positive datasets (WikiCS, ogbn-arxiv) lose some of their MLP gain under GCNII because structure provides overlapping information [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 4-factor Shapley bars. Forig (green) is the top contributor on 7 of 9 datasets (with FLLM top on WikiCS and ogbn-arxiv); FLLM Shapley values (coral) average out the direct concat cost shown in [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that introducing LLM-generated node features (SBERT-encoded GPT-4o-mini TAPE) via pure input concatenation to bag-of-words features systematically degrades accuracy on homophilous Planetoid benchmarks when using an MLP backbone on the public split (PubMed: -17.0 +/- 0.3 pp; Cora: -4.3 +/- 0.6 pp), with the effect attenuating or reversing under relaxed conditions (GCN/GCNII/GAT backbones, random splits, other datasets). It introduces Delta_sig (LLM-alone discriminability) as an interpretive measure that correlates more strongly with concatenation cost than homophily (r^2=0.38 vs 0.06), supported by a bootstrap change-point at tau=13.8 pp, a dimension-controlled ablation on PubMed, and a power-law fit |Delta_concat| ~ (sqrt(d_l/n))^1.31 (r^2=0.97) across nine PubMed configurations.

Significance. If the central measurements hold, the work supplies concrete evidence that simple concatenation of LLM features can harm rather than help on standard homophilous benchmarks, contrasting with gains reported for joint training or distillation pipelines. Strengths include the direct accuracy measurements, the dimension-controlled PCA/noise ablation that rules out dimensionality and weight-decay artifacts, the high-r^2 power-law relation on multiple PubMed configurations, and the transparent bootstrap analysis of the Delta_sig change-point. These elements provide a falsifiable lens for when concatenation is likely to be neutral or detrimental.

major comments (1)
  1. [Experimental protocol and ablation sections] The central claim attributes the observed accuracy drops directly to 'pure input concatenation' interference. The reported protocol uses identical training hyperparameters for the bag-of-words baseline and the concatenated model. While the dimension-controlled ablation (placing the LLM drop between PCA at -2.3 pp and Gaussian noise at -37.3 pp) rules out dimensionality and weight-decay artifacts, no evidence is provided that the higher-dimensional concatenated inputs reach an equivalent optimum (e.g., via separate hyperparameter search, learning-curve comparison, or adjusted LR/epochs). This leaves open the possibility that part of the -17 pp PubMed drop arises from under-optimization rather than feature interference per se, which is load-bearing for the attribution in the title and abstract.
minor comments (2)
  1. [Section introducing Delta_sig] The definition and computation of Delta_sig (computed from LLM features alone before GNN training) should be stated with an explicit equation or pseudocode in the main text to make the correlation analysis fully reproducible without reference to the appendix.
  2. [Power-law analysis paragraph] The power-law fit is reported with r^2=0.97 on nine PubMed configurations; adding the fitted exponent with its uncertainty and the exact list of configurations (e.g., as a table row) would strengthen the claim that the low-Delta_sig, small-n corner explains the headline deficit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The recognition of the paper's direct measurements, ablations, and analyses is appreciated. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experimental protocol and ablation sections] The central claim attributes the observed accuracy drops directly to 'pure input concatenation' interference. The reported protocol uses identical training hyperparameters for the bag-of-words baseline and the concatenated model. While the dimension-controlled ablation (placing the LLM drop between PCA at -2.3 pp and Gaussian noise at -37.3 pp) rules out dimensionality and weight-decay artifacts, no evidence is provided that the higher-dimensional concatenated inputs reach an equivalent optimum (e.g., via separate hyperparameter search, learning-curve comparison, or adjusted LR/epochs). This leaves open the possibility that part of the -17 pp PubMed drop arises from under-optimization rather than feature interference per se, which is load-bearing for the attribution in the title and abstract.

    Authors: We acknowledge the validity of this observation. The manuscript employs the same set of training hyperparameters for the bag-of-words baseline and the LLM-concatenated model to ensure a controlled comparison. While this protocol does not include a dedicated hyperparameter optimization for the concatenated inputs, the dimension-controlled ablation demonstrates that the observed drop cannot be attributed solely to increased input dimensionality, as PCA reduction yields only -2.3 pp while LLM concatenation yields -17.0 pp. Furthermore, the Gaussian noise control at matched dimension produces a much larger drop (-37.3 pp), indicating that the LLM features introduce a specific interference effect beyond optimization challenges from dimensionality. The power-law relationship fitted across nine PubMed configurations with r^2 = 0.97 provides additional evidence for a systematic phenomenon tied to feature discriminability (Delta_sig). Nevertheless, to strengthen the claim and rule out under-optimization, we will conduct a separate hyperparameter search for the concatenated model on PubMed and include learning curve analyses in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are direct empirical measurements

full rationale

The paper reports direct measurements of test accuracy drops on Planetoid splits when concatenating LLM features to MLP/GNN backbones. Delta_sig is computed solely from LLM features prior to any GNN training, and the reported r^2 correlations, bootstrap change-point tau, and power-law exponent are descriptive fits to the observed accuracy deltas across the 9 datasets and 9 PubMed configurations. These fits are not used to derive the headline drops or to claim first-principles predictions; the central attribution to concatenation interference rests on the measured deltas themselves (with dimension-controlled ablations). No self-citations, uniqueness theorems, or ansatzes appear in the provided text to support the claims. The derivation chain is therefore self-contained observational reporting rather than reduction to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard supervised learning assumptions plus the new Delta_sig metric whose threshold is chosen by bootstrap on the same nine datasets; one fitted exponent and one change-point value are introduced to summarize the observed pattern.

free parameters (2)
  • tau = 13.8 pp
    Bootstrap-selected change-point separating positive and non-positive concatenation cost
  • power-law exponent = 1.31
    Exponent in the fit |Delta_concat| proportional to (sqrt(d_l/n))^alpha across nine PubMed configurations
axioms (1)
  • standard math Standard assumptions on random initialization, fixed public splits, and early-stopping behavior in GNN training
    Invoked when reporting mean accuracy and standard deviation across seeds
invented entities (1)
  • Delta_sig no independent evidence
    purpose: Scalar measure of LLM-feature class discriminability used to predict concatenation cost
    Newly defined quantity whose correlation with observed accuracy change is reported

pith-pipeline@v0.9.1-grok · 5947 in / 1436 out tokens · 45000 ms · 2026-06-27T02:08:55.015062+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. (2020). Simple and deep graph convolutional networks. In ICML

  2. [2]

    Chen, R. et al. (2024). LLaGA: Large Language and Graph Assistant. In ICML

  3. [3]

    Chen, Z., Mao, H., Liu, J., Song, Y., Li, B., Jin, W., Fatemi, B., Tsitsulin, A., Perozzi, B., Liu, H., and Tang, J. (2024). Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights. In NeurIPS Datasets and Benchmarks Track. arXiv:2406.10727. 10

  4. [4]

    K., and Li, J

    Li, Y., Wang, P., Zhu, X., Chen, A., Jiang, H., Cai, D., Chan, W. K., and Li, J. (2024). GLBench: A Comprehensive Benchmark for Graph with Large Language Models. In NeurIPS Datasets and Benchmarks Track. arXiv:2407.07457

  5. [5]

    Wu, X., Shen, Y., Ge, F., Shan, C., Jiao, Y., Sun, X., and Cheng, H. (2025). When Do LLMs Help With Node Classification? A Comprehensive Analysis. In ICML

  6. [6]

    Muschalik, M., Fumagalli, F., Frazzetto, P., Strotherm, J., Hermes, L., Sperduti, A., Hüllermeier, E., and Hammer, B. (2025). Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks. arXiv:2501.16944

  7. [7]

    He, X. et al. (2023). Harnessing explanations: LLM-to-LM interpreter for enhanced text-attributed graph representation learning. arXiv:2305.19523 (TAPE)

  8. [8]

    Luan, S. et al. (2024). The heterophily paradox: when homophily fails and when it succeeds

  9. [9]

    Ma, Y. et al. (2022). Is homophily a necessity for graph neural networks? In ICLR

  10. [10]

    Tang, J. et al. (2024). GraphGPT: Graph instruction tuning for large language models

  11. [11]

    Wang, R. et al. (2025). TANS: Topology-aware neighbor summarization for LLM-on-graph. NAACL

  12. [12]

    Ying, R. et al. (2019). GNNExplainer: Generating explanations for graph neural networks. NeurIPS

  13. [13]

    Yuan, H. et al. (2021). On explainability of graph neural networks via subgraph explorations. ICML

  14. [14]

    Zhao, J. et al. (2023). Learning on large-scale text-attributed graphs via variational inference. In ICLR (GLEM)

  15. [15]

    Zhu, J. et al. (2020). Beyond homophily in graph neural networks: Current limitations and effective designs. NeurIPS

  16. [16]

    Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. (2018). Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, NeurIPS

  17. [17]

    Duval, A., and Malliaros, F. D. (2021). GraphSVX: Shapley value explanations for graph neural networks. In ECML-PKDD

  18. [18]

    Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. (2021). Extracting training data from large language models. In30th USENIX Security Symposium, pp. 2633–2650

  19. [19]

    A., García-Ferrero, I., Etxaniz, J., de Lacalle, O

    Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. InFindings of EMNLP

  20. [20]

    Chaudhuri, A., Dutta, A., Bui, T., and Georgescu, S. (2025). A closer look at multimodal representation collapse.arXiv preprint arXiv:2505.22483

  21. [21]

    Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProc. ICML

  22. [22]

    Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. InProc. CVPR, pp. 8238–8247

  23. [23]

    naive Bayes

    Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations.Bernoulli, 10(6):989–1010

  24. [24]

    (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed

    Hastie, T., Tibshirani, R., and Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer. 11

  25. [25]

    Yang, H., Wang, X., Tao, Q., Hu, S., Lin, Z., and Zhang, M. (2024). GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language Model. arXiv:2412.06849

  26. [26]

    Hewitt, J., and Manning, C. D. (2019). A Structural Probe for Finding Syntax in Word Representations. InNAACL-HLT, pp. 4129–4138

  27. [27]

    Belinkov, Y., and Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72

  28. [28]

    Voita, E., and Titov, I. (2020). Information-Theoretic Probing with Minimum Description Length. In EMNLP, pp. 183–196

  29. [29]

    Zheng, Y., Zhang, Z., Wang, Z., Li, X., Luan, S., Peng, X., and Chen, L. (2025). Disentangling and Re-evaluating The Effectiveness of Graph Structure Learning For GNNs. In NeurIPS Datasets and Benchmarks Track. OpenReview brvLHfbSQX

  30. [30]

    When Structure Doesn't Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected

    Xu, H., You, Y., and Ma, T. (2025). When Structure Doesn’t Help: LLMs Do Not Read Text-Attributed Graphs as Effectively as We Expected. arXiv:2511.16767. 12 A LLM Feature Generation Details Text-attributed datasets.For Cora, CiteSeer, PubMed, WikiCS, and ogbn-arxiv we reuse the per-node TAPE explanations released by Wu et al.[5] (generated with GPT-4o-min...

  31. [31]

    CATEGORY (or TYPE for Texas) / 3

    TOPIC / 2. CATEGORY (or TYPE for Texas) / 3. CONTEXT / 4. KEYWORDS: 5 keywords. For Amazon-Ratings, the initial smoke-test agent wrote a deterministic template generator (scripts/gen_amazon.py) that inserts numerical feature values into one of sixteen product-domain topic templates indexed by(node_id,L 2). The full V3 validation in Appendix F shows this r...

  32. [32]

    TOPIC: An 88-word isolated page with no hyperlinks, resembling a moderately detailed but disconnected personal write-up. 2. TYPE: student. 3. CONTEXT: A standalone participant whose content is visible but decoupled from the topology. 4. KEYWORDS: moderate bio, no links, disconnected, personal write-up, self-contained actor (fresh Sonnet, node index 346 in...

  33. [33]

    This is F2 in the main paper (Fig

    Threshold behavior inMl (equivalently∆ sig).At fixed( n,d o,d l), sign of∆ concat flips withMl around a threshold depending onn. This is F2 in the main paper (Fig. 2)

  34. [34]

    ∆sig predicts better at point estimate,

    Monotone decay in1/√n.At fixed( Ml,d o,d l), the penalty term scales as1/√n, so|∆ concat|should decay monotonically with training-set sizen. This is validated empirically by the train-fraction curves in Appendix J. Both predictions are qualitative: we do not estimateC1,C 2 or∥Σ∥op from data. The role of this analysis is to place the empirical observations...