GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

Hanlin Gu; Jingzhou Jiang; Lixin Fan; Yi Yang; Zhuoyi Peng

arxiv: 2606.11562 · v1 · pith:WZTK36PDnew · submitted 2026-06-10 · 💻 cs.LG · cs.CL

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

Zhuoyi Peng , Jingzhou Jiang , Hanlin Gu , Lixin Fan , Yi Yang This is my paper

Pith reviewed 2026-06-27 10:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords graph inferenceLLM benchmarkinggraph neural networksneighborhood reasoninggraph QAcommunity detectionmasked prediction

0 comments

The pith

No LLM method family matches plain GNNs on neighborhood graph inference tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GraphInfer-Bench to measure whether language models can produce answers that depend on a node's full neighborhood rather than any single node or retrievable path. It defines five tasks split between description of a region and comparison between regions, built on six real graphs with 42,000 samples that passed a four-layer quality screen. Four method families are tested on identical tasks: graph-token alignment models, zero-shot frontier LLMs, Graph2Text supervised fine-tuning, and plain GNNs as reference. The results show that every LLM-based approach leaves a performance gap, while plain GNNs equal or exceed the best LLM row on all tasks and widen the margin most on community detection.

Core claim

GraphInfer-Bench shows that no evaluated LLM family closes the inference gap: graph-token alignment works on some description tasks but fails comparisons, frontier LLMs lead among LLM methods on outlier detection and community partition yet lag on masked-node prediction, Graph2Text SFT is strongest on description but trails frontier models on comparison, and plain GNNs match or beat every LLM-based result across the board.

What carries the argument

GraphInfer-Bench, a collection of five tasks (relational description, theme description, outlier detection, community partition, masked-node prediction) where ground truth lives only in collective neighborhood structure.

If this is right

Graph-token alignment partially solves relational and theme description but collapses on comparison tasks.
Frontier closed-source LLMs lead LLM-based methods on outlier detection and community partition yet remain weaker on masked-node prediction.
Graph2Text supervised fine-tuning is the strongest LLM approach on description tasks but falls behind frontier LLMs on comparison tasks.
Plain GNNs produce the largest performance margin over LLM methods on community detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The persistent gap suggests that current LLM architectures may lack built-in mechanisms for neighborhood aggregation that GNN layers supply directly.
The benchmark tasks could serve as a testbed for hybrid models that route structural computation through GNN layers before language generation.
If the quality screen truly isolates neighborhood inference, similar construction methods could be applied to other graph domains such as molecular or citation networks.

Load-bearing premise

The five tasks are built so that correct answers cannot be found in any single node or along any path, and the quality-control protocol ensures this property holds.

What would settle it

An LLM-based method achieving higher accuracy than the best GNN baseline on community partition or masked-node prediction within the released GraphInfer-Bench data would falsify the central result.

Figures

Figures reproduced from arXiv: 2606.11562 by Hanlin Gu, Jingzhou Jiang, Lixin Fan, Yi Yang, Zhuoyi Peng.

**Figure 2.** Figure 2: Reasoning-supervision ablation. full is the main-table recipe (Answer plus Reasoning). w/o strips the Reasoning paragraph. n=300 per task. Per-domain numbers for the full bars are in Tab. 14 (LLaGA, TEA-GLM, GOFA) and Tab. 18 (SFT). out Reasoning. The full-LLM gradient of SFT lets the model discriminate from graph-text alone, and removing the free-form rationale removes a competing gradient. T2 (yes/no) al… view at source ↗

**Figure 3.** Figure 3: Projector and encoder pretraining ablation on TEA-GLM, GOFA, and RGLM. Solid bars [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Per-task prompt templates. The five tasks share a single skeleton (graph header + ques [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-2 judge prompt. Both Llama-3.1-70B-AWQ and Qwen-2.5-72B-AWQ receive an [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Per-cell gold-label distribution after the L4 cap (1,400 samples / cell). Each donut is [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphInfer-Bench targets neighborhood-level graph inference with a new task set, but the abstract supplies no metrics or construction details to verify the core claim.

read the letter

The paper introduces GraphInfer-Bench to test whether LLMs can answer graph questions whose ground truth requires neighborhood aggregation instead of single-node lookup or path retrieval. It defines five description and comparison tasks across six real graphs, generates 42,000 samples, applies a four-layer quality-control filter, and runs four method families: graph-token alignment, zero-shot frontier LLMs, Graph2Text SFT, and plain GNNs.

What is new is the explicit framing that existing graph QA setups (algorithm simulation, node classification, KG-QA, GraphRAG) all permit path-retrievable answers, so the benchmark tries to close that loophole. The evaluation is straightforward and includes a structural baseline, which is useful. The reported ordering—that GNNs match or exceed the best LLM rows on every task, with the biggest gap on community detection—is the kind of concrete comparison that can guide follow-up work.

The main limitation is that the abstract states the performance ordering and the QC protocol but gives no numbers, error bars, task templates, or sample traces. Without those, it is impossible to check whether the tasks actually force neighborhood inference or whether some fraction still admits local solutions. The stress-test concern about path-retrievable items therefore stands until the full construction details appear.

This is for people building or evaluating LLM systems on relational data in biology, security, or recommendation settings. A reader who wants to see how current LLM families compare to GNNs on multi-node tasks will find the setup worth examining. The paper deserves peer review so referees can inspect the task predicates and the raw results; the idea is clear enough that the details should be checked rather than dismissed outright.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GraphInfer-Bench, a benchmark with 42,000 samples on five tasks (relational, theme, outlier detection, community partition, masked-node prediction) across six graphs. Tasks are constructed so ground truth requires neighborhood aggregation, not single-node lookup or path retrieval. Evaluations of graph-token alignment, frontier LLMs, Graph2Text SFT, and GNNs show no LLM family closes the gap, with GNNs matching or exceeding LLM performance, particularly on community detection.

Significance. Should the benchmark tasks genuinely demand multi-node inference as claimed, the result that plain GNNs outperform LLM-based methods would establish graph inference as an open problem for LLMs, providing a clear baseline and motivating research into better graph reasoning capabilities in language models.

major comments (2)

[Abstract] Abstract: The headline result that 'plain GNNs match or beat the strongest LLM-based row' is stated without any numerical metrics, error bars, or statistical tests; the abstract supplies only qualitative ordering, preventing verification of the claimed margins (e.g., largest on community detection).
[Abstract] Abstract: The four-layer quality-control protocol is asserted to ensure 'ground truth lives in no single node' and 'no path retrieves', yet no formal predicate, exhaustive path-enumeration procedure, or concrete sample traces are provided to demonstrate that neighborhood aggregation is indispensable rather than surface properties.

minor comments (1)

[Abstract] Abstract: The task names (e.g., 'relational, theme') are introduced without brief definitions or references to their construction details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will incorporate revisions to improve clarity and verifiability while preserving the manuscript's core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result that 'plain GNNs match or beat the strongest LLM-based row' is stated without any numerical metrics, error bars, or statistical tests; the abstract supplies only qualitative ordering, preventing verification of the claimed margins (e.g., largest on community detection).

Authors: We agree that the abstract would benefit from quantitative support for the headline claim. The full manuscript already reports exact performance metrics, standard deviations across multiple runs, and task-specific margins in Tables 2-4, with the largest gap on community partition. In revision we will condense key numerical results (including approximate margins) and a note on statistical comparisons into the abstract to enable direct verification without altering the qualitative ordering. revision: yes
Referee: [Abstract] Abstract: The four-layer quality-control protocol is asserted to ensure 'ground truth lives in no single node' and 'no path retrieves', yet no formal predicate, exhaustive path-enumeration procedure, or concrete sample traces are provided to demonstrate that neighborhood aggregation is indispensable rather than surface properties.

Authors: The four-layer protocol and its rationale are detailed in the Benchmark Construction section, which specifies the checks used to confirm that answers cannot be obtained from a single node or a simple path. To strengthen the presentation, we will add an explicit formal predicate for the required neighborhood-aggregation property, a high-level description of the path-enumeration verification step, and one or two concrete sample traces (with node/edge annotations) either in the main text or as supplementary material. This addresses the request for more rigorous demonstration while remaining consistent with the existing construction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces GraphInfer-Bench as an empirical evaluation suite across five tasks on real graphs, reports direct performance numbers for four method families, and draws comparative conclusions from those measurements. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or self-citation load-bearing steps appear in the reported methodology or results. The central claim (no LLM family closes the gap to GNNs) rests on tabulated accuracies rather than any reduction to prior self-referential constructs, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark rather than a theoretical derivation, so it introduces no free parameters or invented entities; it rests on the domain assumption that certain real-world inferences require neighborhood information not contained in any single node.

axioms (1)

domain assumption Certain inferences on real-world graphs require information aggregated from a node's neighborhood rather than contained in any single node or path.
This premise underpins the claim that existing graph-QA protocols are insufficient and that the new tasks test a distinct capability.

pith-pipeline@v0.9.1-grok · 5847 in / 1281 out tokens · 43774 ms · 2026-06-27T10:39:40.795401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Claude Opus 4.7 System Card.https://www.anthropic.com/ claude-opus-4-7-system-card, April 2026

Anthropic. Claude Opus 4.7 System Card.https://www.anthropic.com/ claude-opus-4-7-system-card, April 2026

2026
[2]

Kqa pro: A dataset with explicit compositional programs for complex question answering over knowledge base

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. Kqa pro: A dataset with explicit compositional programs for complex question answering over knowledge base. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6101–6119, 2022. 9

2022
[3]

Graphwiz: An instruction-following lan- guage model for graph computational problems

Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li. Graphwiz: An instruction-following lan- guage model for graph computational problems. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 353–364, 2024

2024
[4]

LLaGA: Large language and graph assistant

Runjin Chen, Tong Zhao, Ajay Kumar Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 7809–7823, 2024

2024
[5]

Exploring the potential of large language models (llms) in learning on graphs.ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024

Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. Exploring the potential of large language models (llms) in learning on graphs.ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024

2024
[6]

Talk like a graph: Encoding graphs for large language models.arXiv preprint arXiv:2310.04560, 2023

Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Talk like a graph: Encoding graphs for large language models.arXiv preprint arXiv:2310.04560, 2023

work page arXiv 2023
[7]

Beyond iid: three levels of generalization for question answering on knowledge bases

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond iid: three levels of generalization for question answering on knowledge bases. InProceedings of the web conference 2021, pages 3477–3488, 2021

2021
[8]

Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmark- ing,

Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and bench- marking.arXiv preprint arXiv:2305.15066, 2023

work page arXiv 2023
[9]

G-retriever: Retrieval-augmented generation for textual graph un- derstanding and question answering.Advances in Neural Information Processing Systems, 37: 132876–132907, 2024

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph un- derstanding and question answering.Advances in Neural Information Processing Systems, 37: 132876–132907, 2024

2024
[10]

Systematic integration of biomedical knowledge prioritizes drugs for repurposing.elife, 6:e26726, 2017

Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.elife, 6:e26726, 2017

2017
[11]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020

2020
[12]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, et al. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Computational Linguistics: ACL 2024, pages 163–184, 2024

2024
[13]

Gofa: A generative one-for-all model for joint graph language modeling.arXiv preprint arXiv:2407.09709, 2024

Lecheng Kong, Jiarui Feng, Hao Liu, Chengsong Huang, Jiaxin Huang, Yixin Chen, and Muhan Zhang. Gofa: A generative one-for-all model for joint graph language modeling.arXiv preprint arXiv:2407.09709, 2024

work page arXiv 2024
[14]

Graphs over time: densification laws, shrinking diameters and possible explanations

Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. InProceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005

2005
[15]

Glbench: A comprehensive benchmark for graph with large language models

Yuhan Li, Peisong Wang, Xiao Zhu, Aochuan Chen, Haiyun Jiang, Deng Cai, Victor W Chan, and Jia Li. Glbench: A comprehensive benchmark for graph with large language models. Advances in Neural Information Processing Systems, 37:42349–42368, 2024

2024
[16]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx- uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024
[17]

One for all: Towards training one graph model for all classification tasks.arXiv preprint arXiv:2310.00149, 2023

Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. One for all: Towards training one graph model for all classification tasks.arXiv preprint arXiv:2310.00149, 2023. 10

work page arXiv 2023
[18]

Graphinstruct: Empowering large language models with graph understanding and reasoning capability.arXiv preprint arXiv:2403.04483, 2024

Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, and Hai Jin. Graphinstruct: Empowering large language models with graph understanding and reasoning capability.arXiv preprint arXiv:2403.04483, 2024

work page arXiv 2024
[19]

Wiki-cs: A wikipedia-based benchmark for graph neural networks,

Péter Mernyei and C ˘at˘alina Cangea. Wiki-cs: A wikipedia-based benchmark for graph neural networks.arXiv preprint arXiv:2007.02901, 2020

work page arXiv 2007
[20]

Financial fraud detection using graph neural networks: A systematic review.Expert Systems with Applications, 240:122156, 2024

Soroor Motie and Bijan Raahemi. Financial fraud detection using graph neural networks: A systematic review.Expert Systems with Applications, 240:122156, 2024

2024
[21]

Communities, modules and large-scale structure in networks.Nature physics, 8(1):25–31, 2012

Mark EJ Newman. Communities, modules and large-scale structure in networks.Nature physics, 8(1):25–31, 2012

2012
[22]

OpenAI GPT-5 System Card

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Let your graph do the talking: Encoding structured data for llms.arXiv preprint arXiv:2402.05862, 2024

Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow. Let your graph do the talking: Encoding structured data for llms.arXiv preprint arXiv:2402.05862, 2024

work page arXiv 2024
[24]

Collective classification in network data.AI magazine, 29(3):93–93, 2008

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi- Rad. Collective classification in network data.AI magazine, 29(3):93–93, 2008

2008
[25]

Stack Exchange Data Dump

Stack Exchange, Inc. Stack Exchange Data Dump. Internet Archive,https://archive. org/details/stackexchange, 2024. Licensed under CC BY-SA 4.0; Physics Stack Ex- change subset used in this work

2024
[26]

The web as a knowledge-base for answering complex ques- tions

Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex ques- tions. InProceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, 2018

2018
[27]

Graphgpt: Graph instruction tuning for large language models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. Graphgpt: Graph instruction tuning for large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 491–500, 2024

2024
[28]

Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

work page arXiv 2024
[29]

Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings.Advances in neural information processing systems, 37:5950–5973, 2024

Duo Wang, Yuan Zuo, Fengzhi Li, and Junjie Wu. Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings.Advances in neural information processing systems, 37:5950–5973, 2024

2024
[30]

Can language models solve graph problems in natural language?Advances in Neural Information Processing Systems, 36:30840–30861, 2023

Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. Can language models solve graph problems in natural language?Advances in Neural Information Processing Systems, 36:30840–30861, 2023

2023
[31]

Stark: Benchmarking llm retrieval on textual and relational knowledge bases.Advances in Neural Information Process- ing Systems, 37:127129–127153, 2024

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vas- silis N Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec. Stark: Benchmarking llm retrieval on textual and relational knowledge bases.Advances in Neural Information Process- ing Systems, 37:127129–127153, 2024

2024
[32]

Graph neural networks in recom- mender systems: a survey.ACM computing surveys, 55(5):1–37, 2022

Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neural networks in recom- mender systems: a survey.ACM computing surveys, 55(5):1–37, 2022

2022
[33]

arXiv preprint arXiv:2506.05690 , year=

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in rag: A comprehensive analysis for graph retrieval- augmented generation.arXiv preprint arXiv:2506.05690, 2025

work page arXiv 2025
[34]

Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation.arXiv preprint arXiv:2506.02404, 2025

Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian-wen Zhang, Di Yin, Xing Sun, and Xiao Huang. Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation.arXiv preprint arXiv:2506.02404, 2025. 11

work page arXiv 2025
[35]

Crag-comprehensive rag benchmark.Ad- vances in Neural Information Processing Systems, 37:10470–10490, 2024

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag benchmark.Ad- vances in Neural Information Processing Systems, 37:10470–10490, 2024

2024
[36]

Language is all a graph needs

Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. Language is all a graph needs. InFindings of the association for computational linguistics: EACL 2024, pages 1955–1973, 2024

2024
[37]

The value of semantic parse labeling for knowledge base question answering

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, 2016

2016
[38]

Gracore: Benchmarking graph comprehension and complex reasoning in large language models

Zike Yuan, Ming Liu, Hui Wang, and Bing Qin. Gracore: Benchmarking graph comprehension and complex reasoning in large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 7925–7948, 2025

2025
[39]

Variational rea- soning for question answering with knowledge graph

Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song. Variational rea- soning for question answering with knowledge graph. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[40]

paper”, “product

Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, and Chuan Shi. Toward graph- tokenizing large language models with reconstructive graph instruction tuning. InProceedings of the ACM Web Conference 2026, pages 430–441, 2026. 12 A Limitations GRAPHINFER-BENCHcurrently spans six text-attributed graph domains: academic citations, e- commerce co-purchase...

work page arXiv 2026

[1] [1]

Claude Opus 4.7 System Card.https://www.anthropic.com/ claude-opus-4-7-system-card, April 2026

Anthropic. Claude Opus 4.7 System Card.https://www.anthropic.com/ claude-opus-4-7-system-card, April 2026

2026

[2] [2]

Kqa pro: A dataset with explicit compositional programs for complex question answering over knowledge base

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. Kqa pro: A dataset with explicit compositional programs for complex question answering over knowledge base. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6101–6119, 2022. 9

2022

[3] [3]

Graphwiz: An instruction-following lan- guage model for graph computational problems

Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li. Graphwiz: An instruction-following lan- guage model for graph computational problems. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 353–364, 2024

2024

[4] [4]

LLaGA: Large language and graph assistant

Runjin Chen, Tong Zhao, Ajay Kumar Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 7809–7823, 2024

2024

[5] [5]

Exploring the potential of large language models (llms) in learning on graphs.ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024

Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. Exploring the potential of large language models (llms) in learning on graphs.ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024

2024

[6] [6]

Talk like a graph: Encoding graphs for large language models.arXiv preprint arXiv:2310.04560, 2023

Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Talk like a graph: Encoding graphs for large language models.arXiv preprint arXiv:2310.04560, 2023

work page arXiv 2023

[7] [7]

Beyond iid: three levels of generalization for question answering on knowledge bases

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond iid: three levels of generalization for question answering on knowledge bases. InProceedings of the web conference 2021, pages 3477–3488, 2021

2021

[8] [8]

Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmark- ing,

Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and bench- marking.arXiv preprint arXiv:2305.15066, 2023

work page arXiv 2023

[9] [9]

G-retriever: Retrieval-augmented generation for textual graph un- derstanding and question answering.Advances in Neural Information Processing Systems, 37: 132876–132907, 2024

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph un- derstanding and question answering.Advances in Neural Information Processing Systems, 37: 132876–132907, 2024

2024

[10] [10]

Systematic integration of biomedical knowledge prioritizes drugs for repurposing.elife, 6:e26726, 2017

Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.elife, 6:e26726, 2017

2017

[11] [11]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020

2020

[12] [12]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, et al. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Computational Linguistics: ACL 2024, pages 163–184, 2024

2024

[13] [13]

Gofa: A generative one-for-all model for joint graph language modeling.arXiv preprint arXiv:2407.09709, 2024

Lecheng Kong, Jiarui Feng, Hao Liu, Chengsong Huang, Jiaxin Huang, Yixin Chen, and Muhan Zhang. Gofa: A generative one-for-all model for joint graph language modeling.arXiv preprint arXiv:2407.09709, 2024

work page arXiv 2024

[14] [14]

Graphs over time: densification laws, shrinking diameters and possible explanations

Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. InProceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005

2005

[15] [15]

Glbench: A comprehensive benchmark for graph with large language models

Yuhan Li, Peisong Wang, Xiao Zhu, Aochuan Chen, Haiyun Jiang, Deng Cai, Victor W Chan, and Jia Li. Glbench: A comprehensive benchmark for graph with large language models. Advances in Neural Information Processing Systems, 37:42349–42368, 2024

2024

[16] [16]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx- uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

2024

[17] [17]

One for all: Towards training one graph model for all classification tasks.arXiv preprint arXiv:2310.00149, 2023

Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. One for all: Towards training one graph model for all classification tasks.arXiv preprint arXiv:2310.00149, 2023. 10

work page arXiv 2023

[18] [18]

Graphinstruct: Empowering large language models with graph understanding and reasoning capability.arXiv preprint arXiv:2403.04483, 2024

Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, and Hai Jin. Graphinstruct: Empowering large language models with graph understanding and reasoning capability.arXiv preprint arXiv:2403.04483, 2024

work page arXiv 2024

[19] [19]

Wiki-cs: A wikipedia-based benchmark for graph neural networks,

Péter Mernyei and C ˘at˘alina Cangea. Wiki-cs: A wikipedia-based benchmark for graph neural networks.arXiv preprint arXiv:2007.02901, 2020

work page arXiv 2007

[20] [20]

Financial fraud detection using graph neural networks: A systematic review.Expert Systems with Applications, 240:122156, 2024

Soroor Motie and Bijan Raahemi. Financial fraud detection using graph neural networks: A systematic review.Expert Systems with Applications, 240:122156, 2024

2024

[21] [21]

Communities, modules and large-scale structure in networks.Nature physics, 8(1):25–31, 2012

Mark EJ Newman. Communities, modules and large-scale structure in networks.Nature physics, 8(1):25–31, 2012

2012

[22] [22]

OpenAI GPT-5 System Card

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Let your graph do the talking: Encoding structured data for llms.arXiv preprint arXiv:2402.05862, 2024

Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow. Let your graph do the talking: Encoding structured data for llms.arXiv preprint arXiv:2402.05862, 2024

work page arXiv 2024

[24] [24]

Collective classification in network data.AI magazine, 29(3):93–93, 2008

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi- Rad. Collective classification in network data.AI magazine, 29(3):93–93, 2008

2008

[25] [25]

Stack Exchange Data Dump

Stack Exchange, Inc. Stack Exchange Data Dump. Internet Archive,https://archive. org/details/stackexchange, 2024. Licensed under CC BY-SA 4.0; Physics Stack Ex- change subset used in this work

2024

[26] [26]

The web as a knowledge-base for answering complex ques- tions

Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex ques- tions. InProceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, 2018

2018

[27] [27]

Graphgpt: Graph instruction tuning for large language models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. Graphgpt: Graph instruction tuning for large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 491–500, 2024

2024

[28] [28]

Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

work page arXiv 2024

[29] [29]

Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings.Advances in neural information processing systems, 37:5950–5973, 2024

Duo Wang, Yuan Zuo, Fengzhi Li, and Junjie Wu. Llms as zero-shot graph learners: Alignment of gnn representations with llm token embeddings.Advances in neural information processing systems, 37:5950–5973, 2024

2024

[30] [30]

Can language models solve graph problems in natural language?Advances in Neural Information Processing Systems, 36:30840–30861, 2023

Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. Can language models solve graph problems in natural language?Advances in Neural Information Processing Systems, 36:30840–30861, 2023

2023

[31] [31]

Stark: Benchmarking llm retrieval on textual and relational knowledge bases.Advances in Neural Information Process- ing Systems, 37:127129–127153, 2024

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vas- silis N Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec. Stark: Benchmarking llm retrieval on textual and relational knowledge bases.Advances in Neural Information Process- ing Systems, 37:127129–127153, 2024

2024

[32] [32]

Graph neural networks in recom- mender systems: a survey.ACM computing surveys, 55(5):1–37, 2022

Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. Graph neural networks in recom- mender systems: a survey.ACM computing surveys, 55(5):1–37, 2022

2022

[33] [33]

arXiv preprint arXiv:2506.05690 , year=

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in rag: A comprehensive analysis for graph retrieval- augmented generation.arXiv preprint arXiv:2506.05690, 2025

work page arXiv 2025

[34] [34]

Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation.arXiv preprint arXiv:2506.02404, 2025

Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian-wen Zhang, Di Yin, Xing Sun, and Xiao Huang. Graphrag-bench: Challenging domain-specific reasoning for evaluating graph retrieval-augmented generation.arXiv preprint arXiv:2506.02404, 2025. 11

work page arXiv 2025

[35] [35]

Crag-comprehensive rag benchmark.Ad- vances in Neural Information Processing Systems, 37:10470–10490, 2024

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag benchmark.Ad- vances in Neural Information Processing Systems, 37:10470–10490, 2024

2024

[36] [36]

Language is all a graph needs

Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. Language is all a graph needs. InFindings of the association for computational linguistics: EACL 2024, pages 1955–1973, 2024

2024

[37] [37]

The value of semantic parse labeling for knowledge base question answering

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, 2016

2016

[38] [38]

Gracore: Benchmarking graph comprehension and complex reasoning in large language models

Zike Yuan, Ming Liu, Hui Wang, and Bing Qin. Gracore: Benchmarking graph comprehension and complex reasoning in large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 7925–7948, 2025

2025

[39] [39]

Variational rea- soning for question answering with knowledge graph

Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song. Variational rea- soning for question answering with knowledge graph. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[40] [40]

paper”, “product

Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, and Chuan Shi. Toward graph- tokenizing large language models with reconstructive graph instruction tuning. InProceedings of the ACM Web Conference 2026, pages 430–441, 2026. 12 A Limitations GRAPHINFER-BENCHcurrently spans six text-attributed graph domains: academic citations, e- commerce co-purchase...

work page arXiv 2026