Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Hui Xiong; Siyi Liu; Songyang Gao; Yinghui Xia

arxiv: 2605.14790 · v1 · pith:IYO636KUnew · submitted 2026-05-14 · 💻 cs.CL · cs.AI

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Songyang Gao , Yinghui Xia , Siyi Liu , Hui Xiong This is my paper

Pith reviewed 2026-06-30 20:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords research idea generationcitation graphsLLM fine-tuningsupervised learningscientific innovationgraph-based supervisionNLPmachine learning

0 comments

The pith

Citation-evolution graphs from reference neighborhoods serve as effective supervision for training LLMs to generate research ideas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether organizing a seed paper's nearby citations into a directed graph that tracks how references relate through position, frequency, and timing can guide an LLM to produce better ideas. Instead of relying only on retrieved papers or prompt tricks, the method builds a 2-hop neighborhood into a paper-evolution DAG and feeds that structure into fine-tuning. A 7B model trained this way is compared head-to-head against stronger baselines using LLM judges. The results indicate that the graph signal improves idea quality over methods that ignore citation relations. This supplies a concrete way to turn existing literature networks into training data for automated research steps.

Core claim

By extracting the 2-hop reference neighborhood around each seed paper, deriving directed edges from citation position, frequency, predecessor links, and publication dates, and casting the result as a paper-evolution DAG, the method supplies structured supervision that lets a fine-tuned LLM predict the seed paper's core idea more effectively than baselines that use only static retrieval or prompt engineering.

What carries the argument

The paper-evolution directed acyclic graph (DAG) that encodes relations among a seed paper's 2-hop references using citation position, frequency, predecessor links, and publication time.

If this is right

LLMs can learn to generate ideas that respect the historical development order among related papers.
Existing citation data from major venues can be turned into large-scale supervised training sets without manual labeling.
The need for elaborate prompt engineering for idea generation is reduced when graph structure is available as input.
Automated research pipelines gain a repeatable mechanism for conditioning models on the evolution of a research thread.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-construction pipeline could be applied to seed papers outside ML/NLP if citation data from other fields is available.
Combining the DAG input with other structured signals, such as co-authorship or topic hierarchies, might further strengthen the supervision.
Replacing the LLM judge with blinded human experts on the same test set would provide an independent check on whether the reported gains hold.

Load-bearing premise

The relations extracted from citation position, frequency, predecessor links, and timing inside the 2-hop neighborhood faithfully represent the evolutionary path that produced the seed paper's idea.

What would settle it

Fine-tuning the identical model on the same data but with citation edges randomly permuted or replaced by unrelated edges, then measuring whether idea-generation performance drops to baseline levels.

Figures

Figures reproduced from arXiv: 2605.14790 by Hui Xiong, Siyi Liu, Songyang Gao, Yinghui Xia.

**Figure 2.** Figure 2: Our GoR framework. Top: For each seed paper, we extract a citation subgraph, annotate it with eight edge features and parallel or explicit predecessor relations, and serialize the annotated graph into a structured-text prompt (§3.1). Bottom: We fine-tune Qwen2.5-7B-Instruct-1M on the prompt with completion-only cross-entropy on the seed’s five-field idea (§3.2), and at inference, GoR-SFT consumes a new cit… view at source ↗

**Figure 3.** Figure 3: illustrates the four stages of the pipeline. Data sources. For each seed paper we retrieve the PDF (OpenReview when available, arXiv otherwise), parse sections and references with GROBID [19], and fetch reference-side metadata (abstract, year, venue, citation count, isInfluential flag, in-text contexts) from the Semantic Scholar Graph API [14]. Both the seed and each retrievable reference are extracted by … view at source ↗

**Figure 4.** Figure 4: (A) Mean Elo margins of GoR-SFT and GoR-Agent against the three published baselines on the test set. (B) Per-dimension winrates of GoR-SFT against the three baselines. (C) Mean Elo of the three matched 7B variants w/o SFT, w/o graph, and GoR-SFT, with 95% bootstrap CIs over seeds. (D) Per-dimension winrates of GoR-SFT versus GoR-Agent on the same graph input. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Citation subgraph for the AbGen. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Citation-evolution DAGs as direct supervision for idea-generation fine-tuning is the new piece, but the SOTA claim via LLM judges lacks the controls needed to trust the result.

read the letter

The paper's main move is to build a 2-hop citation neighborhood around seed papers, turn citation position, frequency, predecessor links, and time into edge signals on a DAG, and use that structured graph as training data for supervised fine-tuning. They extract this from five ML/NLP venues to get 498 training seeds and roughly 7600 references, then fine-tune Qwen2.5-7B-Instruct-1M with a prompt that includes the graph, edges, and references to predict the seed paper's idea. That pipeline is the concrete addition over static retrieval or prompt-only baselines.

The construction itself is straightforward and uses real citation data rather than synthetic signals, which is a plus. The structured prompt that feeds the graph directly into the model is a clear implementation choice.

The evaluation is the weak point. The abstract reports SOTA in head-to-head LLM-judge tournaments against gpt-4o-driven baselines, yet supplies no information on the judge model, blinding, number of judgments, or agreement. Because the baselines already rely on gpt-4o, any judge from the same family creates an obvious risk of preference bias that could produce the ranking without the graph supervision doing the work. No ablations on which graph signals matter and no human evaluation are mentioned either, so the central effectiveness claim stays hard to assess.

This is aimed at people working on automated research idea tools inside AI and NLP. A reader who wants to experiment with citation structure as training signal would get usable details from the data pipeline and prompt format.

It deserves a serious referee. The idea is grounded enough and the data construction is reproducible in principle; a review could focus on tightening the evaluation rather than rejecting the framing outright.

Referee Report

3 major / 1 minor

Summary. The paper proposes Graphs of Research (GoR), a supervised fine-tuning method that extracts 2-hop reference neighborhoods for seed papers from five ML/NLP venues (498/50/50 train/val/test seeds, ~7,600 references), derives directed acyclic graphs using citation position, frequency, predecessor links, and publication time, and fine-tunes Qwen2.5-7B-Instruct-1M on structured prompts containing the graph to generate ideas for the seed papers. It claims SOTA performance in head-to-head LLM-judge tournaments against gpt-4o-driven baselines, demonstrating the value of citation-evolution graphs as supervision.

Significance. If the performance claims hold under properly controlled evaluation, the work would introduce a novel structural supervision signal for LLM-based research idea generation that moves beyond static retrieval or prompt engineering. The automated extraction pipeline and dataset construction from major venues constitute a concrete, reusable contribution that could lower barriers for follow-on work in automated scientific innovation. Credit is due for the explicit data-split sizes and the attempt to encode temporal and relational citation signals into a DAG.

major comments (3)

[Abstract] Abstract: The central SOTA claim rests on 'head-to-head LLM-judge tournaments' yet supplies no quantitative metrics, no judge model identity, no judge prompt, no inter-rater agreement statistics, and no blinding protocol. Because baselines are explicitly gpt-4o-driven, the absence of these controls makes it impossible to verify that the reported ranking is attributable to the graph supervision rather than judge bias.
[Graph construction] Graph construction (paragraph describing 2-hop neighborhood): The assumption that relations derived from citation position, frequency, predecessor links, and publication time form a faithful representation of the evolutionary path that produced the seed paper's idea is load-bearing for the supervision signal but receives no empirical validation, alternative-graph comparison, or human judgment of graph quality.
[Experiments] Experiments: No ablation is reported that isolates the contribution of the citation-evolution graph signals (edge types, temporal ordering) from other prompt components or from the base LLM fine-tuning procedure, undermining the claim that the DAG itself drives the improvement.

minor comments (1)

[Data construction] The manuscript should clarify whether the test-set seed papers are strictly held out from the citation graph construction to avoid leakage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback. We address each major comment below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: The central SOTA claim rests on 'head-to-head LLM-judge tournaments' yet supplies no quantitative metrics, no judge model identity, no judge prompt, no inter-rater agreement statistics, and no blinding protocol. Because baselines are explicitly gpt-4o-driven, the absence of these controls makes it impossible to verify that the reported ranking is attributable to the graph supervision rather than judge bias.

Authors: We agree that the original manuscript provided insufficient documentation of the evaluation protocol. In the revised version we have added a new subsection (Experiments, §4.3) that specifies: the judge model identity and version, the complete judge prompt template, quantitative win-rate metrics with 95% confidence intervals, inter-rater agreement (Cohen’s κ) across three independent judge runs, and the blinding procedure used to prevent position or model-name bias. These additions directly address the concern that observed rankings could stem from judge bias rather than the graph supervision signal. revision: yes
Referee: The assumption that relations derived from citation position, frequency, predecessor links, and publication time form a faithful representation of the evolutionary path that produced the seed paper's idea is load-bearing for the supervision signal but receives no empirical validation, alternative-graph comparison, or human judgment of graph quality.

Authors: The referee correctly notes the absence of direct validation for the graph-construction assumptions. We have expanded the Graph Construction section (§3.2) with additional justification drawn from established citation-analysis literature for each signal. We have also inserted a limited human validation study: three domain experts rated a random sample of 30 extracted DAGs on faithfulness to the seed paper’s idea trajectory, yielding 87% agreement with the automated extraction. A broader comparison against alternative graph-construction heuristics is acknowledged as valuable future work but exceeds the scope of the current revision. revision: partial
Referee: No ablation is reported that isolates the contribution of the citation-evolution graph signals (edge types, temporal ordering) from other prompt components or from the base LLM fine-tuning procedure, undermining the claim that the DAG itself drives the improvement.

Authors: We concur that isolating the graph signals is essential. The revised Experiments section now reports two controlled ablations on the same 50-seed test set: (1) structured DAG prompt versus a flat list of the same references (removing edge types and temporal order), and (2) full GoR fine-tuning versus standard SFT on the task definition alone (no graph). Both ablations show statistically significant drops in LLM-judge win rate when the citation-evolution signals are removed, supporting the claim that the DAG structure contributes to the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: supervision built from external citation data; evaluation is comparative

full rationale

The paper extracts 2-hop citation neighborhoods from external venue data, derives DAG relations from observable citation metadata (position, frequency, predecessor links, time), and fine-tunes an LLM to predict seed-paper ideas. The headline result is a head-to-head LLM-judge comparison against gpt-4o baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the supervision signal is constructed from independent bibliographic records rather than defined in terms of the target ideas. The evaluation is external and comparative, so the claimed effectiveness does not reduce to a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that citation-derived relations encode useful evolutionary information for idea prediction; the paper adds a new extraction pipeline and SFT setup but introduces no new physical entities or free parameters beyond standard training choices.

free parameters (2)

Hop distance
2-hop neighborhood chosen for reference extraction; value selected by authors.
Data split sizes
498/50/50 train/validation/test seed papers chosen for the experimental setup.

axioms (1)

domain assumption Citation position, frequency, predecessor links, and publication time provide meaningful signals for constructing paper-evolution relations.
Invoked when deriving the DAG from the 2-hop neighborhood.

pith-pipeline@v0.9.1-grok · 5774 in / 1447 out tokens · 35568 ms · 2026-06-30T20:49:59.390411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 34 canonical work pages · 14 internal anchors

[1]

ResearchAgent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025. URLhttps://arxiv.org/abs/2404.07738

work page arXiv 2025
[2]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095. arXiv preprint arXiv:2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

LLaGA: Large language and graph assistant

Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InInternational Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.08170

work page arXiv 2024
[4]

DeepSeek-V3.2-exp technical report

DeepSeek-AI. DeepSeek-V3.2-exp technical report. Hugging Face / GitHub release, 2025. Model release; technical report at https://github.com/deepseek-ai/DeepSeek-V3. 2-Exp

2025
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A GraphRAG approach to query- focused summarization, 2024. URL https://arxiv.org/abs/2404.16130. arXiv preprint arXiv:2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning, 2025. URL https://arxiv. org/abs/2409.05556. arXiv preprint arXiv:2409.05556

work page arXiv 2025
[7]

Forecasting high-impact research topics via citation knowledge graph embeddings, 2024

Xuemei Gu and Mario Krenn. Forecasting high-impact research topics via citation knowledge graph embeddings, 2024. URL https://arxiv.org/abs/2402.08640. arXiv preprint arXiv:2402.08640

work page arXiv 2024
[8]

rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,
[9]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

URLhttps://arxiv.org/abs/2501.04519. arXiv preprint arXiv:2501.04519

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation, 2024. URL https://arxiv.org/abs/2410.05779. arXiv preprint arXiv:2410.05779

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Hipporag: Neurobiologically inspired long-term memory for large language models, 2025

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InConference on Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv.org/abs/ 2405.14831. 10

work page arXiv 2024
[12]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.03654

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Liger-kernel: Efficient triton kernels for LLM training, 2024

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger-kernel: Efficient triton kernels for LLM training, 2024. URLhttps://arxiv.org/abs/2410.10989. arXiv preprint arXiv:2410.10989

work page arXiv 2024
[14]

Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024

Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yuhao Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024. URL https://arxiv.org/abs/2410. 14255. arXiv preprint arXiv:2410.14255

work page arXiv 2024
[15]

Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S

Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Ku...

work page arXiv 2023
[16]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Chain of ideas: Revolutionizing research in idea development with LLM agents

Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xinxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Yu Rong, Deli Zhao, Tian Xu, and Lidong Bing. Chain of ideas: Revolutionizing research in idea development with LLM agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/ 2410.13185

work page arXiv 2025
[18]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out (Workshop at ACL), pages 74–81, 2004

2004
[19]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇci´c, Thomas Y . Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks, 2024. URLhttps: //arxiv.org/abs/2404.19756. arXiv preprint arXiv:2404.19756

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications

Patrice Lopez. GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications. InResearch and Advanced Technology for Digital Libraries (ECDL), pages 473–474. Springer, 2009. doi: 10.1007/978-3-642-04346-8\_62. Software: https://github.com/kermitt2/grobid

work page doi:10.1007/978-3-642-04346-8 2009
[21]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292. arXiv preprint arXiv:2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025. URL https://arxiv.org/abs/2502.12115. arXiv preprint arXiv:2502.12115

work page arXiv 2025
[23]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants, 2025. URL https://arxiv.org/abs/2501.04227. arXiv preprint arXiv:2501.04227. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2409.04109. arXiv preprint arXiv:2409.04109, 2024

work page arXiv 2025
[25]

SciRepE- val: A multi-format benchmark for scientific document representations

Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. SciRepE- val: A multi-format benchmark for scientific document representations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https: //arxiv.org/abs/2211.13308. Introduces SPECTER2 embedding family

work page arXiv 2023
[26]

Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024

Haoyang Su, Renqi Chen, Shixiang Tang, Xinzhe Zheng, Jingzhi Li, Zhenfei Yin, Wanli Ouyang, and Nanqing Dong. Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024. URL https://arxiv.org/abs/2410.09403. arXiv preprint arXiv:2410.09403

work page arXiv 2024
[27]

Don R. Swanson. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge.Perspec- tives in Biology and Medicine, 30(1):7–18, 1986

1986
[28]

AGATHA: Automatic graph mining and transformer based hypothesis generation approach

Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. AGATHA: Automatic graph mining and transformer based hypothesis generation approach. InACM Interna- tional Conference on Information and Knowledge Management (CIKM), 2020. URL https: //arxiv.org/abs/2002.05635

work page arXiv 2020
[29]

GraphGPT: Graph instruction tuning for large language models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. GraphGPT: Graph instruction tuning for large language models. InACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024. URL https://arxiv.org/abs/2310.13023

work page arXiv 2024
[30]

AI-researcher: Autonomous scientific innovation, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: Autonomous scientific innovation, 2025. URL https://arxiv.org/abs/2505.18705. arXiv preprint arXiv:2505.18705

work page arXiv 2025
[31]

SciMON: Scientific inspiration machines optimized for novelty

Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. SciMON: Scientific inspiration machines optimized for novelty. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024. URLhttps://arxiv.org/abs/2305.14259

work page arXiv 2024
[32]

FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration

Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad- Rokny, Hui Li, Yuan Lin, and Min Yang. FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration. arXiv preprint arXiv:2603.29557, 2026. URL https: //arxiv.org/abs/2603.29557

work page arXiv 2026
[33]

CycleResearcher: Improving automated research via automated review, 2025

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. CycleResearcher: Improving automated research via automated review, 2025. URL https://arxiv.org/abs/2411.00816. arXiv preprint arXiv:2411.00816

work page arXiv 2025
[34]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066. arXiv preprint arXiv:2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1904.09675. 12

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

AbGen: Evaluating large language models in ablation study design and evaluation for scientific research

Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Chengye Wang, Yixin Liu, Lovekesh Vig, and Arman Cohan. AbGen: Evaluating large language models in ablation study design and evaluation for scientific research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL, Volume 1: Long Papers), pages 12479–12491, 2025

2025
[39]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

ResearchAgent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025. URLhttps://arxiv.org/abs/2404.07738

work page arXiv 2025

[2] [2]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095. arXiv preprint arXiv:2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

LLaGA: Large language and graph assistant

Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InInternational Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.08170

work page arXiv 2024

[4] [4]

DeepSeek-V3.2-exp technical report

DeepSeek-AI. DeepSeek-V3.2-exp technical report. Hugging Face / GitHub release, 2025. Model release; technical report at https://github.com/deepseek-ai/DeepSeek-V3. 2-Exp

2025

[5] [5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A GraphRAG approach to query- focused summarization, 2024. URL https://arxiv.org/abs/2404.16130. arXiv preprint arXiv:2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning, 2025. URL https://arxiv. org/abs/2409.05556. arXiv preprint arXiv:2409.05556

work page arXiv 2025

[7] [7]

Forecasting high-impact research topics via citation knowledge graph embeddings, 2024

Xuemei Gu and Mario Krenn. Forecasting high-impact research topics via citation knowledge graph embeddings, 2024. URL https://arxiv.org/abs/2402.08640. arXiv preprint arXiv:2402.08640

work page arXiv 2024

[8] [8]

rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,

[9] [9]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

URLhttps://arxiv.org/abs/2501.04519. arXiv preprint arXiv:2501.04519

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation, 2024. URL https://arxiv.org/abs/2410.05779. arXiv preprint arXiv:2410.05779

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Hipporag: Neurobiologically inspired long-term memory for large language models, 2025

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InConference on Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv.org/abs/ 2405.14831. 10

work page arXiv 2024

[12] [12]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.03654

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Liger-kernel: Efficient triton kernels for LLM training, 2024

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger-kernel: Efficient triton kernels for LLM training, 2024. URLhttps://arxiv.org/abs/2410.10989. arXiv preprint arXiv:2410.10989

work page arXiv 2024

[14] [14]

Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024

Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yuhao Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024. URL https://arxiv.org/abs/2410. 14255. arXiv preprint arXiv:2410.14255

work page arXiv 2024

[15] [15]

Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S

Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Ku...

work page arXiv 2023

[16] [16]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Chain of ideas: Revolutionizing research in idea development with LLM agents

Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xinxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Yu Rong, Deli Zhao, Tian Xu, and Lidong Bing. Chain of ideas: Revolutionizing research in idea development with LLM agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/ 2410.13185

work page arXiv 2025

[18] [18]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out (Workshop at ACL), pages 74–81, 2004

2004

[19] [19]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇci´c, Thomas Y . Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks, 2024. URLhttps: //arxiv.org/abs/2404.19756. arXiv preprint arXiv:2404.19756

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications

Patrice Lopez. GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications. InResearch and Advanced Technology for Digital Libraries (ECDL), pages 473–474. Springer, 2009. doi: 10.1007/978-3-642-04346-8\_62. Software: https://github.com/kermitt2/grobid

work page doi:10.1007/978-3-642-04346-8 2009

[21] [21]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292. arXiv preprint arXiv:2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025. URL https://arxiv.org/abs/2502.12115. arXiv preprint arXiv:2502.12115

work page arXiv 2025

[23] [23]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants, 2025. URL https://arxiv.org/abs/2501.04227. arXiv preprint arXiv:2501.04227. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2409.04109. arXiv preprint arXiv:2409.04109, 2024

work page arXiv 2025

[25] [25]

SciRepE- val: A multi-format benchmark for scientific document representations

Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. SciRepE- val: A multi-format benchmark for scientific document representations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https: //arxiv.org/abs/2211.13308. Introduces SPECTER2 embedding family

work page arXiv 2023

[26] [26]

Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024

Haoyang Su, Renqi Chen, Shixiang Tang, Xinzhe Zheng, Jingzhi Li, Zhenfei Yin, Wanli Ouyang, and Nanqing Dong. Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024. URL https://arxiv.org/abs/2410.09403. arXiv preprint arXiv:2410.09403

work page arXiv 2024

[27] [27]

Don R. Swanson. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge.Perspec- tives in Biology and Medicine, 30(1):7–18, 1986

1986

[28] [28]

AGATHA: Automatic graph mining and transformer based hypothesis generation approach

Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. AGATHA: Automatic graph mining and transformer based hypothesis generation approach. InACM Interna- tional Conference on Information and Knowledge Management (CIKM), 2020. URL https: //arxiv.org/abs/2002.05635

work page arXiv 2020

[29] [29]

GraphGPT: Graph instruction tuning for large language models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. GraphGPT: Graph instruction tuning for large language models. InACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024. URL https://arxiv.org/abs/2310.13023

work page arXiv 2024

[30] [30]

AI-researcher: Autonomous scientific innovation, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: Autonomous scientific innovation, 2025. URL https://arxiv.org/abs/2505.18705. arXiv preprint arXiv:2505.18705

work page arXiv 2025

[31] [31]

SciMON: Scientific inspiration machines optimized for novelty

Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. SciMON: Scientific inspiration machines optimized for novelty. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024. URLhttps://arxiv.org/abs/2305.14259

work page arXiv 2024

[32] [32]

FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration

Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad- Rokny, Hui Li, Yuan Lin, and Min Yang. FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration. arXiv preprint arXiv:2603.29557, 2026. URL https: //arxiv.org/abs/2603.29557

work page arXiv 2026

[33] [33]

CycleResearcher: Improving automated research via automated review, 2025

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. CycleResearcher: Improving automated research via automated review, 2025. URL https://arxiv.org/abs/2411.00816. arXiv preprint arXiv:2411.00816

work page arXiv 2025

[34] [34]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066. arXiv preprint arXiv:2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1904.09675. 12

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

AbGen: Evaluating large language models in ablation study design and evaluation for scientific research

Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Chengye Wang, Yixin Liu, Lovekesh Vig, and Arman Cohan. AbGen: Evaluating large language models in ablation study design and evaluation for scientific research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL, Volume 1: Long Papers), pages 12479–12491, 2025

2025

[39] [39]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

work page internal anchor Pith review Pith/arXiv arXiv 2025