pith. sign in

arxiv: 2605.14790 · v1 · pith:IYO636KUnew · submitted 2026-05-14 · 💻 cs.CL · cs.AI

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Pith reviewed 2026-06-30 20:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords research idea generationcitation graphsLLM fine-tuningsupervised learningscientific innovationgraph-based supervisionNLPmachine learning
0
0 comments X

The pith

Citation-evolution graphs from reference neighborhoods serve as effective supervision for training LLMs to generate research ideas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether organizing a seed paper's nearby citations into a directed graph that tracks how references relate through position, frequency, and timing can guide an LLM to produce better ideas. Instead of relying only on retrieved papers or prompt tricks, the method builds a 2-hop neighborhood into a paper-evolution DAG and feeds that structure into fine-tuning. A 7B model trained this way is compared head-to-head against stronger baselines using LLM judges. The results indicate that the graph signal improves idea quality over methods that ignore citation relations. This supplies a concrete way to turn existing literature networks into training data for automated research steps.

Core claim

By extracting the 2-hop reference neighborhood around each seed paper, deriving directed edges from citation position, frequency, predecessor links, and publication dates, and casting the result as a paper-evolution DAG, the method supplies structured supervision that lets a fine-tuned LLM predict the seed paper's core idea more effectively than baselines that use only static retrieval or prompt engineering.

What carries the argument

The paper-evolution directed acyclic graph (DAG) that encodes relations among a seed paper's 2-hop references using citation position, frequency, predecessor links, and publication time.

If this is right

  • LLMs can learn to generate ideas that respect the historical development order among related papers.
  • Existing citation data from major venues can be turned into large-scale supervised training sets without manual labeling.
  • The need for elaborate prompt engineering for idea generation is reduced when graph structure is available as input.
  • Automated research pipelines gain a repeatable mechanism for conditioning models on the evolution of a research thread.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-construction pipeline could be applied to seed papers outside ML/NLP if citation data from other fields is available.
  • Combining the DAG input with other structured signals, such as co-authorship or topic hierarchies, might further strengthen the supervision.
  • Replacing the LLM judge with blinded human experts on the same test set would provide an independent check on whether the reported gains hold.

Load-bearing premise

The relations extracted from citation position, frequency, predecessor links, and timing inside the 2-hop neighborhood faithfully represent the evolutionary path that produced the seed paper's idea.

What would settle it

Fine-tuning the identical model on the same data but with citation edges randomly permuted or replaced by unrelated edges, then measuring whether idea-generation performance drops to baseline levels.

Figures

Figures reproduced from arXiv: 2605.14790 by Hui Xiong, Siyi Liu, Songyang Gao, Yinghui Xia.

Figure 1
Figure 1. Figure 1: Comparison of the existing paradigm of idea generation, our GoR, and the human ideation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our GoR framework. Top: For each seed paper, we extract a citation subgraph, annotate it with eight edge features and parallel or explicit predecessor relations, and serialize the annotated graph into a structured-text prompt (§3.1). Bottom: We fine-tune Qwen2.5-7B-Instruct-1M on the prompt with completion-only cross-entropy on the seed’s five-field idea (§3.2), and at inference, GoR-SFT consumes a new cit… view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the four stages of the pipeline. Data sources. For each seed paper we retrieve the PDF (OpenReview when available, arXiv otherwise), parse sections and references with GROBID [19], and fetch reference-side metadata (abstract, year, venue, citation count, isInfluential flag, in-text contexts) from the Semantic Scholar Graph API [14]. Both the seed and each retrievable reference are extracted by … view at source ↗
Figure 4
Figure 4. Figure 4: (A) Mean Elo margins of GoR-SFT and GoR-Agent against the three published baselines on the test set. (B) Per-dimension winrates of GoR-SFT against the three baselines. (C) Mean Elo of the three matched 7B variants w/o SFT, w/o graph, and GoR-SFT, with 95% bootstrap CIs over seeds. (D) Per-dimension winrates of GoR-SFT versus GoR-Agent on the same graph input. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Citation subgraph for the AbGen. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Graphs of Research (GoR), a supervised fine-tuning method that extracts 2-hop reference neighborhoods for seed papers from five ML/NLP venues (498/50/50 train/val/test seeds, ~7,600 references), derives directed acyclic graphs using citation position, frequency, predecessor links, and publication time, and fine-tunes Qwen2.5-7B-Instruct-1M on structured prompts containing the graph to generate ideas for the seed papers. It claims SOTA performance in head-to-head LLM-judge tournaments against gpt-4o-driven baselines, demonstrating the value of citation-evolution graphs as supervision.

Significance. If the performance claims hold under properly controlled evaluation, the work would introduce a novel structural supervision signal for LLM-based research idea generation that moves beyond static retrieval or prompt engineering. The automated extraction pipeline and dataset construction from major venues constitute a concrete, reusable contribution that could lower barriers for follow-on work in automated scientific innovation. Credit is due for the explicit data-split sizes and the attempt to encode temporal and relational citation signals into a DAG.

major comments (3)
  1. [Abstract] Abstract: The central SOTA claim rests on 'head-to-head LLM-judge tournaments' yet supplies no quantitative metrics, no judge model identity, no judge prompt, no inter-rater agreement statistics, and no blinding protocol. Because baselines are explicitly gpt-4o-driven, the absence of these controls makes it impossible to verify that the reported ranking is attributable to the graph supervision rather than judge bias.
  2. [Graph construction] Graph construction (paragraph describing 2-hop neighborhood): The assumption that relations derived from citation position, frequency, predecessor links, and publication time form a faithful representation of the evolutionary path that produced the seed paper's idea is load-bearing for the supervision signal but receives no empirical validation, alternative-graph comparison, or human judgment of graph quality.
  3. [Experiments] Experiments: No ablation is reported that isolates the contribution of the citation-evolution graph signals (edge types, temporal ordering) from other prompt components or from the base LLM fine-tuning procedure, undermining the claim that the DAG itself drives the improvement.
minor comments (1)
  1. [Data construction] The manuscript should clarify whether the test-set seed papers are strictly held out from the citation graph construction to avoid leakage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback. We address each major comment below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central SOTA claim rests on 'head-to-head LLM-judge tournaments' yet supplies no quantitative metrics, no judge model identity, no judge prompt, no inter-rater agreement statistics, and no blinding protocol. Because baselines are explicitly gpt-4o-driven, the absence of these controls makes it impossible to verify that the reported ranking is attributable to the graph supervision rather than judge bias.

    Authors: We agree that the original manuscript provided insufficient documentation of the evaluation protocol. In the revised version we have added a new subsection (Experiments, §4.3) that specifies: the judge model identity and version, the complete judge prompt template, quantitative win-rate metrics with 95% confidence intervals, inter-rater agreement (Cohen’s κ) across three independent judge runs, and the blinding procedure used to prevent position or model-name bias. These additions directly address the concern that observed rankings could stem from judge bias rather than the graph supervision signal. revision: yes

  2. Referee: The assumption that relations derived from citation position, frequency, predecessor links, and publication time form a faithful representation of the evolutionary path that produced the seed paper's idea is load-bearing for the supervision signal but receives no empirical validation, alternative-graph comparison, or human judgment of graph quality.

    Authors: The referee correctly notes the absence of direct validation for the graph-construction assumptions. We have expanded the Graph Construction section (§3.2) with additional justification drawn from established citation-analysis literature for each signal. We have also inserted a limited human validation study: three domain experts rated a random sample of 30 extracted DAGs on faithfulness to the seed paper’s idea trajectory, yielding 87% agreement with the automated extraction. A broader comparison against alternative graph-construction heuristics is acknowledged as valuable future work but exceeds the scope of the current revision. revision: partial

  3. Referee: No ablation is reported that isolates the contribution of the citation-evolution graph signals (edge types, temporal ordering) from other prompt components or from the base LLM fine-tuning procedure, undermining the claim that the DAG itself drives the improvement.

    Authors: We concur that isolating the graph signals is essential. The revised Experiments section now reports two controlled ablations on the same 50-seed test set: (1) structured DAG prompt versus a flat list of the same references (removing edge types and temporal order), and (2) full GoR fine-tuning versus standard SFT on the task definition alone (no graph). Both ablations show statistically significant drops in LLM-judge win rate when the citation-evolution signals are removed, supporting the claim that the DAG structure contributes to the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: supervision built from external citation data; evaluation is comparative

full rationale

The paper extracts 2-hop citation neighborhoods from external venue data, derives DAG relations from observable citation metadata (position, frequency, predecessor links, time), and fine-tunes an LLM to predict seed-paper ideas. The headline result is a head-to-head LLM-judge comparison against gpt-4o baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the supervision signal is constructed from independent bibliographic records rather than defined in terms of the target ideas. The evaluation is external and comparative, so the claimed effectiveness does not reduce to a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that citation-derived relations encode useful evolutionary information for idea prediction; the paper adds a new extraction pipeline and SFT setup but introduces no new physical entities or free parameters beyond standard training choices.

free parameters (2)
  • Hop distance
    2-hop neighborhood chosen for reference extraction; value selected by authors.
  • Data split sizes
    498/50/50 train/validation/test seed papers chosen for the experimental setup.
axioms (1)
  • domain assumption Citation position, frequency, predecessor links, and publication time provide meaningful signals for constructing paper-evolution relations.
    Invoked when deriving the DAG from the 2-hop neighborhood.

pith-pipeline@v0.9.1-grok · 5774 in / 1447 out tokens · 35568 ms · 2026-06-30T20:49:59.390411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 34 canonical work pages · 14 internal anchors

  1. [1]

    ResearchAgent: Iterative research idea generation over scientific literature with large language models

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025. URLhttps://arxiv.org/abs/2404.07738

  2. [2]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095. arXiv preprint arXiv:2410.07095

  3. [3]

    LLaGA: Large language and graph assistant

    Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InInternational Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.08170

  4. [4]

    DeepSeek-V3.2-exp technical report

    DeepSeek-AI. DeepSeek-V3.2-exp technical report. Hugging Face / GitHub release, 2025. Model release; technical report at https://github.com/deepseek-ai/DeepSeek-V3. 2-Exp

  5. [5]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A GraphRAG approach to query- focused summarization, 2024. URL https://arxiv.org/abs/2404.16130. arXiv preprint arXiv:2404.16130

  6. [6]

    Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning, 2025. URL https://arxiv. org/abs/2409.05556. arXiv preprint arXiv:2409.05556

  7. [7]

    Forecasting high-impact research topics via citation knowledge graph embeddings, 2024

    Xuemei Gu and Mario Krenn. Forecasting high-impact research topics via citation knowledge graph embeddings, 2024. URL https://arxiv.org/abs/2402.08640. arXiv preprint arXiv:2402.08640

  8. [8]

    rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking,

  9. [9]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    URLhttps://arxiv.org/abs/2501.04519. arXiv preprint arXiv:2501.04519

  10. [10]

    LightRAG: Simple and Fast Retrieval-Augmented Generation

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation, 2024. URL https://arxiv.org/abs/2410.05779. arXiv preprint arXiv:2410.05779

  11. [11]

    Hipporag: Neurobiologically inspired long-term memory for large language models, 2025

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InConference on Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv.org/abs/ 2405.14831. 10

  12. [12]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.03654

  13. [13]

    Liger-kernel: Efficient triton kernels for LLM training, 2024

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yanning Chen. Liger-kernel: Efficient triton kernels for LLM training, 2024. URLhttps://arxiv.org/abs/2410.10989. arXiv preprint arXiv:2410.10989

  14. [14]

    Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024

    Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yuhao Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas, 2024. URL https://arxiv.org/abs/2410. 14255. arXiv preprint arXiv:2410.14255

  15. [15]

    Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S

    Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Ku...

  16. [16]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180

  17. [17]

    Chain of ideas: Revolutionizing research in idea development with LLM agents

    Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xinxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Yu Rong, Deli Zhao, Tian Xu, and Lidong Bing. Chain of ideas: Revolutionizing research in idea development with LLM agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/ 2410.13185

  18. [18]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out (Workshop at ACL), pages 74–81, 2004

  19. [19]

    KAN: Kolmogorov-Arnold Networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇci´c, Thomas Y . Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks, 2024. URLhttps: //arxiv.org/abs/2404.19756. arXiv preprint arXiv:2404.19756

  20. [20]

    GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications

    Patrice Lopez. GROBID: Combining automatic bibliographic data recognition and term extrac- tion for scholarship publications. InResearch and Advanced Technology for Digital Libraries (ECDL), pages 473–474. Springer, 2009. doi: 10.1007/978-3-642-04346-8\_62. Software: https://github.com/kermitt2/grobid

  21. [21]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292. arXiv preprint arXiv:2408.06292

  22. [22]

    SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025. URL https://arxiv.org/abs/2502.12115. arXiv preprint arXiv:2502.12115

  23. [23]

    Agent Laboratory: Using LLM Agents as Research Assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants, 2025. URL https://arxiv.org/abs/2501.04227. arXiv preprint arXiv:2501.04227. 11

  24. [24]

    Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2409.04109. arXiv preprint arXiv:2409.04109, 2024

  25. [25]

    SciRepE- val: A multi-format benchmark for scientific document representations

    Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. SciRepE- val: A multi-format benchmark for scientific document representations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https: //arxiv.org/abs/2211.13308. Introduces SPECTER2 embedding family

  26. [26]

    Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024

    Haoyang Su, Renqi Chen, Shixiang Tang, Xinzhe Zheng, Jingzhi Li, Zhenfei Yin, Wanli Ouyang, and Nanqing Dong. Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation, 2024. URL https://arxiv.org/abs/2410.09403. arXiv preprint arXiv:2410.09403

  27. [27]

    Don R. Swanson. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge.Perspec- tives in Biology and Medicine, 30(1):7–18, 1986

  28. [28]

    AGATHA: Automatic graph mining and transformer based hypothesis generation approach

    Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. AGATHA: Automatic graph mining and transformer based hypothesis generation approach. InACM Interna- tional Conference on Information and Knowledge Management (CIKM), 2020. URL https: //arxiv.org/abs/2002.05635

  29. [29]

    GraphGPT: Graph instruction tuning for large language models

    Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. GraphGPT: Graph instruction tuning for large language models. InACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024. URL https://arxiv.org/abs/2310.13023

  30. [30]

    AI-researcher: Autonomous scientific innovation, 2025

    Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: Autonomous scientific innovation, 2025. URL https://arxiv.org/abs/2505.18705. arXiv preprint arXiv:2505.18705

  31. [31]

    SciMON: Scientific inspiration machines optimized for novelty

    Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. SciMON: Scientific inspiration machines optimized for novelty. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024. URLhttps://arxiv.org/abs/2305.14259

  32. [32]

    FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration

    Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad- Rokny, Hui Li, Yuan Lin, and Min Yang. FlowPIE: Test-time scientific idea evolution with flow-guided literature exploration. arXiv preprint arXiv:2603.29557, 2026. URL https: //arxiv.org/abs/2603.29557

  33. [33]

    CycleResearcher: Improving automated research via automated review, 2025

    Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. CycleResearcher: Improving automated research via automated review, 2025. URL https://arxiv.org/abs/2411.00816. arXiv preprint arXiv:2411.00816

  34. [34]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066. arXiv preprint arXiv:2504.08066

  35. [35]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115. arXiv preprint arXiv:2412.15115

  36. [36]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  37. [37]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1904.09675. 12

  38. [38]

    AbGen: Evaluating large language models in ablation study design and evaluation for scientific research

    Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Chengye Wang, Yixin Liu, Lovekesh Vig, and Arman Cohan. AbGen: Evaluating large language models in ablation study design and evaluation for scientific research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL, Volume 1: Long Papers), pages 12479–12491, 2025

  39. [39]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...