IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

Daniel Xavier de Sousa; Hongyu Guo; Ricardo Mar\c{c}al; Xiaodan Zhu; Yuanhao Shen

arxiv: 2507.15736 · v2 · submitted 2025-07-21 · 💻 cs.CL

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

Yuanhao Shen , Daniel Xavier de Sousa , Ricardo Mar\c{c}al , Hongyu Guo , Xiaodan Zhu This is my paper

Pith reviewed 2026-05-19 03:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords interdisciplinary researchlarge language modelsbenchmarkknowledge integrationidea recommendationevaluation frameworkcross-disciplinary tasks

0 comments

The pith

IDRBench offers the first comprehensive benchmark for large language models in interdisciplinary research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IDRBench to evaluate how well large language models can perform tasks that require combining knowledge from multiple disciplines. This matters because significant innovation often emerges from bridging separate fields, and LLMs may help if their current abilities are properly understood and measured. The framework defines three tasks—IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation—along with datasets to create concrete evaluations. Analysis of ten mainstream LLMs then provides behavioral insights and establishes initial benchmarks and baselines.

Core claim

IDRBench is the first framework to comprehensively investigate LLMs' interdisciplinary research capability through datasets and three tasks: IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation, with evaluations on ten mainstream LLMs providing analysis and setting benchmarks for future work.

What carries the argument

The IDRBench framework consisting of datasets and three specific evaluation tasks designed to measure LLMs' ability to integrate knowledge across disciplines.

If this is right

Establishes standardized benchmarks and baselines that future work on LLMs for cross-disciplinary tasks can track and improve upon.
Reveals specific patterns in how current LLMs handle identification, integration, and recommendation of ideas from different fields.
Creates a foundation for AI systems intended to assist researchers in locating and combining knowledge across disciplinary boundaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that improve on the recommendation task might be deployed to automatically surface potential new research directions that combine insights from distant fields.
The benchmark approach could be adapted to measure AI performance on other forms of creative knowledge synthesis beyond research papers.
Scores on IDRBench could serve as a selection criterion when choosing or adapting LLMs for real collaborative projects involving multiple disciplines.

Load-bearing premise

The three defined tasks serve as valid and sufficient proxies for measuring genuine interdisciplinary research capability in LLMs.

What would settle it

A controlled study in which human domain experts judge that high-scoring LLM outputs on the three IDRBench tasks do not produce or validate genuinely novel interdisciplinary research contributions.

Figures

Figures reproduced from arXiv: 2507.15736 by Daniel Xavier de Sousa, Hongyu Guo, Ricardo Mar\c{c}al, Xiaodan Zhu, Yuanhao Shen.

**Figure 1.** Figure 1: Triplet data format in IDRBench - Showing that Papers PB and PC are integrated (more than merely referenced) to generate the IDR Paper PA. To obtain a positive triplet, the annotators need to identify PA from the candidate pool, then figure out the key cited papers PB and PC that play central roles in deriving the IDR idea. To contribute to understanding LLMs’ abilities in IDR, we took a small step to int… view at source ↗

**Figure 2.** Figure 2: Visualization of tasks IPI, I3, and I2R within IDRBench. Orange and green arrows stand [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of binary discipline combinations for ArXiv data and positive samples in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 7.** Figure 7: Finally, they are asked to annotate the specific sentence(s) in this IDR paper that specifically [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 4.** Figure 4: List of papers available for the annotator to choose from. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Task 1, displayed after the annotator selects a paper. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Task 2, displayed if the annotator answers "Yes" in Task 1. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Task 3, displayed after completing Task 2. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Task 4, displayed after completing Task 3. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Review page, where the annotator can review and optionally go back and edit their answers. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison sample on both idea integration reasoning and full abstract [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

read the original abstract

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis of their behavior and establishes benchmarks and baselines for future research. To the best of our knowledge, IDRBench is the first to provide a comprehensive investigation of LLMs' IDR capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IDRBench sets up a new benchmark for LLMs in interdisciplinary research but needs human validation to back its claims.

read the letter

The main takeaway is that this paper introduces IDRBench as a new benchmark for assessing LLMs on interdisciplinary research through three tasks, but the lack of human expert validation leaves the central claims on shaky ground. The work is new in creating a unified framework with IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation tasks, along with datasets for evaluating ten different LLMs. This combination hasn't been standardized before, so it provides a practical starting point and baselines for future studies on how models handle cross-field knowledge integration. The systematic evaluation and analysis of model behaviors across these tasks is a clear contribution that others can build on. What the paper does well is define the tasks explicitly and run consistent tests to establish initial performance numbers. It also makes the datasets part of the benchmark, which supports reproducibility. The soft spots center on validation. The tasks are meant to proxy real interdisciplinary capability, yet no human baselines, expert assessments of task realism, or checks for data biases are reported. This makes it tough to tell whether strong LLM performance reflects actual synthesis skills or just familiarity with the curated examples. These gaps are significant because the paper positions itself as a comprehensive investigation. This paper targets researchers developing LLMs for scientific and innovative applications, particularly those interested in benchmarks for complex reasoning. Readers working on evaluation frameworks would find the task definitions and results informative. I would recommend engaging with it through peer review. The benchmark idea is timely and the execution provides a foundation worth refining with added validation steps.

Referee Report

2 major / 2 minor

Summary. The paper introduces IDRBench, a benchmark framework with three tasks (IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation) and associated datasets to evaluate how well large language models can perform interdisciplinary research. It reports results on ten mainstream LLMs and claims this is the first comprehensive investigation of LLMs' IDR capabilities.

Significance. If the tasks are shown to be valid proxies, the benchmark would provide useful baselines and a starting point for measuring and improving LLMs' ability to integrate knowledge across fields, which is relevant given the role of interdisciplinarity in innovation.

major comments (2)

[Task definitions and evaluation setup] The central claim that the three tasks measure LLMs' IDR capability rests on their validity as proxies, yet the manuscript reports no human expert validation, inter-annotator agreement scores, or correlation analysis showing that model performance tracks actual interdisciplinary synthesis (as opposed to retrieval or pattern matching on the curated data).
[Evaluation and results] No human baselines or expert ratings of task realism are provided for any of the three tasks, which is required to interpret the LLM performance numbers as evidence of genuine IDR capability rather than benchmark-specific artifacts.

minor comments (2)

[Abstract and introduction] The abstract's novelty claim ('to the best of our knowledge, IDRBench is the first') would be strengthened by a dedicated related-work subsection that explicitly contrasts the new tasks against prior LLM benchmarks on cross-domain reasoning or knowledge integration.
[Dataset construction] Clarify the dataset construction process, including source selection criteria and any steps taken to mitigate domain-specific biases, to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of IDRBench as a valid benchmark for LLM interdisciplinary research capabilities.

read point-by-point responses

Referee: [Task definitions and evaluation setup] The central claim that the three tasks measure LLMs' IDR capability rests on their validity as proxies, yet the manuscript reports no human expert validation, inter-annotator agreement scores, or correlation analysis showing that model performance tracks actual interdisciplinary synthesis (as opposed to retrieval or pattern matching on the curated data).

Authors: We agree that explicit validation of the tasks as proxies for genuine IDR is important for interpreting the results. The three tasks were constructed by drawing on established definitions and examples of interdisciplinary research from the scholarly literature, with datasets drawn from real cross-field publications. Nevertheless, we acknowledge that the current manuscript does not include human expert validation or inter-annotator agreement statistics. In the revised version we will add a dedicated validation subsection that reports expert review of task realism on a sampled subset of instances together with inter-annotator agreement scores. Direct correlation analysis with downstream research impact would require longitudinal outcome data that is not yet available for this initial benchmark; we will note this limitation and identify it as an important avenue for follow-up studies. revision: partial
Referee: [Evaluation and results] No human baselines or expert ratings of task realism are provided for any of the three tasks, which is required to interpret the LLM performance numbers as evidence of genuine IDR capability rather than benchmark-specific artifacts.

Authors: We appreciate the referee's emphasis on the need for human baselines to contextualize the reported LLM scores. In the revised manuscript we will include human performance baselines collected from domain experts on a representative sample of each task, along with expert ratings of task realism. These additions will allow readers to better distinguish benchmark-specific effects from broader IDR capabilities and will be presented alongside the existing LLM results. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark creation is self-contained empirical contribution

full rationale

The paper introduces IDRBench as a new framework consisting of three explicitly defined tasks (IDR Paper Identification, IDR Idea Integration, IDR Idea Recommendation) along with associated datasets and an evaluation of ten LLMs. No equations, fitted parameters, or derivations are present that reduce to prior inputs by construction. The central claim of providing the first comprehensive investigation rests on the novelty of the benchmark itself rather than any self-citation chain or self-definitional loop. The tasks are presented as proxies by design choice, not derived from fitted results or prior author work invoked as uniqueness theorems. This is a standard empirical benchmark paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the three constructed tasks adequately represent interdisciplinary research capability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Interdisciplinary research capability in LLMs can be meaningfully decomposed into the tasks of paper identification, idea integration, and idea recommendation.
The benchmark framework is built directly on these three tasks as the core evaluation components.

pith-pipeline@v0.9.0 · 5714 in / 1219 out tokens · 29497 ms · 2026-05-19T03:44:41.735424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 7 internal anchors

[1]

Claude 3.7 Sonnet

Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet. Accessed: 2025-05-15. 2024

work page 2025
[2]

S ci BERT : A Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: A Pretrained Language Model for Sci- entific Text”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). Ed. by Kentaro Inui et al. Hong Kong, China: Association for Computational...

work page doi:10.18653/v1/d19-1371 2019
[3]

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

Ben Bogin et al. “SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12622– 12645. DOI: 10....

work page doi:10.18653/v1/2024.emnlp-main.702 2024
[4]

Mapping the backbone of science

Kevin W. Boyack, Richard Klavans, and Katy Börner. “Mapping the backbone of science”. In: Scientometrics 64.3 (2005), pp. 351–374. ISSN : 1588-2861. DOI: 10.1007/s11192-005- 0255-6. URL: https://doi.org/10.1007/s11192-005-0255-6

work page doi:10.1007/s11192-005- 2005
[5]

Language Models are Few-Shot Learners

Tom Brown et al. “Language Models are Few-Shot Learners”. In:Advances in Neural Infor- mation Processing Systems. Ed. by H. Larochelle et al. V ol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020
[7]

An Overview of Diffusion Models for Text Generation

Helena ˇCeovi´c et al. “An Overview of Diffusion Models for Text Generation”. In:2023 46th MIPRO ICT and Electronics Convention (MIPRO) . 2023, pp. 941–946. DOI: 10 . 23919 / MIPRO57284.2023.10159911

work page arXiv 2023
[8]

Mean Reciprocal Rank

Nick Craswell. “Mean Reciprocal Rank”. In: Encyclopedia of Database Systems . Ed. by LING LIU and M. TAMER ÖZSU. Boston, MA: Springer US, 2009, pp. 1703–1703. ISBN : 978-0-387-39940-9. DOI: 10.1007/978-0-387-39940-9_488 . URL: https://doi.org/ 10.1007/978-0-387-39940-9_488

work page doi:10.1007/978-0-387-39940-9_488 2009
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforce- ment Learning. 2025. arXiv: 2501.12948 [cs.CL]. URL: https://arxiv.org/abs/2501. 12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances

Hassan Gharoun et al. “Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances”. In: ACM Comput. Surv. 56.12 (July 2024). ISSN : 0360-0300. DOI: 10.1145/ 3659943. URL: https://doi.org/10.1145/3659943

work page doi:10.1145/3659943 2024
[11]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407.21783 [cs.AI] . URL: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Ideabench: Benchmarking large language models for research idea generation

Sikun Guo et al. IdeaBench: Benchmarking Large Language Models for Research Idea Generation. 2024. arXiv: 2411.02429 [cs.CL] . URL: https://arxiv.org/abs/2411. 02429

work page arXiv 2024
[13]

Preliminary study on Wilcoxon learning machines

Jer Guang Hsieh, Yih Lon Lin, and Jyh Horng Jeng. “Preliminary study on Wilcoxon learning machines”. In: Journal of IEEE Transactions on Neural Networks and Learning Systems 19.2 (2008), pp. 201–211

work page 2008
[14]

16 Karen Spärck Jones

Kalervo Järvelin and Jaana Kekäläinen. “Cumulated gain-based evaluation of IR techniques”. In: ACM Trans. Inf. Syst. 20.4 (Oct. 2002), pp. 422–446. ISSN : 1046-8188. DOI: 10.1145/ 582415.582418. URL: https://doi.org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[15]

Convergence: Facilitating Transdisciplinary Integration of Life Sciences

Committee on Key Challenge Areas for Convergence, Health; Board on Life Sciences; Di- vision on Earth, and Life Studies; National Research Council. Convergence: Facilitating Transdisciplinary Integration of Life Sciences. Washington, DC: The National Academies Press, 2014. ISBN : 978-0-309-30151-0. DOI: 10.17226/18722 . URL: https://pubmed. ncbi.nlm.nih.g...

work page doi:10.17226/18722 2014
[16]

The Eureka Moment

Guenther Knoblich and Michael Oellinger. “The Eureka Moment”. In: Scientific American Mind 17.5 (2006), pp. 38–43. ISSN : 15552284, 2331379X. URL: http://www.jstor.org/ stable/24921587 (visited on 05/15/2025)

work page arXiv 2006
[17]

Large language models are zero-shot reasoners

Takeshi Kojima et al. “Large language models are zero-shot reasoners”. In:Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022. ISBN : 9781713871088

work page 2022
[18]

Advances and challenges in artificial intelligence text generation

Bing Li et al. “Advances and challenges in artificial intelligence text generation”. In:Frontiers of Information Technology & Electronic Engineering 25.1 (Jan. 2024), pp. 64–83. ISSN : 2095-9230. DOI: 10.1631/FITEE.2300410 . URL: https://doi.org/10.1631/FITEE. 2300410

work page doi:10.1631/fitee.2300410 2024
[19]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013/

work page 2004
[20]

Evaluating and enhancing large language models for novelty assessment in scholarly publications

Ethan Lin, Zhiyuan Peng, and Yi Fang. Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications. 2024. arXiv: 2409.16605 [cs.CL]. URL: https://arxiv.org/abs/2409.16605

work page arXiv 2024
[21]

AIGC-Enabled Interdisciplinary Science Measurement

Jiangfeng Liu et al. “AIGC-Enabled Interdisciplinary Science Measurement”. In: Wisdom, Well-Being, Win-Win. Ed. by Isaac Sserwanga et al. Cham: Springer Nature Switzerland, 2024, pp. 161–170. ISBN : 978-3-031-57850-2

work page 2024
[22]

arXiv preprint arXiv:2409.12538 (2024)

Yiren Liu et al. “Personaflow: Boosting research ideation with llm-simulated expert personas”. In: arXiv preprint arXiv:2409.12538 (2024)

work page arXiv 2024
[23]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu et al. ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration- Based Task Decomposition. 2025. arXiv: 2503.21248 [cs.CL] . URL: https://arxiv. org/abs/2503.21248

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu et al. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

work page
[25]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

arXiv: 2408.06292 [cs.AI]. URL: https://arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Quantifying and addressing uncertainty in the measurement of interdisciplinarity

Maryam Nakhoda, Peter Whigham, and Sander Zwanenburg. “Quantifying and addressing uncertainty in the measurement of interdisciplinarity”. In: Scientometrics 128.11 (Sept. 2023), pp. 6107–6127. ISSN : 0138-9130. DOI: 10.1007/s11192- 023- 04822- 2 . URL: https: //doi.org/10.1007/s11192-023-04822-2

work page doi:10.1007/s11192- 2023
[27]

Introducing OpenAI o1-preview

OpenAI. Introducing OpenAI o1-preview. https://openai.com/index/introducing- openai-o1-preview/. Accessed: 2025-05-15. 2024

work page 2025
[28]

GPT-4 Technical Report

OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL] . URL: https: //arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2002, pp. 311–

work page 2002
[30]

B leu: a Method for Automatic Evaluation of Machine Translation

DOI: 10.3115/1073083.1073135 . URL: https://doi.org/10.3115/1073083. 1073135

work page doi:10.3115/1073083.1073135
[31]

11 Céline McKeown

Marissa Radensky et al. Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination. 2025. arXiv: 2409.14634 [cs.HC] . URL: https: //arxiv.org/abs/2409.14634

work page internal anchor Pith review arXiv 2025
[32]

Multi-, Inter-, and Transdisciplinarity within the Public Health Workforce: A Scoping Review to Assess Definitions and Applications of Concepts

Kerstin Sell et al. “Multi-, Inter-, and Transdisciplinarity within the Public Health Workforce: A Scoping Review to Assess Definitions and Applications of Concepts”. In: International Journal of Environmental Research and Public Health19.17 (2022). ISSN : 1660-4601. DOI: 10.3390/ijerph191710902. URL: https://www.mdpi.com/1660-4601/19/17/10902

work page doi:10.3390/ijerph191710902 2022
[33]

Perfect absorption in complex scattering systems with or without hidden symmetries,

James Shi Feng ; Evans. “Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines”. In: Conference on Human Factors in Computing Systems | CHI Workshop 2024(2024). DOI: 10.1038/s41467- 023-36741-4

work page doi:10.1038/s41467- 2024
[34]

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum? id=M23dTGWCZy

work page 2025
[35]

Automated content analysis and crisis communication research

Toni GLA Van Der Meer. “Automated content analysis and crisis communication research”. In: Public Relations Review 42.5 (2016), pp. 952–961. 12

work page 2016
[36]

A Theoretical Analysis of NDCG Type Ranking Measures

Yining Wang et al. A Theoretical Analysis of NDCG Type Ranking Measures. 2013. arXiv: 1304.6480 [cs.LG]. URL: https://arxiv.org/abs/1304.6480

work page internal anchor Pith review Pith/arXiv arXiv 2013
[37]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022.ISBN : 9781713871088

work page 2022
[38]

Identifying multidisciplinary problems from scientific publications based on a text generation method

Ziyan Xu et al. “Identifying multidisciplinary problems from scientific publications based on a text generation method”. In: Journal of Data and Information Science 9.3 (2024), pp. 213–237. DOI: 10.2478/jdis-2024-0021. URL: https://doi.org/10.2478/jdis-2024-0021

work page doi:10.2478/jdis-2024-0021 2024
[39]

URL https://doi.org/10.18653/v1/2024.findings- emnlp.420

Zonglin Yang et al. “Large Language Models for Automated Open-domain Scientific Hypothe- ses Discovery”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 13545–13565. DOI: 10.18653/v1/2024.findings- a...

work page doi:10.18653/v1/2024.findings- 2024
[40]

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Zonglin Yang et al. “MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum?id=X9OfMNNepI

work page 2025
[41]

DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration

Chengbo Zheng et al. “DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration”. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. UIST ’24. Pittsburgh, PA, USA: Association for Computing Machinery, 2024. ISBN : 9798400706288. DOI: 10.1145/3654777.3676366 . URL: https://doi.o...

work page doi:10.1145/3654777.3676366 2024

[1] [1]

Claude 3.7 Sonnet

Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet. Accessed: 2025-05-15. 2024

work page 2025

[2] [2]

S ci BERT : A Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: A Pretrained Language Model for Sci- entific Text”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). Ed. by Kentaro Inui et al. Hong Kong, China: Association for Computational...

work page doi:10.18653/v1/d19-1371 2019

[3] [3]

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

Ben Bogin et al. “SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12622– 12645. DOI: 10....

work page doi:10.18653/v1/2024.emnlp-main.702 2024

[4] [4]

Mapping the backbone of science

Kevin W. Boyack, Richard Klavans, and Katy Börner. “Mapping the backbone of science”. In: Scientometrics 64.3 (2005), pp. 351–374. ISSN : 1588-2861. DOI: 10.1007/s11192-005- 0255-6. URL: https://doi.org/10.1007/s11192-005-0255-6

work page doi:10.1007/s11192-005- 2005

[5] [5]

Language Models are Few-Shot Learners

Tom Brown et al. “Language Models are Few-Shot Learners”. In:Advances in Neural Infor- mation Processing Systems. Ed. by H. Larochelle et al. V ol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020

[6] [7]

An Overview of Diffusion Models for Text Generation

Helena ˇCeovi´c et al. “An Overview of Diffusion Models for Text Generation”. In:2023 46th MIPRO ICT and Electronics Convention (MIPRO) . 2023, pp. 941–946. DOI: 10 . 23919 / MIPRO57284.2023.10159911

work page arXiv 2023

[7] [8]

Mean Reciprocal Rank

Nick Craswell. “Mean Reciprocal Rank”. In: Encyclopedia of Database Systems . Ed. by LING LIU and M. TAMER ÖZSU. Boston, MA: Springer US, 2009, pp. 1703–1703. ISBN : 978-0-387-39940-9. DOI: 10.1007/978-0-387-39940-9_488 . URL: https://doi.org/ 10.1007/978-0-387-39940-9_488

work page doi:10.1007/978-0-387-39940-9_488 2009

[8] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforce- ment Learning. 2025. arXiv: 2501.12948 [cs.CL]. URL: https://arxiv.org/abs/2501. 12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [10]

Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances

Hassan Gharoun et al. “Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances”. In: ACM Comput. Surv. 56.12 (July 2024). ISSN : 0360-0300. DOI: 10.1145/ 3659943. URL: https://doi.org/10.1145/3659943

work page doi:10.1145/3659943 2024

[10] [11]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407.21783 [cs.AI] . URL: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [12]

Ideabench: Benchmarking large language models for research idea generation

Sikun Guo et al. IdeaBench: Benchmarking Large Language Models for Research Idea Generation. 2024. arXiv: 2411.02429 [cs.CL] . URL: https://arxiv.org/abs/2411. 02429

work page arXiv 2024

[12] [13]

Preliminary study on Wilcoxon learning machines

Jer Guang Hsieh, Yih Lon Lin, and Jyh Horng Jeng. “Preliminary study on Wilcoxon learning machines”. In: Journal of IEEE Transactions on Neural Networks and Learning Systems 19.2 (2008), pp. 201–211

work page 2008

[13] [14]

16 Karen Spärck Jones

Kalervo Järvelin and Jaana Kekäläinen. “Cumulated gain-based evaluation of IR techniques”. In: ACM Trans. Inf. Syst. 20.4 (Oct. 2002), pp. 422–446. ISSN : 1046-8188. DOI: 10.1145/ 582415.582418. URL: https://doi.org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002

[14] [15]

Convergence: Facilitating Transdisciplinary Integration of Life Sciences

Committee on Key Challenge Areas for Convergence, Health; Board on Life Sciences; Di- vision on Earth, and Life Studies; National Research Council. Convergence: Facilitating Transdisciplinary Integration of Life Sciences. Washington, DC: The National Academies Press, 2014. ISBN : 978-0-309-30151-0. DOI: 10.17226/18722 . URL: https://pubmed. ncbi.nlm.nih.g...

work page doi:10.17226/18722 2014

[15] [16]

The Eureka Moment

Guenther Knoblich and Michael Oellinger. “The Eureka Moment”. In: Scientific American Mind 17.5 (2006), pp. 38–43. ISSN : 15552284, 2331379X. URL: http://www.jstor.org/ stable/24921587 (visited on 05/15/2025)

work page arXiv 2006

[16] [17]

Large language models are zero-shot reasoners

Takeshi Kojima et al. “Large language models are zero-shot reasoners”. In:Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022. ISBN : 9781713871088

work page 2022

[17] [18]

Advances and challenges in artificial intelligence text generation

Bing Li et al. “Advances and challenges in artificial intelligence text generation”. In:Frontiers of Information Technology & Electronic Engineering 25.1 (Jan. 2024), pp. 64–83. ISSN : 2095-9230. DOI: 10.1631/FITEE.2300410 . URL: https://doi.org/10.1631/FITEE. 2300410

work page doi:10.1631/fitee.2300410 2024

[18] [19]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013/

work page 2004

[19] [20]

Evaluating and enhancing large language models for novelty assessment in scholarly publications

Ethan Lin, Zhiyuan Peng, and Yi Fang. Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications. 2024. arXiv: 2409.16605 [cs.CL]. URL: https://arxiv.org/abs/2409.16605

work page arXiv 2024

[20] [21]

AIGC-Enabled Interdisciplinary Science Measurement

Jiangfeng Liu et al. “AIGC-Enabled Interdisciplinary Science Measurement”. In: Wisdom, Well-Being, Win-Win. Ed. by Isaac Sserwanga et al. Cham: Springer Nature Switzerland, 2024, pp. 161–170. ISBN : 978-3-031-57850-2

work page 2024

[21] [22]

arXiv preprint arXiv:2409.12538 (2024)

Yiren Liu et al. “Personaflow: Boosting research ideation with llm-simulated expert personas”. In: arXiv preprint arXiv:2409.12538 (2024)

work page arXiv 2024

[22] [23]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu et al. ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration- Based Task Decomposition. 2025. arXiv: 2503.21248 [cs.CL] . URL: https://arxiv. org/abs/2503.21248

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu et al. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

work page

[24] [25]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

arXiv: 2408.06292 [cs.AI]. URL: https://arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Quantifying and addressing uncertainty in the measurement of interdisciplinarity

Maryam Nakhoda, Peter Whigham, and Sander Zwanenburg. “Quantifying and addressing uncertainty in the measurement of interdisciplinarity”. In: Scientometrics 128.11 (Sept. 2023), pp. 6107–6127. ISSN : 0138-9130. DOI: 10.1007/s11192- 023- 04822- 2 . URL: https: //doi.org/10.1007/s11192-023-04822-2

work page doi:10.1007/s11192- 2023

[26] [27]

Introducing OpenAI o1-preview

OpenAI. Introducing OpenAI o1-preview. https://openai.com/index/introducing- openai-o1-preview/. Accessed: 2025-05-15. 2024

work page 2025

[27] [28]

GPT-4 Technical Report

OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL] . URL: https: //arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2002, pp. 311–

work page 2002

[29] [30]

B leu: a Method for Automatic Evaluation of Machine Translation

DOI: 10.3115/1073083.1073135 . URL: https://doi.org/10.3115/1073083. 1073135

work page doi:10.3115/1073083.1073135

[30] [31]

11 Céline McKeown

Marissa Radensky et al. Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination. 2025. arXiv: 2409.14634 [cs.HC] . URL: https: //arxiv.org/abs/2409.14634

work page internal anchor Pith review arXiv 2025

[31] [32]

Multi-, Inter-, and Transdisciplinarity within the Public Health Workforce: A Scoping Review to Assess Definitions and Applications of Concepts

Kerstin Sell et al. “Multi-, Inter-, and Transdisciplinarity within the Public Health Workforce: A Scoping Review to Assess Definitions and Applications of Concepts”. In: International Journal of Environmental Research and Public Health19.17 (2022). ISSN : 1660-4601. DOI: 10.3390/ijerph191710902. URL: https://www.mdpi.com/1660-4601/19/17/10902

work page doi:10.3390/ijerph191710902 2022

[32] [33]

Perfect absorption in complex scattering systems with or without hidden symmetries,

James Shi Feng ; Evans. “Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines”. In: Conference on Human Factors in Computing Systems | CHI Workshop 2024(2024). DOI: 10.1038/s41467- 023-36741-4

work page doi:10.1038/s41467- 2024

[33] [34]

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum? id=M23dTGWCZy

work page 2025

[34] [35]

Automated content analysis and crisis communication research

Toni GLA Van Der Meer. “Automated content analysis and crisis communication research”. In: Public Relations Review 42.5 (2016), pp. 952–961. 12

work page 2016

[35] [36]

A Theoretical Analysis of NDCG Type Ranking Measures

Yining Wang et al. A Theoretical Analysis of NDCG Type Ranking Measures. 2013. arXiv: 1304.6480 [cs.LG]. URL: https://arxiv.org/abs/1304.6480

work page internal anchor Pith review Pith/arXiv arXiv 2013

[36] [37]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022.ISBN : 9781713871088

work page 2022

[37] [38]

Identifying multidisciplinary problems from scientific publications based on a text generation method

Ziyan Xu et al. “Identifying multidisciplinary problems from scientific publications based on a text generation method”. In: Journal of Data and Information Science 9.3 (2024), pp. 213–237. DOI: 10.2478/jdis-2024-0021. URL: https://doi.org/10.2478/jdis-2024-0021

work page doi:10.2478/jdis-2024-0021 2024

[38] [39]

URL https://doi.org/10.18653/v1/2024.findings- emnlp.420

Zonglin Yang et al. “Large Language Models for Automated Open-domain Scientific Hypothe- ses Discovery”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 13545–13565. DOI: 10.18653/v1/2024.findings- a...

work page doi:10.18653/v1/2024.findings- 2024

[39] [40]

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Zonglin Yang et al. “MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum?id=X9OfMNNepI

work page 2025

[40] [41]

DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration

Chengbo Zheng et al. “DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration”. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. UIST ’24. Pittsburgh, PA, USA: Association for Computing Machinery, 2024. ISBN : 9798400706288. DOI: 10.1145/3654777.3676366 . URL: https://doi.o...

work page doi:10.1145/3654777.3676366 2024