IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research
Pith reviewed 2026-05-19 03:44 UTC · model grok-4.3
The pith
IDRBench offers the first comprehensive benchmark for large language models in interdisciplinary research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IDRBench is the first framework to comprehensively investigate LLMs' interdisciplinary research capability through datasets and three tasks: IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation, with evaluations on ten mainstream LLMs providing analysis and setting benchmarks for future work.
What carries the argument
The IDRBench framework consisting of datasets and three specific evaluation tasks designed to measure LLMs' ability to integrate knowledge across disciplines.
If this is right
- Establishes standardized benchmarks and baselines that future work on LLMs for cross-disciplinary tasks can track and improve upon.
- Reveals specific patterns in how current LLMs handle identification, integration, and recommendation of ideas from different fields.
- Creates a foundation for AI systems intended to assist researchers in locating and combining knowledge across disciplinary boundaries.
Where Pith is reading between the lines
- Models that improve on the recommendation task might be deployed to automatically surface potential new research directions that combine insights from distant fields.
- The benchmark approach could be adapted to measure AI performance on other forms of creative knowledge synthesis beyond research papers.
- Scores on IDRBench could serve as a selection criterion when choosing or adapting LLMs for real collaborative projects involving multiple disciplines.
Load-bearing premise
The three defined tasks serve as valid and sufficient proxies for measuring genuine interdisciplinary research capability in LLMs.
What would settle it
A controlled study in which human domain experts judge that high-scoring LLM outputs on the three IDRBench tasks do not produce or validate genuinely novel interdisciplinary research contributions.
Figures
read the original abstract
Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis of their behavior and establishes benchmarks and baselines for future research. To the best of our knowledge, IDRBench is the first to provide a comprehensive investigation of LLMs' IDR capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IDRBench, a benchmark framework with three tasks (IDR Paper Identification, IDR Idea Integration, and IDR Idea Recommendation) and associated datasets to evaluate how well large language models can perform interdisciplinary research. It reports results on ten mainstream LLMs and claims this is the first comprehensive investigation of LLMs' IDR capabilities.
Significance. If the tasks are shown to be valid proxies, the benchmark would provide useful baselines and a starting point for measuring and improving LLMs' ability to integrate knowledge across fields, which is relevant given the role of interdisciplinarity in innovation.
major comments (2)
- [Task definitions and evaluation setup] The central claim that the three tasks measure LLMs' IDR capability rests on their validity as proxies, yet the manuscript reports no human expert validation, inter-annotator agreement scores, or correlation analysis showing that model performance tracks actual interdisciplinary synthesis (as opposed to retrieval or pattern matching on the curated data).
- [Evaluation and results] No human baselines or expert ratings of task realism are provided for any of the three tasks, which is required to interpret the LLM performance numbers as evidence of genuine IDR capability rather than benchmark-specific artifacts.
minor comments (2)
- [Abstract and introduction] The abstract's novelty claim ('to the best of our knowledge, IDRBench is the first') would be strengthened by a dedicated related-work subsection that explicitly contrasts the new tasks against prior LLM benchmarks on cross-domain reasoning or knowledge integration.
- [Dataset construction] Clarify the dataset construction process, including source selection criteria and any steps taken to mitigate domain-specific biases, to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of IDRBench as a valid benchmark for LLM interdisciplinary research capabilities.
read point-by-point responses
-
Referee: [Task definitions and evaluation setup] The central claim that the three tasks measure LLMs' IDR capability rests on their validity as proxies, yet the manuscript reports no human expert validation, inter-annotator agreement scores, or correlation analysis showing that model performance tracks actual interdisciplinary synthesis (as opposed to retrieval or pattern matching on the curated data).
Authors: We agree that explicit validation of the tasks as proxies for genuine IDR is important for interpreting the results. The three tasks were constructed by drawing on established definitions and examples of interdisciplinary research from the scholarly literature, with datasets drawn from real cross-field publications. Nevertheless, we acknowledge that the current manuscript does not include human expert validation or inter-annotator agreement statistics. In the revised version we will add a dedicated validation subsection that reports expert review of task realism on a sampled subset of instances together with inter-annotator agreement scores. Direct correlation analysis with downstream research impact would require longitudinal outcome data that is not yet available for this initial benchmark; we will note this limitation and identify it as an important avenue for follow-up studies. revision: partial
-
Referee: [Evaluation and results] No human baselines or expert ratings of task realism are provided for any of the three tasks, which is required to interpret the LLM performance numbers as evidence of genuine IDR capability rather than benchmark-specific artifacts.
Authors: We appreciate the referee's emphasis on the need for human baselines to contextualize the reported LLM scores. In the revised manuscript we will include human performance baselines collected from domain experts on a representative sample of each task, along with expert ratings of task realism. These additions will allow readers to better distinguish benchmark-specific effects from broader IDR capabilities and will be presented alongside the existing LLM results. revision: yes
Circularity Check
No circularity: benchmark creation is self-contained empirical contribution
full rationale
The paper introduces IDRBench as a new framework consisting of three explicitly defined tasks (IDR Paper Identification, IDR Idea Integration, IDR Idea Recommendation) along with associated datasets and an evaluation of ten LLMs. No equations, fitted parameters, or derivations are present that reduce to prior inputs by construction. The central claim of providing the first comprehensive investigation rests on the novelty of the benchmark itself rather than any self-citation chain or self-definitional loop. The tasks are presented as proxies by design choice, not derived from fitted results or prior author work invoked as uniqueness theorems. This is a standard empirical benchmark paper with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interdisciplinary research capability in LLMs can be meaningfully decomposed into the tasks of paper identification, idea integration, and idea recommendation.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet. Accessed: 2025-05-15. 2024
work page 2025
-
[2]
S ci BERT : A Pretrained Language Model for Scientific Text
Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: A Pretrained Language Model for Sci- entific Text”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). Ed. by Kentaro Inui et al. Hong Kong, China: Association for Computational...
-
[3]
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Ben Bogin et al. “SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories”. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 12622– 12645. DOI: 10....
-
[4]
Mapping the backbone of science
Kevin W. Boyack, Richard Klavans, and Katy Börner. “Mapping the backbone of science”. In: Scientometrics 64.3 (2005), pp. 351–374. ISSN : 1588-2861. DOI: 10.1007/s11192-005- 0255-6. URL: https://doi.org/10.1007/s11192-005-0255-6
-
[5]
Language Models are Few-Shot Learners
Tom Brown et al. “Language Models are Few-Shot Learners”. In:Advances in Neural Infor- mation Processing Systems. Ed. by H. Larochelle et al. V ol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
work page 2020
-
[7]
An Overview of Diffusion Models for Text Generation
Helena ˇCeovi´c et al. “An Overview of Diffusion Models for Text Generation”. In:2023 46th MIPRO ICT and Electronics Convention (MIPRO) . 2023, pp. 941–946. DOI: 10 . 23919 / MIPRO57284.2023.10159911
-
[8]
Nick Craswell. “Mean Reciprocal Rank”. In: Encyclopedia of Database Systems . Ed. by LING LIU and M. TAMER ÖZSU. Boston, MA: Springer US, 2009, pp. 1703–1703. ISBN : 978-0-387-39940-9. DOI: 10.1007/978-0-387-39940-9_488 . URL: https://doi.org/ 10.1007/978-0-387-39940-9_488
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforce- ment Learning. 2025. arXiv: 2501.12948 [cs.CL]. URL: https://arxiv.org/abs/2501. 12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances
Hassan Gharoun et al. “Meta-learning Approaches for Few-Shot Learning: A Survey of Recent Advances”. In: ACM Comput. Surv. 56.12 (July 2024). ISSN : 0360-0300. DOI: 10.1145/ 3659943. URL: https://doi.org/10.1145/3659943
-
[11]
Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407.21783 [cs.AI] . URL: https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Ideabench: Benchmarking large language models for research idea generation
Sikun Guo et al. IdeaBench: Benchmarking Large Language Models for Research Idea Generation. 2024. arXiv: 2411.02429 [cs.CL] . URL: https://arxiv.org/abs/2411. 02429
-
[13]
Preliminary study on Wilcoxon learning machines
Jer Guang Hsieh, Yih Lon Lin, and Jyh Horng Jeng. “Preliminary study on Wilcoxon learning machines”. In: Journal of IEEE Transactions on Neural Networks and Learning Systems 19.2 (2008), pp. 201–211
work page 2008
-
[14]
Kalervo Järvelin and Jaana Kekäläinen. “Cumulated gain-based evaluation of IR techniques”. In: ACM Trans. Inf. Syst. 20.4 (Oct. 2002), pp. 422–446. ISSN : 1046-8188. DOI: 10.1145/ 582415.582418. URL: https://doi.org/10.1145/582415.582418
-
[15]
Convergence: Facilitating Transdisciplinary Integration of Life Sciences
Committee on Key Challenge Areas for Convergence, Health; Board on Life Sciences; Di- vision on Earth, and Life Studies; National Research Council. Convergence: Facilitating Transdisciplinary Integration of Life Sciences. Washington, DC: The National Academies Press, 2014. ISBN : 978-0-309-30151-0. DOI: 10.17226/18722 . URL: https://pubmed. ncbi.nlm.nih.g...
-
[16]
Guenther Knoblich and Michael Oellinger. “The Eureka Moment”. In: Scientific American Mind 17.5 (2006), pp. 38–43. ISSN : 15552284, 2331379X. URL: http://www.jstor.org/ stable/24921587 (visited on 05/15/2025)
-
[17]
Large language models are zero-shot reasoners
Takeshi Kojima et al. “Large language models are zero-shot reasoners”. In:Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022. ISBN : 9781713871088
work page 2022
-
[18]
Advances and challenges in artificial intelligence text generation
Bing Li et al. “Advances and challenges in artificial intelligence text generation”. In:Frontiers of Information Technology & Electronic Engineering 25.1 (Jan. 2024), pp. 64–83. ISSN : 2095-9230. DOI: 10.1631/FITEE.2300410 . URL: https://doi.org/10.1631/FITEE. 2300410
-
[19]
ROUGE: A Package for Automatic Evaluation of Summaries
Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013/
work page 2004
-
[20]
Evaluating and enhancing large language models for novelty assessment in scholarly publications
Ethan Lin, Zhiyuan Peng, and Yi Fang. Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications. 2024. arXiv: 2409.16605 [cs.CL]. URL: https://arxiv.org/abs/2409.16605
-
[21]
AIGC-Enabled Interdisciplinary Science Measurement
Jiangfeng Liu et al. “AIGC-Enabled Interdisciplinary Science Measurement”. In: Wisdom, Well-Being, Win-Win. Ed. by Isaac Sserwanga et al. Cham: Springer Nature Switzerland, 2024, pp. 161–170. ISBN : 978-3-031-57850-2
work page 2024
-
[22]
arXiv preprint arXiv:2409.12538 (2024)
Yiren Liu et al. “Personaflow: Boosting research ideation with llm-simulated expert personas”. In: arXiv preprint arXiv:2409.12538 (2024)
-
[23]
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu et al. ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration- Based Task Decomposition. 2025. arXiv: 2503.21248 [cs.CL] . URL: https://arxiv. org/abs/2503.21248
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu et al. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
-
[25]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
arXiv: 2408.06292 [cs.AI]. URL: https://arxiv.org/abs/2408.06292
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Quantifying and addressing uncertainty in the measurement of interdisciplinarity
Maryam Nakhoda, Peter Whigham, and Sander Zwanenburg. “Quantifying and addressing uncertainty in the measurement of interdisciplinarity”. In: Scientometrics 128.11 (Sept. 2023), pp. 6107–6127. ISSN : 0138-9130. DOI: 10.1007/s11192- 023- 04822- 2 . URL: https: //doi.org/10.1007/s11192-023-04822-2
-
[27]
OpenAI. Introducing OpenAI o1-preview. https://openai.com/index/introducing- openai-o1-preview/. Accessed: 2025-05-15. 2024
work page 2025
-
[28]
OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL] . URL: https: //arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
BLEU: a method for automatic evaluation of machine translation
Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2002, pp. 311–
work page 2002
-
[30]
B leu: a Method for Automatic Evaluation of Machine Translation
DOI: 10.3115/1073083.1073135 . URL: https://doi.org/10.3115/1073083. 1073135
-
[31]
Marissa Radensky et al. Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination. 2025. arXiv: 2409.14634 [cs.HC] . URL: https: //arxiv.org/abs/2409.14634
work page internal anchor Pith review arXiv 2025
-
[32]
Kerstin Sell et al. “Multi-, Inter-, and Transdisciplinarity within the Public Health Workforce: A Scoping Review to Assess Definitions and Applications of Concepts”. In: International Journal of Environmental Research and Public Health19.17 (2022). ISSN : 1660-4601. DOI: 10.3390/ijerph191710902. URL: https://www.mdpi.com/1660-4601/19/17/10902
-
[33]
Perfect absorption in complex scattering systems with or without hidden symmetries,
James Shi Feng ; Evans. “Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines”. In: Conference on Human Factors in Computing Systems | CHI Workshop 2024(2024). DOI: 10.1038/s41467- 023-36741-4
-
[34]
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum? id=M23dTGWCZy
work page 2025
-
[35]
Automated content analysis and crisis communication research
Toni GLA Van Der Meer. “Automated content analysis and crisis communication research”. In: Public Relations Review 42.5 (2016), pp. 952–961. 12
work page 2016
-
[36]
A Theoretical Analysis of NDCG Type Ranking Measures
Yining Wang et al. A Theoretical Analysis of NDCG Type Ranking Measures. 2013. arXiv: 1304.6480 [cs.LG]. URL: https://arxiv.org/abs/1304.6480
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[37]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22. New Orleans, LA, USA: Curran Associates Inc., 2022.ISBN : 9781713871088
work page 2022
-
[38]
Ziyan Xu et al. “Identifying multidisciplinary problems from scientific publications based on a text generation method”. In: Journal of Data and Information Science 9.3 (2024), pp. 213–237. DOI: 10.2478/jdis-2024-0021. URL: https://doi.org/10.2478/jdis-2024-0021
-
[39]
URL https://doi.org/10.18653/v1/2024.findings- emnlp.420
Zonglin Yang et al. “Large Language Models for Automated Open-domain Scientific Hypothe- ses Discovery”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 13545–13565. DOI: 10.18653/v1/2024.findings- a...
-
[40]
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Zonglin Yang et al. “MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses”. In: The Thirteenth International Conference on Learning Representations. 2025. URL: https://openreview.net/forum?id=X9OfMNNepI
work page 2025
-
[41]
DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration
Chengbo Zheng et al. “DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration”. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. UIST ’24. Pittsburgh, PA, USA: Association for Computing Machinery, 2024. ISBN : 9798400706288. DOI: 10.1145/3654777.3676366 . URL: https://doi.o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.