Recognition: unknown
Heterogeneous Scientific Foundation Model Collaboration
Pith reviewed 2026-05-07 08:45 UTC · model grok-4.3
The pith
Eywa adds language-based reasoning interfaces to domain-specific foundation models so they can join agentic systems on non-linguistic data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Eywa is a heterogeneous agentic framework that augments domain-specific foundation models with a language-model-based reasoning interface. This interface enables language models to guide inference over non-linguistic data modalities, allowing predictive foundation models to participate in higher-level reasoning and decision-making. The framework can replace a single-agent pipeline, integrate specialized agents into multi-agent systems, or use planning-based orchestration to coordinate both types across modalities.
What carries the argument
The language-model-based reasoning interface added to domain-specific foundation models, which converts language guidance into operations on specialized non-text data while keeping the model's original strengths intact.
If this is right
- EywaAgent can replace a single language-model agent in existing pipelines.
- EywaMAS swaps in specialized agents within multi-agent systems.
- EywaOrchestra uses a planner to route tasks across language and non-language models.
- Tasks involving structured or domain-specific data show measurable accuracy gains.
- Overall system reliance on language-only reasoning drops through the collaboration.
Where Pith is reading between the lines
- Similar interfaces could be tested on engineering or medical simulation models outside the paper's science focus.
- Dynamic planning might reduce mismatch errors when agents must choose between text and numeric tools.
- The approach could encourage developers to build lightweight adapters rather than retraining full models for each modality.
- Broader use might shift scientific AI design toward modular interfaces instead of monolithic language models.
Load-bearing premise
That attaching a language-based reasoning interface lets language models effectively direct inference inside domain-specific models without harming their specialized accuracy.
What would settle it
An experiment in which Eywa shows no performance gain or even lower accuracy than either standalone specialized models or pure language-model agents on the same structured-data scientific tasks.
read the original abstract
Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic framework designed to extend language-centric systems to a broader class of scientific foundation models. The key idea of Eywa is to augment domain-specific foundation models with a language-model-based reasoning interface, enabling language models to guide inference over non-linguistic data modalities. This design allows predictive foundation models, which are typically optimized for specialized data and tasks, to participate in higher-level reasoning and decision-making processes within agentic systems. Eywa can serve as a drop-in replacement for a single-agent pipeline (EywaAgent) or be integrated into existing multi-agent systems by replacing traditional agents with specialized agents (EywaMAS). We further investigate a planning-based orchestration framework in which a planner dynamically coordinates traditional agents and Eywa agents to solve complex tasks across heterogeneous data modalities (EywaOrchestra). We evaluate Eywa across a diverse set of scientific domains spanning physical, life, and social sciences. Experimental results demonstrate that Eywa improves performance on tasks involving structured and domain-specific data, while reducing reliance on language-based reasoning through effective collaboration with specialized foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Eywa, a heterogeneous agentic framework that augments domain-specific scientific foundation models with language-model-based reasoning interfaces. This enables language models to guide inference over non-linguistic data modalities, allowing specialized predictive models to participate in higher-level reasoning and decision-making within agentic systems. The framework is presented in three forms: EywaAgent as a drop-in single-agent replacement, EywaMAS for integration into multi-agent systems, and EywaOrchestra as a planning-based orchestration layer that dynamically coordinates traditional and Eywa agents. The authors evaluate the approach across physical, life, and social science domains and claim that it improves performance on structured and domain-specific data tasks while reducing reliance on language-based reasoning.
Significance. If the empirical claims hold under rigorous validation, the work could meaningfully advance integration of specialized scientific foundation models into agentic AI systems. By providing a general interface layer rather than requiring end-to-end retraining, Eywa addresses a practical gap between general-purpose language agents and high-performance domain models. The orchestration variant further suggests a path toward dynamic, modality-aware planning. These contributions would be of interest to researchers working on scientific AI, multi-agent systems, and foundation-model collaboration, provided the performance gains are shown to be robust across baselines and tasks.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim that 'Eywa improves performance on tasks involving structured and domain-specific data, while reducing reliance on language-based reasoning' is load-bearing, yet the abstract supplies no information on experimental design, baselines, metrics, datasets, error bars, or statistical tests. If the full manuscript does not contain a complete experimental section with quantitative comparisons (e.g., against standard LLM agents, direct fine-tuning, or modality-specific pipelines) and ablation studies isolating the reasoning-interface contribution, the support for the performance and 'reduced language reliance' assertions cannot be evaluated. This must be addressed with concrete tables, figures, and reproducibility details.
- [§3] §3 (Framework Description): The weakest assumption—that a language-model-based reasoning interface can effectively guide inference over non-linguistic data modalities without compromising the specialized capabilities of the domain foundation models—is stated but not formally characterized. The manuscript should provide either a precise interface specification (e.g., input/output formats, prompt templates, or API contracts) or empirical evidence that the interface preserves the original model's accuracy on its native tasks. Without this, it is unclear whether the collaboration mechanism is general or task-specific.
minor comments (2)
- [Throughout] The acronyms EywaAgent, EywaMAS, and EywaOrchestra are introduced without an explicit nomenclature table or consistent usage pattern across sections; a short table mapping names to roles would improve readability.
- [Abstract and §4] The abstract states evaluation 'across a diverse set of scientific domains' but does not list the specific tasks or datasets; the experimental section should include an explicit enumeration (e.g., Table 1) for traceability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below, providing clarifications on the experimental rigor and framework formalization. Where appropriate, we have revised the manuscript to strengthen the presentation of results and interface details.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that 'Eywa improves performance on tasks involving structured and domain-specific data, while reducing reliance on language-based reasoning' is load-bearing, yet the abstract supplies no information on experimental design, baselines, metrics, datasets, error bars, or statistical tests. If the full manuscript does not contain a complete experimental section with quantitative comparisons (e.g., against standard LLM agents, direct fine-tuning, or modality-specific pipelines) and ablation studies isolating the reasoning-interface contribution, the support for the performance and 'reduced language reliance' assertions cannot be evaluated. This must be addressed with concrete tables, figures, and reproducibility details.
Authors: The full manuscript contains a comprehensive §4 (Experiments) section with quantitative evaluations across physical, life, and social science domains. This includes direct comparisons to standard LLM agents, fine-tuned baselines, and modality-specific pipelines, along with ablation studies that isolate the contribution of the reasoning interface. Tables report performance metrics with error bars and statistical significance tests; datasets and reproducibility details (including code and hyperparameters) are provided in the appendix. We agree that the abstract is high-level and will expand it in the revision to briefly summarize the experimental design, key baselines, main metrics, and core findings while preserving its concise nature. revision: yes
-
Referee: [§3] §3 (Framework Description): The weakest assumption—that a language-model-based reasoning interface can effectively guide inference over non-linguistic data modalities without compromising the specialized capabilities of the domain foundation models—is stated but not formally characterized. The manuscript should provide either a precise interface specification (e.g., input/output formats, prompt templates, or API contracts) or empirical evidence that the interface preserves the original model's accuracy on its native tasks. Without this, it is unclear whether the collaboration mechanism is general or task-specific.
Authors: Section 3 describes the Eywa reasoning interface as a modular augmentation layer that translates between language-based agent instructions and the native input/output formats of domain-specific foundation models. Empirical evidence that this interface preserves (and in many cases improves) native task accuracy is presented in §4 through side-by-side comparisons showing that Eywa-augmented models retain or exceed the performance of standalone domain models on their original tasks while enabling higher-level agentic reasoning. To address the request for formal characterization, we will add a dedicated subsection in the revised §3 that specifies the interface contract, including standardized input/output schemas, prompt templates for the language-model wrapper, and API-level contracts that ensure generality across modalities. revision: yes
Circularity Check
No significant circularity; derivation chain absent
full rationale
The manuscript introduces an architectural framework (Eywa) for interfacing language models with domain-specific scientific foundation models via a reasoning layer. No equations, derivations, fitted parameters, or mathematical claims appear in the abstract or are indicated in the full text. All performance assertions rest on experimental evaluations across physical, life, and social science tasks rather than any reduction to self-defined inputs or self-citations. The central design (augmenting models with a language-based interface) is presented as an engineering choice, not derived from prior results by the same authors. This satisfies the default expectation of a non-circular empirical/architectural paper.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Eywa framework and its variants (EywaAgent, EywaMAS, EywaOrchestra)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
Reference graph
Works this paper leans on
-
[1]
OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URLhttps://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review doi:10.48550/arxiv.2303.08774 2023
-
[2]
Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory,
Gemma Team. Gemma 3 technical report.CoRR, abs/2503.19786, 2025. doi: 10.48550/ARXIV.2503. 19786. URLhttps://doi.org/10.48550/arXiv.2503.19786
-
[3]
Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URLhttps://doi.org/10.48550/arXiv.2407.21783
-
[4]
arXiv preprint arXiv:2601.12538 (2026)
Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, and Jingrui He. Agentic reas...
-
[5]
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Hen...
-
[6]
Latent collaboration in multi-agent systems
Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, and Ling Yang. Latent collaboration in multi-agent systems.CoRR, abs/2511.20639, 2025. doi: 10.48550/ARXIV.2511.20639. URL https://doi.org/10.48550/arXiv.2511.20639
-
[7]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.CoRR, abs/2307.03109, 2023. doi: 10.48550/ARXIV.2307.03109. URLhttps://doi.org/10.48550/arXiv.2307.03109
-
[8]
How far are we from AGI: are llms all we need?Trans
Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, Guanyu Lin, and Jiaxuan You. How far are we from AGI: are llms all we need?Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=H2ZKqfNd0U
2024
-
[9]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...
-
[10]
Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji...
-
[11]
Smiles, a chemical language and information system
David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.J. Chem. Inf. Comput. Sci., 28(1):31–36, 1988. doi: 10.1021/CI00057A005. URL https://doi.org/10.1021/ci00057a005
-
[12]
The era5 global reanalysis
Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis. Quarterly journal of the royal meteorological society, 146(730):1999–2049, 2020
1999
-
[13]
Doan, and Chan- dan K
Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D. Doan, and Chan- dan K. Reddy. Llm-srbench: A new benchmark for scientific equation discovery with large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second Internat...
2025
-
[14]
UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023.Nucleic Acids Res., 51 (D1):523–531, 2023. doi: 10.1093/NAR/GKAC1052. URL https://doi.org/10.1093/nar/ gkac1052
-
[15]
Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, and Bowen Zhou. From AI for science to agentic science: A survey on autonomous scientific...
-
[16]
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflo...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.19897 2025
-
[17]
AI scientists produce results without reasoning scientifically
Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, NM Krishnan, and Kevin Maik Jablonka. Ai scientists produce results without reasoning scientifically.arXiv preprint arXiv:2604.18805, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Sidharth S. Menon, Trishit Mondal, Shuvayan Brahmachary, Aniruddha Panda, Subodh M. Joshi, Kaushic Kalyanaraman, and Ameya D. Jagtap. On scientific foundation models: Rigorous definitions, key applications, and a comprehensive survey.Neural Networks, 198:108567, 2026. doi: 10.1016/J. NEUNET.2026.108567. URLhttps://doi.org/10.1016/j.neunet.2026.108567
work page doi:10.1016/j 2026
-
[19]
Shengchao Chen, Guodong Long, Jing Jiang, Dikai Liu, and Chengqi Zhang. Foundation models for weather and climate data understanding: A comprehensive survey.CoRR, abs/2312.03014, 2023. doi: 10.48550/ARXIV.2312.03014. URLhttps://doi.org/10.48550/arXiv.2312.03014
-
[20]
arXiv preprint arXiv:2307.13721 doi:10.48550/arXiv.2307.13721
Muhammad Awais, Muzammal Naseer, Salman H. Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook.CoRR, abs/2307.13721, 2023. doi: 10.48550/ARXIV.2307.13721. URLhttps://doi.org/10.48550/arXiv.2307.13721
-
[21]
Lee, J., Lee, Y ., Kim, J., Kosiorek, A., Choi, S., and Teh, Y
Siva Rama Krishna Kottapalli, Karthik Hubli, Sandeep Chandrashekhara, Garima Jain, Sunayana Hubli, Gayathri Botla, and Ramesh Doddaiah. Foundation models for time series: A survey.CoRR, abs/2504.04011, 2025. doi: 10.48550/ARXIV.2504.04011. URLhttps://doi.org/10.48550/ arXiv.2504.04011
-
[22]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin...
work page internal anchor Pith review arXiv 2021
-
[23]
A foundation model for clinician-centered drug repurposing.Nature Medicine, 30(12):3601–3613, 2024
Kexin Huang, Payal Chandak, Qianwen Wang, Shreyas Havaldar, Akhil Vaid, Jure Leskovec, Girish N Nadkarni, Benjamin S Glicksberg, Nils Gehlenborg, and Marinka Zitnik. A foundation model for clinician-centered drug repurposing.Nature Medicine, 30(12):3601–3613, 2024
2024
-
[24]
RémiLam,AlvaroSanchez-Gonzalez,MatthewWillson,PeterWirnsberger,MeireFortunato,Alexander Pritzel, Suman V. Ravuri, Timo Ewalds, Ferran Alet, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, StephanHoyer,GeorgeHolland,JacklynnStott,OriolVinyals,ShakirMohamed,andPeterW.Battaglia. Graphcast: Learning skillful medium-range global weather forecasting.CoRR, abs/22...
-
[25]
arXiv preprint arXiv:2405.04285 , year=
Xiao Xiang Zhu, Zhitong Xiong, Yi Wang, Adam J. Stewart, Konrad Heidler, Yuanyuan Wang, Zheng- hang Yuan, Thomas Dujardin, Qingsong Xu, and Yilei Shi. On the foundations of earth and cli- mate foundation models.CoRR, abs/2405.04285, 2024. doi: 10.48550/ARXIV.2405.04285. URL https://doi.org/10.48550/arXiv.2405.04285. 16 Heterogeneous Scientific Foundation ...
-
[26]
Foundation models for the electric power grid.Joule, 8(12):3245–3258, 2024
Hendrik F Hamann, Blazhe Gjorgiev, Thomas Brunschwiler, Leonardo SA Martins, Alban Puech, Anna Varbella, Jonas Weiss, Juan Bernabe-Moreno, Alexandre Blondin Massé, Seong Lok Choi, et al. Foundation models for the electric power grid.Joule, 8(12):3245–3258, 2024
2024
-
[27]
OlmoEarth : Stable latent image modeling for multimodal earth observation
Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. Olmoearth: Stable latent image modeling for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025
-
[28]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024, pages 8048–8057. ij...
2024
-
[29]
Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, andJürgenSchmidhuber. Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...
2024
-
[30]
O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Sean Follmer, Jeff Han, Jürgen Steimle, and Nathalie Henry Riche, editors,Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Fra...
-
[31]
Bingyu Yan, Xiaoming Zhang, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, and Chaozhuo Li. Beyond self-talk: A communication-centric survey of llm-based multi-agent systems.CoRR, abs/2502.14321, 2025. doi: 10.48550/ARXIV.2502.14321. URLhttps://doi.org/10.48550/ arXiv.2502.14321
-
[32]
Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of llms.CoRR, abs/2501.06322, 2025. doi: 10.48550/ARXIV.2501.06322. URLhttps://doi.org/10.48550/arXiv.2501.06322
work page internal anchor Pith review doi:10.48550/arxiv.2501.06322 2025
-
[33]
A survey on large language model based autonomous agents , volume=
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024. doi: 10. 1007/S11704-024-40231-1. URLhttps://doi.org/10.1007/s11704-024-40231-1
-
[34]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...
-
[35]
Model context protocol
Anthropic. Model context protocol. https://docs.anthropic.com/en/docs/ agents-and-tools/mcp, 2024. 17 Heterogeneous Scientific Foundation Model Collaboration
2024
-
[36]
Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: A benchmark for molecular machine learning.CoRR, abs/1703.00564, 2017. URLhttp://arxiv.org/abs/1703.00564
-
[37]
Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025
M-A-P Team. Supergpqa: Scaling LLM evaluation across 285 graduate disciplines.CoRR, abs/2502.14739, 2025. doi: 10.48550/ARXIV.2502.14739. URLhttps://doi.org/10.48550/ arXiv.2502.14739
-
[38]
Physicsarena: The first multimodal physics reasoning benchmark exploring variable, process, and solution dimensions
Song Dai, Yibo Yan, Jiamin Su, Dongfang Zihao, Yubo Gao, Yonghua Hei, Jungang Li, Junyan Zhang, Sicheng Tao, Zhuoran Gao, and Xuming Hu. Physicsarena: The first multimodal physics reasoning benchmark exploring variable, process, and solution dimensions. In Christos Christodoulopoulos, Tan- moy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings o...
2025
-
[39]
Phybench: Holistic evaluation of physical perception and reasoning in large language models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li,...
-
[40]
Ming Yin, Yuanhao Qu, Dyllan Liu, Ling Yang, Le Cong, and Mengdi Wang. Genome-bench: A scientific reasoning benchmark from real-world expert discussions.CoRR, abs/2505.19501, 2025. doi: 10.48550/ARXIV.2505.19501. URLhttps://doi.org/10.48550/arXiv.2505.19501
-
[41]
Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, and An- drew Zisserman
Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, and An- drew Zisserman. Scivid: Cross-domain evaluation of video models in scientific applications.CoRR, abs/2507.03578, 2025. doi: 10.48550/ARXIV.2507.03578. URLhttps://doi...
-
[42]
Emmanuel Johnson, Quentin Febvre, Anastasiia Gorbunova, Sammy Metref, Maxime Bal- larotta, Julien Le Sommer, and Ronan Fablet
J. Emmanuel Johnson, Quentin Febvre, Anastasiia Gorbunova, Sammy Metref, Maxime Bal- larotta, Julien Le Sommer, and Ronan Fablet. Oceanbench: The sea surface height edi- tion. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural...
2023
-
[43]
Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, and Maosong Sun. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, T...
-
[44]
Evaluating Large Language Models in Scientific Discovery
Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, et al. Evaluating large language models in scientific discovery.arXiv preprint arXiv:2512.15567, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Mmlu-pro: A more robust and challenging multi- task language understanding benchmark.Advances in Neural Information Processing Systems, 37: 95266–95290, 2024
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi- task language understanding benchmark.Advances in Neural Information Processing Systems, 37: 95266–95290, 2024
2024
-
[46]
F., Turkmen, C., Stella, L., Erickson, N., Guerron, P., Bohlke-Schneider, M., and Wang, Y
Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, and Yuyang Wang. fev-bench: A realistic benchmark for time series forecasting.CoRR, abs/2509.26468, 2025. doi: 10.48550/ARXIV.2509.26468. URLhttps://doi. org/10.48550/arXiv.2509.26468
-
[47]
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.CoRR, abs/2506.16791, 2025. doi: 10.48550/ARXIV.2506.16791. URLhttps://doi.org/10.48550/ arXiv.2506.16791
-
[48]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...
2023
-
[49]
Improvingfactuality and reasoning in language models through multiagent debate
YilunDu, ShuangLi, AntonioTorralba, JoshuaB.Tenenbaum, andIgorMordatch. Improvingfactuality and reasoning in language models through multiagent debate. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024...
2024
-
[50]
URLhttps://proceedings.mlr.press/v235/du24e.html
-
[51]
Mixture-of-agents en- hances large language model capabilities
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents en- hances large language model capabilities. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps: //openreview.net/forum?id=h0ZfDIrj7T
2025
-
[52]
arXiv preprint arXiv:2505.16997 , year=
Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, and Siheng Chen. X-MAS: towards building multi-agent systems with heterogeneous llms.CoRR, abs/2505.16997, 2025. doi: 10.48550/ARXIV.2505.16997. URLhttps://doi.org/10.48550/arXiv.2505.16997. 19 Heterogeneous Scientific Foundation Model Collaboration
-
[53]
Maddix, Hao Wang, Michael W
Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Türkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda-Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Bernie Wang. Chronos: Learning the langu...
2024
-
[54]
Chronos-2: From Univariate to Universal Forecasting
Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Pra- teek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael...
work page internal anchor Pith review doi:10.48550/arxiv.2510.15821 2025
-
[55]
Tabpfn: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=cp5PvcI6w8_
2023
-
[56]
OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026. doi: 10.48550/ARXIV.2601.03267. URLhttps://doi.org/10.48550/arXiv.2601.03267
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267 2026
-
[57]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review arXiv 2025
-
[59]
Claude family models.https://platform.claude.com/docs/en/about-claude/ models/overview, 2025
Anthropic. Claude family models.https://platform.claude.com/docs/en/about-claude/ models/overview, 2025
2025
-
[60]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural...
2022
-
[61]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X
2023
-
[62]
Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, et al. A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148, 2025. 20 Heterogeneous Scientific Foundation Model Collaboration
-
[63]
A com- prehensive survey of scientific large language models and their applications in scientific discovery
Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, and Jiawei Han. A com- prehensive survey of scientific large language models and their applications in scientific discovery. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8783–8817, 2024
2024
-
[64]
Galactica: A Large Language Model for Science
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022
work page internal anchor Pith review arXiv 2022
-
[65]
Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
AitorLewkowycz,AndersAndreassen,DavidDohan,EthanDyer,HenrykMichalewski,VinayRamasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
2022
-
[66]
Biogpt: gen- erative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022
Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. Biogpt: gen- erative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022
2022
-
[67]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
2023
-
[68]
Chemllm: A chemical large language model.arXiv preprint arXiv:2402.06852, 2024
Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, et al. Chemllm: A chemical large language model.arXiv preprint arXiv:2402.06852, 2024
-
[69]
arXiv preprint arXiv:2402.09391 (2024)
Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024
-
[70]
Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, and Jie Tang. Sciglm: Training scientific language models with self-reflective instruction annotation and tuning.arXiv preprint arXiv:2401.07950, 4, 2024
-
[71]
Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
2023
-
[72]
Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025
Alireza Ghafarollahi and Markus J Buehler. Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025
2025
-
[73]
arXiv preprint arXiv:2310.01728 , year=
Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728, 2023
-
[74]
Table meets llm: Can large language models understand structured table data? a benchmark and empirical study
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 645–654, 2024
2024
-
[75]
Can large language models understand molecules?BMC bioinformatics, 25(1):225, 2024
Shaghayegh Sadeghi, Alan Bui, Ali Forooghi, Jianguo Lu, and Alioune Ngom. Can large language models understand molecules?BMC bioinformatics, 25(1):225, 2024. 21 Heterogeneous Scientific Foundation Model Collaboration
2024
-
[76]
A decoder-only foundation model for time-series forecasting
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024,...
2024
-
[77]
Unified training of universal time series forecasting transformers
Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML...
2024
-
[78]
R., Ghonia, H., Bhagwatkar, R., Khorasani, A., Bayazi, M
Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Bilos, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider, Sahil Garg, AlexandreDrouin,NicolasChapados,YuriyNevmyvaka,andIrinaRish. Lag-llama: Towardsfoundation models for time series forecasting.CoRR, abs/2310.08278, 2023. doi: 10.48550/A...
-
[79]
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nat., 637(8044):319–326, 2025. doi: 10.1038/S41586-024-08328-6. URL https://doi.org/10.1038/s41586-024-08328-6
-
[80]
Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schölk...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.