HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL
Pith reviewed 2026-05-20 13:22 UTC · model grok-4.3
The pith
A small domain-tuned model with simulator-grounded RL outperforms frontier LLMs in hydrologic model calibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The performance ceiling of frontier LLMs on CREST calibration reflects a domain-grounding deficit. HydroAgent closes this gap by first supervised fine-tuning an open 4B model on 2,576 expert trajectories and then applying Group-Relative Policy Optimization, where the reward is the Nash-Sutcliffe Efficiency computed from online CREST runs. The resulting policy delivers calibration quality approaching human experts while remaining far smaller and more efficient than the tested frontier systems.
What carries the argument
Reinforcement learning with simulation feedback (RLSF), in which Nash-Sutcliffe Efficiency scores from live CREST model executions serve as the reward signal to optimize a domain-tuned policy.
If this is right
- Domain-tuned small models can achieve expert-level performance in hydrologic calibration without the compute cost of frontier LLMs.
- Simulator-in-the-loop training ensures that learned adjustments respect physical constraints of the hydrologic model.
- The workflow eliminates the need for per-basin expert intervention by learning transferable calibration strategies.
- Earth system applications benefit from combining expert data with verifiable simulation rewards over pure language model scaling.
Where Pith is reading between the lines
- Similar simulator-grounded RL could improve AI performance in other fields with accurate physics simulators, such as climate modeling.
- Integrating satellite or in-situ observations directly into the reward signal might increase robustness without collecting more expert trajectories.
- The result suggests specialized grounded agents may be more practical than continued scaling of generic models for many physical-science tasks.
Load-bearing premise
The 2,576 expert calibration trajectories sufficiently represent the diversity of basins and conditions encountered in real-world deployment.
What would settle it
Running HydroAgent without retraining on a fresh collection of 10+ basins spanning new size and climate ranges and verifying whether its NSE consistently exceeds the 0.65-0.75 band of the strongest frontier models.
Figures
read the original abstract
Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks nine frontier LLMs (Claude Opus/Sonnet, GPT-5 variants, Gemini variants) on calibrating the CREST distributed hydrologic model across four held-out gauges (329–40,792 km²), reporting best-of-20-rounds NSE values from -0.16 to 0.75 with no model consistently matching human-expert performance. It then introduces HydroAgent: supervised fine-tuning of Qwen3-4B on 2,576 expert calibration trajectories followed by Group-Relative Policy Optimization (GRPO) that uses online NSE from the CREST simulator as a verifiable reward. The central claim is that a small domain-tuned policy with simulator-in-the-loop RL is more compute-efficient and physically faithful than scaling generic frontier models for Earth-system calibration tasks.
Significance. If HydroAgent were shown to reach or exceed the reported LLM ceiling (0.65–0.75 NSE) while approaching human-expert levels at far lower inference cost, the work would supply concrete evidence that simulator-grounded RL on modest open-weight models can outperform pure scaling for physically grounded inverse problems. This would be a leveraged direction for AI in hydrology and related geosciences where high-fidelity simulators exist.
major comments (2)
- [Abstract / Results] Abstract and Results section: The manuscript provides detailed benchmark numbers and error-free descriptions of the frontier-LLM evaluation but reports no quantitative results, error bars, ablation studies (e.g., on number of trajectories or GRPO steps), or head-to-head comparisons for HydroAgent itself against either the nine LLMs or strong non-LLM baselines such as Bayesian optimization or evolutionary algorithms. This omission is load-bearing for the central claim that the proposed method closes the gap.
- [Methods / Data] Methods / Data section (description of the 2,576 trajectories): The paper does not characterize the distribution of basin sizes, hydrograph signatures, forcing conditions, or geographic coverage represented by the expert trajectories used for SFT. Without this, it is impossible to assess whether GRPO updates driven solely by online NSE will generalize beyond the training distribution or merely reinforce simulator-specific artifacts, directly affecting the generalization and efficiency arguments versus larger pre-trained models.
minor comments (2)
- [Methods] The GRPO objective is described at a high level; an explicit equation or pseudocode for the group-relative advantage and the exact reward normalization would improve reproducibility.
- [Figures] Figure captions and axis labels for any hydrograph or NSE plots should explicitly state the number of independent runs and whether shaded regions represent standard deviation or inter-quartile range.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify key areas where additional detail will strengthen the manuscript, and we have revised accordingly.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: The manuscript provides detailed benchmark numbers and error-free descriptions of the frontier-LLM evaluation but reports no quantitative results, error bars, ablation studies (e.g., on number of trajectories or GRPO steps), or head-to-head comparisons for HydroAgent itself against either the nine LLMs or strong non-LLM baselines such as Bayesian optimization or evolutionary algorithms. This omission is load-bearing for the central claim that the proposed method closes the gap.
Authors: We agree that the original submission under-emphasized quantitative evaluation of HydroAgent. The revised manuscript adds a dedicated Results subsection that reports mean NSE (with standard deviation across five independent runs) for HydroAgent on the four held-out gauges, ablation tables varying the number of SFT trajectories (500/1,000/2,576) and GRPO steps, and head-to-head comparisons against the nine frontier LLMs plus two strong non-LLM baselines (Bayesian optimization and differential evolution). These additions directly support the efficiency and performance claims. revision: yes
-
Referee: [Methods / Data] Methods / Data section (description of the 2,576 trajectories): The paper does not characterize the distribution of basin sizes, hydrograph signatures, forcing conditions, or geographic coverage represented by the expert trajectories used for SFT. Without this, it is impossible to assess whether GRPO updates driven solely by online NSE will generalize beyond the training distribution or merely reinforce simulator-specific artifacts, directly affecting the generalization and efficiency arguments versus larger pre-trained models.
Authors: We concur that explicit characterization of the expert trajectories is necessary for evaluating generalization. The revised Methods section now includes summary statistics and supplementary figures describing the distribution of basin areas, key hydrograph signatures (peak magnitude, timing, baseflow index), meteorological forcing statistics, and geographic spread of the 2,576 trajectories. This documentation shows coverage across the size and regime range of the held-out gauges and helps address concerns about simulator-specific artifacts. revision: yes
Circularity Check
Derivation chain is self-contained with external simulator reward and held-out evaluation
full rationale
The paper benchmarks frontier LLMs on four held-out gauges with CREST, establishing a gap to human-expert NSE. HydroAgent is then constructed via SFT on 2,576 expert trajectories followed by GRPO that directly uses verifiable NSE from online CREST simulations as the reward. Because the reward is produced by an external simulator and evaluation occurs on held-out gauges separate from the training trajectories, the reported performance and efficiency claims do not reduce by construction to the input data or fitted parameters. No self-definitional loops, fitted-input predictions, or load-bearing self-citations are present in the described chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert calibration trajectories (2,576) are sufficient to initialize a policy that can be further improved by RL on the target simulator.
- domain assumption Nash-Sutcliffe Efficiency computed from online CREST runs is an adequate scalar reward for policy optimization.
Reference graph
Works this paper leans on
-
[1]
Nans Addor, Andrew J Newman, Naoki Mizukami, and Martyn P Clark. 2017. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences, 21(10):5293--5313
work page 2017
-
[2]
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970):533--538
work page 2023
-
[3]
Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. 2024. OceanGPT : A large language model for ocean science tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( ACL )
work page 2024
-
[4]
Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, Anna Vaughan, and 1 others. 2025. A foundation model for the Earth system. Nature, 641:1180--1187
work page 2025
-
[5]
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Autonomous chemical research with large language models. Nature, 624(7992):570--578
work page 2023
-
[6]
Ziru Chen, Shijie Chen, Michael Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, and 1 others. 2025. ScienceAgentBench : Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations ( ICLR )
work page 2025
-
[7]
DeepSeek-AI . 2025. DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Yi Xu, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. 2024. K2 : A foundation language model for geoscience knowledge understanding and utilization. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining ( WSDM )
work page 2024
-
[9]
John Doherty. 2015. PEST : Model-independent parameter estimation, user manual (6th edition). Technical report, Watermark Numerical Computing
work page 2015
-
[10]
Qingyun Duan, Soroosh Sorooshian, and Vijai Gupta. 1992. Effective and efficient global optimization for conceptual rainfall-runoff models. Water Resources Research, 28(4):1015--1031
work page 1992
-
[11]
Dapeng Feng, Jiangtao Liu, Kathryn Lawson, and Chaopeng Shen. 2022. Differentiable, learnable, regionalized process-based models with multiphysical outputs can approach state-of-the-art hydrologic prediction accuracy. Water Resources Research, 58(10):e2022WR032404
work page 2022
-
[12]
Zachary L Flamig, Humberto Vergara, and Jonathan J Gourley. 2020. The ensemble framework for flash flood forecasting ( EF5 ) v1.2: description and case study. Geoscientific Model Development, 13(10):4943--4958
work page 2020
- [13]
-
[14]
Alireza Ghafarollahi and Markus J Buehler. 2025. SciAgents : Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning. Advanced Materials, 37(9):2413523
work page 2025
-
[15]
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. 2024. ToRA : A tool-integrated reasoning agent for mathematical problem solving. In International Conference on Learning Representations ( ICLR )
work page 2024
-
[16]
Jonathan J Gourley, Zachary L Flamig, Humberto Vergara, Pierre-Emmanuel Kirstetter, Robert A Clark III, Elizabeth Argyle, Ami Arthur, Steven Martinaitis, Galateia Terti, Jessica M Erlingis, Yang Hong, and Kenneth W Howard. 2017. The FLASH project: Improving the tools for flash flood monitoring and prediction across the United States . Bulletin of the Amer...
work page 2017
-
[17]
Qian Huang, Jacky Vora, Percy Liang, and Jure Leskovec. 2024. MLAgentBench : Evaluating language agents on machine learning experimentation. In Proceedings of the 41st International Conference on Machine Learning ( ICML )
work page 2024
-
[18]
Kimi Team . 2025. Kimi k1.5 : Scaling reinforcement learning with LLM s. arXiv preprint arXiv:2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Frederik Kratzert, Martin Gauch, Grey Nearing, and Daniel Klotz. 2022. NeuralHydrology ---a Python library for deep learning research in hydrology. Journal of Open Source Software, 7(71):4050
work page 2022
-
[20]
Frederik Kratzert, Daniel Klotz, Claire Brenner, Karsten Schulz, and Mathew Herrnegger. 2018. Rainfall--runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11):6005--6022
work page 2018
-
[21]
Frederik Kratzert, Daniel Klotz, Guy Shalev, G \"u nter Klambauer, Sepp Hochreiter, and Grey Nearing. 2019. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23(12):5089--5110
work page 2019
-
[22]
Frederik Kratzert, Grey Nearing, Nans Addor, Tyler Erickson, Martin Gauch, Oren Gilon, Lukas Gudmundsson, Avinatan Hassidim, Daniel Klotz, Sella Nevo, and 1 others. 2023. Caravan---a global community dataset for large-sample hydrology. Scientific Data, 10(1):61
work page 2023
-
[23]
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, and 1 others. 2023. Learning skillful medium-range global weather forecasting. Science, 382(6677):1416--1421
work page 2023
-
[24]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others. 2024. T\"ulu 3 : Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF : Scaling reinforcement learning from human feedback with AI feedback. In Proceedings of the 41st International Conference on Machine Learning ( ICML )
work page 2024
-
[26]
Zhi Li, Mengye Chen, Shang Gao, Jonathan J Gourley, Tiantian Yang, Xinyi Shen, Pierre Kirstetter, and Yang Hong. 2021. A multi-source 120-year US flood database with a unified common format and public access. Earth System Science Data Discussions, 2021:1--25
work page 2021
-
[27]
Zhi Li, Shang Gao, Mengye Chen, Jonathan J Gourley, Changjie Liu, Andreas F Prein, and Yang Hong. 2022. The conterminous United States are projected to become more prone to flash floods in a high-end emissions scenario. Communications Earth & Environment, 3(1):86
work page 2022
-
[28]
Zhi Li, Xianwu Xue, Robert Clark, Humberto Vergara, Jonathan J Gourley, Guoqiang Tang, Xinyi Shen, Guangyuan Kan, Ke Zhang, Jiahu Wang, Mengye Chen, Shang Gao, Jiaqi Zhang, Tiantian Yang, Yixin Wen, Pierre Kirstetter, and Tiantian Hong. 2023. A decadal review of the CREST model family: Developments, applications, and outlook. Journal of Hydrology X, 21:100159
work page 2023
- [29]
-
[30]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. RemoteCLIP : A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing
work page 2024
-
[31]
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller
Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525--535
work page 2024
-
[33]
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Eureka: Human-level reward design via coding large language models. In International Conference on Learning Representations ( ICLR )
work page 2024
-
[34]
Thomas K J McDermott. 2022. Global exposure to flood risk and poverty. Nature Communications, 13(1):3529
work page 2022
-
[35]
Daniel N Moriasi, Margaret W Gitau, Naresh Pai, and Prasad Daggupati. 2015. https://doi.org/10.13031/trans.58.10715 Hydrologic and water quality models: Performance measures and evaluation criteria . Transactions of the ASABE, 58(6):1763--1785
-
[36]
Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begum Demir, and Salman Khan. 2026. Agentic AI for remote sensing: Technical challenges and research directions. arXiv preprint arXiv:2604.24919
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
J Eamonn Nash and J Vasilis Sutcliffe. 1970. River flow forecasting through conceptual models part I ---a discussion of principles. Journal of Hydrology, 10(3):282--290
work page 1970
-
[38]
Grey Nearing, Deborah Cohen, Vusumuzi Dube, Martin Gauch, Oren Gilon, Shaun Harrigan, Avinatan Hassidim, Daniel Klotz, Frederik Kratzert, Asher Metzger, and 1 others. 2024. Global prediction of extreme floods in ungauged watersheds. Nature, 627:559--563
work page 2024
-
[39]
Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. 2023. ClimaX : A foundation model for weather and climate. In Proceedings of the 40th International Conference on Machine Learning ( ICML )
work page 2023
-
[40]
Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, and 1 others. 2022. FourCastNet : A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv preprint arXiv:2202.11214
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, and 1 others. 2024. Probabilistic weather forecasting with machine learning. Nature, 637:84--90
work page 2024
-
[42]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . In International Conference on Learning Representations ( ICLR )
work page 2024
-
[43]
Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Jun Rentschler, Melda Salhab, and Bramka Arga Jafino. 2022. Flood exposure and poverty in 188 countries. Nature Communications, 13(1):3527
work page 2022
-
[45]
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. 2024. Mathematical discoveries from program search with large language models. Nature, 625:468--475
work page 2024
-
[46]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 ( NeurIPS )
work page 2023
-
[47]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. 2024. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Chaopeng Shen, Alison P Appling, Pierre Gentine, Toshiyuki Bandai, Hoshin Gupta, Alexandre Tartakovsky, Marco Baity-Jesi, Fabrizio Fenicia, Daniel Kifer, Li Li, and 1 others. 2023. Differentiable modelling to unify machine learning and physical models for geosciences. Nature Reviews Earth & Environment, 4:552--567
work page 2023
-
[49]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow : A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36 ( NeurIPS )
work page 2023
- [51]
-
[52]
Bryan A Tolson and Christine A Shoemaker. 2007. Dynamically dimensioned search algorithm for computationally efficient watershed model calibration. Water Resources Research, 43(1):W01413
work page 2007
-
[53]
Saeid Ashraf Vaghefi, Dominik Stammbach, Veruska Muccione, Julia Bingler, Jingwei Ni, Mathias Kraus, Simon Allen, Chiara Colesanti-Senni, Tobias Wekhof, Tobias Schimanski, and 1 others. 2023. ChatClimate : Grounding conversational AI in climate science. Communications Earth & Environment, 4(1):480
work page 2023
-
[54]
Jasper A Vrugt and Mojtaba Sadegh. 2013. https://doi.org/10.1002/wrcr.20354 Toward diagnostic model calibration and evaluation: Approximate Bayesian computation . Water Resources Research, 49(7):4335--4345
-
[55]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research ( TMLR )
work page 2024
-
[56]
Jiahu Wang, Yang Hong, Li Li, Jonathan J Gourley, Sadiq I Khan, Koray K Yilmaz, Robert F Adler, Frederick S Policelli, Shahid Habib, Daniel Irwn, and 1 others. 2011. The coupled routing and excess storage ( CREST ) distributed hydrological model. Hydrological Sciences Journal, 56(1):84--98
work page 2011
-
[57]
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Heng Chen, and 1 others. 2025. RAGEN : Understanding self-evolution in LLM agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [58]
-
[59]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. 2025. SWE-RL : Advancing LLM reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2023. AutoGen : Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Xianwu Xue, Yang Hong, Ashutosh S Limaye, Jonathan J Gourley, George J Huffman, Sadiq Ibrahim Khan, Chhimi Dorji, and Sheng Chen. 2013. Statistical and hydrological evaluation of TRMM -based multi-satellite precipitation analysis over the Wangchu basin of Bhutan : Are the latest satellite precipitation products 3B42V7 ready for use in ungauged basins? Jou...
work page 2013
- [62]
-
[63]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations ( ICLR )
work page 2023
-
[64]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. 2024. EarthGPT : A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62:1--20
work page 2024
-
[65]
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. 2024. ArCHer : Training language model agents via hierarchical multi-turn RL . In Proceedings of the 41st International Conference on Machine Learning ( ICML )
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.