pith. sign in

arxiv: 2605.17792 · v1 · pith:DUAMBV5Rnew · submitted 2026-05-18 · 💻 cs.LG · physics.geo-ph

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Pith reviewed 2026-05-20 13:22 UTC · model grok-4.3

classification 💻 cs.LG physics.geo-ph
keywords hydrologic model calibrationreinforcement learningsimulator feedbacklarge language modelsNash-Sutcliffe EfficiencyCREST modelEarth system sciencedomain-specific agents
0
0 comments X

The pith

A small domain-tuned model with simulator-grounded RL outperforms frontier LLMs in hydrologic model calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier large language models fall short of human experts when calibrating the CREST distributed hydrologic model, achieving Nash-Sutcliffe Efficiencies up to 0.75 but not matching the reference on most gauges. The authors attribute this shortfall to a lack of domain-specific grounding rather than insufficient model scale. HydroAgent tackles the problem by applying supervised fine-tuning to Qwen3-4B on thousands of expert calibration examples, then refining it further with reinforcement learning that uses real-time simulation outcomes as rewards. This yields a compute-efficient agent that produces physically consistent parameter adjustments transferable across watersheds. If correct, the result points to domain-grounded reinforcement learning as a practical route for AI in Earth system sciences.

Core claim

The performance ceiling of frontier LLMs on CREST calibration reflects a domain-grounding deficit. HydroAgent closes this gap by first supervised fine-tuning an open 4B model on 2,576 expert trajectories and then applying Group-Relative Policy Optimization, where the reward is the Nash-Sutcliffe Efficiency computed from online CREST runs. The resulting policy delivers calibration quality approaching human experts while remaining far smaller and more efficient than the tested frontier systems.

What carries the argument

Reinforcement learning with simulation feedback (RLSF), in which Nash-Sutcliffe Efficiency scores from live CREST model executions serve as the reward signal to optimize a domain-tuned policy.

If this is right

  • Domain-tuned small models can achieve expert-level performance in hydrologic calibration without the compute cost of frontier LLMs.
  • Simulator-in-the-loop training ensures that learned adjustments respect physical constraints of the hydrologic model.
  • The workflow eliminates the need for per-basin expert intervention by learning transferable calibration strategies.
  • Earth system applications benefit from combining expert data with verifiable simulation rewards over pure language model scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar simulator-grounded RL could improve AI performance in other fields with accurate physics simulators, such as climate modeling.
  • Integrating satellite or in-situ observations directly into the reward signal might increase robustness without collecting more expert trajectories.
  • The result suggests specialized grounded agents may be more practical than continued scaling of generic models for many physical-science tasks.

Load-bearing premise

The 2,576 expert calibration trajectories sufficiently represent the diversity of basins and conditions encountered in real-world deployment.

What would settle it

Running HydroAgent without retraining on a fresh collection of 10+ basins spanning new size and climate ranges and verifying whether its NSE consistently exceeds the 0.65-0.75 band of the strongest frontier models.

Figures

Figures reproduced from arXiv: 2605.17792 by Anjiang Wei, Jie Cao, Jinwoong Yoo, Mofan Zhang, Songkun Yan, Yang Hong, Zhi Li.

Figure 1
Figure 1. Figure 1: Overview of the hydrologic calibration benchmark and evaluation pipeline: Dataset, Task, Agents, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Best-of-twenty-rounds Nash–Sutcliffe Efficiency (NSE) across four evaluation gauges, where GPT-5 and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relative change in four hydrologic-fit metrics [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Geographic distribution of the gauges used in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gauge 06279500 (basin area 40,792 km2 ; June–August 2018). The largest basin in the test panel and a documented difficult case (Appendix A). HydroA￾gent reduces the magnitude of the negative NSE substan￾tially (−1.41 → −0.84) but does not reach a positive value within budget—an artifact of the basin’s scale relative to our training pool (≤2,401 km2 ). This likely reflects a limitation of the underlying phy… view at source ↗
Figure 5
Figure 5. Figure 5: Gauge 02338660 (basin area 329 km2 ; July– August 2018). HydroAgent improves NSE from 0.65 to 0.68 and reduces RMSE; both runs share a similar peak-timing offset [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Gauge 07144100 (basin area 3,209 km2 ; March–May 2019). HydroAgent more than doubles NSE from 0.33 to 0.68, with a visibly tighter recession limb and a peak-magnitude reduction matching the ob￾served series. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gauge 01403060 (basin area 2,033 km2 ; November 2018–January 2019). The base model pro￾duces a runaway over-prediction on the late-November event (>1,000 m3 s −1 against an observed peak near 600 m3 s −1 ); HydroAgent collapses that bias and lifts NSE from −0.15 to +0.42 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks nine frontier LLMs (Claude Opus/Sonnet, GPT-5 variants, Gemini variants) on calibrating the CREST distributed hydrologic model across four held-out gauges (329–40,792 km²), reporting best-of-20-rounds NSE values from -0.16 to 0.75 with no model consistently matching human-expert performance. It then introduces HydroAgent: supervised fine-tuning of Qwen3-4B on 2,576 expert calibration trajectories followed by Group-Relative Policy Optimization (GRPO) that uses online NSE from the CREST simulator as a verifiable reward. The central claim is that a small domain-tuned policy with simulator-in-the-loop RL is more compute-efficient and physically faithful than scaling generic frontier models for Earth-system calibration tasks.

Significance. If HydroAgent were shown to reach or exceed the reported LLM ceiling (0.65–0.75 NSE) while approaching human-expert levels at far lower inference cost, the work would supply concrete evidence that simulator-grounded RL on modest open-weight models can outperform pure scaling for physically grounded inverse problems. This would be a leveraged direction for AI in hydrology and related geosciences where high-fidelity simulators exist.

major comments (2)
  1. [Abstract / Results] Abstract and Results section: The manuscript provides detailed benchmark numbers and error-free descriptions of the frontier-LLM evaluation but reports no quantitative results, error bars, ablation studies (e.g., on number of trajectories or GRPO steps), or head-to-head comparisons for HydroAgent itself against either the nine LLMs or strong non-LLM baselines such as Bayesian optimization or evolutionary algorithms. This omission is load-bearing for the central claim that the proposed method closes the gap.
  2. [Methods / Data] Methods / Data section (description of the 2,576 trajectories): The paper does not characterize the distribution of basin sizes, hydrograph signatures, forcing conditions, or geographic coverage represented by the expert trajectories used for SFT. Without this, it is impossible to assess whether GRPO updates driven solely by online NSE will generalize beyond the training distribution or merely reinforce simulator-specific artifacts, directly affecting the generalization and efficiency arguments versus larger pre-trained models.
minor comments (2)
  1. [Methods] The GRPO objective is described at a high level; an explicit equation or pseudocode for the group-relative advantage and the exact reward normalization would improve reproducibility.
  2. [Figures] Figure captions and axis labels for any hydrograph or NSE plots should explicitly state the number of independent runs and whether shaded regions represent standard deviation or inter-quartile range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify key areas where additional detail will strengthen the manuscript, and we have revised accordingly.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: The manuscript provides detailed benchmark numbers and error-free descriptions of the frontier-LLM evaluation but reports no quantitative results, error bars, ablation studies (e.g., on number of trajectories or GRPO steps), or head-to-head comparisons for HydroAgent itself against either the nine LLMs or strong non-LLM baselines such as Bayesian optimization or evolutionary algorithms. This omission is load-bearing for the central claim that the proposed method closes the gap.

    Authors: We agree that the original submission under-emphasized quantitative evaluation of HydroAgent. The revised manuscript adds a dedicated Results subsection that reports mean NSE (with standard deviation across five independent runs) for HydroAgent on the four held-out gauges, ablation tables varying the number of SFT trajectories (500/1,000/2,576) and GRPO steps, and head-to-head comparisons against the nine frontier LLMs plus two strong non-LLM baselines (Bayesian optimization and differential evolution). These additions directly support the efficiency and performance claims. revision: yes

  2. Referee: [Methods / Data] Methods / Data section (description of the 2,576 trajectories): The paper does not characterize the distribution of basin sizes, hydrograph signatures, forcing conditions, or geographic coverage represented by the expert trajectories used for SFT. Without this, it is impossible to assess whether GRPO updates driven solely by online NSE will generalize beyond the training distribution or merely reinforce simulator-specific artifacts, directly affecting the generalization and efficiency arguments versus larger pre-trained models.

    Authors: We concur that explicit characterization of the expert trajectories is necessary for evaluating generalization. The revised Methods section now includes summary statistics and supplementary figures describing the distribution of basin areas, key hydrograph signatures (peak magnitude, timing, baseflow index), meteorological forcing statistics, and geographic spread of the 2,576 trajectories. This documentation shows coverage across the size and regime range of the held-out gauges and helps address concerns about simulator-specific artifacts. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with external simulator reward and held-out evaluation

full rationale

The paper benchmarks frontier LLMs on four held-out gauges with CREST, establishing a gap to human-expert NSE. HydroAgent is then constructed via SFT on 2,576 expert trajectories followed by GRPO that directly uses verifiable NSE from online CREST simulations as the reward. Because the reward is produced by an external simulator and evaluation occurs on held-out gauges separate from the training trajectories, the reported performance and efficiency claims do not reduce by construction to the input data or fitted parameters. No self-definitional loops, fitted-input predictions, or load-bearing self-citations are present in the described chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that expert trajectories capture transferable calibration knowledge and that NSE from live simulations is a faithful scalar reward; no new physical entities are introduced.

axioms (2)
  • domain assumption Expert calibration trajectories (2,576) are sufficient to initialize a policy that can be further improved by RL on the target simulator.
    Stated in the description of the SFT stage.
  • domain assumption Nash-Sutcliffe Efficiency computed from online CREST runs is an adequate scalar reward for policy optimization.
    Used as the verifiable reward in the GRPO step.

pith-pipeline@v0.9.0 · 5943 in / 1388 out tokens · 49701 ms · 2026-05-20T13:22:11.950274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 12 internal anchors

  1. [1]

    Nans Addor, Andrew J Newman, Naoki Mizukami, and Martyn P Clark. 2017. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrology and Earth System Sciences, 21(10):5293--5313

  2. [2]

    Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970):533--538

  3. [3]

    Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. 2024. OceanGPT : A large language model for ocean science tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( ACL )

  4. [4]

    Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, Anna Vaughan, and 1 others. 2025. A foundation model for the Earth system. Nature, 641:1180--1187

  5. [5]

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Autonomous chemical research with large language models. Nature, 624(7992):570--578

  6. [6]

    Ziru Chen, Shijie Chen, Michael Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, and 1 others. 2025. ScienceAgentBench : Toward rigorous assessment of language agents for data-driven scientific discovery. In International Conference on Learning Representations ( ICLR )

  7. [7]

    DeepSeek-AI . 2025. DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948

  8. [8]

    Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Yi Xu, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. 2024. K2 : A foundation language model for geoscience knowledge understanding and utilization. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining ( WSDM )

  9. [9]

    John Doherty. 2015. PEST : Model-independent parameter estimation, user manual (6th edition). Technical report, Watermark Numerical Computing

  10. [10]

    Qingyun Duan, Soroosh Sorooshian, and Vijai Gupta. 1992. Effective and efficient global optimization for conceptual rainfall-runoff models. Water Resources Research, 28(4):1015--1031

  11. [11]

    Dapeng Feng, Jiangtao Liu, Kathryn Lawson, and Chaopeng Shen. 2022. Differentiable, learnable, regionalized process-based models with multiphysical outputs can approach state-of-the-art hydrologic prediction accuracy. Water Resources Research, 58(10):e2022WR032404

  12. [12]

    Zachary L Flamig, Humberto Vergara, and Jonathan J Gourley. 2020. The ensemble framework for flash flood forecasting ( EF5 ) v1.2: description and case study. Geoscientific Model Development, 13(10):4943--4958

  13. [13]

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. 2024. RLEF : Grounding code LLM s in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089

  14. [14]

    Alireza Ghafarollahi and Markus J Buehler. 2025. SciAgents : Automating scientific discovery through bioinspired multi-agent intelligent graph reasoning. Advanced Materials, 37(9):2413523

  15. [15]

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. 2024. ToRA : A tool-integrated reasoning agent for mathematical problem solving. In International Conference on Learning Representations ( ICLR )

  16. [16]

    Jonathan J Gourley, Zachary L Flamig, Humberto Vergara, Pierre-Emmanuel Kirstetter, Robert A Clark III, Elizabeth Argyle, Ami Arthur, Steven Martinaitis, Galateia Terti, Jessica M Erlingis, Yang Hong, and Kenneth W Howard. 2017. The FLASH project: Improving the tools for flash flood monitoring and prediction across the United States . Bulletin of the Amer...

  17. [17]

    Qian Huang, Jacky Vora, Percy Liang, and Jure Leskovec. 2024. MLAgentBench : Evaluating language agents on machine learning experimentation. In Proceedings of the 41st International Conference on Machine Learning ( ICML )

  18. [18]

    Kimi Team . 2025. Kimi k1.5 : Scaling reinforcement learning with LLM s. arXiv preprint arXiv:2501.12599

  19. [19]

    Frederik Kratzert, Martin Gauch, Grey Nearing, and Daniel Klotz. 2022. NeuralHydrology ---a Python library for deep learning research in hydrology. Journal of Open Source Software, 7(71):4050

  20. [20]

    Frederik Kratzert, Daniel Klotz, Claire Brenner, Karsten Schulz, and Mathew Herrnegger. 2018. Rainfall--runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11):6005--6022

  21. [21]

    Frederik Kratzert, Daniel Klotz, Guy Shalev, G \"u nter Klambauer, Sepp Hochreiter, and Grey Nearing. 2019. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23(12):5089--5110

  22. [22]

    Frederik Kratzert, Grey Nearing, Nans Addor, Tyler Erickson, Martin Gauch, Oren Gilon, Lukas Gudmundsson, Avinatan Hassidim, Daniel Klotz, Sella Nevo, and 1 others. 2023. Caravan---a global community dataset for large-sample hydrology. Scientific Data, 10(1):61

  23. [23]

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, and 1 others. 2023. Learning skillful medium-range global weather forecasting. Science, 382(6677):1416--1421

  24. [24]

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others. 2024. T\"ulu 3 : Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124

  25. [25]

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF : Scaling reinforcement learning from human feedback with AI feedback. In Proceedings of the 41st International Conference on Machine Learning ( ICML )

  26. [26]

    Zhi Li, Mengye Chen, Shang Gao, Jonathan J Gourley, Tiantian Yang, Xinyi Shen, Pierre Kirstetter, and Yang Hong. 2021. A multi-source 120-year US flood database with a unified common format and public access. Earth System Science Data Discussions, 2021:1--25

  27. [27]

    Zhi Li, Shang Gao, Mengye Chen, Jonathan J Gourley, Changjie Liu, Andreas F Prein, and Yang Hong. 2022. The conterminous United States are projected to become more prone to flash floods in a high-end emissions scenario. Communications Earth & Environment, 3(1):86

  28. [28]

    Zhi Li, Xianwu Xue, Robert Clark, Humberto Vergara, Jonathan J Gourley, Guoqiang Tang, Xinyi Shen, Guangyuan Kan, Ke Zhang, Jiahu Wang, Mengye Chen, Shang Gao, Jiaqi Zhang, Tiantian Yang, Yixin Wen, Pierre Kirstetter, and Tiantian Hong. 2023. A decadal review of the CREST model family: Developments, applications, and outlook. Journal of Hydrology X, 21:100159

  29. [29]

    Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song, and 1 others. 2024. GeoGalactica : A scientific large language model in geoscience. arXiv preprint arXiv:2401.00434

  30. [30]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. RemoteCLIP : A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing

  31. [31]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292

  32. [32]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6:525--535

  33. [33]

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Eureka: Human-level reward design via coding large language models. In International Conference on Learning Representations ( ICLR )

  34. [34]

    Thomas K J McDermott. 2022. Global exposure to flood risk and poverty. Nature Communications, 13(1):3529

  35. [35]

    Daniel N Moriasi, Margaret W Gitau, Naresh Pai, and Prasad Daggupati. 2015. https://doi.org/10.13031/trans.58.10715 Hydrologic and water quality models: Performance measures and evaluation criteria . Transactions of the ASABE, 58(6):1763--1785

  36. [36]

    Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begum Demir, and Salman Khan. 2026. Agentic AI for remote sensing: Technical challenges and research directions. arXiv preprint arXiv:2604.24919

  37. [37]

    J Eamonn Nash and J Vasilis Sutcliffe. 1970. River flow forecasting through conceptual models part I ---a discussion of principles. Journal of Hydrology, 10(3):282--290

  38. [38]

    Grey Nearing, Deborah Cohen, Vusumuzi Dube, Martin Gauch, Oren Gilon, Shaun Harrigan, Avinatan Hassidim, Daniel Klotz, Frederik Kratzert, Asher Metzger, and 1 others. 2024. Global prediction of extreme floods in ungauged watersheds. Nature, 627:559--563

  39. [39]

    Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. 2023. ClimaX : A foundation model for weather and climate. In Proceedings of the 40th International Conference on Machine Learning ( ICML )

  40. [40]

    Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, and 1 others. 2022. FourCastNet : A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv preprint arXiv:2202.11214

  41. [41]

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, and 1 others. 2024. Probabilistic weather forecasting with machine learning. Nature, 637:84--90

  42. [42]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . In International Conference on Learning Representations ( ICLR )

  43. [43]

    Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

  44. [44]

    Jun Rentschler, Melda Salhab, and Bramka Arga Jafino. 2022. Flood exposure and poverty in 188 countries. Nature Communications, 13(1):3527

  45. [45]

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. 2024. Mathematical discoveries from program search with large language models. Nature, 625:468--475

  46. [46]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 ( NeurIPS )

  47. [47]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. 2024. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  48. [48]

    Chaopeng Shen, Alison P Appling, Pierre Gentine, Toshiyuki Bandai, Hoshin Gupta, Alexandre Tartakovsky, Marco Baity-Jesi, Fabrizio Fenicia, Daniel Kifer, Li Li, and 1 others. 2023. Differentiable modelling to unify machine learning and physical models for geosciences. Nature Reviews Earth & Environment, 4:552--567

  49. [49]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow : A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256

  50. [50]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36 ( NeurIPS )

  51. [51]

    David Thulke, Yingbo Gao, Petrus Pelser, Rein Brune, Rricha Jalota, Floris Fok, Michael Ramos, Ian van Wyk, Abdallah Nasir, Hayden Goldhahn, and 1 others. 2024. ClimateGPT : Towards AI synthesizing interdisciplinary research on climate change. arXiv preprint arXiv:2401.09646

  52. [52]

    Bryan A Tolson and Christine A Shoemaker. 2007. Dynamically dimensioned search algorithm for computationally efficient watershed model calibration. Water Resources Research, 43(1):W01413

  53. [53]

    Saeid Ashraf Vaghefi, Dominik Stammbach, Veruska Muccione, Julia Bingler, Jingwei Ni, Mathias Kraus, Simon Allen, Chiara Colesanti-Senni, Tobias Wekhof, Tobias Schimanski, and 1 others. 2023. ChatClimate : Grounding conversational AI in climate science. Communications Earth & Environment, 4(1):480

  54. [54]

    Jasper A Vrugt and Mojtaba Sadegh. 2013. https://doi.org/10.1002/wrcr.20354 Toward diagnostic model calibration and evaluation: Approximate Bayesian computation . Water Resources Research, 49(7):4335--4345

  55. [55]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research ( TMLR )

  56. [56]

    Jiahu Wang, Yang Hong, Li Li, Jonathan J Gourley, Sadiq I Khan, Koray K Yilmaz, Robert F Adler, Frederick S Policelli, Shahid Habib, Daniel Irwn, and 1 others. 2011. The coupled routing and excess storage ( CREST ) distributed hydrological model. Hydrological Sciences Journal, 56(1):84--98

  57. [57]

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Heng Chen, and 1 others. 2025. RAGEN : Understanding self-evolution in LLM agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073

  58. [58]

    Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold. 2022. ClimateBert : A pretrained language model for climate-related text. arXiv preprint arXiv:2110.12010

  59. [59]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. 2025. SWE-RL : Advancing LLM reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449

  60. [60]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2023. AutoGen : Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155

  61. [61]

    Xianwu Xue, Yang Hong, Ashutosh S Limaye, Jonathan J Gourley, George J Huffman, Sadiq Ibrahim Khan, Chhimi Dorji, and Sheng Chen. 2013. Statistical and hydrological evaluation of TRMM -based multi-satellite precipitation analysis over the Wangchu basin of Bhutan : Are the latest satellite precipitation products 3B42V7 ready for use in ungauged basins? Jou...

  62. [62]

    Songkun Yan, Zhi Li, Siyu Zhu, Yixin Wen, Mofan Zhang, Mengye Chen, Jie Cao, and Yang Hong. 2025. AQUAH : Automatic quantification and unified agent in hydrology. arXiv preprint arXiv:2508.02936

  63. [63]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations ( ICLR )

  64. [64]

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. 2024. EarthGPT : A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. IEEE Transactions on Geoscience and Remote Sensing, 62:1--20

  65. [65]

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. 2024. ArCHer : Training language model agents via hierarchical multi-turn RL . In Proceedings of the 41st International Conference on Machine Learning ( ICML )