pith. sign in

arxiv: 2605.17856 · v1 · pith:ATW2UB6Snew · submitted 2026-05-18 · 💻 cs.AI

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

Pith reviewed 2026-05-20 10:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords acrossprocess-basedagentsknowledgemodellingscientificsimulationbarrier
0
0 comments X

The pith

KI enables AI agents to run process-based Earth science simulations with up to 84% success in a 3000-trial hydrology benchmark versus under 40% without it, and a toolkit extracts similar KI for 119 models across 14 domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Process-based models in Earth science are complex computer programs that simulate things like water cycles and climate based on physical laws. However, using them requires a lot of expert knowledge, which many people in at-risk communities don't have. This paper proposes a Knowledge Infrastructure, or KI, that packages this expert knowledge into simple, actionable pieces that AI agents can use. The KI includes specific modeling steps that have been validated, protocols for different stages of the simulation, and ways for the agent to diagnose and fix problems when things go wrong. In tests with a hydrology model, agents using KI succeeded in creating good simulations 84 percent of the time, compared to less than 40 percent without it. They also built a toolkit that can create similar KI for over 100 other models in 14 different fields. Interestingly, the way experts decide on modeling choices and fix errors turned out to be similar across all these different models, even though the underlying science varies. This suggests that scientific expertise has a common structure that can be captured and shared. Overall, this could make advanced simulations available to more people and help scientists from different areas work together more easily.

Core claim

Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines, with modelling decisions and failure remedies converging across 119 KIs from 14 domains.

Load-bearing premise

That the convergence of modelling decisions and failure remedies across models with different underlying physics demonstrates that operational expertise is structured and extractable in a general, non-ad-hoc form that can be autonomously dissected by the KDT toolkit.

Figures

Figures reproduced from arXiv: 2605.17856 by Birk Li, Jianyun Zhang, Junliang Jin, Liujun Zhu, Ruiqi Wu, Yichen Zhao, Yuchen Liu, Ziwei Li.

Figure 1
Figure 1. Figure 1: Knowledge infrastructure is evaluated through depth to scale. (a): Scale. KI construction and validation across 119 process-based models spanning 14 Earth-science domains tests whether the scaffold can be built beyond a single hand-authored workflow. Each column represents one model, coloured by the depth at which its KI was validated. Dark blue (n = 2): VIC and Lohmann routing, hand-built and tested under… view at source ↗
Figure 2
Figure 2. Figure 2: Knowledge dissection converts operational expertise into agent-usable knowledge infrastructure. (a): The Knowledge Dissection Toolkit (KDT) converts model source code, docu￾mentation and example cases into a self-contained knowledge infrastructure (KI) package. Knowledge dissection extracts three forms of operational expertise: procedural knowledge, encoded as validated modelling operators; evaluative know… view at source ↗
Figure 3
Figure 3. Figure 3: KI enables reliable agentic simulation in a coupled hydrological workflow. (a) Milestone-level completion across the 14-step VIC–Lohmann workflow, showing where agents exit the pipeline. Attrition is concentrated around the preparation for forcing and VIC setup, where cross-component dependencies (M1–M3) must be resolved. (b) Success rates across three Huai River basins. Agent rankings are broadly consiste… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end validation across both KI cohorts: 25 expert-supervised packages (a, b) and 92 autonomously dissected packages (c). (a): Per-site best performance. Left, 21 non-crop models on a Pearson r axis (threshold r = 0.5); right, 4 crop models on an |PBIAS| axis (threshold 25%). Each dot is one of the three best-performing sites selected per model among sites with a complete 3 × 3 (three dots per row). G… view at source ↗
Figure 5
Figure 5. Figure 5: Structural convergence across 119 models and 14 domains. (a) Tool category proportions by domain. Stacked horizontal bars show the distribution of 835 tools across seven conserved functional categories plus an OTHER residual. Annotation: all 119 models share the same seven operational stages. (b) Triplet counts per model by domain. Each point is one model (blue circles = auto-dissected, red diamonds = expe… view at source ↗
Figure 6
Figure 6. Figure 6: KI agents as model interfaces. (a): Left: financial, geographic, temporal, linguistic, and institutional barriers limit the use of process-based models across user types, from farmers to policymakers. Centre: KI architecture (validated modelling operators, staged domain protocols, diagnostic recovery mechanisms) lets AI agents operate canonical physics models end-to-end through natural-language interaction… view at source ↗
Figure 7
Figure 7. Figure 7: Extended Data [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Extended Data [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Extended Data [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Extended Data [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extended Data [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Knowledge Infrastructure (KI) as an agent-actionable scaffold externalizing expertise from process-based Earth science models into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. In a 3,000-trial coupled-hydrology benchmark, KI-equipped agents achieve up to 84% success in producing physically plausible, verifiable end-to-end simulations versus below 40% without KI. A Knowledge Dissection Toolkit (KDT) autonomously generates KIs for 117 additional models (total 119 across 14 domains), with modelling decisions and failure remedies converging despite differing physics; this is taken to show operational expertise is structured and extractable rather than ad hoc. KI is positioned to lower access barriers for non-specialists and integration barriers between modelling communities, enabling process-based science as a living scientific commons.

Significance. If the empirical results and generalization hold after addressing verification gaps, this could meaningfully advance AI-assisted scientific simulation by making complex process-based models more accessible and extensible. The scale of the benchmark (3,000 trials) and autonomous construction of 119 KIs via KDT are practical strengths that could support broader adoption in Earth sciences, particularly for climate-risk communities. The framing of expertise as extractable infrastructure aligns with open-science goals and offers a concrete mechanism for agentic workflows.

major comments (2)
  1. [Abstract and benchmark/results section] Abstract and benchmark/results section: The central performance claim (up to 84% physically plausible simulations with KI vs. <40% without) is load-bearing for the efficacy argument, yet no details are provided on the specific criteria or automated/manual procedures used to verify physical plausibility, the construction and randomization of the 3,000 trials, presence of error bars or statistical tests, or trial-level controls. This leaves the quantitative support for the reported gap thin and difficult to evaluate.
  2. [KDT description and cross-domain generalization section] KDT description and cross-domain generalization section: The claim that convergence of modelling decisions and failure remedies across 119 KIs from 14 domains demonstrates that 'operational expertise is structured and extractable rather than ad hoc' relies on KDT-produced artifacts. However, the manuscript reports no control condition (e.g., KDT applied to unstructured/non-scientific sources or side-by-side comparison with human-expert KIs constructed without the KDT template). This makes it impossible to distinguish discovery of intrinsic structure from imposition by the uniform extraction template defined in KDT.
minor comments (2)
  1. [Terminology and notation] Ensure all acronyms (KI, KDT) receive explicit first-use definitions and consistent expansion in the main text; a short table or glossary would aid readers from non-AI Earth-science backgrounds.
  2. [References] The manuscript would benefit from additional references to prior agentic AI frameworks for scientific workflows and knowledge-extraction methods to better contextualize the novelty of the scaffolding approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript on Knowledge Infrastructure (KI). The comments identify key areas where additional methodological transparency would strengthen the presentation of results. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract and benchmark/results section] Abstract and benchmark/results section: The central performance claim (up to 84% physically plausible simulations with KI vs. <40% without) is load-bearing for the efficacy argument, yet no details are provided on the specific criteria or automated/manual procedures used to verify physical plausibility, the construction and randomization of the 3,000 trials, presence of error bars or statistical tests, or trial-level controls. This leaves the quantitative support for the reported gap thin and difficult to evaluate.

    Authors: We agree that expanded methodological detail is required for readers to fully assess the benchmark. In the revised manuscript we will add a dedicated subsection describing: (i) the automated verification criteria (mass balance, boundary consistency, output range checks against domain protocols) together with the protocol for manual expert review on a 10% random sample of trials; (ii) the trial-generation procedure, including the randomization scheme for initial states, meteorological forcings, and parameter perturbations constrained to physically plausible intervals; (iii) statistical reporting with per-condition success rates, standard-error bars across repeated runs, and paired significance tests; and (iv) stratified results by simulation complexity to serve as trial-level controls. These additions will appear in the main text with supporting code and example verification logs placed in the supplementary material. revision: yes

  2. Referee: [KDT description and cross-domain generalization section] KDT description and cross-domain generalization section: The claim that convergence of modelling decisions and failure remedies across 119 KIs from 14 domains demonstrates that 'operational expertise is structured and extractable rather than ad hoc' relies on KDT-produced artifacts. However, the manuscript reports no control condition (e.g., KDT applied to unstructured/non-scientific sources or side-by-side comparison with human-expert KIs constructed without the KDT template). This makes it impossible to distinguish discovery of intrinsic structure from imposition by the uniform extraction template defined in KDT.

    Authors: The referee correctly notes that the current text presents convergence as an observational result without explicit controls for template bias. We will revise the cross-domain section to include a brief control experiment: the KDT will be run on a corpus of non-scientific technical documents to test whether the same structured operator/protocol/recovery format emerges in the absence of domain content. We will also report quantitative agreement metrics between KDT-generated KIs and a small set of independently authored human-expert KIs for three representative models. These additions will allow readers to evaluate the extent to which observed convergence reflects intrinsic structure versus template influence. A full 119-model human comparison remains outside the scope of the present study but is noted as valuable future work. revision: partial

Circularity Check

1 steps flagged

Convergence claim partially reduces to KDT template by construction

specific steps
  1. self definitional [Abstract]
    "We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc."

    KI is defined as the scaffold consisting of 'validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms' produced by KDT. The observed convergence in modelling decisions and failure remedies is then used to conclude that expertise is intrinsically structured; because KDT applies a uniform extraction template, the convergence is enforced by the toolkit rather than independently confirming pre-existing non-ad-hoc structure.

full rationale

The 84% vs <40% benchmark results are presented as direct empirical outcomes from agent trials and do not reduce to fitted parameters or self-referential definitions. The load-bearing generalization—that convergence across 119 KIs demonstrates 'structured and extractable' expertise rather than ad hoc knowledge—relies on KIs produced by the KDT, which standardizes the very operators, protocols, and recovery mechanisms whose convergence is then interpreted as evidence of intrinsic structure. This creates moderate circularity in the interpretive step but leaves the core performance metrics independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that decades of scientific understanding can be externalized into discrete, agent-actionable components, and that cross-model convergence is evidence of general structure rather than coincidence or selection effects.

axioms (1)
  • domain assumption Process-based simulation models encode decades of scientific understanding across the Earth sciences
    Opening premise of the abstract that underpins the value of externalizing expertise.
invented entities (2)
  • Knowledge Infrastructure (KI) no independent evidence
    purpose: Agent-actionable scaffold externalizing expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms
    Core new construct introduced to enable end-to-end agent execution.
  • Knowledge Dissection Toolkit (KDT) no independent evidence
    purpose: Autonomously produces KI for additional process-based models
    Tool introduced to demonstrate generalizability across 117 models.

pith-pipeline@v0.9.0 · 5774 in / 1657 out tokens · 63638 ms · 2026-05-20T10:42:48.089200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Al-Zu’bi et al

    M. Al-Zu’bi et al. African perspectives on climate change research.Nature Climate Change, 12:1078–1084, 2022

  2. [2]

    Qwen code.https://github.com/QwenLM/Qwen-Agent, 2025

    Alibaba Cloud. Qwen code.https://github.com/QwenLM/Qwen-Agent, 2025

  3. [3]

    Claude code: an agentic coding tool

    Anthropic. Claude code: an agentic coding tool. https://claude.com/product/ claude-code, 2025

  4. [4]

    How Anthropic teams use Claude Code

    Anthropic. How Anthropic teams use Claude Code. https://www.anthropic.com/news/ how-anthropic-teams-use-claude-code, 2026

  5. [5]

    H. E. Beck et al. MSWX: A global sub-daily, sub-degree, ensemble-based meteorological dataset.Bulletin of the American Meteorological Society, 103:E710–E732, 2022

  6. [6]

    M. F. P. Bierkens. Global hydrology 2015: State, trends, and directions.Water Resources Research, 51:4923–4947, 2015

  7. [7]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models.Nature, 624:570–578, 2023

  8. [8]

    M. P. Clark, B. Nijssen, J. D. Lundquist, et al. A unified approach for process-based hydrologic modeling: 1. modeling concept.Water Resources Research, 51:2498–2514, 2015

  9. [9]

    Deng et al

    C. Deng et al. K2: a foundation language model for geoscience knowledge understanding and utilization. InProceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24), pages 161–170, 2024

  10. [10]

    Harmonized world soil database (version 1.2)

    FAO/IIASA/ISRIC/ISSCAS/JRC. Harmonized world soil database (version 1.2). FAO and IIASA https://www.fao.org/soils-portal/data-hub/ soil-maps-and-databases/harmonized-world-soil-database-v12, 2012

  11. [11]

    Färber et al

    C. Färber et al. The GRDC-Caravan extension.Earth System Science Data, 17:4613–4640, 2025

  12. [12]

    T. G. Farr et al. The Shuttle Radar Topography Mission.Reviews of Geophysics, 45:RG2004, 2007

  13. [13]

    Fatichi et al

    S. Fatichi et al. An overview of current applications, challenges, and future trends in distributed process-based models in hydrology.Journal of Hydrology, 537:45–60, 2016

  14. [14]

    A. J. G. Ferrer et al. The impact of an adjusted cropping calendar on the welfare of rice farming households in the Mekong River Delta, Vietnam.Economic Analysis and Policy, 73:639–652, 2022

  15. [15]

    Gemini cli.https://github.com/google-gemini/gemini-cli, 2025

    Google. Gemini cli.https://github.com/google-gemini/gemini-cli, 2025

  16. [16]

    He et al

    J. He et al. The first high-resolution meteorological forcing dataset for land process studies over China.Scientific Data, 7:25, 2020

  17. [17]

    Can Coding Agents Reproduce Findings in Computational Materials Science?

    Z. Huang et al. Can coding agents reproduce findings in computational materials science? Preprint athttps://arxiv.org/abs/2605.00803, 2026

  18. [18]

    Hutton et al

    C. Hutton et al. Most computational hydrology is not reproducible, so is it really science?Water Resources Research, 52:7548–7555, 2016

  19. [19]

    Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change

    IPCC.Climate Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2022

  20. [20]

    Liang, D

    X. Liang, D. P. Lettenmaier, E. F. Wood, and S. J. Burges. A simple hydrologically based model of land surface water and energy fluxes for general circulation models.Journal of Geophysical Research, 99:14415–14428, 1994. 23

  21. [21]

    Lohmann, R

    D. Lohmann, R. Nolte-Holube, and E. Raschke. A large-scale horizontal routing model to be coupled to land surface parametrization schemes.Tellus A, 48:708–721, 1996

  22. [22]

    Lohmann, E

    D. Lohmann, E. Raschke, B. Nijssen, and D. P. Lettenmaier. Regional scale hydrology: I. formulation of the VIC-2L model coupled to a routing model.Hydrological Sciences Journal, 43:131–141, 1998

  23. [23]

    T. R. Loveland et al. Development of a global land cover characteristics database and IGBP DISCover from 1 km A VHRR data.International Journal of Remote Sensing, 21:1303–1330, 2000

  24. [24]

    Lu et al

    C. Lu et al. Towards end-to-end automation of AI research.Nature, 651:914–919, 2026

  25. [25]

    Mandal et al

    I. Mandal et al. Evaluating large language model agents for automation of atomic force microscopy.Nature Communications, 16:9331, 2025

  26. [26]

    L. A. Melsen. It takes a village to run a model, the social practices of hydrological modeling. Water Resources Research, 58:e2021WR030600, 2022

  27. [27]

    C. B. Ménard et al. Scientific and human errors in a snow model intercomparison.Bulletin of the American Meteorological Society, 102:E61–E79, 2021

  28. [28]

    Kimi code cli.https://github.com/kimicode, 2025

    Moonshot AI. Kimi code cli.https://github.com/kimicode, 2025

  29. [29]

    Codex cli.https://github.com/openai/codex, 2025

    OpenAI. Codex cli.https://github.com/openai/codex, 2025

  30. [30]

    Overland et al

    I. Overland et al. Funding flows for climate change research on Africa: where do they come from and where do they go?Climate and Development, 14:705–724, 2022

  31. [31]

    Pastorello et al

    G. Pastorello et al. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data.Scientific Data, 7:225, 2020

  32. [32]

    Poggio et al

    L. Poggio et al. SoilGrids 2.0: producing soil information for the globe with quantified spatial uncertainty.SOIL, 7:217–240, 2021

  33. [33]

    Rodell et al

    M. Rodell et al. The global land data assimilation system.Bulletin of the American Meteorolog- ical Society, 85:381–394, 2004

  34. [34]

    Rosenzweig et al

    C. Rosenzweig et al. The agricultural model intercomparison and improvement project (AgMIP): Protocols and pilot studies.Agricultural and Forest Meteorology, 170:166–182, 2013

  35. [35]

    Sheffield et al

    J. Sheffield et al. Satellite remote sensing for water resources management: potential for supporting sustainable development in data-poor regions.Water Resources Research, 54: 9724–9758, 2018

  36. [36]

    Quantifying greenhouse gas fluxes in agriculture and forestry: Methods for entity-scale inventory

    USDA. Quantifying greenhouse gas fluxes in agriculture and forestry: Methods for entity-scale inventory. Technical Report Technical Bulletin No. 1939, 2nd edition, U.S. Department of Agriculture, Office of the Chief Economist, 2024

  37. [37]

    Wang et al

    J. Wang et al. Impact of straw return on greenhouse gas emissions from maize fields in China: meta-analysis.Frontiers in Plant Science, 16:1493357, 2025

  38. [38]

    Warszawski et al

    L. Warszawski et al. The inter-sectoral impact model intercomparison project (ISI–MIP): Project framework.Proceedings of the National Academy of Sciences USA, 111:3228–3232, 2014

  39. [39]

    E. F. Wood et al. Hyperresolution global land surface modeling: meeting a grand challenge for monitoring Earth’s terrestrial water.Water Resources Research, 47:W05301, 2011

  40. [40]

    Zhang et al

    Y . Zhang et al. GeoAnalystBench: a GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation.Transactions in GIS, 2025

  41. [41]

    Zhang et al

    Z. Zhang et al. Physics-based models outperform AI weather forecasts of record-breaking extremes.Science Advances, 12:eaec1433, 2026

  42. [42]

    Zhao et al

    W. Zhao et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026. 24