KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science
Pith reviewed 2026-05-20 10:42 UTC · model grok-4.3
The pith
KI enables AI agents to run process-based Earth science simulations with up to 84% success in a 3000-trial hydrology benchmark versus under 40% without it, and a toolkit extracts similar KI for 119 models across 14 domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines, with modelling decisions and failure remedies converging across 119 KIs from 14 domains.
Load-bearing premise
That the convergence of modelling decisions and failure remedies across models with different underlying physics demonstrates that operational expertise is structured and extractable in a general, non-ad-hoc form that can be autonomously dissected by the KDT toolkit.
Figures
read the original abstract
Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Knowledge Infrastructure (KI) as an agent-actionable scaffold externalizing expertise from process-based Earth science models into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. In a 3,000-trial coupled-hydrology benchmark, KI-equipped agents achieve up to 84% success in producing physically plausible, verifiable end-to-end simulations versus below 40% without KI. A Knowledge Dissection Toolkit (KDT) autonomously generates KIs for 117 additional models (total 119 across 14 domains), with modelling decisions and failure remedies converging despite differing physics; this is taken to show operational expertise is structured and extractable rather than ad hoc. KI is positioned to lower access barriers for non-specialists and integration barriers between modelling communities, enabling process-based science as a living scientific commons.
Significance. If the empirical results and generalization hold after addressing verification gaps, this could meaningfully advance AI-assisted scientific simulation by making complex process-based models more accessible and extensible. The scale of the benchmark (3,000 trials) and autonomous construction of 119 KIs via KDT are practical strengths that could support broader adoption in Earth sciences, particularly for climate-risk communities. The framing of expertise as extractable infrastructure aligns with open-science goals and offers a concrete mechanism for agentic workflows.
major comments (2)
- [Abstract and benchmark/results section] Abstract and benchmark/results section: The central performance claim (up to 84% physically plausible simulations with KI vs. <40% without) is load-bearing for the efficacy argument, yet no details are provided on the specific criteria or automated/manual procedures used to verify physical plausibility, the construction and randomization of the 3,000 trials, presence of error bars or statistical tests, or trial-level controls. This leaves the quantitative support for the reported gap thin and difficult to evaluate.
- [KDT description and cross-domain generalization section] KDT description and cross-domain generalization section: The claim that convergence of modelling decisions and failure remedies across 119 KIs from 14 domains demonstrates that 'operational expertise is structured and extractable rather than ad hoc' relies on KDT-produced artifacts. However, the manuscript reports no control condition (e.g., KDT applied to unstructured/non-scientific sources or side-by-side comparison with human-expert KIs constructed without the KDT template). This makes it impossible to distinguish discovery of intrinsic structure from imposition by the uniform extraction template defined in KDT.
minor comments (2)
- [Terminology and notation] Ensure all acronyms (KI, KDT) receive explicit first-use definitions and consistent expansion in the main text; a short table or glossary would aid readers from non-AI Earth-science backgrounds.
- [References] The manuscript would benefit from additional references to prior agentic AI frameworks for scientific workflows and knowledge-extraction methods to better contextualize the novelty of the scaffolding approach.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript on Knowledge Infrastructure (KI). The comments identify key areas where additional methodological transparency would strengthen the presentation of results. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract and benchmark/results section] Abstract and benchmark/results section: The central performance claim (up to 84% physically plausible simulations with KI vs. <40% without) is load-bearing for the efficacy argument, yet no details are provided on the specific criteria or automated/manual procedures used to verify physical plausibility, the construction and randomization of the 3,000 trials, presence of error bars or statistical tests, or trial-level controls. This leaves the quantitative support for the reported gap thin and difficult to evaluate.
Authors: We agree that expanded methodological detail is required for readers to fully assess the benchmark. In the revised manuscript we will add a dedicated subsection describing: (i) the automated verification criteria (mass balance, boundary consistency, output range checks against domain protocols) together with the protocol for manual expert review on a 10% random sample of trials; (ii) the trial-generation procedure, including the randomization scheme for initial states, meteorological forcings, and parameter perturbations constrained to physically plausible intervals; (iii) statistical reporting with per-condition success rates, standard-error bars across repeated runs, and paired significance tests; and (iv) stratified results by simulation complexity to serve as trial-level controls. These additions will appear in the main text with supporting code and example verification logs placed in the supplementary material. revision: yes
-
Referee: [KDT description and cross-domain generalization section] KDT description and cross-domain generalization section: The claim that convergence of modelling decisions and failure remedies across 119 KIs from 14 domains demonstrates that 'operational expertise is structured and extractable rather than ad hoc' relies on KDT-produced artifacts. However, the manuscript reports no control condition (e.g., KDT applied to unstructured/non-scientific sources or side-by-side comparison with human-expert KIs constructed without the KDT template). This makes it impossible to distinguish discovery of intrinsic structure from imposition by the uniform extraction template defined in KDT.
Authors: The referee correctly notes that the current text presents convergence as an observational result without explicit controls for template bias. We will revise the cross-domain section to include a brief control experiment: the KDT will be run on a corpus of non-scientific technical documents to test whether the same structured operator/protocol/recovery format emerges in the absence of domain content. We will also report quantitative agreement metrics between KDT-generated KIs and a small set of independently authored human-expert KIs for three representative models. These additions will allow readers to evaluate the extent to which observed convergence reflects intrinsic structure versus template influence. A full 119-model human comparison remains outside the scope of the present study but is noted as valuable future work. revision: partial
Circularity Check
Convergence claim partially reduces to KDT template by construction
specific steps
-
self definitional
[Abstract]
"We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc."
KI is defined as the scaffold consisting of 'validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms' produced by KDT. The observed convergence in modelling decisions and failure remedies is then used to conclude that expertise is intrinsically structured; because KDT applies a uniform extraction template, the convergence is enforced by the toolkit rather than independently confirming pre-existing non-ad-hoc structure.
full rationale
The 84% vs <40% benchmark results are presented as direct empirical outcomes from agent trials and do not reduce to fitted parameters or self-referential definitions. The load-bearing generalization—that convergence across 119 KIs demonstrates 'structured and extractable' expertise rather than ad hoc knowledge—relies on KIs produced by the KDT, which standardizes the very operators, protocols, and recovery mechanisms whose convergence is then interpreted as evidence of intrinsic structure. This creates moderate circularity in the interpretive step but leaves the core performance metrics independently falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Process-based simulation models encode decades of scientific understanding across the Earth sciences
invented entities (2)
-
Knowledge Infrastructure (KI)
no independent evidence
-
Knowledge Dissection Toolkit (KDT)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Al-Zu’bi et al. African perspectives on climate change research.Nature Climate Change, 12:1078–1084, 2022
work page 2022
-
[2]
Qwen code.https://github.com/QwenLM/Qwen-Agent, 2025
Alibaba Cloud. Qwen code.https://github.com/QwenLM/Qwen-Agent, 2025
work page 2025
-
[3]
Claude code: an agentic coding tool
Anthropic. Claude code: an agentic coding tool. https://claude.com/product/ claude-code, 2025
work page 2025
-
[4]
How Anthropic teams use Claude Code
Anthropic. How Anthropic teams use Claude Code. https://www.anthropic.com/news/ how-anthropic-teams-use-claude-code, 2026
work page 2026
-
[5]
H. E. Beck et al. MSWX: A global sub-daily, sub-degree, ensemble-based meteorological dataset.Bulletin of the American Meteorological Society, 103:E710–E732, 2022
work page 2022
-
[6]
M. F. P. Bierkens. Global hydrology 2015: State, trends, and directions.Water Resources Research, 51:4923–4947, 2015
work page 2015
-
[7]
D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models.Nature, 624:570–578, 2023
work page 2023
-
[8]
M. P. Clark, B. Nijssen, J. D. Lundquist, et al. A unified approach for process-based hydrologic modeling: 1. modeling concept.Water Resources Research, 51:2498–2514, 2015
work page 2015
-
[9]
C. Deng et al. K2: a foundation language model for geoscience knowledge understanding and utilization. InProceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24), pages 161–170, 2024
work page 2024
-
[10]
Harmonized world soil database (version 1.2)
FAO/IIASA/ISRIC/ISSCAS/JRC. Harmonized world soil database (version 1.2). FAO and IIASA https://www.fao.org/soils-portal/data-hub/ soil-maps-and-databases/harmonized-world-soil-database-v12, 2012
work page 2012
-
[11]
C. Färber et al. The GRDC-Caravan extension.Earth System Science Data, 17:4613–4640, 2025
work page 2025
-
[12]
T. G. Farr et al. The Shuttle Radar Topography Mission.Reviews of Geophysics, 45:RG2004, 2007
work page 2007
-
[13]
S. Fatichi et al. An overview of current applications, challenges, and future trends in distributed process-based models in hydrology.Journal of Hydrology, 537:45–60, 2016
work page 2016
-
[14]
A. J. G. Ferrer et al. The impact of an adjusted cropping calendar on the welfare of rice farming households in the Mekong River Delta, Vietnam.Economic Analysis and Policy, 73:639–652, 2022
work page 2022
-
[15]
Gemini cli.https://github.com/google-gemini/gemini-cli, 2025
Google. Gemini cli.https://github.com/google-gemini/gemini-cli, 2025
work page 2025
- [16]
-
[17]
Can Coding Agents Reproduce Findings in Computational Materials Science?
Z. Huang et al. Can coding agents reproduce findings in computational materials science? Preprint athttps://arxiv.org/abs/2605.00803, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
C. Hutton et al. Most computational hydrology is not reproducible, so is it really science?Water Resources Research, 52:7548–7555, 2016
work page 2016
-
[19]
IPCC.Climate Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2022
work page 2022
- [20]
-
[21]
D. Lohmann, R. Nolte-Holube, and E. Raschke. A large-scale horizontal routing model to be coupled to land surface parametrization schemes.Tellus A, 48:708–721, 1996
work page 1996
-
[22]
D. Lohmann, E. Raschke, B. Nijssen, and D. P. Lettenmaier. Regional scale hydrology: I. formulation of the VIC-2L model coupled to a routing model.Hydrological Sciences Journal, 43:131–141, 1998
work page 1998
-
[23]
T. R. Loveland et al. Development of a global land cover characteristics database and IGBP DISCover from 1 km A VHRR data.International Journal of Remote Sensing, 21:1303–1330, 2000
work page 2000
- [24]
-
[25]
I. Mandal et al. Evaluating large language model agents for automation of atomic force microscopy.Nature Communications, 16:9331, 2025
work page 2025
-
[26]
L. A. Melsen. It takes a village to run a model, the social practices of hydrological modeling. Water Resources Research, 58:e2021WR030600, 2022
work page 2022
-
[27]
C. B. Ménard et al. Scientific and human errors in a snow model intercomparison.Bulletin of the American Meteorological Society, 102:E61–E79, 2021
work page 2021
-
[28]
Kimi code cli.https://github.com/kimicode, 2025
Moonshot AI. Kimi code cli.https://github.com/kimicode, 2025
work page 2025
-
[29]
Codex cli.https://github.com/openai/codex, 2025
OpenAI. Codex cli.https://github.com/openai/codex, 2025
work page 2025
-
[30]
I. Overland et al. Funding flows for climate change research on Africa: where do they come from and where do they go?Climate and Development, 14:705–724, 2022
work page 2022
-
[31]
G. Pastorello et al. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data.Scientific Data, 7:225, 2020
work page 2020
-
[32]
L. Poggio et al. SoilGrids 2.0: producing soil information for the globe with quantified spatial uncertainty.SOIL, 7:217–240, 2021
work page 2021
-
[33]
M. Rodell et al. The global land data assimilation system.Bulletin of the American Meteorolog- ical Society, 85:381–394, 2004
work page 2004
-
[34]
C. Rosenzweig et al. The agricultural model intercomparison and improvement project (AgMIP): Protocols and pilot studies.Agricultural and Forest Meteorology, 170:166–182, 2013
work page 2013
-
[35]
J. Sheffield et al. Satellite remote sensing for water resources management: potential for supporting sustainable development in data-poor regions.Water Resources Research, 54: 9724–9758, 2018
work page 2018
-
[36]
Quantifying greenhouse gas fluxes in agriculture and forestry: Methods for entity-scale inventory
USDA. Quantifying greenhouse gas fluxes in agriculture and forestry: Methods for entity-scale inventory. Technical Report Technical Bulletin No. 1939, 2nd edition, U.S. Department of Agriculture, Office of the Chief Economist, 2024
work page 1939
-
[37]
J. Wang et al. Impact of straw return on greenhouse gas emissions from maize fields in China: meta-analysis.Frontiers in Plant Science, 16:1493357, 2025
work page 2025
-
[38]
L. Warszawski et al. The inter-sectoral impact model intercomparison project (ISI–MIP): Project framework.Proceedings of the National Academy of Sciences USA, 111:3228–3232, 2014
work page 2014
-
[39]
E. F. Wood et al. Hyperresolution global land surface modeling: meeting a grand challenge for monitoring Earth’s terrestrial water.Water Resources Research, 47:W05301, 2011
work page 2011
-
[40]
Y . Zhang et al. GeoAnalystBench: a GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation.Transactions in GIS, 2025
work page 2025
-
[41]
Z. Zhang et al. Physics-based models outperform AI weather forecasts of record-breaking extremes.Science Advances, 12:eaec1433, 2026
work page 2026
-
[42]
W. Zhao et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026. 24
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.