pith. sign in

arxiv: 2604.06567 · v2 · pith:KKDDJM6Hnew · submitted 2026-04-08 · ⚛️ physics.ao-ph

A PMP-inspired Evaluation Framework for Assessing Deep-Learning Earth System Models

Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3

classification ⚛️ physics.ao-ph
keywords deep learningearth system modelsclimate model evaluationclimatologyclimate variabilityprecipitationmonsoonmodel assessment
0
0 comments X

The pith

A framework using standardized climate diagnostics evaluates deep-learning Earth system models for their ability to reproduce key climate features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an evaluation framework for deep-learning Earth system models by drawing on diagnostics commonly applied to traditional physics-based models. This approach measures how well the models match observed climatology, major variability patterns, monsoon behavior, and precipitation changes when compared against reference data and other model benchmarks. It shifts focus from short-term forecast accuracy toward suitability for longer climate studies. A sympathetic reader would care because it offers a consistent way to gauge whether these computationally efficient models can support Earth system applications and to spot where further development is needed.

Core claim

The paper claims that adapting a collection of established climate diagnostics permits deep-learning Earth system models to be tested for reproduction of climatology, major modes of variability, monsoon systems, and precipitation behavior relative to observations and conventional model benchmarks, revealing strengths in large-scale fields alongside persistent difficulties with precipitation, tropical variability, and long-run stability in some cases.

What carries the argument

The set of standardized diagnostics that quantify a model's skill at reproducing climatology, variability modes, monsoon behavior, and precipitation patterns against observational references and benchmark simulations.

If this is right

  • Direct comparison becomes possible between deep-learning models and traditional climate models using the same metrics.
  • Strengths appear in several large-scale climate fields and modes of variability.
  • Challenges are identified in precipitation simulation, tropical variability, and long-run stability for certain model versions.
  • The framework supports guiding future development of deep-learning models toward climate-relevant uses.
  • It serves as a step toward establishing trust in these models for Earth system science tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostics could be applied to hybrid models that combine deep-learning components with physics-based ones to check consistency across approaches.
  • If deep-learning models exhibit systematically different error patterns, the framework might need supplementary tests focused on stability over multi-year runs.
  • Widespread adoption could allow these models to contribute to ensemble climate projections alongside established simulations.

Load-bearing premise

Diagnostics originally created for traditional physics-based models remain suitable even when applied to deep-learning models that may have different error structures and stability properties.

What would settle it

A deep-learning model that scores well on the framework's metrics but produces unstable or unrealistic behavior in extended free-running simulations would indicate the diagnostics are insufficient.

Figures

Figures reproduced from arXiv: 2604.06567 by C\'eline Bonfils, Giuliana Pallotta, Jiwoo Lee, Paul Ullrich, Seth Goodnight, Shiheng Duan.

Figure 1
Figure 1. Figure 1: Performance of annual mean precipitation simulated in (a) ACE2 and (b) NeuralGCM-evap, compared to the reference observa￾tional dataset of monthly precipitation GPCP 3.2 (Adler et al., 2018) as in [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Portrait plot for normalized spatial Root Mean Square Error (RMSE) across different seasons. Negative normalized error indicates performance better than the multi-model median (like in (Ullrich et al., 2025) and (Lee et al., 2024)). The climatology metric is computed with respect to the observed climatological fields provided by the reference dataset product corresponding to each examined variable as repor… view at source ↗
Figure 3
Figure 3. Figure 3: Parallel coordinate plot for spatiotemporal RMSE from PMP mean climate metrics. Each vertical axis represents a different variable. The distributions of RMSE from CMIP6 modes are visualized as violin plots shaded in gray. The colored markers represent the DL-ESMs: NeuralGCM, ACE2, and NeuralGCM-evap. The time epoch used for this analysis is 1981–2013. The middle of each vertical axis is aligned with the me… view at source ↗
Figure 4
Figure 4. Figure 4: Portrait plot of the amplitude of extra-tropical modes of variability simulated by CMIP6 models and DL-ESMs in their historical or equivalent simulations. The amplitude ratio metric is the ratio of spatiotemporal standard deviations of the model versus the observed principal components (PCs), obtained using the CBF method in the PMP (Lee et al., 2024). Rows correspond to mode and season, and columns to mod… view at source ↗
Figure 5
Figure 5. Figure 5: Parallel coordinate plot for spatiotemporal RMSE for CMIP6 model ensembles (gray shaded) with individual models plotted as colored markers. The metrics for the two DL-ESMs are shown as solid lines (ACE2, blue and NeuralGCM, red In [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: NAO skill comparison in DJF period: (a) ACE2 simulation as compared to the observations in the NAO domain (top row), ability to reproduce the associated teleconnections (middle row) and time series comparison, with a representation of the standard deviations from the model and the observations (bottom row). Given that the analyzed DL-ESMs are forced with SST, their ability to reproduce important modes of v… view at source ↗
Figure 7
Figure 7. Figure 7: MJO propagation metrics – wavenumber–frequency power spectra – from (a) obervations as from GPCP v1.3 (Huffman et al., 2001) and (b) ACE2, (c) NeuralGCM-evap and (d) NeuralGCM-precip. The EWR is defined as the ratio of eastward power (as the average power in the dashed box on the right) to westward power (as the average power in the dashed box on the left) from the 2-dimensional wavenumber–frequency power … view at source ↗
Figure 8
Figure 8. Figure 8: MJO east–west power ratio (EWR; unitless) from CMIP6 models ( orange) compared to the Dl-ESMs (red) for boreal winter. The EWR corresponding to the observational dataset is shown in gray (GPCP v1.3; (Huffman et al., 2001)) and an horizontal line is added to facilitate the comparison with the reference observation (i.e., GPCP v1.3; black).The ensemble members and number are the same as in [PITH_FULL_IMAGE:… view at source ↗
Figure 9
Figure 9. Figure 9: Precipitation pentads compared between model and observations. The monsoon metrics are computed against observational datasets (GPCP v1.3 and CMORPH v1.0; Joyce et al. (2004); Xie et al. (2017)) and Historical simulations are performed with (a) ACE2 and (b) NeuralGCM-precip. Results are shown for six monsoon regions: all-India rainfall (AIR), northern Australia (AUS), Sahel, Gulf of Guinea (GoG), North Ame… view at source ↗
Figure 10
Figure 10. Figure 10: Annual precipitation range (shading, in mm/day) in six different monsoon domains in GPCP 3.2 (top, 1998-2017) and the four analyzed DL-ESMs. Blue stars show regions where the model agrees with observations, red dots represent locations where the monsoon definition threshold (the default threshold for the threat score is 2.5 mm/day) is met in the observations but missed in the model, green triangles are lo… view at source ↗
Figure 11
Figure 11. Figure 11: Power spectrum of precipitation variability associated with different timescales: total variability in left panels and variability anomaly on the right panels, obtained by removing the long-term mean and seasonal cycle for the (a) Global domain and (b) Tropical domain. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Normalized annual precipitation variability. The PMP metric value that is a ratio of spatial standard deviation (i.e., model/obser￾vation) is in the top right corner of each panel. 4.6 Taylor Diagrams We here show Taylor diagrams (Taylor, 2001) to simultaneously depict standard deviation, correlation, and RMSE for key variables across different seasons and domains, as usually combined to PMP metrics (Lee … view at source ↗
Figure 13
Figure 13. Figure 13: Taylor diagrams summarizing the performance of DL-ESM models compared to CMIP6 models for globally averaged climate variable in Spring: (a) Air temperature at 850 hPa; (b) precipitation; (c) Zonal wind at 200 hPa. 5 Discussion and Conclusions This study examines several leading DL-ESMs, including ACE2, NeuralGCM, NeuralGCM-evap and NeuralGCM-precip through the lens of the PMP metrics. These metrics cover … view at source ↗
Figure 4
Figure 4. Figure 4: 45 [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗
read the original abstract

In recent years, Deep-Learning Earth System Models (DL-ESMs) have emerged as promising, computationally efficient complements to traditional Earth system models. Here, we present an evaluation framework for testing DL-ESMs from a climate-model-development perspective using standardized diagnostics from the PCMDI Metrics Package (PMP). This framework allows DL-ESMs, including Ai2's ACE2 and Google's NeuralGCM, to be assessed with metrics that quantify their ability to reproduce climatology, major modes of variability, monsoon behavior, and precipitation variability relative to observational reference datasets and CMIP-class benchmarks. By evaluating DL-ESMs with tools commonly used for traditional models, we extend their assessment beyond short-range forecast skill and toward climate-relevant applications. The results identify encouraging strengths in several large-scale fields and modes of variability, while also highlighting persistent challenges in precipitation, tropical variability, and long-run stability for some model versions. This evaluation is a critical step toward building trust in DL-ESMs, guiding future model development, and clarifying their fit-for-purpose for Earth system science applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a PMP-inspired evaluation framework for deep-learning Earth system models (DL-ESMs), applying standardized PCMDI Metrics Package diagnostics to assess climatology, major modes of variability (e.g., ENSO, MJO), monsoon behavior, and precipitation variability. It evaluates models including Ai2's ACE2 and Google's NeuralGCM against observational references and CMIP benchmarks, reporting strengths in large-scale fields alongside challenges in precipitation, tropical variability, and long-run stability for certain versions. The central claim is that these established diagnostics extend DL-ESM assessment beyond short-range forecasts to climate-relevant applications.

Significance. If the framework holds, it offers a practical bridge between DL model development and traditional climate-model evaluation practices, enabling consistent benchmarking that could guide iterative improvements and clarify suitability for Earth system applications. The use of external observational datasets and CMIP benchmarks avoids circularity and provides reproducible, community-standard metrics.

major comments (2)
  1. [§4] §4 (Results on long-run stability): The abstract and results note persistent challenges in long-run stability for some DL-ESM versions, yet the manuscript does not demonstrate that the selected PMP diagnostics (e.g., ENSO pattern correlation or monsoon onset metrics) actually detect or quantify accumulating non-physical drift versus merely documenting it post hoc. A concrete test—such as comparing PMP scores on short vs. multi-year integrations for a drifting model version—would strengthen the claim that PMP is sufficient for DL-specific failure modes.
  2. [§3.2] §3.2 (Metric selection and data): The choice of PMP diagnostics is justified by their use in traditional ESMs, but the paper should explicitly address whether DL-ESM error structures (e.g., spectral artifacts or mode collapse) require supplementary diagnostics beyond the current set; without this, the sufficiency claim for climate-relevant assessment rests on an untested assumption of comparable error structures.
minor comments (2)
  1. [Figure 2] Figure 2 caption: Clarify the exact observational reference dataset and CMIP ensemble version used for each panel to improve reproducibility.
  2. [§2.1] §2.1: The description of NeuralGCM and ACE2 model versions lacks explicit citation to the original model papers or release versions; add these for traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to better demonstrate the value of the PMP-based framework for DL-ESMs. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4] §4 (Results on long-run stability): The abstract and results note persistent challenges in long-run stability for some DL-ESM versions, yet the manuscript does not demonstrate that the selected PMP diagnostics (e.g., ENSO pattern correlation or monsoon onset metrics) actually detect or quantify accumulating non-physical drift versus merely documenting it post hoc. A concrete test—such as comparing PMP scores on short vs. multi-year integrations for a drifting model version—would strengthen the claim that PMP is sufficient for DL-specific failure modes.

    Authors: We agree that an explicit demonstration of the diagnostics' sensitivity to drift would strengthen the paper. The current results apply PMP metrics to long integrations and report stability issues for certain versions, but we did not include a controlled short-versus-long comparison. In the revised manuscript we will add such an analysis for one drifting model version, comparing PMP scores (including ENSO and monsoon metrics) between short-range and multi-year runs to show how the diagnostics capture accumulating non-physical drift. revision: yes

  2. Referee: [§3.2] §3.2 (Metric selection and data): The choice of PMP diagnostics is justified by their use in traditional ESMs, but the paper should explicitly address whether DL-ESM error structures (e.g., spectral artifacts or mode collapse) require supplementary diagnostics beyond the current set; without this, the sufficiency claim for climate-relevant assessment rests on an untested assumption of comparable error structures.

    Authors: We acknowledge that DL-ESMs can exhibit error structures (such as spectral artifacts) that differ from those of traditional ESMs. The PMP diagnostics were selected for their established relevance to climate processes rather than an assumption of identical error statistics. Our results already show that these metrics successfully flag important deficiencies in precipitation and tropical variability. In the revision we will expand the discussion in §3.2 to explicitly consider potential DL-specific errors, explain the rationale for the current metric set as a standardized bridge to the climate-modeling community, and note that supplementary diagnostics may be added in future extensions of the framework. revision: partial

Circularity Check

0 steps flagged

No significant circularity: framework applies external PMP metrics to DL-ESMs

full rationale

The paper introduces an evaluation framework that applies the pre-existing PCMDI Metrics Package (PMP) diagnostics—originally developed for physics-based ESMs—to DL-ESMs such as ACE2 and NeuralGCM. All metrics (climatology, modes of variability, monsoon behavior, precipitation) are computed against independent observational reference datasets and CMIP-class benchmarks, with no parameters fitted inside the paper and no predictions derived from internal fits. The central claim is an application of external standardized tools rather than a derivation that reduces to self-definition, fitted inputs, or self-citation chains. Because the assessment is benchmarked entirely outside the paper's own fitted values or assumptions, the evaluation chain remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that PMP metrics transfer directly to DL models; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption PMP diagnostics designed for traditional ESMs are suitable for assessing DL-ESMs
    The entire evaluation framework depends on this transferability assumption stated in the abstract.

pith-pipeline@v0.9.0 · 5727 in / 1148 out tokens · 36864 ms · 2026-05-21T10:27:46.409005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This framework allows DL-ESMs... to be assessed with metrics that quantify their ability to reproduce climatology, major modes of variability, monsoon behavior, and precipitation variability relative to observational reference datasets and CMIP-class benchmarks.

  • IndisputableMonolith/Foundation/AlexanderDuality alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The PMP framework computes hundreds of summary statistics including mean state metrics, variability metrics, and process-oriented diagnostics.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    F., Sapiano, M

    Adler, R. F., Sapiano, M. R., Huffman, G. J., Wang, J. J., Gu, G., Bolvin, D., Chiu, L., Schneider, U., Becker, A., Nelkin, E., Xie, P., Ferraro, R., and Shin, D.-B.: The Global Precipitation Climatology Project (GPCP) monthly analysis (new version 2.3) and a review of 2017 global precipitation, Atmosphere, 9, 138, https://doi.org/10.3390/atmos9040138,

  2. [2]

    H., Sperber, K

    Ahn, M.-S., Kim, D. H., Sperber, K. R., Kang, I.-S., Maloney, E. D., Waliser, D. E., and Hendon, H. H.: MJO simulation in CMIP5 climate models: MJO skill metrics and process-oriented diagnosis, Climate Dynamics, 49, 4023–4045, https://doi.org/10.1007/s00382-017-3558- 4,

  3. [3]

    J., Lee, J., Pendergrass, A

    Ahn, M.-S., Gleckler, P. J., Lee, J., Pendergrass, A. G., and Jakob, C.: Benchmarking Simulated Precipitation Variability Amplitude across Time Scales, Journal of Climate, 35, 3173–3196, https://doi.org/10.1175/JCLI-D-21-0542.1,

  4. [4]

    Ai2: ACE2-ERA5 (Revision a4ca6cc), https://doi.org/10.57967/hf/5377,

  5. [5]

    Back, S.-Y ., Kim, D., and Son, S.-W.: MJO Diversity in CMIP6 Models, Journal of Climate, 37, 4835 – 4850, https://doi.org/10.1175/JCLI- D-23-0656.1,

  6. [6]

    A., Hassanzadeh, P., Rucker, K., and Shaw, T

    Baxter, I., Pahlavan, H. A., Hassanzadeh, P., Rucker, K., and Shaw, T. A.: Benchmarking Atmospheric Circulation Vari- ability in an AI Emulator, ACE2, and a Hybrid Model, NeuralGCM, Geophysical Research Letters, 53, e2025GL119 877, https://doi.org/https://doi.org/10.1029/2025GL119877, e2025GL119877 2025GL119877,

  7. [7]

    Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q.: Accurate medium-range global weather forecasting with 3D neural networks, Nature, 619, 533–538, https://doi.org/10.1038/s41586-023-06185-3,

  8. [8]

    Nature, 641 (8065), 1180--1187, doi:10.1038/s41586-025-09005-y, ://www.nature.com/articles/s41586-025-09005-y

    Bodnar, C., Bruinsma, W. P., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P., Riechert, M., Weyn, J. A., Dong, H., Gupta, J. K., Thambiratnam, K., Archibald, A. T., Wu, C.-C., Heider, E., Welling, M., Turner, R. E., and Perdikaris, P.: A foundation model for the Earth system, Nature, 641, 1180–1187, https://doi.org/10.1038/s41586-025-09005-y,

  9. [9]

    D., Ge, T., Subramaniam, A., Manshausen, P., Gupta, A., Hall, D

    Brenowitz, N. D., Ge, T., Subramaniam, A., Manshausen, P., Gupta, A., Hall, D. M., Mardani, M., Vahdat, A., Kashinath, K., and Pritchard, M. S.: Climate in a Bottle: Towards a Generative Foundation Model for the Kilometer-Scale Global Atmosphere, arXiv preprint, https: //arxiv.org/abs/2505.06474,

  10. [10]

    Bretherton, C., Watt-Meyer, O., Henn, B., and Koldunov, N.: AIMIP Phase 1 Specification, https://github.com/ai2cm/AIMIP, version 1.2.3, accessed 19 February 2026,

  11. [11]

    D., Popescu, O.-I., Pellicer- Valero, O

    Camps-Valls, G., Fernández-Torres, M.-Á., Cohrs, K.-H., Höhl, A., Castelletti, A., Pacal, A., Robin, C., Martinuzzi, F., Papoutsis, I., Prapas, I., Pérez-Aracil, J., Weigel, K., Gonzalez-Calabuig, M., Reichstein, M., Rabel, M., Giuliani, M., Mahecha, M. D., Popescu, O.-I., Pellicer- Valero, O. J., Ouala, S., Salcedo-Sanz, S., Sippel, S., Kondylatos, S., H...

  12. [12]

    A., and Maloney, E

    Chien, M.-T., Barnes, E. A., and Maloney, E. D.: Modulation of tropical cyclogenesis on subseasonal-to-interannual timescales in the deep- learning climate emulator ACE2, Machine Learning: Earth, 1, 015 008, https://doi.org/10.1088/3049-4753/adfd61,

  13. [13]

    K., Watt-Meyer, O., Kwa, A., McGibbon, J., Henn, B., Perkins, W

    Clark, S. K., Watt-Meyer, O., Kwa, A., McGibbon, J., Henn, B., Perkins, W. A., Wu, E., Harris, L. M., and Bretherton, C. S.: ACE2- SOM: Coupling an ML Atmospheric Emulator to a Slab Ocean and Learning the Sensitivity of Climate to Changed CO2, Journal of Geophysical Research: Machine Learning and Computation, 2, e2024JH000 575, https://doi.org/https://doi...

  14. [14]

    E., Stevenson, S., Fasullo, J

    26 Coats, S., Smerdon, J. E., Stevenson, S., Fasullo, J. T., Otto-Bliesner, B., and Ault, T. R.: Paleoclimate Constraints on the Spatiotemporal Character of Past and Future Droughts, Journal of Climate, 33, 9883 – 9903, https://doi.org/10.1175/JCLI-D-20-0004.1,

  15. [15]

    R., Liu, Z., Espinosa, Z

    Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M.: A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate, AGU Advances, 6, e2025A V001 706, https://doi.org/https://doi.org/10.1029/2025A V001706, e2025A V001706 2025A V001706,

  16. [16]

    J.: PCMDI/cmor: CMOR version 3.2.2, https://cmor.llnl.gov/, software release, March 2017,

    Doutriaux, C., Nadeau, D., Bradshaw, T., Kettleborough, J., Weigel, T., Hogan, E., and Durack, P. J.: PCMDI/cmor: CMOR version 3.2.2, https://cmor.llnl.gov/, software release, March 2017,

  17. [17]

    Duan, S., Zhang, J., Bonfils, C., and Pallotta, G.: Testing NeuralGCM’s capability to simulate future heatwaves based on the 2021 Pacific Northwest heatwave event, npj Climate and Atmospheric Science, 8, 251, https://doi.org/10.1038/s41612-025-01137-2,

  18. [18]

    D., Gentine, P., Barnes, E

    Eyring, V ., Collins, W. D., Gentine, P., Barnes, E. A., Barreiro, M., Beucler, T., Bocquet, M., Bretherton, C. S., Christensen, H. M., Dagon, K., Gagne, D. J., Hall, D., Hammerling, D., Hoyer, S., Iglesias-Suarez, F., Lopez-Gomez, I., McGraw, M. C., Meehl, G. A., Molina, M. J., Monteleoni, C., Mueller, J., Pritchard, M. S., Rolnick, D., Runge, J., Stier,...

  19. [19]

    J., Taylor, K

    Gleckler, P. J., Taylor, K. E., and Doutriaux, C.: Performance metrics for climate models, Journal of Geophysical Research: Atmospheres, 113, https://doi.org/10.1029/2007JD008972,

  20. [20]

    Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Sim- mons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., De Chiara, G., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M....

  21. [21]

    Jiang, X., Maloney, E., and Su, H.: Large-scale controls of propagation of the Madden-Julian Oscillation, npj Climate and Atmospheric Science, 3, 29, https://doi.org/10.1038/s41612-020-00134-x,

  22. [22]

    M., Rehfeld, K., Ait Brahim, Y ., Dütsch, M., Gwinneth, B., Hou, A., Loutre, M.-F., Hen- drizan, M., Meissner, K., Mongwe, P., Otto-Bliesner, B., Pezzi, L

    Kageyama, M., Braconnot, P., Chiessi, C. M., Rehfeld, K., Ait Brahim, Y ., Dütsch, M., Gwinneth, B., Hou, A., Loutre, M.-F., Hen- drizan, M., Meissner, K., Mongwe, P., Otto-Bliesner, B., Pezzi, L. P., Rovere, A., Seltzer, A., Sime, L., and Zhu, J.: Lessons from paleoclimates for recent and future climate change: opportunities and insights, Frontiers in Cl...

  23. [23]

    A., Dunstone, N

    27 Kent, C., Scaife, A. A., Dunstone, N. J., Smith, D., Hardiman, S. C., Dunstan, T., and Watt-Meyer, O.: Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data, npj Climate and Atmospheric Science, 8, 314, https://doi.org/10.1038/s41612-025-01198-3,

  24. [24]

    Brenner, and Stephan Hoyer

    Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S.: Neural general circulation models for weather and climate, Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y,

  25. [25]

    Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S., and Battaglia, P.: GraphCast: Learning skillful medium- range global weather forecasting, https://arxiv.org/abs/2212.12794,

  26. [26]

    R., Gleckler, P

    Lee, J., Sperber, K. R., Gleckler, P. J., Bonfils, C., and Taylor, K. E.: Quantifying the agreement between observed and simulated extratropical modes of interannual variability, Climate Dynamics, 52, 4057–4089, https://doi.org/10.1007/s00382-018-4355-4,

  27. [27]

    J., Ahn, M.-S., Ordonez, A., Ullrich, P

    Lee, J., Gleckler, P. J., Ahn, M.-S., Ordonez, A., Ullrich, P. A., Sperber, K. R., Taylor, K. E., Planton, Y . Y ., Guilyardi, E., Durack, P., Bonfils, C., Zelinka, M. D., Chao, L.-W., Dong, B., Doutriaux, C., Zhang, C., V o, T., Boutte, J., Wehner, M. F., Pendergrass, A. G., Kim, D., Xue, Z., Wittenberg, A. T., and Krasting, J.: Systematic and objective ...

  28. [28]

    J., Yang, W., and Vecchi, G

    Meng, Z., Hakim, G. J., Yang, W., and Vecchi, G. A.: Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies, Geophysical Research Letters, 53, e2025GL117 990, https://doi.org/https://doi.org/10.1029/2025GL117990, e2025GL117990 2025GL117990,

  29. [29]

    Nikumbh, A. C., Lin, P., Paynter, D., and Ming, Y .: Does Increasing Horizontal Resolution Improve the Simulation of Intense Tropical Rain- fall in GFDL’s AM4 Model?, Geophysical Research Letters, 51, e2023GL106 708, https://doi.org/https://doi.org/10.1029/2023GL106708, e2023GL106708 2023GL106708,

  30. [30]

    Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A.: FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators, https://arxiv.org/abs/2202.11214,

  31. [31]

    Peings, Y ., Dong, C., Mahesh, A., Pritchard, M., Collins, W., and Magnusdottir, G.: Subseasonal Forecasting and MJO Telecon- nections in Machine Learning Weather Prediction Models, Journal of Geophysical Research: Atmospheres, 131, e2025JD044 910, https://doi.org/https://doi.org/10.1029/2025JD044910, e2025JD044910 2025JD044910,

  32. [32]

    Pithan, F., Athanase, M., Dahlke, S., Sánchez-Benítez, A., Shupe, M. D., Sledd, A., Streffing, J., Svensson, G., and Jung, T.: Nudging allows direct evaluation of coupled climate models with in situ observations: a case study from the MOSAiC expedition, Geoscientific Model Development, 16, 1857–1873, https://doi.org/10.5194/gmd-16-1857-2023,

  33. [33]

    Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M.: Probabilistic weather forecasting with machine learning, Nature, 637, 84–90, https://doi.org/10.1038/s41586-024- 08252-9,

  34. [34]

    Rasp, S., Hoyer, S., Merose, A., Langmore, I., Battaglia, P., Russell, T., Sanchez-Gonzalez, A., Yang, V ., Carver, R., Agrawal, S., Chantry, M., Ben Bouallegue, Z., Dueben, P., Bromberg, C., Sisk, J., Barrington, L., Bell, A., and Sha, F.: WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models, Journal of Advances in Mod...

  35. [35]

    A., and Pahlavan, H

    28 Rucker, K., Baxter, I., Hassanzadeh, P., Shaw, T. A., and Pahlavan, H. A.: Benchmarking Regional Thermodynamic Trends in an AI emulator, ACE2, and a hybrid model, NeuralGCM, https://arxiv.org/abs/2511.00274,

  36. [36]

    G., and Coauthors, 2018: Storylines: an alternative approach to representing uncertainty in physical aspects of climate change

    Shepherd, T. G., Boyd, E., Calel, R., Chapman, S. C., Dessai, S., Dima-West, I., Fowler, H. J., James, R., Maraun, D., Martius, O., Senior, C. A., Sobel, A. H., and Stainforth, D. A.: Storylines: an alternative approach to representing uncertainty in physical aspects of climate change, Climatic Change, 151, 555–571, https://link.springer.com/article/10.10...

  37. [37]

    Sønderby, C. K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Bronstein, M., Kalchbrenner, N., and van den Oord, A.: MetNet: A Neural Weather Model for Precipitation Forecasting, https://arxiv.org/abs/2003.12140,

  38. [38]

    R.: Madden–Julian variability in NCAR CAM2.0 and CCSM2.0, Climate Dynamics, 23, 259–278, https://doi.org/10.1007/s00382-004-0447-4,

    Sperber, K. R.: Madden–Julian variability in NCAR CAM2.0 and CCSM2.0, Climate Dynamics, 23, 259–278, https://doi.org/10.1007/s00382-004-0447-4,

  39. [39]

    Stephens, G. L., L’Ecuyer, T., Forbes, R., Gettlemen, A., Golaz, J.-C., Bodas-Salcedo, A., Suzuki, K., Gabriel, P., and Haynes, J.: The dreary state of precipitation in global models, Journal of Geophysical Research: Atmospheres, 115, https://doi.org/10.1029/2010JD014532,

  40. [40]

    E.: Summarizing multiple aspects of model performance in a single diagram, Journal of Geophysical Research, 106, 7183–7192, https://doi.org/10.1029/2000JD900719,

    Taylor, K. E.: Summarizing multiple aspects of model performance in a single diagram, Journal of Geophysical Research, 106, 7183–7192, https://doi.org/10.1029/2000JD900719,

  41. [41]

    A., Barnes, E

    Ullrich, P. A., Barnes, E. A., Collins, W. D., Dagon, K., Duan, S., Elms, J., Lee, J., Leung, L. R., Lu, D., Molina, M. J., O’Brien, T. A., and Rebassoo, F. O.: Recommendations for Comprehensive and Independent Evaluation of Machine Learning-Based Earth System Mod- els, Journal of Geophysical Research: Machine Learning and Computation, 2, e2024JH000 496, ...

  42. [42]

    J., Ferraro, R., Taylor, K

    Waliser, D., Gleckler, P. J., Ferraro, R., Taylor, K. E., Ames, S., Biard, J., Bosilovich, M. G., Brown, O., Chepfer, H., Cinquini, L., Durack, P. J., Eyring, V ., Mathieu, P.-P., Lee, T., Pinnock, S., Potter, G. L., Rixen, M., Saunders, R., Schulz, J., Thépaut, J.-N., and Tuma, M.: Observations for Model Intercomparison Project (Obs4MIPs): status for CMI...

  43. [43]

    Wang, B., Kim, H.-J., Kikuchi, K., and Kitoh, A.: Diagnostic metrics for evaluation of annual and diurnal cycles, Climate Dynamics, 37, 941–955, https://doi.org/10.1007/s00382-010-0877-0,

  44. [44]

    Watson-Parris, D., Rao, Y ., Olivié, D., Seland, Ø., Nowack, P., Camps-Valls, G., Stier, P., Bouabid, S., Dewey, M., Fons, E., Gonzalez, J., Harder, P., Jeggle, K., Lenhardt, J., Manshausen, P., Novitasari, M., Ricard, L., and Roesch, C.: ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections, Journal of Advances in Modeling Earth Systems, 14,...

  45. [45]

    ClimateBench v1.0: A bench- mark for data-driven climate projections

    Watt-Meyer, O., Dresdner, G., McGibbon, J., Clark, S. K., Henn, B., Duncan, J., Brenowitz, N. D., Kashinath, K., Pritchard, M. S., Bonev, B., Peters, M. E., and Bretherton, C. S.: ACE: A fast, skillful learned global atmospheric model for climate prediction, https://arxiv.org/ abs/2310.02074,

  46. [46]

    double ITCZ

    Xiang, B., Zhao, M., Held, I. M., and Golaz, J.-C.: Predicting the severity of spurious “double ITCZ” problem in CMIP5 coupled models from AMIP simulations, Geophysical Research Letters, 44, 1520–1527, https://doi.org/https://doi.org/10.1002/2016GL071992,

  47. [47]

    H., Yarosh, Y ., Sun, F., and Lin, R.: Reprocessed, bias-corrected CMORPH global high-resolution precip- itation estimates from 1998, Journal of Hydrometeorology, 18, 1617–1641,

    29 Xie, P., Joyce, R., Wu, S., Yoo, S. H., Yarosh, Y ., Sun, F., and Lin, R.: Reprocessed, bias-corrected CMORPH global high-resolution precip- itation estimates from 1998, Journal of Hydrometeorology, 18, 1617–1641,

  48. [48]

    Yuval, J., Langmore, I., Kochkov, D., and Hoyer, S.: Neural general circulation models for modeling precipitation, Science Advances, 12, eadv6891, https://doi.org/10.1126/sciadv.adv6891,

  49. [49]

    arXiv preprint arXiv:2510.02415 , year=

    Zhang, B. and Merlis, T. M.: The Equilibrium Response of Atmospheric Machine-Learning Models to Uniform Sea Surface Temperature Warming, https://arxiv.org/abs/2510.02415,

  50. [50]

    Zhang, G., Rao, M., Yuval, J., and Zhao, M.: Advancing seasonal prediction of tropical cyclone activity with a hybrid AI-physics climate model, Environmental Research Letters, 20, 094 031, https://doi.org/10.1088/1748-9326/adf864, 2025a. Zhang, Q., Cheng, S., Liu, L., Zhang, L., Xu, J., She, D., and Yuan, Z.: Projections of climate change and its impacts ...

  51. [51]

    and (Lee et al., 2024). The climatology metric is computed with respect to the observed climatological fields provided by the reference dataset product corresponding to each examined variable as reported in Table

  52. [52]

    and (b) ACE2, (c) NeuralGCM-evap and (d) NeuralGCM-precip. The EWR is defined as the ratio of eastward power (as the average power in the dashed box on the right) to westward power (as the average power in the dashed box on the left) from the 2-dimensional wavenumber–frequency power spectra of daily 10°N-–10°S averaged precipitation in May to October (sha...

  53. [53]

    The monsoon metrics obtained from observation datasets (GPCP v1.3 and CMORPH v1.0; Joyce et al

    45 Figure S23.Comparing the precipitation pentads between model and observations in NeuralGCM-evap. The monsoon metrics obtained from observation datasets (GPCP v1.3 and CMORPH v1.0; Joyce et al. (2004); Xie et al. (2017) and Historical simulation conducted via (a) ACE2 and (b) NeuralGCM-precip. For each model, we analyzed results for six monsoon regions:...