pith. sign in

arxiv: 2605.06944 · v2 · pith:ONFNJHF4new · submitted 2026-05-07 · ⚛️ physics.ao-ph

AIMIP Phase 1: systematic evaluations of AI weather and climate models

Pith reviewed 2026-05-20 23:02 UTC · model grok-4.3

classification ⚛️ physics.ao-ph
keywords AI weather modelsclimate model intercomparisonhistorical climate simulationsea surface temperature forcingEl Nino responseout-of-sample generalizationreanalysis dataAI climate models
0
0 comments X

The pith

AI models simulate historical climate and forcing responses as well as conventional physics-based models, though some underestimate warming trends.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up AIMIP Phase 1 as a standardized intercomparison to test AI weather and climate models under identical conditions. All models must simulate the atmosphere while driven by the same historical sea surface temperatures from 1979 to 2024 and trained on reanalysis data. Performance is checked across biases, trends, El Nino responses, temporal variability, and out-of-sample generalization. The central finding is that the AI models match a traditional model in many respects, which matters because it tests whether data-driven approaches can serve as practical alternatives to physics-based climate models. Differences appear in trend underestimation and generalization behavior, pointing to specific areas where further work is needed.

Core claim

The authors establish that AI models are able to simulate the historical climate and response to forcing as well as a conventional physically-based model. At the same time some AI models underestimate historical warming trends and their predictions diverge in the out-of-sample generalization tests. This result comes from running the models under a shared experiment that supplies specified historical sea surface temperatures over 1979-2024 and requires training against reanalysis data, with evaluation on five major criteria.

What carries the argument

The AIMIP Phase 1 common experiment, which forces models with specified historical sea surface temperatures over 1979-2024 and requires training against reanalysis data, provides the shared protocol that makes differences in AI frameworks and architectures directly comparable.

If this is right

  • AI models can be treated as viable options for reproducing historical climate behavior at the level of traditional models.
  • Underestimation of warming trends in some AI models indicates a need to improve their representation of long-term climate changes.
  • Divergence among models in out-of-sample tests shows that generalization performance is not uniform across AI architectures.
  • The publicly released dataset allows the wider community to conduct additional checks and refine the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same protocol could be extended to future forcing scenarios to check whether the current performance levels hold under changing conditions.
  • Hybrid models that combine AI components with selected physical constraints might reduce the observed trend biases.
  • Systematic intercomparisons like this one could become a standard step before deploying AI models in operational climate services.

Load-bearing premise

The comparison assumes that training on reanalysis data and forcing all models with the same historical sea surface temperatures gives a sufficient and unbiased basis for judging fundamentally different AI modeling systems.

What would settle it

A decisive test would be whether independent observational records or withheld recent years show that the AI models produce systematically larger errors in warming trends or El Nino responses than the conventional model across multiple metrics.

Figures

Figures reproduced from arXiv: 2605.06944 by Antonia Jost, Brian Henn, Christian Lessig, Christopher S. Bretherton, Dale Durran, Dmitrii Kochkov, Guillaume Couairon, Ignacio Lopez-Gomez, Janni Yuval, Kyle Joseph Chen Hall, Maria J. Molina, Nathaniel Cresswell-Clay, Nikolay Koldunov, Noah Brenowitz, Oliver Watt-Meyer, Peter Manshausen, Renu Singh, Robert Brunstein, Stephan Hoyer, Troy Arcomano, Yana Hasson.

Figure 1
Figure 1. Figure 1: Biases at 1◦ resolution versus ERA5, for the AIWCMs and a CMIP6 model (GFDL-CM4, bottom row). (a), (b): 2-meter air temperature biases over the training (1979-2014) and test (2015-2024) periods, respectively. GFDL-CM4 data end in 2014 and so are only available over the training period. (c), (d): surface precipitation biases over the same periods, for models that included surface precipitation outputs (Arch… view at source ↗
Figure 2
Figure 2. Figure 2: RMSB area-weighted over the globe on the 1◦ grid. (a) through (g): surface variables; (h) 500 hPa geopotential height; (i) through (l), (m) through (p): temperature, specific humidity, and u, v wind at 850 hPa and 250 hPa, respectively. Bars indicate the ensemble medians and error bars indicate the ensemble ranges. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Global- and annual-mean 2-meter air temperature, shown as anomalies from the training period (1979-2014) average. ERA5 is in black; AIWCM model ensemble means are shown, along with the CMIP6 GFDL-CM4 single-member prediction. The AIMIP test period (2015-2024) is shaded at right. 4.3 E2: Trends We compute trends first by computing global area-weighted annual mean series, and then fitting linear trends to th… view at source ↗
Figure 4
Figure 4. Figure 4: Trends of global- and annual-mean variables. (a through e) surface variables, (f) 500 hPa geopotential height, (g), (h) 850 hPa temperature and humidity, and (i), (j) 250 hPa temperature and humidity. In (d) mean sea level pressure trend is shown for all models that submitted this variable, but for ACE2.1-ERA5, MD-1.5 v0.9 and NeuralGCM surface pressure trend is shown. The dark background bar is ERA5. GFDL… view at source ↗
Figure 5
Figure 5. Figure 5: Trend maps at 1◦ resolution over the training period for (a) 2-meter temperature and (b) surface precipitation. We also show maps of trends computed at the gridpoint scale. In [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ENSO coefficient maps at 1◦ resolution for ERA5 (upper left panels) and model coefficient errors versus ERA5 coefficients (subsequent panels) over the training period, for (a) 2-meter temperature and (b) surface precipitation. 6-hourly predictions), which may influence their ability to capture the daily average variability evaluated here. MD-1.5 v0.9 makes predictions only at a monthly timestep and is not … view at source ↗
Figure 7
Figure 7. Figure 7: Standard deviation of daily anomalies from monthly mean at 1◦ resolution over 1979, for 2-meter air temperature (a) and surface precipitation (b). Upper left panels shows anomaly standard deviation in ERA5, and subsequent panels show the error in model anomaly standard deviations. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Global area-weighted mean of model daily anomaly standard deviation errors, relative to global-mean ERA5 daily variability, at 1 ◦ resolution for the set of variables shown in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Time-mean response to +2 K and +4 K SST perturbations, for 2-meter air temperature (a), (b) and surface precipitation (c), (d), respectively. Only +4 K SST perturbations are available for the GFDL-CM4 model. reliably predict future climate trends using historical information and reliable physical knowledge is a key challenge for the AIWCM community over the next few years. 25 [PITH_FULL_IMAGE:figures/full… view at source ↗
read the original abstract

We present the AI weather and climate model intercomparison project (AIMIP), phase 1. Drawing from the rich tradition of intercomparisons in climate model development, we specify a common experiment, output data format, and training constraints (namely, training against historical reanalysis data) for AIMIP Phase 1 models. We aim to identify differences in modeling frameworks and AI architectural choices that influence model behavior, and build trust in AI weather and climate models through open data and evaluation. AIMIP Phase 1 models must simulate the atmosphere given specified historical sea surface temperatures over 1979-2024. We evaluate the models' performance using five major evaluation criteria: biases, trends, response to El Ni\~{n}o-related sea surface temperature anomalies, temporal variability, and out-of-sample generalization tests. We find that the AI models are able to simulate the historical climate and response to forcing as well as a conventional physically-based model, but some AI models underestimate historical warming trends, and their predictions diverge in the out-of-sample generalization tests. We describe the AIMIP Phase 1 dataset that is publicly available for additional evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AIMIP Phase 1, an intercomparison project for AI weather and climate models. It specifies a common experiment where models simulate the atmosphere with prescribed historical sea surface temperatures from 1979 to 2024, trained on reanalysis data. The models are evaluated on five criteria: biases, trends, response to El Niño-related SST anomalies, temporal variability, and out-of-sample generalization. The main results indicate that AI models can simulate historical climate and forcing responses comparably to a conventional physically-based model, although some underestimate historical warming trends and their predictions diverge in out-of-sample tests. The AIMIP Phase 1 dataset is made publicly available for further evaluations.

Significance. If the findings hold, this work is significant for establishing standardized benchmarks in the rapidly developing field of AI-based climate modeling, helping to build trust and identify strengths and weaknesses of different AI architectures. A notable strength is the commitment to open data and evaluation, which facilitates community scrutiny and additional analyses. This aligns with the tradition of model intercomparisons but adapts it to AI frameworks, potentially accelerating progress in the area.

major comments (2)
  1. Section 2 (Experiment Setup): The central claim that AI models perform 'as well as' conventional models rests on the common AMIP-style experiment with prescribed SSTs and reanalysis training. However, without an ablation study comparing performance when the conventional model is similarly constrained to reanalysis fields, it is unclear whether the equivalence reflects true dynamical skill or reproduction of reanalysis-embedded patterns. This is load-bearing for the abstract's performance claims.
  2. Section 4 (Evaluation Criteria and Results): The qualitative assessment that some AI models underestimate historical warming trends and diverge in out-of-sample tests lacks supporting quantitative details such as specific trend magnitudes, RMSE values, or statistical tests. This makes it difficult to gauge the practical significance of these differences and assess robustness across the five evaluation criteria.
minor comments (2)
  1. Abstract: The expansion of the AIMIP acronym is provided, but a brief mention of the specific conventional model used for comparison would improve clarity.
  2. Data Availability: While the dataset is stated to be publicly available, including a direct link or DOI in the main text would enhance accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and positive evaluation of the significance of our work on AIMIP Phase 1. We address the major comments below and have updated the manuscript accordingly to improve clarity and provide additional quantitative information.

read point-by-point responses
  1. Referee: Section 2 (Experiment Setup): The central claim that AI models perform 'as well as' conventional models rests on the common AMIP-style experiment with prescribed SSTs and reanalysis training. However, without an ablation study comparing performance when the conventional model is similarly constrained to reanalysis fields, it is unclear whether the equivalence reflects true dynamical skill or reproduction of reanalysis-embedded patterns. This is load-bearing for the abstract's performance claims.

    Authors: The conventional model is a physics-based GCM run in standard AMIP configuration with prescribed SSTs but without being constrained or nudged to reanalysis atmospheric fields. This is the appropriate baseline for comparison, as it represents a traditional dynamical model simulating the atmosphere under the same boundary conditions. The AI models, while trained on reanalysis, demonstrate not mere reproduction because they exhibit specific shortcomings, such as underestimating warming trends despite the trends being present in the training data. This suggests limitations in capturing certain dynamical processes. We have added text in Section 2 to explicitly describe the setup differences between the AI and conventional models to avoid any ambiguity. We note that a full ablation with the conventional model constrained to reanalysis would require additional experiments (e.g., data assimilation) not standard in AMIP and is planned for future work. revision: partial

  2. Referee: Section 4 (Evaluation Criteria and Results): The qualitative assessment that some AI models underestimate historical warming trends and diverge in out-of-sample tests lacks supporting quantitative details such as specific trend magnitudes, RMSE values, or statistical tests. This makes it difficult to gauge the practical significance of these differences and assess robustness across the five evaluation criteria.

    Authors: We agree that quantitative details enhance the interpretability of the results. In the revised manuscript, we have expanded Section 4 to include specific numerical values: for example, global surface temperature trends (in K per decade) for each AI model and the conventional model, RMSE metrics for biases and temporal variability, and p-values from statistical significance tests comparing trends and out-of-sample performance. These have been added to the text, a new table, and updated figure captions for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation intercomparison without self-referential derivations

full rationale

This paper describes an intercomparison project (AIMIP Phase 1) that specifies a common experimental protocol—prescribed historical SSTs from 1979-2024 and training against reanalysis data—then reports direct performance comparisons of AI models against a conventional physics-based model on biases, trends, ENSO response, variability, and out-of-sample tests. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces by construction to its own inputs. The central claims rest on empirical evaluation outputs rather than any self-definition, ansatz smuggling, or load-bearing self-citation. The setup is externally benchmarked against an independent conventional model and publicly released data, satisfying the criteria for a self-contained, non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The project rests on standard climate modeling assumptions about reanalysis data quality and the representativeness of chosen metrics; no new entities are postulated and no free parameters are fitted in the reported work.

axioms (1)
  • domain assumption Historical reanalysis data accurately represents past atmospheric states for the purpose of training and evaluating models.
    Models are required to train against this data under the project constraints described in the abstract.

pith-pipeline@v0.9.0 · 5819 in / 1376 out tokens · 102421 ms · 2026-05-20T23:02:25.043342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month ...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Hydrometeor., 4, 1147–1167,

    Adler, R., Huffman, G., Chang, A., Ferraro, R., Xie, P., Janowiak, J., Rudolf, B., Schneider, U., Curtis, S., Bolvin, D., Gruber, A., Susskind, J., and Arkin, P.: The Version 2 Global Precipitation Climatology Project (GPCP) Monthly Precipitation Analysis (1979-Present), J. Hydrometeor., 4, 1147–1167,

  2. [2]

    Allan, R., Willett, K., John, V ., and Trent, T.: Global Changes in Water Vapor 1979–2020, Journal of Geophysical Research: Atmospheres, 127, https://doi.org/10.1029/2022JD036728,

  3. [3]

    Arcomano, T., Henn, B., and Bretherton, C.: AIMIP Phase 1 Forcing Dataset, https://doi.org/10.5281/zenodo.17065758,

  4. [4]

    G., Chelliah, M., and Goldenberg, S

    Barnston, A. G., Chelliah, M., and Goldenberg, S. B.: Documentation of a highly ENSO-related sst region in the equatorial pacific: Research note, Atmosphere-Ocean, 35, 367–383, https://doi.org/10.1080/07055900.1997.9649597,

  5. [5]

    Byrne, M. P. and O’Gorman, P. A.: Land–Ocean Warming Contrast over a Wide Range of Climates: Convective Quasi-Equilibrium Theory and Idealized Simulations, Journal of Climate, 26, 4000–4016, https://doi.org/10.1175/JCLI-D-12-00262.1,

  6. [6]

    Cinquini, L., Crichton, D., Mattmann, C., Harney, J., Shipman, G., Wang, F., Ananthakrishnan, R., Miller, N., Denvil, S., Morgan, M., Pobre, Z., Bell, G. M., Doutriaux, C., Drach, R., Williams, D., Kershaw, P., Pascoe, S., Gonzalez, E., Fiore, S., and Schweitzer, R.: The Earth System Grid Federation: An open infrastructure for access to distributed geospa...

  7. [7]

    Couairon, G., Singh, R., Charantonis, A., Lessig, C., and Monteleoni, C.: ArchesWeatherGen: Skillful and compute-efficient probabilistic weather forecasting with machine learning, Science Advances, 12, eadx2372, https://doi.org/10.1126/sciadv.adx2372,

  8. [8]

    R., Liu, Z., Espinosa, Z

    Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M.: A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate, AGU Advances, 6, https://doi.org/10.1029/2025A V001706,

  9. [9]

    P., Hewitt, H

    Dunne, J. P., Hewitt, H. T., Arblaster, J. M., Bonou, F., Boucher, O., Cavazos, T., Dingley, B., Durack, P. J., Hassler, B., Juckes, M., Miyakawa, T., Mizielinski, M., Naik, V ., Nicholls, Z., O’Rourke, E., Pincus, R., Sanderson, B. M., Simpson, I. R., and Taylor, K. E.: An evolving Coupled Model Intercomparison Project phase 7 (CMIP7) and Fast Track in s...

  10. [10]

    D., Kölling, T., Allured, D., Jelenak, A., Soerensen, A

    Eaton, B., Gregory, J., Drach, B., Taylor, K., Hankin, S., Caron, J., Signell, R., Bentley, P., Rappa, G., Höck, H., Pamment, A., Juckes, M., Raspaud, M., Blower, J., Horne, R., Whiteaker, T., Blodgett, D., Zender, C., Lee, D., Hassell, D., Snow, A. D., Kölling, T., Allured, D., Jelenak, A., Soerensen, A. M., Gaultier, L., Herlédan, S., Manzano, F., Bärri...

  11. [11]

    A., Senior, C

    Eyring, V ., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B., Stouffer, R. J., and Taylor, K. E.: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization, Geoscientific Model Development, 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016, 2016a. Eyring, V ., Righi, M., Lauer, A., Evaldsson, M., Wen...

  12. [12]

    L., Boyle, J

    Gates, W. L., Boyle, J. S., Covey, C., Dease, C. G., Doutriaux, C. M., Drach, R. S., Fiorino, M., Gleckler, P. J., Hnilo, J. J., Marlais, S. M., Phillips, T. J., Potter, G. L., Santer, B. D., Sperber, K. R., Taylor, K. E., and Williams, D. N.: An Overview of the Results of the Atmospheric Model Intercomparison Project (AMIP I), Bulletin of the American Me...

  13. [13]

    M., Hivon, E., Banday, A

    Gorski, K. M., Hivon, E., Banday, A. J., Wandelt, B. D., Hansen, F. K., Reinecke, M., and Bartelmann, M.: HEALPix: A Framework for High-Resolution Discretization and Fast Analysis of Data Distributed on the Sphere, The Astrophysical Journal, 622, 759–771, https://doi.org/10.1086/427976,

  14. [14]

    G., Blanton, C., McHugh, C., Nikonov, S., Radhakrishnan, A., Rand, K., Zadeh, N

    Guo, H., John, J. G., Blanton, C., McHugh, C., Nikonov, S., Radhakrishnan, A., Rand, K., Zadeh, N. T., Balaji, V ., Durachta, J., Dupuis, C., Menzel, R., Robinson, T., Underwood, S., Vahlenkamp, H., Bushuk, M., Dunne, K. A., Dussin, R., Gauthier, P. P., Ginoux, P., Griffies, S. M., Hallberg, R., Harrison, M., Hurlin, W., Lin, P., Malyshev, S., Naik, V ., ...

  15. [15]

    Hall, K. J. C. and Molina, M. J.: Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP, http://arxiv.org/abs/2604.13481,

  16. [16]

    V ., and Watt-Meyer, O.: ai2cm/AIMIP: GMD manuscript submission, https://doi.org/10.5281/zenodo.20072877,

    Henn, B., Bretherton, C., Koldunov, N. V ., and Watt-Meyer, O.: ai2cm/AIMIP: GMD manuscript submission, https://doi.org/10.5281/zenodo.20072877,

  17. [17]

    D., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M., Geer, A., Haimberger, L., Healy, S., Hogan, R

    Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Sim- mons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., Chiara, G. D., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M....

  18. [18]

    R., Moreno, R

    Karlbauer, M., Cresswell-Clay, N., Durran, D. R., Moreno, R. A., Kurth, T., Bonev, B., Brenowitz, N., and Butz, M. V .: Advancing Parsimonious Deep Learning Weather Prediction Using the HEALPix Mesh, Journal of Advances in Modeling Earth Systems, 16, e2023MS004 021, https://doi.org/https://doi.org/10.1029/2023MS004021, e2023MS004021 2023MS004021,

  19. [19]

    P., and Hoyer, S.: Neural general circulation models for weather and climate, Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y,

    30 Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S.: Neural general circulation models for weather and climate, Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y,

  20. [20]

    A., Simmons, A., Vamborg, F., and Rodwell, M

    Lavers, D. A., Simmons, A., Vamborg, F., and Rodwell, M. J.: An evaluation of ERA5 precipitation for climate monitoring, Quarterly Journal of the Royal Meteorological Society, 148, 3152–3165, https://doi.org/10.1002/qj.4351,

  21. [21]

    J., Ahn, M.-S., Ordonez, A., Ullrich, P

    Lee, J., Gleckler, P. J., Ahn, M.-S., Ordonez, A., Ullrich, P. A., Sperber, K. R., Taylor, K. E., Planton, Y . Y ., Guilyardi, E., Durack, P., Bonfils, C., Zelinka, M. D., Chao, L.-W., Dong, B., Doutriaux, C., Zhang, C., V o, T., Boutte, J., Wehner, M. F., Pendergrass, A. G., Kim, D., Xue, Z., Wittenberg, A. T., and Krasting, J.: Systematic and objective ...

  22. [22]

    Liu, Z., Mao, H., Wu, C.-Y ., Feichtenhofer, C., Darrell, T., and Xie, S.: A ConvNet for the 2020s, https://arxiv.org/abs/2201.03545,

  23. [23]

    Mauzey, C., Durack, P., Taylor, K. E., Florek, P., Doutriaux, C., Nadeau, D., Hogan, E., Kettleborough, J., Weigel, T., kjoti, jmrgonza, Nicholls, Z., Betts, E., Seddon, J., and Wachsmann, F.: PCMDI/CMOR: CMOR v3.8.0, https://doi.org/10.5281/zenodo.10946710,

  24. [24]

    WP-MIP: An Artificial Intelligence, Hybrid and Physically Based Model Intercomparison Project for Weather Prediction

    McTaggart-Cowan, R., Magnusson, L., Polichtchouk, I., Ackerley, D., Koehler, M., Casati, B., Chen, J.-H., Hudson, D., Ujiie, M., Aziz, N. A., et al.: WP-MIP: An Artificial Intelligence, Hybrid and Physically Based Model Intercomparison Project for Weather Prediction, arXiv preprint arXiv:2604.16643,

  25. [25]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models, http: //arxiv.org/abs/2112.10752,

  26. [26]

    T., Dong, B., and Gregory, J

    Sutton, R. T., Dong, B., and Gregory, J. M.: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations, Geophysical Research Letters, 34, https://doi.org/10.1029/2006GL028164,

  27. [27]

    Taylor, K. E., Williamson, D., and Zwiers, F.: AMIP Sea Surface Temperature and Sea Ice Concentration Boundary Conditions, https: //pcmdi.llnl.gov/mips/amip/details/index.html, accessed: 2024-04-01,

  28. [28]

    E., Juckes, M., Balaji, V ., Cinquini, L., Denvil, S., Durack, P

    Taylor, K. E., Juckes, M., Balaji, V ., Cinquini, L., Denvil, S., Durack, P. J., Elkington, M., Guilyardi, E., Kharin, S., Lautenschlager, M., Lawrence, B., Nadeau, D., and Stockhause, M.: CMIP6 Model Output Metadata Requirements, Data Reference Syntax (DRS) and Con- trolled V ocabularies (CVs), https://doi.org/10.5281/zenodo.15670624,

  29. [29]

    A., Barnes, E

    Ullrich, P. A., Barnes, E. A., Collins, W., Dagon, K., Duan, S., Elms, J., Lee, J., Leung, L. R., Lu, D., Molina, M. J., O’Brien, T. A., and Rebassoo, F. O.: Recommendations for Comprehensive and Independent Evaluation of Machine Learning-Based Earth System Models, Journal of Geophysical Research: Machine Learning and Computation, 2, https://doi.org/10.10...

  30. [30]

    K., Kwa, A., Perkins, W

    Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S.: ACE2: ac- curately learning subseasonal to decadal atmospheric variability and forced responses, npj Climate and Atmospheric Science, 8, 205, https://doi.org/10.1038/s41612-025-01090-0,

  31. [31]

    J., Andrews, T., Bodas-Salcedo, A., Bony, S., Bretherton, C

    Webb, M. J., Andrews, T., Bodas-Salcedo, A., Bony, S., Bretherton, C. S., Chadwick, R., Chepfer, H., Douville, H., Good, P., Kay, J. E., Klein, S. A., Marchand, R., Medeiros, B., Siebesma, A. P., Skinner, C. B., Stevens, B., Tselioudis, G., Tsushima, Y ., and Watanabe, M.: 31 The Cloud Feedback Model Intercomparison Project (CFMIP) contribution to CMIP6, ...

  32. [32]

    Yuval, J., Langmore, I., Kochkov, D., and Hoyer, S.: Neural general circulation models for modeling precipitation, Science Advances, 12, 1060–1066, https://doi.org/10.1126/sciadv.adv6891,

  33. [33]

    Simulation Characteristics With Prescribed SSTs, Journal of Advances in Modeling Earth Systems, 10, 691–734, https://doi.org/https://doi.org/10.1002/2017MS001208,

  34. [34]

    et al.: pangeo-data/xESMF: Universal Regridder for Geospatial Data, https://doi.org/10.5281/zenodo.4294774,

    Zhuang, J. et al.: pangeo-data/xESMF: Universal Regridder for Geospatial Data, https://doi.org/10.5281/zenodo.4294774,

  35. [35]

    Second, the AMIP algorithm for calculating monthly values for SST and SIC is problematic

    First, it does not extend past 2022, while AIMIP Phase 1 inference simulations cover through 2024 to maximize the possible length of high-quality obser- vational comparison. Second, the AMIP algorithm for calculating monthly values for SST and SIC is problematic. It involves specifying mid-month values that, when linearly interpolated in time, give the mo...

  36. [36]

    Different parts of the denoising are carried out by different networks, with the higher noise levels being denoised by less trained/early-stopped versions of the network

    cBottle1.3, like the published version, is an Ensemble-of-Experts model. Different parts of the denoising are carried out by different networks, with the higher noise levels being denoised by less trained/early-stopped versions of the network. This is to avoid overfitting at large noise levels (see Brenowitz et al. (2025) for details). For every model, we...

  37. [37]

    Numbers indicate the amount of noisy samples this network is trained on. Physics Indices: 33 –p1 checkpoints: –training-state-000512000.checkpoint –training-state-002048000.checkpoint –training-state-009856000.checkpoint –p2 checkpoints: –training-state-000512000.checkpoint –training-state-002176000.checkpoint –training-state-009984000.checkpoint –p3 chec...

  38. [38]

    train t st sp cific humidit. ACE2.1 -ER A5 Ar chesW eather Ar chesW eatherGen cBottle1.3 DLES.M MD-1.5 v0.9 N uralGCM -HRD GFDL -CM4 Figure C4.RSMB for specific humidity over pressure levels and training and test periods. C4 Daily variability Figure C11 shows dry-day fraction errors versus ERA5 over 1979 at 1 ◦resolution, for models that submitted daily s...

  39. [39]

    Computation is over 1979 and a cutoff of 0.1 mm is used to define a dry day

    41 (a) ERA5 ACE2.1-ERA5 cBottle1.3 DLESyM NeuralGCM-HRD GFDL-CM4 0 0.2 0.4 0.6 0.8 ERA5 dry day fractio -0.3 -0.15 0 0.15 0.3 model dry day fractio error Figure C11.Dry-day fraction error in ERA5 (top left panel) and dry day fraction errors versus ERA5 (subsequent panels). Computation is over 1979 and a cutoff of 0.1 mm is used to define a dry day. 42 App...