AIMIP Phase 1: systematic evaluations of AI weather and climate models
Pith reviewed 2026-05-20 23:02 UTC · model grok-4.3
The pith
AI models simulate historical climate and forcing responses as well as conventional physics-based models, though some underestimate warming trends.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that AI models are able to simulate the historical climate and response to forcing as well as a conventional physically-based model. At the same time some AI models underestimate historical warming trends and their predictions diverge in the out-of-sample generalization tests. This result comes from running the models under a shared experiment that supplies specified historical sea surface temperatures over 1979-2024 and requires training against reanalysis data, with evaluation on five major criteria.
What carries the argument
The AIMIP Phase 1 common experiment, which forces models with specified historical sea surface temperatures over 1979-2024 and requires training against reanalysis data, provides the shared protocol that makes differences in AI frameworks and architectures directly comparable.
If this is right
- AI models can be treated as viable options for reproducing historical climate behavior at the level of traditional models.
- Underestimation of warming trends in some AI models indicates a need to improve their representation of long-term climate changes.
- Divergence among models in out-of-sample tests shows that generalization performance is not uniform across AI architectures.
- The publicly released dataset allows the wider community to conduct additional checks and refine the models.
Where Pith is reading between the lines
- The same protocol could be extended to future forcing scenarios to check whether the current performance levels hold under changing conditions.
- Hybrid models that combine AI components with selected physical constraints might reduce the observed trend biases.
- Systematic intercomparisons like this one could become a standard step before deploying AI models in operational climate services.
Load-bearing premise
The comparison assumes that training on reanalysis data and forcing all models with the same historical sea surface temperatures gives a sufficient and unbiased basis for judging fundamentally different AI modeling systems.
What would settle it
A decisive test would be whether independent observational records or withheld recent years show that the AI models produce systematically larger errors in warming trends or El Nino responses than the conventional model across multiple metrics.
Figures
read the original abstract
We present the AI weather and climate model intercomparison project (AIMIP), phase 1. Drawing from the rich tradition of intercomparisons in climate model development, we specify a common experiment, output data format, and training constraints (namely, training against historical reanalysis data) for AIMIP Phase 1 models. We aim to identify differences in modeling frameworks and AI architectural choices that influence model behavior, and build trust in AI weather and climate models through open data and evaluation. AIMIP Phase 1 models must simulate the atmosphere given specified historical sea surface temperatures over 1979-2024. We evaluate the models' performance using five major evaluation criteria: biases, trends, response to El Ni\~{n}o-related sea surface temperature anomalies, temporal variability, and out-of-sample generalization tests. We find that the AI models are able to simulate the historical climate and response to forcing as well as a conventional physically-based model, but some AI models underestimate historical warming trends, and their predictions diverge in the out-of-sample generalization tests. We describe the AIMIP Phase 1 dataset that is publicly available for additional evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AIMIP Phase 1, an intercomparison project for AI weather and climate models. It specifies a common experiment where models simulate the atmosphere with prescribed historical sea surface temperatures from 1979 to 2024, trained on reanalysis data. The models are evaluated on five criteria: biases, trends, response to El Niño-related SST anomalies, temporal variability, and out-of-sample generalization. The main results indicate that AI models can simulate historical climate and forcing responses comparably to a conventional physically-based model, although some underestimate historical warming trends and their predictions diverge in out-of-sample tests. The AIMIP Phase 1 dataset is made publicly available for further evaluations.
Significance. If the findings hold, this work is significant for establishing standardized benchmarks in the rapidly developing field of AI-based climate modeling, helping to build trust and identify strengths and weaknesses of different AI architectures. A notable strength is the commitment to open data and evaluation, which facilitates community scrutiny and additional analyses. This aligns with the tradition of model intercomparisons but adapts it to AI frameworks, potentially accelerating progress in the area.
major comments (2)
- Section 2 (Experiment Setup): The central claim that AI models perform 'as well as' conventional models rests on the common AMIP-style experiment with prescribed SSTs and reanalysis training. However, without an ablation study comparing performance when the conventional model is similarly constrained to reanalysis fields, it is unclear whether the equivalence reflects true dynamical skill or reproduction of reanalysis-embedded patterns. This is load-bearing for the abstract's performance claims.
- Section 4 (Evaluation Criteria and Results): The qualitative assessment that some AI models underestimate historical warming trends and diverge in out-of-sample tests lacks supporting quantitative details such as specific trend magnitudes, RMSE values, or statistical tests. This makes it difficult to gauge the practical significance of these differences and assess robustness across the five evaluation criteria.
minor comments (2)
- Abstract: The expansion of the AIMIP acronym is provided, but a brief mention of the specific conventional model used for comparison would improve clarity.
- Data Availability: While the dataset is stated to be publicly available, including a direct link or DOI in the main text would enhance accessibility.
Simulated Author's Rebuttal
We thank the referee for their detailed review and positive evaluation of the significance of our work on AIMIP Phase 1. We address the major comments below and have updated the manuscript accordingly to improve clarity and provide additional quantitative information.
read point-by-point responses
-
Referee: Section 2 (Experiment Setup): The central claim that AI models perform 'as well as' conventional models rests on the common AMIP-style experiment with prescribed SSTs and reanalysis training. However, without an ablation study comparing performance when the conventional model is similarly constrained to reanalysis fields, it is unclear whether the equivalence reflects true dynamical skill or reproduction of reanalysis-embedded patterns. This is load-bearing for the abstract's performance claims.
Authors: The conventional model is a physics-based GCM run in standard AMIP configuration with prescribed SSTs but without being constrained or nudged to reanalysis atmospheric fields. This is the appropriate baseline for comparison, as it represents a traditional dynamical model simulating the atmosphere under the same boundary conditions. The AI models, while trained on reanalysis, demonstrate not mere reproduction because they exhibit specific shortcomings, such as underestimating warming trends despite the trends being present in the training data. This suggests limitations in capturing certain dynamical processes. We have added text in Section 2 to explicitly describe the setup differences between the AI and conventional models to avoid any ambiguity. We note that a full ablation with the conventional model constrained to reanalysis would require additional experiments (e.g., data assimilation) not standard in AMIP and is planned for future work. revision: partial
-
Referee: Section 4 (Evaluation Criteria and Results): The qualitative assessment that some AI models underestimate historical warming trends and diverge in out-of-sample tests lacks supporting quantitative details such as specific trend magnitudes, RMSE values, or statistical tests. This makes it difficult to gauge the practical significance of these differences and assess robustness across the five evaluation criteria.
Authors: We agree that quantitative details enhance the interpretability of the results. In the revised manuscript, we have expanded Section 4 to include specific numerical values: for example, global surface temperature trends (in K per decade) for each AI model and the conventional model, RMSE metrics for biases and temporal variability, and p-values from statistical significance tests comparing trends and out-of-sample performance. These have been added to the text, a new table, and updated figure captions for clarity. revision: yes
Circularity Check
No circularity: evaluation intercomparison without self-referential derivations
full rationale
This paper describes an intercomparison project (AIMIP Phase 1) that specifies a common experimental protocol—prescribed historical SSTs from 1979-2024 and training against reanalysis data—then reports direct performance comparisons of AI models against a conventional physics-based model on biases, trends, ENSO response, variability, and out-of-sample tests. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces by construction to its own inputs. The central claims rest on empirical evaluation outputs rather than any self-definition, ansatz smuggling, or load-bearing self-citation. The setup is externally benchmarked against an independent conventional model and publicly released data, satisfying the criteria for a self-contained, non-circular analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Historical reanalysis data accurately represents past atmospheric states for the purpose of training and evaluating models.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AIMIP Phase 1 models must simulate the atmosphere given specified historical sea surface temperatures over 1979-2024... evaluate... biases, trends, response to El Niño... temporal variability, and out-of-sample generalization tests.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We find that the AI models are able to simulate the historical climate and response to forcing as well as a conventional physically-based model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month ...
Reference graph
Works this paper leans on
-
[1]
Adler, R., Huffman, G., Chang, A., Ferraro, R., Xie, P., Janowiak, J., Rudolf, B., Schneider, U., Curtis, S., Bolvin, D., Gruber, A., Susskind, J., and Arkin, P.: The Version 2 Global Precipitation Climatology Project (GPCP) Monthly Precipitation Analysis (1979-Present), J. Hydrometeor., 4, 1147–1167,
work page 1979
-
[2]
Allan, R., Willett, K., John, V ., and Trent, T.: Global Changes in Water Vapor 1979–2020, Journal of Geophysical Research: Atmospheres, 127, https://doi.org/10.1029/2022JD036728,
-
[3]
Arcomano, T., Henn, B., and Bretherton, C.: AIMIP Phase 1 Forcing Dataset, https://doi.org/10.5281/zenodo.17065758,
-
[4]
G., Chelliah, M., and Goldenberg, S
Barnston, A. G., Chelliah, M., and Goldenberg, S. B.: Documentation of a highly ENSO-related sst region in the equatorial pacific: Research note, Atmosphere-Ocean, 35, 367–383, https://doi.org/10.1080/07055900.1997.9649597,
-
[5]
Byrne, M. P. and O’Gorman, P. A.: Land–Ocean Warming Contrast over a Wide Range of Climates: Convective Quasi-Equilibrium Theory and Idealized Simulations, Journal of Climate, 26, 4000–4016, https://doi.org/10.1175/JCLI-D-12-00262.1,
-
[6]
Cinquini, L., Crichton, D., Mattmann, C., Harney, J., Shipman, G., Wang, F., Ananthakrishnan, R., Miller, N., Denvil, S., Morgan, M., Pobre, Z., Bell, G. M., Doutriaux, C., Drach, R., Williams, D., Kershaw, P., Pascoe, S., Gonzalez, E., Fiore, S., and Schweitzer, R.: The Earth System Grid Federation: An open infrastructure for access to distributed geospa...
-
[7]
Couairon, G., Singh, R., Charantonis, A., Lessig, C., and Monteleoni, C.: ArchesWeatherGen: Skillful and compute-efficient probabilistic weather forecasting with machine learning, Science Advances, 12, eadx2372, https://doi.org/10.1126/sciadv.adx2372,
-
[8]
Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M.: A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate, AGU Advances, 6, https://doi.org/10.1029/2025A V001706,
-
[9]
Dunne, J. P., Hewitt, H. T., Arblaster, J. M., Bonou, F., Boucher, O., Cavazos, T., Dingley, B., Durack, P. J., Hassler, B., Juckes, M., Miyakawa, T., Mizielinski, M., Naik, V ., Nicholls, Z., O’Rourke, E., Pincus, R., Sanderson, B. M., Simpson, I. R., and Taylor, K. E.: An evolving Coupled Model Intercomparison Project phase 7 (CMIP7) and Fast Track in s...
-
[10]
D., Kölling, T., Allured, D., Jelenak, A., Soerensen, A
Eaton, B., Gregory, J., Drach, B., Taylor, K., Hankin, S., Caron, J., Signell, R., Bentley, P., Rappa, G., Höck, H., Pamment, A., Juckes, M., Raspaud, M., Blower, J., Horne, R., Whiteaker, T., Blodgett, D., Zender, C., Lee, D., Hassell, D., Snow, A. D., Kölling, T., Allured, D., Jelenak, A., Soerensen, A. M., Gaultier, L., Herlédan, S., Manzano, F., Bärri...
-
[11]
Eyring, V ., Bony, S., Meehl, G. A., Senior, C. A., Stevens, B., Stouffer, R. J., and Taylor, K. E.: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization, Geoscientific Model Development, 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016, 2016a. Eyring, V ., Righi, M., Lauer, A., Evaldsson, M., Wen...
-
[12]
Gates, W. L., Boyle, J. S., Covey, C., Dease, C. G., Doutriaux, C. M., Drach, R. S., Fiorino, M., Gleckler, P. J., Hnilo, J. J., Marlais, S. M., Phillips, T. J., Potter, G. L., Santer, B. D., Sperber, K. R., Taylor, K. E., and Williams, D. N.: An Overview of the Results of the Atmospheric Model Intercomparison Project (AMIP I), Bulletin of the American Me...
-
[13]
Gorski, K. M., Hivon, E., Banday, A. J., Wandelt, B. D., Hansen, F. K., Reinecke, M., and Bartelmann, M.: HEALPix: A Framework for High-Resolution Discretization and Fast Analysis of Data Distributed on the Sphere, The Astrophysical Journal, 622, 759–771, https://doi.org/10.1086/427976,
work page internal anchor Pith review doi:10.1086/427976
-
[14]
G., Blanton, C., McHugh, C., Nikonov, S., Radhakrishnan, A., Rand, K., Zadeh, N
Guo, H., John, J. G., Blanton, C., McHugh, C., Nikonov, S., Radhakrishnan, A., Rand, K., Zadeh, N. T., Balaji, V ., Durachta, J., Dupuis, C., Menzel, R., Robinson, T., Underwood, S., Vahlenkamp, H., Bushuk, M., Dunne, K. A., Dussin, R., Gauthier, P. P., Ginoux, P., Griffies, S. M., Hallberg, R., Harrison, M., Hurlin, W., Lin, P., Malyshev, S., Naik, V ., ...
-
[15]
Hall, K. J. C. and Molina, M. J.: Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP, http://arxiv.org/abs/2604.13481,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Henn, B., Bretherton, C., Koldunov, N. V ., and Watt-Meyer, O.: ai2cm/AIMIP: GMD manuscript submission, https://doi.org/10.5281/zenodo.20072877,
-
[17]
Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Sim- mons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., Chiara, G. D., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M....
-
[18]
Karlbauer, M., Cresswell-Clay, N., Durran, D. R., Moreno, R. A., Kurth, T., Bonev, B., Brenowitz, N., and Butz, M. V .: Advancing Parsimonious Deep Learning Weather Prediction Using the HEALPix Mesh, Journal of Advances in Modeling Earth Systems, 16, e2023MS004 021, https://doi.org/https://doi.org/10.1029/2023MS004021, e2023MS004021 2023MS004021,
-
[19]
30 Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S.: Neural general circulation models for weather and climate, Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y,
-
[20]
A., Simmons, A., Vamborg, F., and Rodwell, M
Lavers, D. A., Simmons, A., Vamborg, F., and Rodwell, M. J.: An evaluation of ERA5 precipitation for climate monitoring, Quarterly Journal of the Royal Meteorological Society, 148, 3152–3165, https://doi.org/10.1002/qj.4351,
-
[21]
J., Ahn, M.-S., Ordonez, A., Ullrich, P
Lee, J., Gleckler, P. J., Ahn, M.-S., Ordonez, A., Ullrich, P. A., Sperber, K. R., Taylor, K. E., Planton, Y . Y ., Guilyardi, E., Durack, P., Bonfils, C., Zelinka, M. D., Chao, L.-W., Dong, B., Doutriaux, C., Zhang, C., V o, T., Boutte, J., Wehner, M. F., Pendergrass, A. G., Kim, D., Xue, Z., Wittenberg, A. T., and Krasting, J.: Systematic and objective ...
- [22]
-
[23]
Mauzey, C., Durack, P., Taylor, K. E., Florek, P., Doutriaux, C., Nadeau, D., Hogan, E., Kettleborough, J., Weigel, T., kjoti, jmrgonza, Nicholls, Z., Betts, E., Seddon, J., and Wachsmann, F.: PCMDI/CMOR: CMOR v3.8.0, https://doi.org/10.5281/zenodo.10946710,
-
[24]
McTaggart-Cowan, R., Magnusson, L., Polichtchouk, I., Ackerley, D., Koehler, M., Casati, B., Chen, J.-H., Hudson, D., Ujiie, M., Aziz, N. A., et al.: WP-MIP: An Artificial Intelligence, Hybrid and Physically Based Model Intercomparison Project for Weather Prediction, arXiv preprint arXiv:2604.16643,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models, http: //arxiv.org/abs/2112.10752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Sutton, R. T., Dong, B., and Gregory, J. M.: Land/sea warming ratio in response to climate change: IPCC AR4 model results and comparison with observations, Geophysical Research Letters, 34, https://doi.org/10.1029/2006GL028164,
-
[27]
Taylor, K. E., Williamson, D., and Zwiers, F.: AMIP Sea Surface Temperature and Sea Ice Concentration Boundary Conditions, https: //pcmdi.llnl.gov/mips/amip/details/index.html, accessed: 2024-04-01,
work page 2024
-
[28]
E., Juckes, M., Balaji, V ., Cinquini, L., Denvil, S., Durack, P
Taylor, K. E., Juckes, M., Balaji, V ., Cinquini, L., Denvil, S., Durack, P. J., Elkington, M., Guilyardi, E., Kharin, S., Lautenschlager, M., Lawrence, B., Nadeau, D., and Stockhause, M.: CMIP6 Model Output Metadata Requirements, Data Reference Syntax (DRS) and Con- trolled V ocabularies (CVs), https://doi.org/10.5281/zenodo.15670624,
-
[29]
Ullrich, P. A., Barnes, E. A., Collins, W., Dagon, K., Duan, S., Elms, J., Lee, J., Leung, L. R., Lu, D., Molina, M. J., O’Brien, T. A., and Rebassoo, F. O.: Recommendations for Comprehensive and Independent Evaluation of Machine Learning-Based Earth System Models, Journal of Geophysical Research: Machine Learning and Computation, 2, https://doi.org/10.10...
-
[30]
Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S.: ACE2: ac- curately learning subseasonal to decadal atmospheric variability and forced responses, npj Climate and Atmospheric Science, 8, 205, https://doi.org/10.1038/s41612-025-01090-0,
-
[31]
J., Andrews, T., Bodas-Salcedo, A., Bony, S., Bretherton, C
Webb, M. J., Andrews, T., Bodas-Salcedo, A., Bony, S., Bretherton, C. S., Chadwick, R., Chepfer, H., Douville, H., Good, P., Kay, J. E., Klein, S. A., Marchand, R., Medeiros, B., Siebesma, A. P., Skinner, C. B., Stevens, B., Tselioudis, G., Tsushima, Y ., and Watanabe, M.: 31 The Cloud Feedback Model Intercomparison Project (CFMIP) contribution to CMIP6, ...
-
[32]
Yuval, J., Langmore, I., Kochkov, D., and Hoyer, S.: Neural general circulation models for modeling precipitation, Science Advances, 12, 1060–1066, https://doi.org/10.1126/sciadv.adv6891,
-
[33]
Simulation Characteristics With Prescribed SSTs, Journal of Advances in Modeling Earth Systems, 10, 691–734, https://doi.org/https://doi.org/10.1002/2017MS001208,
-
[34]
Zhuang, J. et al.: pangeo-data/xESMF: Universal Regridder for Geospatial Data, https://doi.org/10.5281/zenodo.4294774,
-
[35]
Second, the AMIP algorithm for calculating monthly values for SST and SIC is problematic
First, it does not extend past 2022, while AIMIP Phase 1 inference simulations cover through 2024 to maximize the possible length of high-quality obser- vational comparison. Second, the AMIP algorithm for calculating monthly values for SST and SIC is problematic. It involves specifying mid-month values that, when linearly interpolated in time, give the mo...
work page 2022
-
[36]
cBottle1.3, like the published version, is an Ensemble-of-Experts model. Different parts of the denoising are carried out by different networks, with the higher noise levels being denoised by less trained/early-stopped versions of the network. This is to avoid overfitting at large noise levels (see Brenowitz et al. (2025) for details). For every model, we...
work page 2025
-
[37]
Numbers indicate the amount of noisy samples this network is trained on. Physics Indices: 33 –p1 checkpoints: –training-state-000512000.checkpoint –training-state-002048000.checkpoint –training-state-009856000.checkpoint –p2 checkpoints: –training-state-000512000.checkpoint –training-state-002176000.checkpoint –training-state-009984000.checkpoint –p3 chec...
work page 1979
-
[38]
train t st sp cific humidit. ACE2.1 -ER A5 Ar chesW eather Ar chesW eatherGen cBottle1.3 DLES.M MD-1.5 v0.9 N uralGCM -HRD GFDL -CM4 Figure C4.RSMB for specific humidity over pressure levels and training and test periods. C4 Daily variability Figure C11 shows dry-day fraction errors versus ERA5 over 1979 at 1 ◦resolution, for models that submitted daily s...
work page 1979
-
[39]
Computation is over 1979 and a cutoff of 0.1 mm is used to define a dry day
41 (a) ERA5 ACE2.1-ERA5 cBottle1.3 DLESyM NeuralGCM-HRD GFDL-CM4 0 0.2 0.4 0.6 0.8 ERA5 dry day fractio -0.3 -0.15 0 0.15 0.3 model dry day fractio error Figure C11.Dry-day fraction error in ERA5 (top left panel) and dry day fraction errors versus ERA5 (subsequent panels). Computation is over 1979 and a cutoff of 0.1 mm is used to define a dry day. 42 App...
work page 1979
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.