pith. machine review for the scientific record. sign in

arxiv: 2601.17636 · v2 · submitted 2026-01-25 · ⚛️ physics.ao-ph

Recognition: 2 theorem links

· Lean Theorem

HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:50 UTC · model grok-4.3

classification ⚛️ physics.ao-ph
keywords machine learningdata assimilationweather forecastinginitial conditionsforecast skillHEALPixAI weather models
0
0 comments X

The pith

A simple machine learning data assimilation system provides initial conditions for off-the-shelf AI weather models that lose less than one day of lead time against ERA5.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HealDA as a neural network that directly maps a short window of satellite and conventional observations to a global 1-degree atmospheric state on the HEALPix grid. When these analyses initialize various existing ML forecast models such as FourCastNet3, Aurora, and FengWu without any retraining, the resulting forecasts trail those started from ERA5 by under one day of effective lead time. Forecast error growth rates remain identical to those from traditional initial conditions, so the skill difference traces back to larger starting errors in the HealDA analyses. Spectral analysis shows these initial errors concentrate in large scales and upper-tropospheric fields because of overfitting during training. Small changes to the verification setup alone can shift the apparent skill gap by 12 to 24 hours.

Core claim

HealDA functions strictly as a data assimilation module whose analyses initialize off-the-shelf ML forecast models. For models including FCN3, Aurora, and FengWu, these initialized forecasts lose less than one day of lead time when scored against ERA5, while FCN3 ensembles trail the ECMWF IFS ENS system by less than 24 hours. Forecast error growth stays unchanged from HealDA initialization, and the skill gap arises primarily from larger initial errors that spectral analysis attributes to overfitting on large scales and upper-tropospheric fields.

What carries the argument

HealDA, the direct observation-to-state neural network that converts a short window of observations into a 1° HEALPix atmospheric analysis without iterative steps.

If this is right

  • Error growth rates in the ML forecast models stay the same whether initialized by HealDA or by NWP analyses.
  • The skill gap originates mainly from higher initial errors concentrated at large scales and in the upper troposphere.
  • Verification setup variations can alter apparent skill differences by 12-24 hours, requiring consistent scoring.
  • A direct-mapping ML DA system already supplies initial conditions usable by current state-of-the-art ML forecast models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reducing overfitting on large scales inside HealDA would likely close most of the remaining skill gap.
  • This direct-mapping approach could support faster, lower-cost end-to-end ML weather pipelines by cutting dependence on full NWP assimilation infrastructure.
  • Future progress in AI weather forecasting may depend more on improving initial-condition quality than on further model architecture changes.

Load-bearing premise

The selected off-the-shelf ML forecast models represent the broader class of AI weather models and the chosen verification metrics and observation window fairly capture operational differences.

What would settle it

A side-by-side plot of error-growth curves for HealDA-initialized versus ERA5-initialized runs in an additional ML model, or the same comparison repeated with verification metrics focused on small-scale fields.

Figures

Figures reproduced from arXiv: 2601.17636 by (2) NOAA, (3) MITRE Corporation), Aayush Gupta (1), Akshay Subramaniam (1), Christopher Miller (3), Karthik Kashinath (1), Kelsey Lieberman (3), Michael S. Pritchard (1), Nicholas Silverman (3), Noah D. Brenowitz (1) ((1) NVIDIA Corporation, Sergey Frolov (2).

Figure 1
Figure 1. Figure 1: End-to-end HealDA system and forecasting pipeline. Observations from various remote-sensing instruments (ATMS, MHS, etc.) and in-situ sources (radiosondes, buoys, etc.) in the time window [𝑡0 − 21 h, 𝑡0 + 3 h] are processed by HealDA, which consists of an Observation Encoder (Obs Encoder) followed by an HPX ViT backbone, to produce an analysis state on the HPX grid at the target time 𝑡0. This analysis can … view at source ↗
Figure 2
Figure 2. Figure 2: RMSE of HealDA analysis vs IFS Time series of global RMSE for both HealDA and IFS against ERA5 in the 2022 test period, computed every 6 hours (00/06/12/18 UTC). The original data is shown with reduced opacity to reduce noise, and the solid line represents the 7-day moving average. closely. This behavior is broadly consistent with strong observational constraints on temperature and humidity from microwave … view at source ↗
Figure 3
Figure 3. Figure 3: Probabilistic FCN3 skill with HealDA and ERA5 initial conditions. CRPS of FCN3 forecasts initialized by HealDA and ERA5, both verified against ERA5 on the HPX64 grid and averaged over 128 initial conditions at 06/18 UTC in 2022. The inset panels zoom into the 6-48 h lead time range. 0 24 48 72 96 120 144 168 192 216 240 Lead time (hours) 0 50 100 150 200 250 CRPS [m² s ²] a Z500 HealDA-initialized FCN3 IFS… view at source ↗
Figure 4
Figure 4. Figure 4: Probabilistic skill of HealDA-initialized FCN3 vs IFS ENS. CRPS of IFS ENS forecasts and FCN3 forecasts initialized from HealDA, verified against ERA5 on the HPX64 grid and averaged over 128 initial conditions at 00/12 UTC in 2022. HealDA initialization (see Section A.5). This confirms our working hypothesis that the main impact of using ML-based initial conditions is shifting the starting error ||𝛿𝑥0||, n… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis error spectral decomposition. Spherical power spectra of HealDA and IFS HRES analysis errors on the HPX64 grid, scored relative to ERA5. The HealDA error spectra are shown, averaged over the test year (2022), in solid lines, and a year from the training period (2021), in dashed lines. For IFS, the error spectra averaged across 2021-2022 are shown. Spectra are shown as a function of spherical harmo… view at source ↗
Figure 6
Figure 6. Figure 6: Error growth. Error power spectra of FCN3 forecasts initialized with HealDA analysis versus ERA5, shown as a function of spherical harmonic degree for (a) Z500 and (b) T850 at multiple forecast lead times. Power is visualized as as 10 log10 𝐶ℓ. 0 24 48 72 96 120 144 168 192 216 240 Lead time (hours) 0 200 400 600 800 RMSE [m² s ²] a Z500 HealDA-initialized Aurora HealDA-initialized FengWu ERA5-initialized … view at source ↗
Figure 7
Figure 7. Figure 7: HealDA can initialize FengWu and Aurora. RMSE of deterministic Aurora and FengWu forecasts initialized from either ERA5 (solid) or HealDA (dashed). Scores are averaged over 128 initial conditions at 06/18 UTC in 2022 and verified against ERA5 on the HPX64 grid. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Availability of observations across the test period. Observation counts at each 6-hour window centered at 00/06/12/18 UTC for HealDA’s sensor suite in the test period: (a) AMSU-A, (b) MHS, (c) ATMS microwave sounders, and (d) all conventional observations. Solid lines show the number of observations per window; dashed lines show the annual mean. 4.4. Aurora and FengWu from HealDA [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 9
Figure 9. Figure 9: HealDA network architecture. Observation streams are flattened and then passed through sensor￾specific embedders (detailed in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: HealDA Sensor Embedder . Each raw observation is described by integer metadata (e.g., HPX pixel, channel, platform), floating-point metadata (e.g., satellite scan angles, local solar time, pressure, height), and the measurement itself. Integer metadata are mapped through embedding tables and combined with featurized floating-point metadata along with the measurement through an Obs tokenizer MLP, yielding … view at source ↗
read the original abstract

AI weather models now rival leading numerical weather prediction (NWP) systems in medium-range skill. However, almost all still rely on NWP data assimilation (DA) to provide initial conditions, tying them to expensive infrastructure and limiting the practical speed and accuracy gains of ML. More recently, ML-based DA systems have been proposed, which are often trained and evaluated end-to-end with a forecast model, making it difficult to assess the quality of their analysis fields. We introduce HealDA, a global ML-based DA system that maps a short window of satellite and conventional observations directly to a 1{\deg} atmospheric state on the HEALPix grid, using a smaller sensor suite than operational NWP. We treat HealDA strictly as a DA module: its analyses are used to initialize off-the-shelf ML forecast models without any fine-tuning of either. For a variety of off-the-shelf ML forecast models, including FourCastNet3 (FCN3), Aurora, and FengWu, HealDA-initialized forecasts lose less than one day of effective lead time when scored against ERA5. HealDA-initialized FCN3 ensembles similarly trail those of the ECMWF IFS ENS system by < 24 h. We find that forecast error growth in these models is unchanged from HealDA initialization, and the skill gap primarily arises from the larger initial error of the HealDA analysis. Spectral analysis reveals that this stems from overfitting to the large scales and upper-tropospheric fields. We also demonstrate that small changes in the verification setup can shift apparent skill by 12--24h, underscoring the need for consistent scoring. Taken together, these results clarify the current performance of ML-based DA systems and show that a relatively simple, direct observation-to-state network can already provide initial conditions that are usable by state-of-the-art ML forecast models with only modest loss in medium-range skill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces HealDA, a global ML-based data assimilation system that maps a short window of satellite and conventional observations directly to a 1° atmospheric state on the HEALPix grid. Treating HealDA strictly as a DA module, the authors initialize off-the-shelf ML forecast models (FourCastNet3, Aurora, FengWu) without fine-tuning and report that the resulting forecasts lose less than one day of effective lead time when scored against ERA5, with unchanged error growth rates relative to ERA5-initialized runs. The skill gap is attributed primarily to larger initial errors arising from overfitting to large scales and upper-tropospheric fields, supported by spectral analysis. HealDA-initialized FCN3 ensembles trail ECMWF IFS ENS by <24 h. The paper also shows that small changes in verification setup can shift apparent skill by 12-24 h.

Significance. If the central quantitative claims prove robust, the work is significant for clarifying the role of initial-condition errors in AI weather models and demonstrating that a relatively simple, direct observation-to-state ML DA system can deliver usable initial conditions for state-of-the-art forecast models with only modest medium-range skill loss. The cross-model empirical tests, ensemble comparisons, and spectral diagnosis of error sources provide concrete evidence that initial-error magnitude, rather than altered error growth, drives the performance gap. This could reduce reliance on expensive NWP DA infrastructure.

major comments (3)
  1. [Results (lead-time and error-growth comparisons)] The central claim that HealDA-initialized forecasts lose <1 day of effective lead time (and exhibit unchanged error growth) is load-bearing for the paper's conclusions, yet the manuscript itself reports that small changes in verification setup shift apparent skill by 12-24 h. Without systematic sensitivity tests across the specific choices of scoring metric, reference threshold, pressure levels/variables, and ERA5 vs. independent observations for the HealDA vs. control comparisons, it is unclear whether the quantitative bound holds under alternative but plausible protocols.
  2. [Error growth analysis subsection] The statement that forecast error growth remains unchanged from HealDA initialization requires explicit quantitative support, such as fitted growth rates with confidence intervals or statistical tests comparing HealDA-initialized vs. ERA5-initialized trajectories, to confirm the difference is not significant given the larger initial errors.
  3. [Spectral analysis section] The spectral analysis attributing initial errors to overfitting on large scales and upper-tropospheric fields is used to explain the skill gap; the manuscript should specify the exact spectral bands, variables, and quantitative metric (e.g., power spectrum ratio or scale-dependent RMSE) used to identify this overfitting and demonstrate it is not an artifact of the chosen verification window.
minor comments (3)
  1. Figure captions should explicitly label all curves (model, initialization method, ensemble vs. deterministic) and include the verification metric and reference dataset for immediate readability.
  2. The abstract states results for 'a variety of off-the-shelf ML forecast models' but the main text should list all tested models and any selection criteria if additional models beyond FCN3, Aurora, and FengWu were evaluated.
  3. Consider adding a summary table of effective lead-time losses broken down by model, variable, and pressure level to complement the narrative claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important aspects of robustness in our quantitative claims. We have revised the manuscript to incorporate additional analyses addressing each major point, as detailed below.

read point-by-point responses
  1. Referee: [Results (lead-time and error-growth comparisons)] The central claim that HealDA-initialized forecasts lose <1 day of effective lead time (and exhibit unchanged error growth) is load-bearing for the paper's conclusions, yet the manuscript itself reports that small changes in verification setup shift apparent skill by 12-24 h. Without systematic sensitivity tests across the specific choices of scoring metric, reference threshold, pressure levels/variables, and ERA5 vs. independent observations for the HealDA vs. control comparisons, it is unclear whether the quantitative bound holds under alternative but plausible protocols.

    Authors: We agree that systematic sensitivity testing strengthens the central claim. In the revised manuscript we have added a dedicated sensitivity analysis subsection. This includes tests varying the scoring metric (RMSE versus anomaly correlation coefficient), reference thresholds for effective lead time, and pressure levels/variables (Z500, T850, U200). The <1-day effective lead-time loss remains consistent across these choices, with variations of 12-24 h as previously noted. All comparisons use ERA5 as the common reference for both HealDA and control runs to maintain fairness. We also discuss the practical limitations of independent global observations and why ERA5 provides the most consistent benchmark. revision: yes

  2. Referee: [Error growth analysis subsection] The statement that forecast error growth remains unchanged from HealDA initialization requires explicit quantitative support, such as fitted growth rates with confidence intervals or statistical tests comparing HealDA-initialized vs. ERA5-initialized trajectories, to confirm the difference is not significant given the larger initial errors.

    Authors: We appreciate this request for quantitative rigor. We have added fitted exponential growth rates (with 95% confidence intervals obtained via bootstrap resampling) to the error-growth subsection. For each model and variable, the growth rates from HealDA and ERA5 initializations are statistically indistinguishable (two-sample t-test on bootstrap replicates, p > 0.05). The confidence intervals overlap substantially, confirming that the larger initial error, rather than altered growth, accounts for the skill gap. These results and the associated statistical tests are now reported explicitly. revision: yes

  3. Referee: [Spectral analysis section] The spectral analysis attributing initial errors to overfitting on large scales and upper-tropospheric fields is used to explain the skill gap; the manuscript should specify the exact spectral bands, variables, and quantitative metric (e.g., power spectrum ratio or scale-dependent RMSE) used to identify this overfitting and demonstrate it is not an artifact of the chosen verification window.

    Authors: We have expanded the spectral analysis section with the requested details. We compute the power-spectrum ratio (HealDA/ERA5) integrated over zonal wavenumber bands 1-10 (large scales) and 11-50 (mesoscales) for variables Z500, T850, and U200. Overfitting is identified by excess power ratios >1.2 in the large-scale band and upper-tropospheric levels. To rule out verification-window artifacts, we repeated the analysis over five independent 10-day windows spanning different seasons; the scale-dependent excess remains consistent. These specifications and robustness checks are now stated explicitly in the text and figure captions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation

full rationale

The manuscript presents HealDA as a trained ML mapping from observations to analysis state, then reports direct empirical comparisons of initialized forecasts against ERA5 and ECMWF ensembles using standard skill metrics. No equations, uniqueness theorems, or derivations are invoked; all headline claims (effective lead-time loss <1 day, unchanged error growth) are measured outcomes on held-out data rather than quantities forced by construction from fitted parameters or self-citations. Verification sensitivity is acknowledged but does not alter the non-circular status of the reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of ML weather modeling and ERA5 as ground truth; no new physical axioms or invented entities are introduced.

axioms (1)
  • domain assumption ERA5 reanalysis serves as a reliable verification target for medium-range skill
    Used throughout the evaluation without additional justification in the abstract.

pith-pipeline@v0.9.0 · 5717 in / 1278 out tokens · 34938 ms · 2026-05-16T11:50:53.868651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards accurate extreme event likelihoods from diffusion model climate emulators

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    Diffusion model climate emulators provide probability density estimates that allow likelihood calculations and odds-ratio-based importance sampling for extreme events such as tropical cyclones.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Deterministic nonperiodic flow.J

    Edward N Lorenz. Deterministic nonperiodic flow.J. Atmos. Sci., 20(2):130–141, March 1963. ISSN 0022-4928. doi: 10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2. 1

  2. [2]

    Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023. 1, 2, 3

  3. [3]

    Pangu-weather: A 3D high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556,

    Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-weather: A 3D high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556,

  4. [4]

    ClimaX: A foundation model for weather and climate.arXiv [cs.LG], January 2023

    Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. ClimaX: A foundation model for weather and climate.arXiv [cs.LG], January 2023. 1 19 HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts

  5. [5]

    Prognostic validation of a neural network unified physics parameteriza- tion.Geophysicak Research Letters, 17:2493, June 2018

    N D Brenowitz and C S Bretherton. Prognostic validation of a neural network unified physics parameteriza- tion.Geophysicak Research Letters, 17:2493, June 2018. ISSN 0094-8276. doi: 10.1029/2018GL078510. 1

  6. [6]

    Can machines learn to predict weather? using deep learning to predict gridded 500-hPa geopotential height from historical weather data.J

    Jonathan A Weyn, Dale R Durran, and Rich Caruana. Can machines learn to predict weather? using deep learning to predict gridded 500-hPa geopotential height from historical weather data.J. Adv. Model. Earth Syst., 11(8):2680–2693, August 2019. ISSN 1942-2466,1942-2466. doi: 10.1029/2019MS001705. 1

  7. [7]

    The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Commun

    Kang Chen, Tao Han, Fenghua Ling, Junchao Gong, Lei Bai, Xinyu Wang, Jing-Jia Luo, Ben Fei, Wenlong Zhang, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, and Wanli Ouyang. The operational medium-range deterministic weather forecasting can be extended beyond a 10-day lead time.Commun. Earth Environ., 6(1):518, July 2025. ...

  8. [8]

    Brenowitz, Yair Cohen, Jaideep Pathak, Ankur Mahesh, Boris Bonev, Thorsten Kurth, Dale R

    Noah D. Brenowitz, Yair Cohen, Jaideep Pathak, Ankur Mahesh, Boris Bonev, Thorsten Kurth, Dale R. Durran, Peter Harrington, and Michael S. Pritchard. A practical probabilistic benchmark for ai weather models.Geophysical Research Letters, 52(7), April 2025. ISSN 1944-8007. doi: 10.1029/2024gl113656. URLhttp://dx.doi.org/10.1029/2024GL113656. 2, 3

  9. [9]

    WeatherBench 2: A benchmark for the next generation of data-driven global weather models.J

    Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. WeatherBench 2: A benchmark for the next generation of data-driven global we...

  10. [10]

    Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, Remi Lam, and Matthew Willson. Gencast: Diffusion-based ensemble forecasting for medium-range weather.arXiv preprint arXiv:2312.15796, 2023. 2, 3

  11. [11]

    Brenner, and Stephan Hoyer

    Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, Sam Hatfield, Peter Battaglia, Alvaro Sanchez- Gonzalez, Matthew Willson, Michael P. Brenner, and Stephan Hoyer. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, July 2024. IS...

  12. [12]

    Observations and data assimilation.https: //www.ecmwf.int/en/research/data-assimilation/observations, 2023

    European Centre for Medium-Range Weather Forecasts. Observations and data assimilation.https: //www.ecmwf.int/en/research/data-assimilation/observations, 2023. Accessed: 2026-01-08. 2

  13. [13]

    Klinker, J.-F

    Florence Rabier, H Järvinen, E. Klinker, J.-F. Mahfouf, and Adrian Simmons. The ecmwf operational implementation of four dimensional variational assimilation. part i: Experimental results with simplified physics, 02/1999 1999. URLhttps://www.ecmwf.int/node/11794. 2

  14. [14]

    Buizza, Magdalena Alonso Balmaseda, Andrew Brown, S

    R. Buizza, Magdalena Alonso Balmaseda, Andrew Brown, S. J. English, Richard Forbes, Alan Geer, T. Haiden, Martin Leutbecher, Linus Magnusson, Mark Rodwell, M. Sleigh, Tim Stockdale, Frédéric Vitart, and N. Wedi. The development and evaluation process followed at ecmwf to upgrade the integrated forecasting system (ifs). ECMWF Techni- cal Memorandum No. 829...

  15. [15]

    End-to-enddata-drivenweatherprediction.Nature, 641:1172–1179,

    A.Allen, S.Markou, W.Tebbutt, etal. End-to-enddata-drivenweatherprediction.Nature, 641:1172–1179,

  16. [16]

    URL https://doi.org/10.1038/s41586-025-08897-0

    doi: 10.1038/s41586-025-08897-0. URL https://doi.org/10.1038/s41586-025-08897-0. Published online: 20 March 2025; Version of record: 21 May 2025. 2, 3, 5, 8

  17. [17]

    Huracan: A skillful end-to-end data-driven system for ensemble data assimilation and weather prediction, 2025

    ZekunNi, JonathanWeyn,HangZhang, YanfeiXiang, JiangBian,WeixinJin, KitThambiratnam, QiZhang, Haiyu Dong, and Hongyu Sun. Huracan: A skillful end-to-end data-driven system for ensemble data assimilation and weather prediction, 2025. URLhttps://arxiv.org/abs/2508.18486. 2, 3, 5, 8, 9 20 HealDA: Highlighting the importance of initial errors in end-to-end AI ...

  18. [18]

    Xichen: An observation-scalable fully ai-driven global weather forecasting system with 4D variational knowledge, 2025

    Wuxin Wang, Weicheng Ni, Lilan Huang, Tao Hao, Ben Fei, Shuo Ma, Taikang Yuan, Yanlai Zhao, Kefeng Deng, Xiaoyong Li, Boheng Duan, Lei Bai, and Kaijun Ren. Xichen: An observation-scalable fully ai-driven global weather forecasting system with 4D variational knowledge, 2025. URLhttps: //arxiv.org/abs/2507.09202. 2, 3, 8

  19. [19]

    X. Sun, X. Zhong, X. Xu, et al. A data-to-forecast machine learning system for global weather.Nature Communications, 16:6658, 2025. doi: 10.1038/s41467-025-62024-1. URLhttps://doi.org/10.1038/ s41467-025-62024-1. Published online: 19 July 2025. 2, 3, 5, 8, 12

  20. [20]

    Collins, Michael S

    Boris Bonev, Thorsten Kurth, Ankur Mahesh, Mauro Bisson, Jean Kossaifi, Karthik Kashinath, Anima Anandkumar, William D. Collins, Michael S. Pritchard, and Alexander Keller. Fourcastnet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale, 2025. URLhttps://arxiv. org/abs/2507.12144. 2, 3

  21. [21]

    Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A

    Cristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, and Paris Perdikaris. A foundation model for the earth system.Nature, May ...

  22. [22]

    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

    Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators, 2022. URLhttps...

  23. [23]

    Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead, 2023

    Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, and Wanli Ouyang. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead, 2023. URLhttps://arxiv.org/abs/2304. 02948. 3

  24. [24]

    Score-based data assimilation, 2023

    François Rozet and Gilles Louppe. Score-based data assimilation, 2023. URLhttps://arxiv.org/abs/ 2306.10574. 3

  25. [25]

    Generative data assimilation of sparse weather station observations at kilometer scales, 2025

    Peter Manshausen, Yair Cohen, Peter Harrington, Jaideep Pathak, Mike Pritchard, Piyush Garg, Morteza Mardani, Karthik Kashinath, Simon Byrne, and Noah Brenowitz. Generative data assimilation of sparse weather station observations at kilometer scales, 2025. URLhttps://arxiv.org/abs/2406.16947. 3

  26. [26]

    Dueben, and Torsten Hoefler

    Langwen Huang, Lukas Gianinazzi, Yuejiang Yu, Peter D. Dueben, and Torsten Hoefler. Diffda: a diffusion model for weather-scale data assimilation, 2024. URLhttps://arxiv.org/abs/2401.05932. 3

  27. [27]

    Appa: Bending weather dynamics with latent diffusion models for global data assimilation, 2025

    Gérôme Andry, Sacha Lewin, François Rozet, Omer Rochman, Victor Mangeleer, Matthias Pirlet, Elise Faulx, Marilaure Grégoire, and Gilles Louppe. Appa: Bending weather dynamics with latent diffusion models for global data assimilation, 2025. URLhttps://arxiv.org/abs/2504.18720. 3

  28. [28]

    Lo-sda: Latent optimization for score-based atmospheric data assimilation, 2025

    Jing-An Sun, Hang Fan, Junchao Gong, Ben Fei, Kun Chen, Fenghua Ling, Wenlong Zhang, Wanghan Xu, Li Yan, Pierre Gentine, and Lei Bai. Lo-sda: Latent optimization for score-based atmospheric data assimilation, 2025. URLhttps://arxiv.org/abs/2510.22562. 3

  29. [29]

    Data driven weather forecasts trained and initialised directly from observations, 2024

    Anthony McNally, Christian Lessig, Peter Lean, Eulalie Boucher, Mihai Alexe, Ewan Pinnington, Matthew Chantry, Simon Lang, Chris Burrows, Marcin Chrust, Florian Pinault, Ethel Villeneuve, Niels Bormann, and Sean Healy. Data driven weather forecasts trained and initialised directly from observations, 2024. URLhttps://arxiv.org/abs/2407.15586. 4

  30. [30]

    An update on ai–dop: skil- ful weather forecasts produced directly from observations.ECMWF Newsletter, (182): 15–18, 2025

    Tony McNally, Christian Lessig, Peter Lean, Eulalie Boucher, Mihai Alexe, Ewan Pinning- ton, Patrick Laloyaux, Simon Lang, Florian Pinault, Matt Chantry, Chris Burrows, Ethel Villeneuve, Marcin Chrust, Niels Bormann, and Sean Healy. An update on ai–dop: skil- ful weather forecasts produced directly from observations.ECMWF Newsletter, (182): 15–18, 2025. d...

  31. [31]

    Dawp: A framework for global observation forecasting via data assimilation and weather prediction in satellite observation space, 2025

    Junchao Gong, Jingyi Xu, Ben Fei, Fenghua Ling, Wenlong Zhang, Kun Chen, Wanghan Xu, Weidong Yang, Xiaokang Yang, and Lei Bai. Dawp: A framework for global observation forecasting via data assimilation and weather prediction in satellite observation space, 2025. URLhttps://arxiv.org/abs/2510.15978. 4

  32. [32]

    Forecast performance of the ecmwf opera- tional forecasting system in 2022.ECMWF Newsletter, (175):5–12, 2023

    Thomas Haiden, Matthieu Chevallier, and David Richardson. Forecast performance of the ecmwf opera- tional forecasting system in 2022.ECMWF Newsletter, (175):5–12, 2023. 5

  33. [33]

    The era5 global reanalysis.Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020

    Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020. 8, 12, 17

  34. [34]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv [cs.LG], October 2025. doi: 10.48550/arXiv.2303.08797. 8

  35. [35]

    Climate in a bottle: Towards a generative foundation model for the kilometer-scale global atmosphere.arXiv [physics.ao-ph], May 2025

    Noah D Brenowitz, Tao Ge, Akshay Subramaniam, Aayush Gupta, David M Hall, Morteza Mardani, Arash Vahdat, Karthik Kashinath, and Michael S Pritchard. Climate in a bottle: Towards a generative foundation model for the kilometer-scale global atmosphere.arXiv [physics.ao-ph], May 2025. URL https://arxiv.org/abs/2505.06474. 9, 13

  36. [36]

    SamudrACE: Fast and accurate coupled climate modeling with 3D ocean and atmosphere emulators

    James P C Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W Andre Perkins, William Gregory, Carlos Fernandez-Granda, Julius Busecke, Oliver Watt-Meyer, William J Hurlin, Alistair Adcroft, Laure Zanna, and Christopher Bretherton. SamudrACE: Fast and accurate coupled climate modeling wit...

  37. [37]

    ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses.arXiv [physics.ao-ph], November 2024

    Oliver Watt-Meyer, Brian Henn, Jeremy McGibbon, Spencer K Clark, Anna Kwa, W Andre Perkins, Elynn Wu, Lucas Harris, and Christopher S Bretherton. ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses.arXiv [physics.ao-ph], November 2024. 9

  38. [38]

    Clark, Brian Henn, James Duncan, Noah D

    Oliver Watt-Meyer, Gideon Dresdner, Jeremy McGibbon, Spencer K. Clark, Brian Henn, James Duncan, Noah D. Brenowitz, Karthik Kashinath, Michael S. Pritchard, Boris Bonev, Matthew E. Peters, and Christopher S. Bretherton. Ace: A fast, skillful learned global atmospheric model for climate prediction,

  39. [39]

    URLhttps://arxiv.org/abs/2310.02074. 12

  40. [40]

    K. M. Gorski, E. Hivon, A. J. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and M. Bartelmann. Healpix: A framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759–771, April 2005. ISSN 1538-4357. doi: 10.1086/427976. URL http://dx.doi.org/10.1086/427976. 13

  41. [41]

    Durran, Raul A

    Matthias Karlbauer, Nathaniel Cresswell-Clay, Dale R. Durran, Raul A. Moreno, Thorsten Kurth, Boris Bonev, Noah Brenowitz, and Martin V. Butz. Advancing parsimonious deep learning weather prediction using the healpix mesh, 2024. URLhttps://arxiv.org/abs/2311.06253. 13

  42. [42]

    Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022. 16

  43. [43]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748. 16

  44. [45]

    URLhttps://arxiv.org/abs/1910.07467. 17

  45. [46]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7. 17

  46. [47]

    Deep Networks with Stochastic Depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth.arXiv preprint arXiv:1603.09382, 2016. doi: 10.48550/arXiv.1603.09382. 17 22 HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts

  47. [48]

    Weatherbench 2: A benchmark for the next generation of data-driven global weather models, 2024

    Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. Weatherbench 2: A benchmark for the next generation of data-driven global we...

  48. [49]

    TheGlobalEnsembleForecastSystem(version13)Replaydataset

    NOAA. TheGlobalEnsembleForecastSystem(version13)Replaydataset. NOAAOpenDataDissemination Program. Available at: https://psl.noaa.gov/data/ufs_replay/, 2024. URL https://psl.noaa. gov/data/ufs_replay/. Subset used: January 2000 – December 2023. Accessed: December 20 2025. 18

  49. [50]

    Methods for assessing the impact of current and future components of the global observing system, 04/2024 2024

    Sean Healy, Niels Bormann, Alan Geer, Elias Holm, Bruce Ingleby, Katie Lean, Katrin Lonitz, and Cristina Lupu. Methods for assessing the impact of current and future components of the global observing system, 04/2024 2024. URL&nbsp;. 18

  50. [51]

    Ascat wind data processing manual

    KNMI and OSI SAF and EUMETSAT. Ascat wind data processing manual. Technical report, KNMI, 2009. URL https://scatterometer.knmi.nl/old_manuals/ss3_pm_ascat_1.0.pdf. Accessed: 2025-12-01. 19

  51. [52]

    Active techniques in wind observations: Scatterometer,

    ECMWF. Active techniques in wind observations: Scatterometer,

  52. [53]

    Accessed: 2025-12-01

    URL https://www.ecmwf.int/sites/default/files/elibrary/2015/ 8918-active-techniques-wind-observations-scatterometer.pdf . Accessed: 2025-12-01. 19

  53. [54]

    Atmospheric motion vectors: Past, present and future

    Mary Forsythe. Atmospheric motion vectors: Past, present and future. Technical re- port, ECMWF / Met Office Seminar on Recent Developments in Use of Satellite Obser- vations in NWP, 2008. URL https://www.ecmwf.int/sites/default/files/elibrary/2008/ 74512-atmospheric-motion-vectors-past-present-and-future_0.pdf . ECMWF Seminar on Satel- lite Observations i...

  54. [55]

    Gps radio occultation lecture notes, 2015

    ECMWF. Gps radio occultation lecture notes, 2015. URLhttps://www.ecmwf.int/sites/default/ files/gpsro_lecture_2015_nwpsaf.pdf. ECMWF / NWPSAF training material. 19

  55. [56]

    Earth2studio: Open-source deep-learning framework for ai weather/climate workflows

    NickGeneva and the NVIDIA Earth2Studio Team. Earth2studio: Open-source deep-learning framework for ai weather/climate workflows. URLhttps://github.com/NVIDIA/earth2studio/releases/tag/ 0.9.0. 19

  56. [57]

    Michaël Zamo and Philippe Naveau. Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts.Mathematical Geosciences, 50 (2):209–234, February 2018. doi: 10.1007/s11004-017-9709-7. URL https://doi.org/10.1007/ s11004-017-9709-7. 24

  57. [58]

    Strictly proper scoring rules, prediction, and estimation.J

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.J. Am. Stat. Assoc., 102(477):359–378, March 2007. ISSN 0162-1459. doi: 10.1198/016214506000001437. 24

  58. [59]

    Wmo integrated processing and prediction system activities – part ii: Specifications of wmo integrated processing and prediction system activities

    World Meteorological Organization. Wmo integrated processing and prediction system activities – part ii: Specifications of wmo integrated processing and prediction system activities. Wmo-no. 485, World Meteorological Organization, 2023. URLhttps://library.wmo.int/idurl/4/35703. Part II: Specifications of WMO Integrated Processing and Prediction System Act...

  59. [60]

    Number5inIFSDocumentation

    ECMWF.IFSDocumentationCY48R1–PartV:EnsemblePredictionSystem. Number5inIFSDocumentation. European Centre for Medium-Range Weather Forecasts, 2023. doi: 10.21957/e529074162. 26

  60. [61]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URLhttps://arxiv.org/ abs/2103.14030. 29 23 HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts 0 24 48 72 96 120 144 168 192 216 240 Lead Time (hour...

  61. [62]

    Red dotted lines mark reference thresholds (ACC = 0.6; SSR = 1). 28 HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts 0 48 96 144 192 240 Lead Time (hours) 0 150 300 450 600 FCN3 RMSE [m² s ²] a Z500 0 48 96 144 192 240 Lead Time (hours) 0.0 0.8 1.6 2.4 3.2 [K]b T850 0 48 96 144 192 240 Lead Time (hours) 0 2 4 6 8 [m...