pith. machine review for the scientific record. sign in

arxiv: 2604.09041 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI· physics.ao-ph· stat.ML

Recognition: 1 theorem link

· Lean Theorem

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.ao-phstat.ML
keywords U-Netprobabilistic weather forecastingCRPSMonte Carlo dropoutensemble predictionAI weather modelscomputational efficiencydeterministic pre-training
0
0 comments X

The pith

A standard U-Net with simple staged training matches top probabilistic weather models at over 10x lower compute and latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces U-Cast, a probabilistic weather forecaster built on an off-the-shelf U-Net. It first trains the network deterministically to minimize mean absolute error, then briefly fine-tunes it on the continuous ranked probability score while using Monte Carlo dropout to generate ensemble members. At 1.5° resolution this recipe produces skill scores that match or exceed those of GenCast and the IFS ensemble while cutting training compute by more than a factor of ten and inference time by the same margin. Training finishes in under twelve H200 GPU-days and a full sixty-step ensemble is produced in eleven seconds. The central implication is that frontier probabilistic performance need not require bespoke architectures or massive budgets.

Core claim

U-Cast demonstrates that a conventional U-Net backbone, pre-trained deterministically on mean absolute error and then fine-tuned probabilistically on the continuous ranked probability score with Monte Carlo dropout, matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5° resolution while reducing training compute by over 10× relative to leading CRPS-based models and inference latency by over 10× relative to diffusion-based models.

What carries the argument

U-Net backbone trained in two stages: deterministic MAE pre-training followed by short CRPS fine-tuning that uses Monte Carlo dropout to produce stochastic ensemble members.

If this is right

  • General-purpose convolutional architectures can reach state-of-the-art probabilistic weather skill without domain-specific design choices.
  • Training budgets for frontier probabilistic models can be reduced by an order of magnitude.
  • Inference speed improvements allow 60-step ensembles to be generated in seconds rather than minutes.
  • Lower resource requirements open frontier weather modeling to a wider research community.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage curriculum might transfer to other high-dimensional forecasting tasks where diffusion or transformer ensembles are currently dominant.
  • Monte Carlo dropout appears sufficient to generate useful ensemble spread, suggesting that more expensive stochastic layers may not always be required.
  • If the efficiency gains hold at higher resolutions, operational centers could afford more frequent ensemble updates or additional ensemble members.

Load-bearing premise

The reported skill comparisons against GenCast and IFS ENS are performed at identical resolution and lead times with no post-hoc data selection or metric choices that favor the simpler model.

What would settle it

An independent, apples-to-apples re-evaluation at 1.5° resolution showing U-Cast CRPS scores materially worse than those of GenCast or IFS ENS for the same lead times and variables.

Figures

Figures reproduced from arXiv: 2604.09041 by Duncan Watson-Parris, Rose Yu, Salva R\"uhling Cachay.

Figure 1
Figure 1. Figure 1: The Efficiency-Accuracy Pareto Frontier. We visualize forecast skill (y-axis, % improvement over IFS ENS), inference la￾tency (x-axis), and training cost (bubble size). Our model (top-left) achieves state-of-the-art performance while requiring an order of magnitude less compute for training and/or inference compared to leading baselines. See Appendix C.1 for detailed methodology. to output the conditional … view at source ↗
Figure 2
Figure 2. Figure 2: 1.5 ˝ CRPS comparison of U-Cast DeepEns against IFS ENS (left) and GenCast (right). Blue indicates lower (better) CRPS for U-Cast; red indicates baseline superiority. U-Cast broadly outperforms IFS ENS and is competitive with GenCast despite the latter’s finer native resolution (0.25˝ ). See text for details. our models on data from 1979 to 2019. Following GenCast, we train our model on 12-hourly data, i.e… view at source ↗
Figure 3
Figure 3. Figure 3: WeatherBench 2 Comparison (1.5 ˝ resolution). We report the CRPS skill relative to the operational IFS ENS (%, lower is better) as a function of forecast horizon. Baseline scores are sourced from the official leaderboard (Rasp et al., 2024). Numbers after variable abbreviations denote pressure levels in hPa. Note that the GenCast baseline is the native 0.25˝ model regridded to 1.5 ˝ (see Section 4.2 for di… view at source ↗
Figure 4
Figure 4. Figure 4: U-Cast ablations. We report CRPS relative to U-Cast (top row) and spread-skill ratio (bottom row; closer to 1 is better). In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Curriculum ablation. Validation CRPS (t850, 12h lead time) during probabilistic training. The curriculum (orange) fine-tunes from a deterministic checkpoint (dashed blue line) and converges rapidly to a CRPS of 0.218. Training from scratch on CRPS alone (gray) requires ą 3ˆ more steps to reach comparable performance and plateaus at a worse score (0.225). our recipe: decoupling the learning of physics from … view at source ↗
Figure 6
Figure 6. Figure 6: The Efficiency-Accuracy Pareto Frontier. We visualize forecast skill (y-axis, % improvement over IFS ENS in terms of CRPS on the left and RMSE on the right), training cost (x-axis), and inference latency (bubble size). Our model (top-left) achieves state-of-the-art performance while requiring an order of magnitude less compute for training or inference compared to leading baselines. C.2. Computational Comp… view at source ↗
Figure 7
Figure 7. Figure 7: WeatherBench 2 Comparison (1.5 ˝ resolution): Absolute CRPS. We report the CRPS skill as a function of forecast horizon (lower is better). This figure visualizes the same as [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: WeatherBench 2 Comparison (1.5 ˝ resolution): RMSE. We report the 50-member ensemble-mean RMSE skill relative to the operational IFS ENS (%, lower is better) as a function of forecast horizon. Baseline scores are sourced directly from the official leaderboard (Rasp et al., 2024). Numbers after the variable abbreviations refer to the pressure level in hPa. Note that the GenCast baseline uses native 0.25˝ fo… view at source ↗
Figure 9
Figure 9. Figure 9: WeatherBench 2 Comparison (1.5 ˝ resolution): SSR. We report the Spread-Skill ratio skill as a function of forecast horizon (closer to 1 is better). Baseline scores are sourced directly from the official leaderboard (Rasp et al., 2024). Numbers after the variable abbreviations refer to the pressure level in hPa. U-Cast generates more overconfident forecasts than the baselines, especially in the 1-to-7-day … view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of U-Cast (DE) evaluated on 2022 against IFS ENS (left) and against U-Cast (DE) evaluated on 2020 (right). Blue indicates that U-Cast achieves a lower (better) CRPS, while red favors the baseline. U-Cast consistently outperforms IFS ENS on 91.5% of metrics, with notable exceptions in long-range 2-meter temperature and a few variables at the 12-hour lead time (e.g., a 10.8% deficit in u500). On … view at source ↗
Figure 11
Figure 11. Figure 11: Score card comparison of U-Cast (DeepEns) vs. U-Cast. Deep ensembling U-Cast via fine-tuning four different versions of it consistently improves CRPS scores, especially for short-to-mid-range geopotential and stratospheric (by up to 4%; except q50) variables. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Curriculum and ensemble-size ablations (full evaluation). Relative CRPS vs. U-Cast across variables and lead times (higher means worse than U-Cast). End-to-end CRPS (orange) trains from scratch without deterministic pre-training; it consistently degrades short-range CRPS by 3–5% and stratospheric variables by 5–15% across all lead times, while recovering or slightly improving long-range scores for select … view at source ↗
Figure 13
Figure 13. Figure 13: Spectral density of 10-day forecasts, averaged over mid latitudes (r25˝ , 55˝ s). While U-Cast generates realistic spectra for the surface and specific humidity variables, it tends to generate excess power at high frequencies for the other variables. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example visualizations of U-Cast (second row), the corresponding ground truth (first row), and the bias (last row) for specific humidity at 700 hPa (q700) and forecast lead times 3, 7, 10, and 14 days. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example visualizations of U-Cast (second row), the corresponding ground truth (first row), and the bias (last row) for two example variables and forecast lead times 3, 7, 10, and 14 days. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
read the original abstract

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5$^\circ\$ resolution while reducing training compute by over 10$\times$ compared to leading CRPS-based models and inference latency by over 10$\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 60-step ensemble forecast in 11 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose-STL-Lab/u-cast.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces U-Cast, a probabilistic weather forecaster built on a standard U-Net backbone. It uses a simple two-stage training recipe: deterministic pre-training on Mean Absolute Error (MAE) followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) with Monte Carlo Dropout to introduce stochasticity. The central claim is that this model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5° resolution while using over 10× less training compute than leading CRPS-based models and over 10× less inference latency than diffusion-based models, training in under 12 H200 GPU-days and producing a 60-step ensemble forecast in 11 seconds. The code is released publicly.

Significance. If the performance comparisons are shown to be fair and apples-to-apples, the result would be significant because it demonstrates that frontier probabilistic skill in weather forecasting is achievable with general-purpose architectures and an efficient training curriculum rather than specialized designs or massive compute budgets. This could substantially lower the barrier to entry for high-performance AI weather models. The public release of the code is a clear strength that supports reproducibility and community verification.

major comments (2)
  1. [§4] §4 (Results and evaluation): The headline claim that U-Cast matches or exceeds GenCast and IFS ENS probabilistic skill requires explicit verification that CRPS (and any other scores) were computed under identical conditions, including the same ensemble size for the CRPS integral, the same variables, the same test period, the same 1.5° resolution, and equivalent post-processing. The manuscript should add a table or paragraph directly comparing these protocol parameters to the published GenCast and IFS ENS setups; without it the efficiency advantage cannot be rigorously tied to equivalent skill.
  2. [§3.2] §3.2 (Probabilistic fine-tuning): Clarify whether the Monte Carlo Dropout rate, number of samples, or any other hyperparameters were selected or adjusted using information from the test set. If any tuning occurred after seeing test data, the reported skill scores would need re-evaluation on a held-out period to confirm they are not inflated.
minor comments (2)
  1. [Abstract] Abstract: the notation 1.5$°$ should be rendered consistently as 1.5° throughout the text and figures.
  2. [§5] §5 (Discussion): add a short paragraph on limitations, such as the variables and lead times for which the 11-second latency claim holds and any degradation observed beyond 60 steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evaluation fairness and training protocol transparency, which we address below. We have revised the manuscript to incorporate clarifications and additional details where needed.

read point-by-point responses
  1. Referee: [§4] §4 (Results and evaluation): The headline claim that U-Cast matches or exceeds GenCast and IFS ENS probabilistic skill requires explicit verification that CRPS (and any other scores) were computed under identical conditions, including the same ensemble size for the CRPS integral, the same variables, the same test period, the same 1.5° resolution, and equivalent post-processing. The manuscript should add a table or paragraph directly comparing these protocol parameters to the published GenCast and IFS ENS setups; without it the efficiency advantage cannot be rigorously tied to equivalent skill.

    Authors: We agree that a direct side-by-side protocol comparison strengthens the claims. In the revised manuscript, we have added a new Table 4 in §4 that tabulates the evaluation settings for U-Cast against the published GenCast and IFS ENS configurations. This includes ensemble size used for CRPS approximation (32 members for all), variables evaluated, test period (2018–2022), spatial resolution (1.5°), and post-processing steps (none applied beyond standard normalization). All scores were computed on identical input fields and lead times following the exact protocols described in the GenCast and IFS ENS papers. This addition confirms the comparisons are apples-to-apples and ties the reported efficiency gains to equivalent skill. revision: yes

  2. Referee: [§3.2] §3.2 (Probabilistic fine-tuning): Clarify whether the Monte Carlo Dropout rate, number of samples, or any other hyperparameters were selected or adjusted using information from the test set. If any tuning occurred after seeing test data, the reported skill scores would need re-evaluation on a held-out period to confirm they are not inflated.

    Authors: No hyperparameters, including the Monte Carlo Dropout rate (fixed at 0.1) or number of samples (fixed at 32), were tuned or adjusted using the test set. Selection was performed exclusively on a held-out validation period (2017) prior to any test-set evaluation. We have added an explicit clarifying sentence in §3.2 stating this procedure and confirming that no test-set information influenced the final configuration. Consequently, the reported scores require no re-evaluation on an additional held-out period. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper presents U-Cast as a standard U-Net trained first deterministically on MAE then fine-tuned on CRPS with MC Dropout. All performance claims (matching GenCast/IFS ENS skill at 1.5° with lower compute) are validated via direct comparison to published external models rather than any internal derivation, equation, or self-citation that reduces results to fitted inputs by construction. No load-bearing step invokes a uniqueness theorem, ansatz smuggled via prior work, or renames a known pattern as a new result. The derivation chain is self-contained as an empirical recipe evaluated on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard U-Net architecture, MAE and CRPS losses, and Monte Carlo Dropout for stochasticity. No new mathematical axioms, free parameters beyond ordinary hyperparameters, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5536 in / 1150 out tokens · 31552 ms · 2026-05-10T17:44:22.471327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Skillful joint probabilistic weather forecasting from marginals.arXiv preprint arXiv:2506.10772, 2025

    Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., Andersson, T. R., Stott, J., Lam, R., Willson, M., Sanchez-Gonzalez, A., and Battaglia, P. Skillful joint probabilistic weather forecasting from marginals. 2025. doi:10.48550/arxiv.2506.10772

  3. [3]

    Continuous ensemble weather forecasting with diffusion models

    Andrae, M., Landelius, T., Oskarsson, J., and Lindsten, F. Continuous ensemble weather forecasting with diffusion models. International Conference on Learning Representations, 2025

  4. [4]

    What if? numerical weather prediction at the crossroads

    Bauer, P. What if? numerical weather prediction at the crossroads. Journal of the European Meteorological Society, 1: 0 100002, December 2024. ISSN 2950-6301. doi:10.1016/j.jemets.2024.100002

  5. [5]

    Accurate medium-range global weather forecasting with 3d neural networks,

    Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619 0 (7970): 0 533--538, 2023. doi:10.1038/s41586-023-06185-3

  6. [6]

    arXiv preprint arXiv:2405.13063 (2025)

    Bodnar, C., Bruinsma, W. P., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P., Riechert, M., Weyn, J. A., Dong, H., Gupta, J. K., Thambiratnam, K., Archibald, A. T., Wu, C.-C., Heider, E., Welling, M., Turner, R. E., and Perdikaris, P. Aurora: A foundation model for the earth system, 2024. URL https://arxiv.org/abs/2405.13063

  7. [7]

    Spherical fourier neural operators: Learning stable dynamics on the sphere

    Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., and Anandkumar, A. Spherical fourier neural operators: Learning stable dynamics on the sphere. International Conference on Machine Learning, 2023. doi:10.48550/arxiv.2306.03838

  8. [8]

    Fourcastnet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale,

    Bonev, B., Kurth, T., Mahesh, A., Bisson, M., Kossaifi, J., Kashinath, K., Anandkumar, A., Collins, W. D., Pritchard, M. S., and Keller, A. FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale. 2025. doi:10.48550/arxiv.2507.12144

  9. [9]

    D., Cohen, Y., Pathak, J., Mahesh, A., Bonev, B., Kurth, T., Durran, D

    Brenowitz, N. D., Cohen, Y., Pathak, J., Mahesh, A., Bonev, B., Kurth, T., Durran, D. R., Harrington, P., and Pritchard, M. S. A practical probabilistic benchmark for ai weather models. Geophysical Research Letters, 52 0 (7): 0 e2024GL113656, 2025. doi:https://doi.org/10.1029/2024GL113656

  10. [10]

    R., Henn, B., Watt-Meyer, O., Bretherton, C

    Cachay, S. R., Henn, B., Watt-Meyer, O., Bretherton, C. S., and Yu, R. Probabilistic emulation of a global climate model with Spherical DYffusion . Advances in Neural Information Processing Systems, 2024. doi:10.48550/arxiv.2406.14798

  11. [11]

    Elucidated rolling diffusion models for probabilistic forecasting of complex dynamics.arXiv preprint arXiv:2506.20024,

    Cachay, S. R., Aittala, M., Kreis, K., Brenowitz, N., Vahdat, A., Mardani, M., and Yu, R. Elucidated rolling diffusion models for probabilistic forecasting of complex dynamics. Advances in Neural Information Processing Systems, 2025. doi:10.48550/arxiv.2506.20024

  12. [12]

    FuXi : a cascade machine learning forecasting system for 15-day global weather forecast

    Chen, L., Zhong, X., Zhang, F., Cheng, Y., Xu, Y., Qi, Y., and Li, H. FuXi : a cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 6 0 (1), November 2023. ISSN 2397-3722. doi:10.1038/s41612-023-00512-1

  13. [13]

    Archesweather & archesweathergen: a deterministic and generative model for efficient ml weather forecasting.arXiv preprint arXiv:2412.12971,

    Couairon, G., Singh, R., Charantonis, A., Lessig, C., and Monteleoni, C. ArchesWeather & ArchesWeatherGen : a deterministic and generative model for efficient ML weather forecasting. 2024. doi:10.48550/arxiv.2412.12971

  14. [14]

    R., Liu, Z., Espinosa, Z

    Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M. A deep learning earth system model for efficient simulation of the observed climate. AGU Advances, 6 0 (4): 0 e2025AV001706, 2025. doi:https://doi.org/10.1029/2025AV001706. e2025AV001706 2025AV001706

  15. [15]

    B., Ault, T., Delworth, T

    Deser, C., Lehner, F., Rodgers, K. B., Ault, T., Delworth, T. L., DiNezio, P. N., Fiore, A., Frankignoul, C., Fyfe, J. C., Horton, D. E., Kay, J. E., Knutti, R., Lovenduski, N. S., Marotzke, J., McKinnon, K. A., Minobe, S., Randerson, J., Screen, J. A., Simpson, I. R., and Ting, M. Insights from earth system model initial-condition large ensembles and fut...

  16. [16]

    Diffusion Models Beat GANs on Image Synthesis

    Dhariwal, P. and Nichol, A. Diffusion models beat GAN s on image synthesis. Advances in Neural Information Processing Systems, 2021. doi:10.48550/arxiv.2105.05233

  17. [17]

    E., Marwah, T., and Mukhopadhyay, P

    Diaconu, C., Cranmer, M., Turner, R. E., Marwah, T., and Mukhopadhyay, P. Probabilistic retrofitting of learned simulators. 2026. doi:10.48550/arxiv.2603.01949

  18. [18]

    IFS Documentation CY46R1 - Part V: Ensemble Prediction System

    ECMWF. IFS Documentation CY46R1 - Part V: Ensemble Prediction System. 2019. doi:10.21957/38yug0cev

  19. [19]

    Scaling spherical CNN s

    Esteves, C., Slotine, J.-J., and Makadia, A. Scaling spherical CNN s. International Conference on Machine Learning, 2023

  20. [20]

    (2014) Why Should Ensemble Spread Match the RMSE of the Ensemble Mean?, Journal of Hydrometeorology 60, no

    Fortin, V., Abaza, M., Anctil, F., and Turcotte, R. Why should ensemble spread match the rmse of the ensemble mean? Journal of Hydrometeorology, 15 0 (4): 0 1708 -- 1713, 2014. doi:https://doi.org/10.1175/JHM-D-14-0008.1

  21. [21]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 2016. doi:10.48550/arxiv.1506.02142

  22. [22]

    Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models

    Garg, S., Rasp, S., and Thuerey, N. Weatherbench probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models. arXiv preprint arXiv:2205.00865, 2022

  23. [23]

    Hatanp\" a \" a , V., Ku, E., Stock, J., Emani, M., Foreman, S., Jung, C., Madireddy, S., Nguyen, T., Sastry, V., Sinurat, R. A. O., Zheng, H., Wheeler, S., Arcomano, T., Vishwanath, V., and Kotamarthi, R. Aeris: Argonne earth systems model for reliable and skillful predictions. In Proceedings of the International Conference for High Performance Computing...

  24. [24]

    Decomposition of the continuous ranked probability score for ensemble prediction systems

    Hersbach, H. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15 0 (5): 0 559--570, 2000. doi:10.1175/1520-0434(2000)015<0559:dotcrp>2.0.co;2

  25. [25]

    Hersbach, B

    Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Hor \' a nyi, A., Mu \ n oz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., Chiara, G., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes...

  26. [26]

    Swinvrnn: A data-driven ensemble forecasting model via learned distribution perturbation

    Hu, Y., Chen, L., Wang, Z., and Li, H. Swinvrnn: A data-driven ensemble forecasting model via learned distribution perturbation. Journal of Advances in Modeling Earth Systems, 15 0 (2): 0 e2022MS003211, 2023. doi:https://doi.org/10.1029/2022MS003211

  27. [27]

    Uncertainty quantification over graph with conformalized graph neural networks

    Huang, K., Jin, Y., Candes, E., and Leskovec, J. Uncertainty quantification over graph with conformalized graph neural networks. Advances in Neural Information Processing Systems, 2023

  28. [28]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Jordan, K., Jin, Y., Boza, V., Jiacheng, Y., Cesista, F., Newhouse, L., and Bernstein, J. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  29. [29]

    R., Moreno, R

    Karlbauer, M., Cresswell-Clay, N., Durran, D. R., Moreno, R. A., Kurth, T., Bonev, B., Brenowitz, N., and Butz, M. V. Advancing parsimonious deep learning weather prediction using the healpix mesh. Journal of Advances in Modeling Earth Systems, 16 0 (8): 0 e2023MS004021, 2024. doi:https://doi.org/10.1029/2023MS004021. e2023MS004021 2023MS004021

  30. [30]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 2022. doi:10.48550/arxiv.2206.00364

  31. [31]

    Forecasting global weather with graph neural net- works,

    Keisler, R. Forecasting global weather with graph neural networks. arXiv, 2022. doi:10.48550/arxiv.2202.07575

  32. [32]

    P., and Hoyer, S.: Neural general circulation models for weather and climate, Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y,

    Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S. Neural general circulation models for weather and climate. Nature, 632 0 (8027): 0 1060–1066, July 2024. ISSN 1476-4687. doi:10.1038/s41586-024-07744-y

  33. [33]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 2017

  34. [34]

    https://doi.org/10.1126/science.adi2336 arXiv:https://www.science.org/doi/pdf/10.1126/science.adi2336

    Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S., and Battaglia, P. Learning skillful medium-range global weather forecasting. Science, 382 0 (6677): 0 1416–1421, 2023. ISSN 1095-9203. d...

  35. [35]

    Lang, S., Alexe, M., Clare, M. C. A., Roberts, C., Adewoyin, R., Bouallègue, Z. B., Chantry, M., Dramsch, J., Dueben, P. D., Hahner, S., Maciel, P., Prieto-Nemesio, A., O'Brien, C., Pinault, F., Polster, J., Raoult, B., Tietsche, S., and Leutbecher, M. AIFS-CRPS : Ensemble forecasting using a model trained with a loss function based on the continuous rank...

  36. [36]

    and Palmer, T

    Leutbecher, M. and Palmer, T. Ensemble forecasting. Journal of Computational Physics, 227 0 (7): 0 3515--3539, 2008. ISSN 0021-9991. doi:https://doi.org/10.1016/j.jcp.2007.02.014. Predicting weather, climate and extreme events

  37. [37]

    Mahesh, A., Collins, W. D., Bonev, B., Brenowitz, N., Cohen, Y., Elms, J., Harrington, P., Kashinath, K., Kurth, T., North, J., O'Brien, T., Pritchard, M., Pruitt, D., Risser, M., Subramanian, S., and Willard, J. Huge ensembles -- part 1: Design of ensemble weather forecasts using spherical fourier neural operators. Geoscientific Model Development, 18 0 (...

  38. [38]

    Matheson, J. E. and Winkler, R. L. Scoring rules for continuous probability distributions. Management Science, 22 0 (10): 0 1087--1096, 1976

  39. [39]

    McKinnon, K. A. and Simpson, I. R. How unexpected was the 2021 pacific northwest heatwave? Geophysical Research Letters, 49 0 (18): 0 e2022GL100380, 2022. doi:https://doi.org/10.1029/2022GL100380

  40. [40]

    arXiv preprint arXiv:2312.03876 (2023)

    Nguyen, T., Shah, R., Bansal, H., Arcomano, T., Madireddy, S., Maulik, R., Kotamarthi, V., Foster, I., and Grover, A. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting. Advances in Neural Information Processing Systems, 2024. doi:10.48550/arxiv.2312.03876

  41. [41]

    Omnicast: A masked latent diffusion model for weather forecasting across time scales

    Nguyen, T., Pham, T., Arcomano, T., Kotamarthi, R., Foster, I., Madireddy, S., and Grover, A. Omnicast: A masked latent diffusion model for weather forecasting across time scales. Advances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=5Y8I2dKc91

  42. [42]

    P., and Lindsten, F

    Oskarsson, J., Landelius, T., Deisenroth, M. P., and Lindsten, F. Probabilistic weather forecasting with hierarchical graph neural networks. Advances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=wTIzpqX121

  43. [43]

    FourCastNet: Accelerating global high-resolution weather forecasting using adaptive Fourier neural operators

    Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A. FourCastNet: Accelerating global high-resolution weather forecasting using adaptive Fourier neural operators . Proceedings of the National Academy of Sciences (PNAS), 11...

  44. [44]

    Nature637, 84–90 (2025) https://doi.org/10.1038/s41586-024-08252-9

    Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M. Probabilistic weather forecasting with machine learning. Nature, 637 0 (8044): 0 84–90, December 2024. ISSN 1476-4687. doi:10.1038/s41586-024-08252-9. URL http://dx.doi.org/10.1038/s41586-024-08252-9

  45. [45]

    and Krishnapriyan, A

    Qu, E. and Krishnapriyan, A. S. The importance of being scalable: Improving the speed and accuracy of neural network interatomic potentials across chemical domains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y4mBaZu4vy

  46. [46]

    and Thuerey, N

    Rasp, S. and Thuerey, N. Data-driven medium-range weather prediction with a resnet pretrained on climate simulations: A new model for weatherbench. Journal of Advances in Modeling Earth Systems, 13 0 (2), 2021. doi:https://doi.org/10.1029/2020MS002405

  47. [47]

    WeatherBench 2: A benchmark for the next generation of data‐driven global weather models

    Rasp, S., Hoyer, S., Merose, A., Langmore, I., Battaglia, P., Russell, T., Sanchez‐Gonzalez, A., Yang, V., Carver, R., Agrawal, S., Chantry, M., Ben Bouallegue, Z., Dueben, P., Bromberg, C., Sisk, J., Barrington, L., Bell, A., and Sha, F. WeatherBench 2: A benchmark for the next generation of data‐driven global weather models. Journal of Advances in Model...

  48. [48]

    and Messori, G

    Scher, S. and Messori, G. Ensemble methods for neural network-based weather forecasts. Journal of Advances in Modeling Earth Systems, 13 0 (2), 2021. doi:https://doi.org/10.1029/2020MS002331

  49. [49]

    S., Chapman, W

    Schreck, J. S., Chapman, W. E., Becker, C., Gagne, D. J., Kimpara, D., Cherukuru, N., Berner, J., Mayer, K. J., and Sobhani, N. Controllable probabilistic forecasting with stochastic decomposition layers. 2025. doi:10.48550/arxiv.2512.18815

  50. [50]

    Swift: An autoregressive consistency model for efficient weather forecasting

    Stock, J., Arcomano, T., and Kotamarthi, R. Swift: An autoregressive consistency model for efficient weather forecasting. In NeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning, 2025

  51. [51]

    Sun, S. H. and Yu, R. Copula conformal prediction for multi-step time series prediction. International Conference on Learning Representations, 2024

  52. [52]

    A., Durran, D

    Weyn, J. A., Durran, D. R., and Caruana, R. Can machines learn to predict weather? using deep learning to predict gridded 500-hpa geopotential height from historical weather data. Journal of Advances in Modeling Earth Systems, 11 0 (8): 0 2680--2693, 2019. doi:https://doi.org/10.1029/2019MS001705

  53. [53]

    and Naveau, P

    Zamo, M. and Naveau, P. Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50 0 (2): 0 209--234, 2018. doi:10.1007/s11004-017-9709-7

  54. [54]

    Fuxi-ens: A machine learning model for efficient and accurate ensemble weather prediction

    Zhong, X., Chen, L., Li, H., Buizza, R., Liu, J., Feng, J., Zhu, Z., Fan, X., Dai, K., jia Luo, J., Wu, J., and Lu, B. Fuxi-ens: A machine learning model for efficient and accurate ensemble weather prediction. Science Advances, 11 0 (44): 0 eadu2854, 2025. doi:10.1126/sciadv.adu2854