On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification

Jackie Ma; Miguel-\'Angel Fern\'andez-Torres; Noelia Otero; Rodrigo Almeida

arxiv: 2511.17176 · v2 · submitted 2025-11-21 · ⚛️ physics.ao-ph · cs.LG

On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification

Rodrigo Almeida , Noelia Otero , Miguel-\'Angel Fern\'andez-Torres , Jackie Ma This is my paper

Pith reviewed 2026-05-17 20:33 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.LG

keywords AI weather modelsensemble forecastingextreme eventsuncertainty quantificationperturbation methodsprobabilistic skillinitial conditions

0 comments

The pith

Simpler initial-condition perturbations produce ensemble skill similar to complex methods in AI weather models for extremes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether adding different kinds of noise to the initial conditions of deterministic AI weather models like GraphCast and FuXi can create useful ensemble forecasts for extreme events. It applies four perturbation methods to simulate uncertainty and evaluates the results for the 2022 Pakistan floods, a China heatwave, and global extremes using metrics for spread and probability. The results indicate that basic methods such as Gaussian or Perlin noise work about as well as more advanced flow-dependent perturbations in terms of realistic uncertainty representation. This matters for making probabilistic predictions more accessible since it reduces the need for complex ensemble generation techniques. However, the specific AI model chosen influences the outcome more than the perturbation strategy does, and these approaches still fall short of full numerical weather prediction ensembles, especially for precipitation extremes compared to temperature.

Core claim

Model choice is the dominant factor for ensemble performance, not perturbation method. Simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS. Across the tested cases, these AI ensembles narrow but do not close the performance gap with numerical weather prediction ensembles or native probabilistic models.

What carries the argument

Generation of 50-member ensembles from deterministic AI models using initial-condition perturbations to quantify uncertainty in extreme weather predictions.

Load-bearing premise

That the four chosen perturbation strategies and the specific extreme events studied adequately represent the uncertainties and behaviors across all relevant extreme weather scenarios.

What would settle it

Demonstrating substantially better probabilistic skill and spread from complex perturbations over simple ones when applied to a broader set of extreme events or different AI model architectures.

Figures

Figures reproduced from arXiv: 2511.17176 by Jackie Ma, Miguel-\'Angel Fern\'andez-Torres, Noelia Otero, Rodrigo Almeida.

**Figure 2.** Figure 2: Hemispheric Centered Bred Vector (HCBV) perturbation method. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of the August 2022 Pakistan extreme precipitation and China heatwave, based on ERA5. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: ROCSS values at the 99th percentile of the ERA5 1990-2020 climatology for daily accumulated [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Daily accumulated precipitation spread for the different ensemble models (ENS, AIWPs) over [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Daily maximum temperature spread for the different ensemble models (ENS, AIWPs) over China [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Tail distribution density (computed using Kernel Density Estimation) of 24-hour accumulated [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Tail distribution density (computed using Kernel Density Estimation) of 24-hour maximum 2 m [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Ensemble mean RMSE per latitude for a 3-day lead time worldwide forecast in August 2022. The [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Global average CRPS over 10-day lead time for the different perturbation methods and AIWP [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Spectral comparison of AIWP models against the ERA5 reanalysis for 2 m temperature (T2M, [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Ensemble spread spectral comparison for 2 m temperature (T2M, top row) and 6-hour accumulated [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Accurate prediction of extreme weather events remains a major challenge for artificial intelligence-based weather prediction systems. While deterministic models such as FuXi, GraphCast, and SFNO have achieved competitive forecast skill relative to numerical weather prediction, their ability to represent uncertainty and capture extremes is still limited. This study investigates how state-of-the-art deterministic artificial intelligence-based models respond to initial-condition perturbations and evaluates the resulting ensembles in forecasting extremes. Using four perturbation strategies (Gaussian, Perlin noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for the August 2022 Pakistan floods and China heatwave, and complement these case studies with a global threshold-based evaluation. Ensemble skill is assessed against ERA5 and compared with IFS ENS and the AIFS ENS probabilistic model using deterministic and probabilistic metrics. Results show that simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS, narrowing but not closing the performance gap with numerical weather prediction ensembles, or native probabilistic models which retain the highest probabilistic skill across variables. Model choice is the dominant factor for ensemble performance, not perturbation method. Across variables, models capture temperature extremes more effectively than precipitation. These findings demonstrate that simple input perturbations can extend deterministic models toward probabilistic forecasting in hardware-constrained settings, supporting artificial intelligence-driven early warning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simple perturbations on these AI models give ensemble skill close to fancier methods for the two 2022 extremes, but model choice still drives most of the performance difference and the gap to real NWP ensembles stays open.

read the letter

The core result is that Gaussian and Perlin noise perturbations on FuXi, GraphCast, and SFNO produce 50-member ensembles whose spread and probabilistic scores for the Pakistan floods and China heatwave sit close to those from Hemispheric Centered Bred Vectors and Huge Ensembles. Model choice matters more than the perturbation technique, and all of them narrow but do not close the gap to IFS ENS or AIFS ENS. Temperature extremes come out better than precipitation across the board. This is a straightforward empirical check rather than a new method, and it shows a low-compute route to probabilistic output from existing deterministic AI models that could matter for early-warning work in places without big ensembles.

Referee Report

2 major / 2 minor

Summary. The paper evaluates how deterministic AI weather models (FuXi, GraphCast, SFNO) respond to four initial-condition perturbation strategies (Gaussian noise, Perlin noise, Hemispheric Centered Bred Vectors, Huge Ensembles) when generating 50-member ensembles for the August 2022 Pakistan floods and China heatwave, supplemented by global threshold-based evaluation. Ensemble performance is assessed via deterministic and probabilistic metrics against ERA5, with comparisons to IFS ENS and AIFS ENS; the central claims are that simpler perturbations produce ensemble spread and probabilistic skill comparable to flow-based methods, that model choice dominates over perturbation method, and that AI ensembles narrow but do not close the gap to NWP while performing better on temperature than precipitation extremes.

Significance. If the quantitative claims hold after additional analysis, the work would demonstrate that low-cost perturbations can usefully extend existing deterministic AI models toward probabilistic forecasting of extremes in resource-limited settings, providing a practical route to AI-supported early warning systems. The direct benchmarking against operational NWP ensembles (IFS ENS) and a native probabilistic AI model (AIFS ENS) supplies a clear reference point for the community.

major comments (2)

[Results] Results section: the claim that 'Model choice is the dominant factor for ensemble performance, not perturbation method' and that Gaussian/Perlin yield 'similarly realistic' spread and skill to HCBV/HENS rests on side-by-side metric comparisons without a formal variance decomposition (ANOVA, main-effect test, or bootstrap comparison of cross-model vs. cross-perturbation effect sizes) for CRPS, spread-error ratio, or other scores. This quantitative gap is load-bearing for the dominance conclusion and for statements about early-warning utility.
[Methods] Methods and global evaluation: the threshold-based global assessment lacks explicit statements of the exact percentile thresholds, variable-specific definitions of extremes, and any data exclusion or masking criteria applied to ERA5 or model output. Without these, it is difficult to judge whether the reported skill differences generalize beyond the two case studies.

minor comments (2)

[Abstract] Abstract: comparative statements such as 'narrowing but not closing the performance gap' would benefit from a single quantitative example (e.g., the typical CRPS difference versus IFS ENS) to anchor the claim.
[Figures] Figures and tables: ensure all skill-score panels include uncertainty estimates (error bars or bootstrap intervals) so that visual comparisons of ensemble spread and probabilistic skill can be assessed for statistical distinguishability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. Their comments have prompted us to strengthen the statistical support for our claims and improve the reproducibility of the global evaluation. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Results] Results section: the claim that 'Model choice is the dominant factor for ensemble performance, not perturbation method' and that Gaussian/Perlin yield 'similarly realistic' spread and skill to HCBV/HENS rests on side-by-side metric comparisons without a formal variance decomposition (ANOVA, main-effect test, or bootstrap comparison of cross-model vs. cross-perturbation effect sizes) for CRPS, spread-error ratio, or other scores. This quantitative gap is load-bearing for the dominance conclusion and for statements about early-warning utility.

Authors: We agree that a formal statistical test strengthens the dominance conclusion. In the revised manuscript we have added a bootstrap resampling analysis (10,000 iterations) comparing effect sizes of model choice versus perturbation method on CRPS, spread-error ratio, and rank histograms. The results, presented in new Supplementary Figure S5 and discussed in Section 3, show that model choice explains 3–5 times more variance than perturbation strategy (p < 0.01 across cases). This quantitative support confirms our original interpretation while addressing the referee’s concern about early-warning utility. revision: yes
Referee: [Methods] Methods and global evaluation: the threshold-based global assessment lacks explicit statements of the exact percentile thresholds, variable-specific definitions of extremes, and any data exclusion or masking criteria applied to ERA5 or model output. Without these, it is difficult to judge whether the reported skill differences generalize beyond the two case studies.

Authors: We thank the referee for noting this gap in reproducibility. The revised Methods section now specifies: temperature extremes are defined at the 95th percentile and precipitation at the 99th percentile of the 1990–2020 ERA5 climatology; extremes are identified as daily values exceeding these thresholds. Masking excludes grid cells with missing ERA5 data and latitudes poleward of 60° where both AI and NWP models exhibit known degradation. These additions clarify the scope and allow readers to assess generalizability beyond the Pakistan and China cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external benchmarks

full rationale

The paper performs an empirical comparison of ensemble generation methods on two extreme-event case studies plus global thresholds, evaluating deterministic and probabilistic metrics directly against ERA5 reanalysis and operational ensembles (IFS ENS, AIFS ENS). No equations, fitted parameters, or derivations are present that could reduce to self-referential definitions or inputs. Claims about model choice dominating perturbation method rest on side-by-side metric comparisons rather than any algebraic identity or self-citation chain. The analysis is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the work assumes standard ensemble-forecasting premises that initial-condition perturbations capture relevant uncertainty and that ERA5 serves as a reliable verification reference, but no explicit free parameters, ad-hoc axioms, or new entities are described.

pith-pipeline@v0.9.0 · 5560 in / 1091 out tokens · 53130 ms · 2026-05-17T20:33:53.702925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using four perturbation strategies (Gaussian, Perlin noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for the August 2022 Pakistan floods and China heatwave... Model choice is the dominant factor for ensemble performance, not perturbation method.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rigorous uncertainty quantification of probabilistic AI weather forecasts with conformal prediction
physics.ao-ph 2026-06 unverdicted novelty 6.0

Online conformal prediction post-processing guarantees calibrated uncertainty coverage for GenCast, NeuralGCM, and AIFS-ENS forecasts of temperature and precipitation including extremes.
Towards Fair Comparisons of AI- and Physics-Based Weather Models for Extreme Events via the Weighted Potential CRPS
stat.AP 2026-06 unverdicted novelty 5.0

Extends Potential CRPS with weights and IDR post-processing to enable fair comparisons of AIWP and NWP models on extreme weather, finding AI models more informative across most variables and thresholds.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 2 Pith papers

[1]

URLhttps://asr.copernicus.org/articles/22/39/2025/

doi:10.5194/asr-22-39-2025. URLhttps://asr.copernicus.org/articles/22/39/2025/. Christopher Bülte, Nina Horat, Julian Quinting, and Sebastian Lerch. Uncertainty quantification for data-driven weather models, March 2024. Ankur Mahesh, William Collins, Boris Bonev, Noah Brenowitz, Yair Cohen, Joshua Elms, Peter Harrington, Karthik Kashinath, Thorsten Kurth,...

work page doi:10.5194/asr-22-39-2025 2025
[2]

URLhttps://arxiv.org/abs/2412.12971. Y. Qiang Sun, Pedram Hassanzadeh, Tiffany Shaw, and Hamid A. Pahlavan. Predicting Beyond Training Data via Extrapolation versus Translocation: AI Weather Models and Dubai’s Unprecedented 2024 Rainfall, May 2025. Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking Uncertainty Disentanglement: Specialized ...

work page doi:10.1175/aies-d-22-0061.1 2024
[3]

doi:10.1093/ije/dyaa104

ISSN 0300-5771, 1464-3685. doi:10.1093/ije/dyaa104. Arfan Arshad, Ali Mirchi, Cenlin He, Azeem Ali Shah, and Amir AghaKouchak. Anthropogenic and climatic drivers of the 2022 mega-flood in pakistan.NPJ Nat. Hazards, 2(1), July 2025. doi:https://doi.org/10.1038/s44304-025-00109-z. Chi-Cherng Hong, An-Yi Huang, Huang-Hsiung Hsu, Wan-Ling Tseng, Mong-Ming Lu,...

work page doi:10.1093/ije/dyaa104 2022
[4]

doi:10.1175/BAMS-D-23-0175.1

ISSN 0003-0007, 1520-0477. doi:10.1175/BAMS-D-23-0175.1. Bingqian Zhou, Shujuan Hu, Jianjun Peng, Deqian Li, Lu Ma, Zhihai Zheng, and Guolin Feng. The extreme heat wave in China in August 2022 related to extreme northward movement of the eastern center of SAH. Atmospheric Research, 293:106918, September 2023. ISSN 0169-8095. doi:10.1016/j.atmosres.2023.10...

work page doi:10.1175/bams-d-23-0175.1 2022

[1] [1]

URLhttps://asr.copernicus.org/articles/22/39/2025/

doi:10.5194/asr-22-39-2025. URLhttps://asr.copernicus.org/articles/22/39/2025/. Christopher Bülte, Nina Horat, Julian Quinting, and Sebastian Lerch. Uncertainty quantification for data-driven weather models, March 2024. Ankur Mahesh, William Collins, Boris Bonev, Noah Brenowitz, Yair Cohen, Joshua Elms, Peter Harrington, Karthik Kashinath, Thorsten Kurth,...

work page doi:10.5194/asr-22-39-2025 2025

[2] [2]

URLhttps://arxiv.org/abs/2412.12971. Y. Qiang Sun, Pedram Hassanzadeh, Tiffany Shaw, and Hamid A. Pahlavan. Predicting Beyond Training Data via Extrapolation versus Translocation: AI Weather Models and Dubai’s Unprecedented 2024 Rainfall, May 2025. Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking Uncertainty Disentanglement: Specialized ...

work page doi:10.1175/aies-d-22-0061.1 2024

[3] [3]

doi:10.1093/ije/dyaa104

ISSN 0300-5771, 1464-3685. doi:10.1093/ije/dyaa104. Arfan Arshad, Ali Mirchi, Cenlin He, Azeem Ali Shah, and Amir AghaKouchak. Anthropogenic and climatic drivers of the 2022 mega-flood in pakistan.NPJ Nat. Hazards, 2(1), July 2025. doi:https://doi.org/10.1038/s44304-025-00109-z. Chi-Cherng Hong, An-Yi Huang, Huang-Hsiung Hsu, Wan-Ling Tseng, Mong-Ming Lu,...

work page doi:10.1093/ije/dyaa104 2022

[4] [4]

doi:10.1175/BAMS-D-23-0175.1

ISSN 0003-0007, 1520-0477. doi:10.1175/BAMS-D-23-0175.1. Bingqian Zhou, Shujuan Hu, Jianjun Peng, Deqian Li, Lu Ma, Zhihai Zheng, and Guolin Feng. The extreme heat wave in China in August 2022 related to extreme northward movement of the eastern center of SAH. Atmospheric Research, 293:106918, September 2023. ISSN 0169-8095. doi:10.1016/j.atmosres.2023.10...

work page doi:10.1175/bams-d-23-0175.1 2022