On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification
Pith reviewed 2026-05-17 20:33 UTC · model grok-4.3
The pith
Simpler initial-condition perturbations produce ensemble skill similar to complex methods in AI weather models for extremes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model choice is the dominant factor for ensemble performance, not perturbation method. Simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS. Across the tested cases, these AI ensembles narrow but do not close the performance gap with numerical weather prediction ensembles or native probabilistic models.
What carries the argument
Generation of 50-member ensembles from deterministic AI models using initial-condition perturbations to quantify uncertainty in extreme weather predictions.
Load-bearing premise
That the four chosen perturbation strategies and the specific extreme events studied adequately represent the uncertainties and behaviors across all relevant extreme weather scenarios.
What would settle it
Demonstrating substantially better probabilistic skill and spread from complex perturbations over simple ones when applied to a broader set of extreme events or different AI model architectures.
Figures
read the original abstract
Accurate prediction of extreme weather events remains a major challenge for artificial intelligence-based weather prediction systems. While deterministic models such as FuXi, GraphCast, and SFNO have achieved competitive forecast skill relative to numerical weather prediction, their ability to represent uncertainty and capture extremes is still limited. This study investigates how state-of-the-art deterministic artificial intelligence-based models respond to initial-condition perturbations and evaluates the resulting ensembles in forecasting extremes. Using four perturbation strategies (Gaussian, Perlin noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for the August 2022 Pakistan floods and China heatwave, and complement these case studies with a global threshold-based evaluation. Ensemble skill is assessed against ERA5 and compared with IFS ENS and the AIFS ENS probabilistic model using deterministic and probabilistic metrics. Results show that simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS, narrowing but not closing the performance gap with numerical weather prediction ensembles, or native probabilistic models which retain the highest probabilistic skill across variables. Model choice is the dominant factor for ensemble performance, not perturbation method. Across variables, models capture temperature extremes more effectively than precipitation. These findings demonstrate that simple input perturbations can extend deterministic models toward probabilistic forecasting in hardware-constrained settings, supporting artificial intelligence-driven early warning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates how deterministic AI weather models (FuXi, GraphCast, SFNO) respond to four initial-condition perturbation strategies (Gaussian noise, Perlin noise, Hemispheric Centered Bred Vectors, Huge Ensembles) when generating 50-member ensembles for the August 2022 Pakistan floods and China heatwave, supplemented by global threshold-based evaluation. Ensemble performance is assessed via deterministic and probabilistic metrics against ERA5, with comparisons to IFS ENS and AIFS ENS; the central claims are that simpler perturbations produce ensemble spread and probabilistic skill comparable to flow-based methods, that model choice dominates over perturbation method, and that AI ensembles narrow but do not close the gap to NWP while performing better on temperature than precipitation extremes.
Significance. If the quantitative claims hold after additional analysis, the work would demonstrate that low-cost perturbations can usefully extend existing deterministic AI models toward probabilistic forecasting of extremes in resource-limited settings, providing a practical route to AI-supported early warning systems. The direct benchmarking against operational NWP ensembles (IFS ENS) and a native probabilistic AI model (AIFS ENS) supplies a clear reference point for the community.
major comments (2)
- [Results] Results section: the claim that 'Model choice is the dominant factor for ensemble performance, not perturbation method' and that Gaussian/Perlin yield 'similarly realistic' spread and skill to HCBV/HENS rests on side-by-side metric comparisons without a formal variance decomposition (ANOVA, main-effect test, or bootstrap comparison of cross-model vs. cross-perturbation effect sizes) for CRPS, spread-error ratio, or other scores. This quantitative gap is load-bearing for the dominance conclusion and for statements about early-warning utility.
- [Methods] Methods and global evaluation: the threshold-based global assessment lacks explicit statements of the exact percentile thresholds, variable-specific definitions of extremes, and any data exclusion or masking criteria applied to ERA5 or model output. Without these, it is difficult to judge whether the reported skill differences generalize beyond the two case studies.
minor comments (2)
- [Abstract] Abstract: comparative statements such as 'narrowing but not closing the performance gap' would benefit from a single quantitative example (e.g., the typical CRPS difference versus IFS ENS) to anchor the claim.
- [Figures] Figures and tables: ensure all skill-score panels include uncertainty estimates (error bars or bootstrap intervals) so that visual comparisons of ensemble spread and probabilistic skill can be assessed for statistical distinguishability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. Their comments have prompted us to strengthen the statistical support for our claims and improve the reproducibility of the global evaluation. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Results] Results section: the claim that 'Model choice is the dominant factor for ensemble performance, not perturbation method' and that Gaussian/Perlin yield 'similarly realistic' spread and skill to HCBV/HENS rests on side-by-side metric comparisons without a formal variance decomposition (ANOVA, main-effect test, or bootstrap comparison of cross-model vs. cross-perturbation effect sizes) for CRPS, spread-error ratio, or other scores. This quantitative gap is load-bearing for the dominance conclusion and for statements about early-warning utility.
Authors: We agree that a formal statistical test strengthens the dominance conclusion. In the revised manuscript we have added a bootstrap resampling analysis (10,000 iterations) comparing effect sizes of model choice versus perturbation method on CRPS, spread-error ratio, and rank histograms. The results, presented in new Supplementary Figure S5 and discussed in Section 3, show that model choice explains 3–5 times more variance than perturbation strategy (p < 0.01 across cases). This quantitative support confirms our original interpretation while addressing the referee’s concern about early-warning utility. revision: yes
-
Referee: [Methods] Methods and global evaluation: the threshold-based global assessment lacks explicit statements of the exact percentile thresholds, variable-specific definitions of extremes, and any data exclusion or masking criteria applied to ERA5 or model output. Without these, it is difficult to judge whether the reported skill differences generalize beyond the two case studies.
Authors: We thank the referee for noting this gap in reproducibility. The revised Methods section now specifies: temperature extremes are defined at the 95th percentile and precipitation at the 99th percentile of the 1990–2020 ERA5 climatology; extremes are identified as daily values exceeding these thresholds. Masking excludes grid cells with missing ERA5 data and latitudes poleward of 60° where both AI and NWP models exhibit known degradation. These additions clarify the scope and allow readers to assess generalizability beyond the Pakistan and China cases. revision: yes
Circularity Check
No circularity: empirical evaluation against external benchmarks
full rationale
The paper performs an empirical comparison of ensemble generation methods on two extreme-event case studies plus global thresholds, evaluating deterministic and probabilistic metrics directly against ERA5 reanalysis and operational ensembles (IFS ENS, AIFS ENS). No equations, fitted parameters, or derivations are present that could reduce to self-referential definitions or inputs. Claims about model choice dominating perturbation method rest on side-by-side metric comparisons rather than any algebraic identity or self-citation chain. The analysis is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using four perturbation strategies (Gaussian, Perlin noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for the August 2022 Pakistan floods and China heatwave... Model choice is the dominant factor for ensemble performance, not perturbation method.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that simpler perturbations like Gaussian and Perlin noise produce similarly realistic ensemble spread and probabilistic skill as flow-based approaches like HCBV and HENS
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Rigorous uncertainty quantification of probabilistic AI weather forecasts with conformal prediction
Online conformal prediction post-processing guarantees calibrated uncertainty coverage for GenCast, NeuralGCM, and AIFS-ENS forecasts of temperature and precipitation including extremes.
-
Towards Fair Comparisons of AI- and Physics-Based Weather Models for Extreme Events via the Weighted Potential CRPS
Extends Potential CRPS with weights and IDR post-processing to enable fair comparisons of AIWP and NWP models on extreme weather, finding AI models more informative across most variables and thresholds.
Reference graph
Works this paper leans on
-
[1]
URLhttps://asr.copernicus.org/articles/22/39/2025/
doi:10.5194/asr-22-39-2025. URLhttps://asr.copernicus.org/articles/22/39/2025/. Christopher Bülte, Nina Horat, Julian Quinting, and Sebastian Lerch. Uncertainty quantification for data-driven weather models, March 2024. Ankur Mahesh, William Collins, Boris Bonev, Noah Brenowitz, Yair Cohen, Joshua Elms, Peter Harrington, Karthik Kashinath, Thorsten Kurth,...
-
[2]
URLhttps://arxiv.org/abs/2412.12971. Y. Qiang Sun, Pedram Hassanzadeh, Tiffany Shaw, and Hamid A. Pahlavan. Predicting Beyond Training Data via Extrapolation versus Translocation: AI Weather Models and Dubai’s Unprecedented 2024 Rainfall, May 2025. Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking Uncertainty Disentanglement: Specialized ...
-
[3]
ISSN 0300-5771, 1464-3685. doi:10.1093/ije/dyaa104. Arfan Arshad, Ali Mirchi, Cenlin He, Azeem Ali Shah, and Amir AghaKouchak. Anthropogenic and climatic drivers of the 2022 mega-flood in pakistan.NPJ Nat. Hazards, 2(1), July 2025. doi:https://doi.org/10.1038/s44304-025-00109-z. Chi-Cherng Hong, An-Yi Huang, Huang-Hsiung Hsu, Wan-Ling Tseng, Mong-Ming Lu,...
-
[4]
ISSN 0003-0007, 1520-0477. doi:10.1175/BAMS-D-23-0175.1. Bingqian Zhou, Shujuan Hu, Jianjun Peng, Deqian Li, Lu Ma, Zhihai Zheng, and Guolin Feng. The extreme heat wave in China in August 2022 related to extreme northward movement of the eastern center of SAH. Atmospheric Research, 293:106918, September 2023. ISSN 0169-8095. doi:10.1016/j.atmosres.2023.10...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.