Towards Fair Comparisons of AI- and Physics-Based Weather Models for Extreme Events via the Weighted Potential CRPS

Annika Alber; Sam Allen; Sebastian Lerch; Tobias Biegert

arxiv: 2606.21170 · v1 · pith:WFXRHQVYnew · submitted 2026-06-19 · 📊 stat.AP

Towards Fair Comparisons of AI- and Physics-Based Weather Models for Extreme Events via the Weighted Potential CRPS

Tobias Biegert , Sam Allen , Annika Alber , Sebastian Lerch This is my paper

Pith reviewed 2026-06-26 12:58 UTC · model grok-4.3

classification 📊 stat.AP

keywords weather forecastingextreme eventsCRPSpost-processingAI weather predictionnumerical weather predictionforecast verificationisotonic distributional regression

0 comments

The pith

AI weather prediction models issue more informative forecasts for extreme events than numerical weather prediction models after IDR post-processing and weighted potential CRPS scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether deterministic AI weather prediction models produce more informative forecasts for extreme weather events than physics-based numerical weather prediction models. Deterministic outputs from both types of models are converted to probabilistic forecasts using isotonic distributional regression before evaluation with weighted versions of the potential CRPS that emphasize extremes. This method leverages optimality properties of IDR for the weighted scores to support fair comparisons across model classes. Applied to WeatherBench 2 forecasts for mean sea level pressure, temperature, wind speed, and precipitation, the results show AIWP models, especially FuXi, leading in most settings, with similar patterns for record-breaking events.

Core claim

When deterministic forecasts are post-processed via isotonic distributional regression and assessed using weighted potential CRPS focused on thresholds from historical data, the AIWP models GraphCast, Pangu-Weather, and particularly FuXi produce more informative probabilistic forecasts for extreme events than the ECMWF high-resolution NWP model across most variables and settings.

What carries the argument

The weighted Potential CRPS obtained after isotonic distributional regression post-processing of deterministic outputs, which evaluates forecast quality with emphasis on extreme exceedances or non-exceedances.

If this is right

AIWP models have the potential to outperform NWP models when forecasting extremes.
The relative ordering of the models is largely insensitive to the choice of extreme thresholds.
The evaluation framework supports fair comparisons between data-driven and physics-based models for extreme weather events.
Forecast performance is compared across mean sea level pressure, temperature, wind speed, and precipitation extremes defined from historical observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to test performance on additional variables or geographic regions not covered in WeatherBench 2.
If the advantage persists at longer lead times, it would support greater reliance on AIWP outputs for early warning systems focused on extremes.
Hybrid models combining AI and physics components might be evaluated under the same weighted scoring to identify complementary strengths.

Load-bearing premise

That isotonic distributional regression post-processing combined with weighted potential CRPS scoring produces sufficiently unbiased comparisons between AI and physics-based models to eliminate residual effects from differences in architecture or training data.

What would settle it

A reversal in performance ranking where NWP models score higher than AIWP models on the same weighted potential CRPS when using an alternative post-processing method or direct probabilistic outputs.

Figures

Figures reproduced from arXiv: 2606.21170 by Annika Alber, Sam Allen, Sebastian Lerch, Tobias Biegert.

**Figure 1.** Figure 1: twPCRPS-St for the four forecasting models as a function of the quantile of the historical data that is used to define the threshold in the twCRPSt. Solid lines correspond to the twCRPSt with weight function w(z) = 1{z>t}, where interest is on threshold exceedances, while dashed lines correspond to the twCRPSt with weight function w(z) = 1{z<t}, where interest is on values not exceeding the threshold. Lowe… view at source ↗

**Figure 2.** Figure 2: twPCRPSt for the four forecasting models when interest is on exceedances of record high values and non-exceedances of record low values. Records are calculated for each month separately. Results have been aggregated across all grid points and are displayed as a function of lead time. ERA5 reanalyses are used as observation data. predicting extreme weather events. The skill of the FuXi model is shown for ea… view at source ↗

**Figure 3.** Figure 3: twPCRPS-St of FuXi. Results are shown at each grid point using ERA5 reanalyses as observation data. The twCRPS focuses on exceedances of the historical 99th percentile, computed separately for each grid point. Columns correspond to weather variables and rows correspond to lead times. Skill is measured relative to the twPCRPS0,t baseline. Brighter colours indicate higher potential skill relative to this bas… view at source ↗

**Figure 4.** Figure 4: Best-performing model at each grid point according to twPCRPSt when predicting exceedances of monthly record thresholds. At each grid point, the colour indicates the model with the lowest twPCRPSt. Columns correspond to weather variables and rows correspond to lead times. The percentages in the legend correspond to the total proportion of cases where each model performs best, aggregated across all grid poi… view at source ↗

**Figure 5.** Figure 5: Scatterplots comparing the twCRPSt of the GenCast ensemble with the twPCRPSt of the deterministic GraphCast forecast. Columns correspond to weather variables and rows to thresholds used in the evaluation: the historical 99th percentile (top), the monthly record threshold (middle), and the overall record threshold (bottom). Points correspond to grid point and lead time combinations, with colours indicating … view at source ↗

**Figure 6.** Figure 6: Illustration of the CRPS (left), qwCRPS (centre), and twCRPS (right) of an EasyUQ predictive distribution Fˆ and observation y. The qwCRPS uses the weight function w(α) = 1{α>0.75}, with the dotted horizontal lines indicating the corresponding transformed probability levels. The twCRPS uses the weight function w(z) = 1{z>t}. For an indicator weight function of the form w(α) = 1{α>τ} for upper-tail emphasis… view at source ↗

**Figure 7.** Figure 7: qwPCRPS-Sτ for the four forecasting models as a function of the quantile level. Solid lines show upper-tail scores, dashed lines show lower-tail scores. Lower-tail scores are shown only for mean sea level pressure and 2 m temperature. The columns correspond to different weather variables, and the rows correspond to different lead times. Results are aggregated across all grid points, using ERA5 reanalyses a… view at source ↗

**Figure 8.** Figure 8: qwPCRPSτ with weight function w(α) = 1{α>0.99} (upper 1%) and w(α) = 1{α<0.01} (lower 1%). Results have been aggregated across all grid points and are displayed as a function of lead time. ERA5 reanalyses are used as observation data. B Easy Uncertainty Quantification (EasyUQ) The PCRPS and twPCRPS correspond respectively to the CRPS and twCRPS applied to predictive distributions obtained using Easy Uncert… view at source ↗

**Figure 9.** Figure 9: qwPCRPS-Sτ of FuXi. Results are shown at each grid point using ERA5 reanalyses as observation data. The qwCRPS employs the weight function w(α) = 1{α>0.99}, focusing on the upper tail at quantile level τ = 0.99. Columns correspond to weather variables and rows correspond to lead times. Skill is measured relative to the qwPCRPS0,τ baseline. Brighter colours indicate higher potential skill relative to this b… view at source ↗

**Figure 10.** Figure 10: qwPCRPSτ skill of FuXi relative to the seasonally varying ERA5 climatology forecast from WeatherBench 2. Results are shown at each grid point using ERA5 reanalyses as observation data. The qwCRPS employs the weight function w(α) = 1{α>0.99}, focusing on the upper tail at quantile level τ = 0.99. Columns correspond to weather variables and rows correspond to lead times. Positive values indicate improvement… view at source ↗

**Figure 11.** Figure 11: Best-performing model according to qwPCRPSτ with weight function w(α) = 1{α>0.99}, focusing on the upper tail at quantile level τ = 0.99. Results are shown at each grid point using ERA5 reanalyses as observation data. The colour at each grid point indicates the model with the lowest qwPCRPSτ . Columns correspond to weather variables and rows correspond to lead times. PW-ERA5 is not available for TP24hr an… view at source ↗

**Figure 12.** Figure 12: Significance of the best-performing model according to the qwPCRPSτ with weight function w(α) = 1{α>0.99}, focusing on the upper tail at quantile level τ = 0.99. Results are shown at each grid point using ERA5 reanalyses as observation data. The colour at each grid point indicates whether a model receives a qwPCRPSτ that is significantly lower than that of all other available models at a 5% significance l… view at source ↗

**Figure 13.** Figure 13: Historical record thresholds at each grid point in the ERA5 reanalysis data from 1979 to 2019. Both maximum and minimum values are shown for mean sea level pressure and 2 m temperature, while only maximum values are shown for 10 m wind speed and 24-hour precipitation accumulation. The colour scale differs between panels, with yellow values always denoting more extreme values [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 14.** Figure 14: The total number of exceedances of monthly record high values and non-exceedances of monthly record low values at each grid point during the evaluation period. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: twPCRPSt skill of FuXi relative to the seasonally varying ERA5 climatology forecast from WeatherBench 2. Results are shown at each grid point using ERA5 reanalyses as observation data. The threshold-weighted score focuses on exceedances of the historical 99th percentile, computed separately for each grid point. Columns correspond to weather variables and rows correspond to lead times. Positive values indi… view at source ↗

**Figure 16.** Figure 16: Significantly best-performing model at each grid point according to the twPCRPSt when predicting exceedances of the monthly record thresholds. At each grid point, the colour indicates whether a model receives a twPCRPSt that is significantly lower than that of all other available models at the 5% significance level. Columns correspond to weather variables and rows correspond to lead times. The percentages… view at source ↗

**Figure 17.** Figure 17: twPCRPSt for the four forecasting models when interest is on exceedances of overall record high values and non-exceedances of overall record low values. Results have been aggregated across all grid points and are displayed as a function of lead time. ERA5 reanalyses are used as observation data. 1 3 5 7 10 Lead Time [d] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 L a tit u d e-w eig h t e d t w P C R P S [P a] MSL… view at source ↗

**Figure 18.** Figure 18: twPCRPSt for the four forecasting models when interest is on exceedances of the historical 99th percentile and non-exceedances of historical 1st percentile. Results have been aggregated across all grid points and are displayed as a function of lead time. ERA5 reanalyses are used as observation data. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: twPCRPSt for HRES and the IFS-initialised AIWP variants GC-IFS and PW-IFS, when interest is on exceedances of monthly record high values and non-exceedances of monthly record low values. The records are calculated using ERA5 reanalysis data, since this is available for a longer historical period than the IFS analyses. Scores are averaged across all grid points using a latitude weighting, and are shown as … view at source ↗

**Figure 20.** Figure 20: Scatterplots comparing the twCRPSt of the IFS ensemble with the twPCRPSt of the deterministic HRES forecast. Columns correspond to weather variables and rows to thresholds used in the evaluation: the historical 99th percentile (top), the monthly record threshold (middle), and the overall record threshold (bottom). Points correspond to grid point and lead time combinations, with colours indicating lead tim… view at source ↗

read the original abstract

We study whether deterministic AI weather prediction (AIWP) models issue more informative forecasts for extreme weather events than deterministic numerical weather prediction (NWP) models. The deterministic model output is subjected to statistical post-processing via isotonic distributional regression (IDR), or EasyUQ, before the resulting probabilistic forecasts are assessed using weighted versions of the continuous ranked probability score (CRPS). This extends the Potential CRPS (PCRPS) measure proposed by Gneiting et al. (2026) to focus on extreme outcomes. Since IDR exhibits optimality properties with respect to weighted versions of the CRPS, the proposed approach inherits desirable properties of the PCRPS, and, in particular, facilitates fair comparisons between data-driven and physics-based models when forecasting extreme weather events. We apply this evaluation framework to forecasts in the WeatherBench 2 dataset issued by the AIWP models GraphCast, Pangu-Weather, and FuXi, with the ECMWF's high-resolution NWP model serving as a physics-based reference. The forecast models are compared when predicting mean sea level pressure, temperature, wind speed, and precipitation extremes, defined as exceedances or non-exceedances of thresholds obtained from historical observation data. We additionally study forecast performance when predicting record-breaking events, though the ordering of the different methods is largely insensitive to the thresholds on which emphasis is placed. We find that AIWP models, particularly FuXi, result in the most informative forecasts for extreme weather events across most settings, suggesting that AIWP models have the potential to outperform NWP models when forecasting extremes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable extension of weighted PCRPS via IDR to score deterministic models on extremes and applies it to WeatherBench 2, but the fairness claim between AIWP and NWP rests on thin evidence.

read the letter

The core contribution is a weighted version of the Potential CRPS that emphasizes extremes, paired with IDR post-processing on deterministic outputs, then used to rank GraphCast, Pangu-Weather, FuXi against ECMWF on WeatherBench 2 variables. They also check record-breaking events. The ordering favors FuXi in most cases.

What works is the grounding in IDR optimality for the weighted score, which the authors cite from prior work. This gives a defensible reason to treat the post-processed forecasts as comparable without arbitrary calibration choices. The application to real extremes data is straightforward and the insensitivity to threshold choice is at least asserted.

The soft spot is the fairness argument. IDR is optimal for each model's own pairs, but that does not automatically remove model-class differences in tail error structure or training data effects. The abstract states the ordering is largely insensitive to thresholds yet gives no numbers or sensitivity plots to back it. The stress-test concern lands: if AIWP and NWP retain distinct biases after IDR, the comparison still mixes informativeness with architecture artifacts, especially for precipitation and wind.

This is for people who evaluate probabilistic weather forecasts or decide on operational model use. It is a targeted methodological note rather than a broad advance.

Send it for peer review. The method is usable and the data application is relevant; referees can press on the missing checks for the fairness claim.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes using isotonic distributional regression (IDR/EasyUQ) to convert deterministic outputs from AIWP models (GraphCast, Pangu-Weather, FuXi) and an NWP reference (ECMWF) into probabilistic forecasts, then scoring them with weighted potential CRPS (PCRPS) focused on extremes defined via historical thresholds. It concludes that AIWP models, particularly FuXi, yield more informative forecasts than NWP for extremes in mean sea level pressure, temperature, wind speed, and precipitation, with the ordering largely insensitive to threshold choice.

Significance. If the fairness and optimality claims hold, the work supplies a practical framework for head-to-head evaluation of data-driven versus physics-based models on tail events and supplies evidence that AIWP can outperform NWP on extremes, which would be relevant for operational forecasting and model development.

major comments (3)

[Abstract] Abstract: the statement that IDR optimality with respect to weighted CRPS 'facilitates fair comparisons' between AIWP and NWP is asserted without any quantitative verification or diagnostic that the post-processed distributions remove model-class-specific tail biases; the central claim therefore rests on an untested transfer of the cited property.
[Abstract] Abstract and Results: the claim that 'the ordering of the different methods is largely insensitive to the thresholds' is presented without accompanying sensitivity tables, plots, or quantitative measures of rank stability across threshold choices; likewise, the procedure for selecting thresholds from historical data is not detailed enough to allow replication or robustness checks.
[Methods] Methods: while IDR is optimal for the weighted CRPS on a given set of forecast-observation pairs, no analysis is supplied to confirm that the resulting probabilistic forecasts equalize the comparison when the underlying deterministic models have different error structures or training distributions (especially for precipitation and wind speed); this leaves open the possibility that residual architecture-specific biases remain in the tails.

minor comments (1)

[Abstract] Abstract: the citation 'Gneiting et al. (2026)' should be checked for the correct year.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The comments identify areas where additional clarification and supporting material would strengthen the presentation of our framework. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that IDR optimality with respect to weighted CRPS 'facilitates fair comparisons' between AIWP and NWP is asserted without any quantitative verification or diagnostic that the post-processed distributions remove model-class-specific tail biases; the central claim therefore rests on an untested transfer of the cited property.

Authors: The optimality of IDR for weighted CRPS is a general theoretical result (see the cited IDR and proper scoring rule literature) that holds for any deterministic input once the isotonic regression is fitted to the same observations. Because the post-processing step is applied independently to each model using identical training data and the same proper scoring rule, the comparison is fair by construction: each model receives its own optimal probabilistic representation under the chosen score. We acknowledge, however, that the manuscript does not include explicit empirical diagnostics (e.g., tail-quantile plots or bias comparisons) demonstrating the removal of architecture-specific tail biases. In the revised version we will add a short appendix with representative post-processed CDFs and quantile-quantile diagnostics for the four variables, allowing readers to inspect the tail adjustments directly. revision: yes
Referee: [Abstract] Abstract and Results: the claim that 'the ordering of the different methods is largely insensitive to the thresholds' is presented without accompanying sensitivity tables, plots, or quantitative measures of rank stability across threshold choices; likewise, the procedure for selecting thresholds from historical data is not detailed enough to allow replication or robustness checks.

Authors: We agree that both the threshold-selection procedure and the supporting evidence for rank stability require more detail. Thresholds are defined as the empirical quantiles (e.g., 0.95, 0.99, 0.995) of the historical observation record in the training period for each variable and location; the exact quantiles and the number of events retained will be stated explicitly in the Methods section. To quantify the claimed insensitivity, we will add a supplementary table reporting model ranks for a grid of thresholds together with a simple stability metric (e.g., the fraction of threshold choices that preserve the overall ordering). These additions will also be referenced in the abstract and Results. revision: yes
Referee: [Methods] Methods: while IDR is optimal for the weighted CRPS on a given set of forecast-observation pairs, no analysis is supplied to confirm that the resulting probabilistic forecasts equalize the comparison when the underlying deterministic models have different error structures or training distributions (especially for precipitation and wind speed); this leaves open the possibility that residual architecture-specific biases remain in the tails.

Authors: The referee correctly observes that the manuscript relies on the theoretical optimality property without additional empirical checks for residual biases arising from differing error structures. While the optimality guarantee is model-agnostic once the forecasts and observations are fixed, we accept that readers may wish to see evidence that architecture-specific tail discrepancies are adequately corrected, particularly for precipitation and wind speed. In revision we will expand the Methods discussion to note this assumption explicitly and will include a brief comparison of raw deterministic error distributions versus the post-processed tails for the two most challenging variables. If space allows, we will also add a short paragraph in the Discussion acknowledging that complete equalization cannot be guaranteed without further model-specific diagnostics. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on externally cited optimality properties.

full rationale

The paper's key claim that IDR post-processing enables fair comparisons rests on optimality properties of IDR w.r.t. weighted CRPS, which are explicitly cited from prior work rather than derived or fitted within this manuscript. The extension of PCRPS to weighted versions for extremes is presented as inheriting those properties without any of the present equations reducing the fairness result to a self-defined quantity, a fitted input renamed as prediction, or a self-citation chain. The comparison framework between AIWP and NWP models therefore remains self-contained against external benchmarks, with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the optimality of IDR for weighted CRPS (external citation) and on the representativeness of historical thresholds for defining extremes. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption IDR is optimal for weighted versions of the CRPS
Invoked to justify that the post-processing step enables fair comparisons; referenced to Gneiting et al. (2026)

pith-pipeline@v0.9.1-grok · 5824 in / 1288 out tokens · 24055 ms · 2026-06-26T12:58:43.469685+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 12 canonical work pages

[1]

SIAM/ASA Journal on Uncertainty Quantification , volume=

Evaluating forecasts for high-impact events using transformed kernel scores , author=. SIAM/ASA Journal on Uncertainty Quantification , volume=. 2023 , publisher=

2023
[2]

Weather and Forecasting , volume=

Weighted verification tools to evaluate univariate and multivariate probabilistic forecasts for high-impact weather events , author=. Weather and Forecasting , volume=
[3]

arXiv preprint arXiv:2511.17176 , year=

On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification , author=. arXiv preprint arXiv:2511.17176 , year=

Pith/arXiv arXiv
[4]

Electronic Journal of Statistics , volume=

Decompositions of the mean continuous ranked probability score , author=. Electronic Journal of Statistics , volume=. 2024 , publisher=

2024
[5]

Nature , volume=

The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=

2015
[6]

Advances in Statistical Climatology, Meteorology and Oceanography , volume=

Forecast score distributions with imperfect observations , author=. Advances in Statistical Climatology, Meteorology and Oceanography , volume=. 2021 , publisher=

2021
[7]

Eulalie Boucher, Mihai Alexe, Peter Lean, Ewan Pinnington, Simon Lang, Patrick Laloyaux, Lorenzo Zampieri, Patricia de Rosnay, Niels Bormann, and Anthony McNally

Bi, Kaifeng and Xie, Lingxi and Zhang, Hengheng and Chen, Xin and Gu, Xiaotao and Tian, Qi , year = 2023, month = jul, journal =. Accurate Medium-Range Global Weather Forecasting with. doi:10.1038/s41586-023-06185-3 , urldate =

work page doi:10.1038/s41586-023-06185-3 2023
[8]

FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale

Bonev, Boris and Kurth, Thorsten and Mahesh, Ankur and Bisson, Mauro and Kossaifi, Jean and Kashinath, Karthik and Anandkumar, Anima and Collins, William D and Pritchard, Michael S and Keller, Alexander , journal=. FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale
[9]

A practical probabilistic benchmark for AI weather models

Brenowitz, Noah D and Cohen, Yair and Pathak, Jaideep and Mahesh, Ankur and Bonev, Boris and Kurth, Thorsten and Durran, Dale R and Harrington, Peter and Pritchard, Michael S , journal=. A practical probabilistic benchmark for AI weather models. 2025 , publisher=

2025
[10]

Monthly Weather Review , volume=

Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review , volume=. 1950 , publisher=

1950
[11]

doi:10.1038/s41612-023-00512-1 , urldate =

Chen, Lei and Zhong, Xiaohui and Zhang, Feng and Cheng, Yuan and Xu, Yinghui and Qi, Yuan and Li, Hao , year = 2023, month = nov, journal =. doi:10.1038/s41612-023-00512-1 , urldate =

work page doi:10.1038/s41612-023-00512-1 2023
[12]

Journal of Econometrics , volume=

Likelihood-based scoring rules for comparing density forecasts in tails , author=. Journal of Econometrics , volume=. 2011 , publisher=

2011
[13]

Turning Up the Heat: Assessing 2-m Temperature Forecast Errors in

Ennis, Kelsey E and Barnes, Elizabeth A and Arcodia, Marybeth C and Fernandez, Martin A and Maloney, Eric D , journal=. Turning Up the Heat: Assessing 2-m Temperature Forecast Errors in
[14]

Quarterly Journal of the Royal Meteorological Society , volume=

Measuring forecast performance in the presence of observation error , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2017 , publisher=

2017
[15]

, year = 1962, journal =

Glasser, Gerald J. , year = 1962, journal =. Variance. doi:10.2307/2282402 , urldate =. 2282402 , eprinttype =

work page doi:10.2307/2282402 1962
[16]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

2007
[17]

Journal of the American Statistical Association , volume=

Making and evaluating point forecasts , author=. Journal of the American Statistical Association , volume=. 2011 , publisher=

2011
[18]

Comparing

Comparing Density Forecasts Using Threshold- and Quantile-Weighted Scoring Rules , author =. Journal of Business & Economic Statistics , volume =. doi:10.1198/jbes.2010.08110 , urldate =

work page doi:10.1198/jbes.2010.08110 2010
[19]

Probabilistic measures afford fair comparisons of

Gneiting, Tilmann and Biegert, Tobias and Kraus, Kristof and Walz, Eva-Maria and Jordan, Alexander I and Lerch, Sebastian , journal=. Probabilistic measures afford fair comparisons of. 2026 , publisher=

2026
[20]

Quarterly Journal of the Royal Meteorological Society , volume=

The continuous ranked probability score for circular variables and its application to mesoscale forecast ensemble verification , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2006 , publisher=

2006
[21]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Isotonic distributional regression , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2021 , publisher=

2021
[22]

Hersbach, Hans and Bell, Bill and Berrisford, Paul and Hirahara, Shoji and Hor. The. Quarterly Journal of the Royal Meteorological Society , volume =. doi:10.1002/qj.3803 , urldate =

work page doi:10.1002/qj.3803
[23]

Annals of Applied Statistics , volume=

FOCUSING ON REGIONS OF INTEREST IN FORECAST EVALUATION , author=. Annals of Applied Statistics , volume=
[24]

2016 , school=

Facets of forecast evaluation , author=. 2016 , school=

2016
[25]

Hydrology and Earth System Sciences , volume=

Verification tools for probabilistic forecasts of continuous hydrological variables , author=. Hydrology and Earth System Sciences , volume=. 2007 , publisher=

2007
[26]

Science, 382 (6677), 1416--1421, doi:10.1126/science.adi2336

Learning Skillful Medium-Range Global Weather Forecasting , author =. Science , volume =. doi:10.1126/science.adi2336 , urldate =

work page doi:10.1126/science.adi2336
[27]

Lang, Simon and Alexe, Mihai and Chantry, Matthew and Dramsch, Jesper and Pinault, Florian and Raoult, Baudouin and Clare, Mariana CA and Lessig, Christian and Maier-Gerber, Michael and Magnusson, Linus and others , journal=
[28]

AIFS-CRPS: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score

Lang, Simon and Alexe, Mihai and Clare, Mariana CA and Roberts, Christopher and Adewoyin, Rilwan and Ben Bouall. AIFS-CRPS: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score. npj Artificial Intelligence , volume=. 2026 , publisher=

2026
[29]

and Ravazzolo, Francesco and Gneiting, Tilmann , year = 2017, journal =

Lerch, Sebastian and Thorarinsdottir, Thordis L. and Ravazzolo, Francesco and Gneiting, Tilmann , year = 2017, journal =. Forecaster's Dilemma: Extreme Events and Forecast Evaluation , shorttitle =. 26408123 , eprinttype =

2017
[30]

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an

Loveday, Nicholas and Hertneky, Tracy , journal=. Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an
[31]

Management Science , volume=

Scoring rules for continuous probability distributions , author=. Management Science , volume=. 1976 , publisher=

1976
[32]

Reviews of geophysics , volume=

Impact forecasting to support emergency management of natural hazards , author=. Reviews of geophysics , volume=. 2020 , publisher=

2020
[33]

Do Data-Driven Models Beat Numerical Models in Forecasting Weather Extremes?

Olivetti, Leonardo and Messori, Gabriele , year = 2024, month = nov, journal =. Do Data-Driven Models Beat Numerical Models in Forecasting Weather Extremes?. doi:10.5194/gmd-17-7915-2024 , urldate =

work page doi:10.5194/gmd-17-7915-2024 2024
[34]

FourCastNet: A global data-driven high-resolution weather model using adaptive fourier neural operators

Pathak, Jaideep and Subramanian, Shashank and Harrington, Peter and Raja, Sanjeev and Chattopadhyay, Ashesh and Mardani, Morteza and Kurth, Thorsten and Hall, David and Li, Zongyi and Azizzadenesheli, Kamyar and others , journal=. FourCastNet: A global data-driven high-resolution weather model using adaptive fourier neural operators
[35]

URL https://doi.org/10.1038/s41586-024-08252-9

Probabilistic Weather Forecasting with Machine Learning , author =. Nature , volume =. doi:10.1038/s41586-024-08252-9 , urldate =

work page doi:10.1038/s41586-024-08252-9
[36]

2020 , publisher=

Rasp, Stephan and Dueben, Peter D and Scher, Sebastian and Weyn, Jonathan A and Mouatadid, Soukayna and Thuerey, Nils , journal=. 2020 , publisher=

2020
[37]

Stephan Rasp, Stephan Hoyer, Aravind Merose, Johannes Langguth, Sebastian Deiser, et al

Rasp, Stephan and Hoyer, Stephan and Merose, Alexander and Langmore, Ian and Battaglia, Peter and Russell, Tyler and. Journal of Advances in Modeling Earth Systems , volume =. doi:10.1029/2023MS004019 , urldate =

work page doi:10.1029/2023ms004019
[38]

A comparison of moderate and extreme

Rivoire, Pauline and Martius, Olivia and Naveau, Philippe , journal=. A comparison of moderate and extreme. 2021 , publisher=

2021
[39]

Quarterly Journal of the Royal Meteorological Society , volume=

Evaluation of point forecasts for extreme events using consistent scoring functions , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2022 , publisher=

2022
[40]

Walz, Eva-Maria and Henzi, Alexander and Ziegel, Johanna and Gneiting, Tilmann , year = 2024, month = feb, journal =. Easy. doi:10.1137/22M1541915 , urldate =

work page doi:10.1137/22m1541915 2024
[41]

Monthly Weather Review , volume =

Improving Probabilistic Forecasts of Extreme Wind Speeds by Training Statistical Postprocessing Models with Weighted Scoring Rules , author =. Monthly Weather Review , volume =. doi:10.1175/MWR-D-24-0151.1 , urldate =

work page doi:10.1175/mwr-d-24-0151.1
[42]

Mathematical Geosciences , volume =

Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts , author =. Mathematical Geosciences , volume =. doi:10.1007/s11004-017-9709-7 , urldate =

work page doi:10.1007/s11004-017-9709-7
[43]

Skilful nowcasting of extreme precipitation with

Zhang, Yuchen and Long, Mingsheng and Chen, Kaiyuan and Xing, Lanxiang and Jin, Ronghua and Jordan, Michael I and Wang, Jianmin , journal=. Skilful nowcasting of extreme precipitation with. 2023 , publisher=

2023
[44]

Science Advances , volume =

Zhongwei Zhang and Erich Fischer and Jakob Zscheischler and Sebastian Engelke , title =. Science Advances , volume =
[45]

2025 , publisher=

Zhong, Xiaohui and Chen, Lei and Li, Hao and Buizza, Roberto and Liu, Jun and Feng, Jie and Zhu, Zijian and Fan, Xu and Dai, Kan and Luo, Jing-jia and others , journal=. 2025 , publisher=

2025
[46]

Nature Reviews Earth & Environment , volume=

A typology of compound weather and climate events , author=. Nature Reviews Earth & Environment , volume=. 2020 , publisher=

2020
[47]

Uncertainty

B. Uncertainty. Artificial Intelligence for the Earth Systems , volume =
[48]

npj Climate and Atmospheric Science , volume =

Do. npj Climate and Atmospheric Science , volume =
[49]

Pasche and Jonathan Wider and Zhongwei Zhang and Jakob Zscheischler and Sebastian Engelke

Olivier C. Pasche and Jonathan Wider and Zhongwei Zhang and Jakob Zscheischler and Sebastian Engelke. Validating Deep Learning Weather Forecast Models on Recent High-Impact Extreme Events. Artificial Intelligence for the Earth Systems. 2025

2025
[50]

arXiv preprint arXiv:2605.01126 , year=

McGovern, Amy and Mandelbaum, Taylor and Rothenberg, Daniel and Loveday, Nicholas and Potvin, Corey and Flora, Montgomery and Magnusson, Linus and Gilleland, Eric and Allen, John , title = ". arXiv preprint arXiv:2605.01126 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2601.18111 , year=

Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting , author=. arXiv preprint arXiv:2601.18111 , year=

arXiv
[52]

arXiv preprint arXiv:2506.10772 , year=

Skillful joint probabilistic weather forecasting from marginals , author=. arXiv preprint arXiv:2506.10772 , year=

arXiv
[53]

Nature , volume=

Neural general circulation models for weather and climate , author=. Nature , volume=. 2024 , publisher=

2024

[1] [1]

SIAM/ASA Journal on Uncertainty Quantification , volume=

Evaluating forecasts for high-impact events using transformed kernel scores , author=. SIAM/ASA Journal on Uncertainty Quantification , volume=. 2023 , publisher=

2023

[2] [2]

Weather and Forecasting , volume=

Weighted verification tools to evaluate univariate and multivariate probabilistic forecasts for high-impact weather events , author=. Weather and Forecasting , volume=

[3] [3]

arXiv preprint arXiv:2511.17176 , year=

On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification , author=. arXiv preprint arXiv:2511.17176 , year=

Pith/arXiv arXiv

[4] [4]

Electronic Journal of Statistics , volume=

Decompositions of the mean continuous ranked probability score , author=. Electronic Journal of Statistics , volume=. 2024 , publisher=

2024

[5] [5]

Nature , volume=

The quiet revolution of numerical weather prediction , author=. Nature , volume=. 2015 , publisher=

2015

[6] [6]

Advances in Statistical Climatology, Meteorology and Oceanography , volume=

Forecast score distributions with imperfect observations , author=. Advances in Statistical Climatology, Meteorology and Oceanography , volume=. 2021 , publisher=

2021

[7] [7]

Eulalie Boucher, Mihai Alexe, Peter Lean, Ewan Pinnington, Simon Lang, Patrick Laloyaux, Lorenzo Zampieri, Patricia de Rosnay, Niels Bormann, and Anthony McNally

Bi, Kaifeng and Xie, Lingxi and Zhang, Hengheng and Chen, Xin and Gu, Xiaotao and Tian, Qi , year = 2023, month = jul, journal =. Accurate Medium-Range Global Weather Forecasting with. doi:10.1038/s41586-023-06185-3 , urldate =

work page doi:10.1038/s41586-023-06185-3 2023

[8] [8]

FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale

Bonev, Boris and Kurth, Thorsten and Mahesh, Ankur and Bisson, Mauro and Kossaifi, Jean and Kashinath, Karthik and Anandkumar, Anima and Collins, William D and Pritchard, Michael S and Keller, Alexander , journal=. FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale

[9] [9]

A practical probabilistic benchmark for AI weather models

Brenowitz, Noah D and Cohen, Yair and Pathak, Jaideep and Mahesh, Ankur and Bonev, Boris and Kurth, Thorsten and Durran, Dale R and Harrington, Peter and Pritchard, Michael S , journal=. A practical probabilistic benchmark for AI weather models. 2025 , publisher=

2025

[10] [10]

Monthly Weather Review , volume=

Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review , volume=. 1950 , publisher=

1950

[11] [11]

doi:10.1038/s41612-023-00512-1 , urldate =

Chen, Lei and Zhong, Xiaohui and Zhang, Feng and Cheng, Yuan and Xu, Yinghui and Qi, Yuan and Li, Hao , year = 2023, month = nov, journal =. doi:10.1038/s41612-023-00512-1 , urldate =

work page doi:10.1038/s41612-023-00512-1 2023

[12] [12]

Journal of Econometrics , volume=

Likelihood-based scoring rules for comparing density forecasts in tails , author=. Journal of Econometrics , volume=. 2011 , publisher=

2011

[13] [13]

Turning Up the Heat: Assessing 2-m Temperature Forecast Errors in

Ennis, Kelsey E and Barnes, Elizabeth A and Arcodia, Marybeth C and Fernandez, Martin A and Maloney, Eric D , journal=. Turning Up the Heat: Assessing 2-m Temperature Forecast Errors in

[14] [14]

Quarterly Journal of the Royal Meteorological Society , volume=

Measuring forecast performance in the presence of observation error , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2017 , publisher=

2017

[15] [15]

, year = 1962, journal =

Glasser, Gerald J. , year = 1962, journal =. Variance. doi:10.2307/2282402 , urldate =. 2282402 , eprinttype =

work page doi:10.2307/2282402 1962

[16] [16]

Journal of the American statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

2007

[17] [17]

Journal of the American Statistical Association , volume=

Making and evaluating point forecasts , author=. Journal of the American Statistical Association , volume=. 2011 , publisher=

2011

[18] [18]

Comparing

Comparing Density Forecasts Using Threshold- and Quantile-Weighted Scoring Rules , author =. Journal of Business & Economic Statistics , volume =. doi:10.1198/jbes.2010.08110 , urldate =

work page doi:10.1198/jbes.2010.08110 2010

[19] [19]

Probabilistic measures afford fair comparisons of

Gneiting, Tilmann and Biegert, Tobias and Kraus, Kristof and Walz, Eva-Maria and Jordan, Alexander I and Lerch, Sebastian , journal=. Probabilistic measures afford fair comparisons of. 2026 , publisher=

2026

[20] [20]

Quarterly Journal of the Royal Meteorological Society , volume=

The continuous ranked probability score for circular variables and its application to mesoscale forecast ensemble verification , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2006 , publisher=

2006

[21] [21]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Isotonic distributional regression , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2021 , publisher=

2021

[22] [22]

Hersbach, Hans and Bell, Bill and Berrisford, Paul and Hirahara, Shoji and Hor. The. Quarterly Journal of the Royal Meteorological Society , volume =. doi:10.1002/qj.3803 , urldate =

work page doi:10.1002/qj.3803

[23] [23]

Annals of Applied Statistics , volume=

FOCUSING ON REGIONS OF INTEREST IN FORECAST EVALUATION , author=. Annals of Applied Statistics , volume=

[24] [24]

2016 , school=

Facets of forecast evaluation , author=. 2016 , school=

2016

[25] [25]

Hydrology and Earth System Sciences , volume=

Verification tools for probabilistic forecasts of continuous hydrological variables , author=. Hydrology and Earth System Sciences , volume=. 2007 , publisher=

2007

[26] [26]

Science, 382 (6677), 1416--1421, doi:10.1126/science.adi2336

Learning Skillful Medium-Range Global Weather Forecasting , author =. Science , volume =. doi:10.1126/science.adi2336 , urldate =

work page doi:10.1126/science.adi2336

[27] [27]

Lang, Simon and Alexe, Mihai and Chantry, Matthew and Dramsch, Jesper and Pinault, Florian and Raoult, Baudouin and Clare, Mariana CA and Lessig, Christian and Maier-Gerber, Michael and Magnusson, Linus and others , journal=

[28] [28]

AIFS-CRPS: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score

Lang, Simon and Alexe, Mihai and Clare, Mariana CA and Roberts, Christopher and Adewoyin, Rilwan and Ben Bouall. AIFS-CRPS: Ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score. npj Artificial Intelligence , volume=. 2026 , publisher=

2026

[29] [29]

and Ravazzolo, Francesco and Gneiting, Tilmann , year = 2017, journal =

Lerch, Sebastian and Thorarinsdottir, Thordis L. and Ravazzolo, Francesco and Gneiting, Tilmann , year = 2017, journal =. Forecaster's Dilemma: Extreme Events and Forecast Evaluation , shorttitle =. 26408123 , eprinttype =

2017

[30] [30]

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an

Loveday, Nicholas and Hertneky, Tracy , journal=. Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an

[31] [31]

Management Science , volume=

Scoring rules for continuous probability distributions , author=. Management Science , volume=. 1976 , publisher=

1976

[32] [32]

Reviews of geophysics , volume=

Impact forecasting to support emergency management of natural hazards , author=. Reviews of geophysics , volume=. 2020 , publisher=

2020

[33] [33]

Do Data-Driven Models Beat Numerical Models in Forecasting Weather Extremes?

Olivetti, Leonardo and Messori, Gabriele , year = 2024, month = nov, journal =. Do Data-Driven Models Beat Numerical Models in Forecasting Weather Extremes?. doi:10.5194/gmd-17-7915-2024 , urldate =

work page doi:10.5194/gmd-17-7915-2024 2024

[34] [34]

FourCastNet: A global data-driven high-resolution weather model using adaptive fourier neural operators

Pathak, Jaideep and Subramanian, Shashank and Harrington, Peter and Raja, Sanjeev and Chattopadhyay, Ashesh and Mardani, Morteza and Kurth, Thorsten and Hall, David and Li, Zongyi and Azizzadenesheli, Kamyar and others , journal=. FourCastNet: A global data-driven high-resolution weather model using adaptive fourier neural operators

[35] [35]

URL https://doi.org/10.1038/s41586-024-08252-9

Probabilistic Weather Forecasting with Machine Learning , author =. Nature , volume =. doi:10.1038/s41586-024-08252-9 , urldate =

work page doi:10.1038/s41586-024-08252-9

[36] [36]

2020 , publisher=

Rasp, Stephan and Dueben, Peter D and Scher, Sebastian and Weyn, Jonathan A and Mouatadid, Soukayna and Thuerey, Nils , journal=. 2020 , publisher=

2020

[37] [37]

Stephan Rasp, Stephan Hoyer, Aravind Merose, Johannes Langguth, Sebastian Deiser, et al

Rasp, Stephan and Hoyer, Stephan and Merose, Alexander and Langmore, Ian and Battaglia, Peter and Russell, Tyler and. Journal of Advances in Modeling Earth Systems , volume =. doi:10.1029/2023MS004019 , urldate =

work page doi:10.1029/2023ms004019

[38] [38]

A comparison of moderate and extreme

Rivoire, Pauline and Martius, Olivia and Naveau, Philippe , journal=. A comparison of moderate and extreme. 2021 , publisher=

2021

[39] [39]

Quarterly Journal of the Royal Meteorological Society , volume=

Evaluation of point forecasts for extreme events using consistent scoring functions , author=. Quarterly Journal of the Royal Meteorological Society , volume=. 2022 , publisher=

2022

[40] [40]

Walz, Eva-Maria and Henzi, Alexander and Ziegel, Johanna and Gneiting, Tilmann , year = 2024, month = feb, journal =. Easy. doi:10.1137/22M1541915 , urldate =

work page doi:10.1137/22m1541915 2024

[41] [41]

Monthly Weather Review , volume =

Improving Probabilistic Forecasts of Extreme Wind Speeds by Training Statistical Postprocessing Models with Weighted Scoring Rules , author =. Monthly Weather Review , volume =. doi:10.1175/MWR-D-24-0151.1 , urldate =

work page doi:10.1175/mwr-d-24-0151.1

[42] [42]

Mathematical Geosciences , volume =

Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts , author =. Mathematical Geosciences , volume =. doi:10.1007/s11004-017-9709-7 , urldate =

work page doi:10.1007/s11004-017-9709-7

[43] [43]

Skilful nowcasting of extreme precipitation with

Zhang, Yuchen and Long, Mingsheng and Chen, Kaiyuan and Xing, Lanxiang and Jin, Ronghua and Jordan, Michael I and Wang, Jianmin , journal=. Skilful nowcasting of extreme precipitation with. 2023 , publisher=

2023

[44] [44]

Science Advances , volume =

Zhongwei Zhang and Erich Fischer and Jakob Zscheischler and Sebastian Engelke , title =. Science Advances , volume =

[45] [45]

2025 , publisher=

Zhong, Xiaohui and Chen, Lei and Li, Hao and Buizza, Roberto and Liu, Jun and Feng, Jie and Zhu, Zijian and Fan, Xu and Dai, Kan and Luo, Jing-jia and others , journal=. 2025 , publisher=

2025

[46] [46]

Nature Reviews Earth & Environment , volume=

A typology of compound weather and climate events , author=. Nature Reviews Earth & Environment , volume=. 2020 , publisher=

2020

[47] [47]

Uncertainty

B. Uncertainty. Artificial Intelligence for the Earth Systems , volume =

[48] [48]

npj Climate and Atmospheric Science , volume =

Do. npj Climate and Atmospheric Science , volume =

[49] [49]

Pasche and Jonathan Wider and Zhongwei Zhang and Jakob Zscheischler and Sebastian Engelke

Olivier C. Pasche and Jonathan Wider and Zhongwei Zhang and Jakob Zscheischler and Sebastian Engelke. Validating Deep Learning Weather Forecast Models on Recent High-Impact Extreme Events. Artificial Intelligence for the Earth Systems. 2025

2025

[50] [50]

arXiv preprint arXiv:2605.01126 , year=

McGovern, Amy and Mandelbaum, Taylor and Rothenberg, Daniel and Loveday, Nicholas and Potvin, Corey and Flora, Montgomery and Magnusson, Linus and Gilleland, Eric and Allen, John , title = ". arXiv preprint arXiv:2605.01126 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2601.18111 , year=

Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting , author=. arXiv preprint arXiv:2601.18111 , year=

arXiv

[52] [52]

arXiv preprint arXiv:2506.10772 , year=

Skillful joint probabilistic weather forecasting from marginals , author=. arXiv preprint arXiv:2506.10772 , year=

arXiv

[53] [53]

Nature , volume=

Neural general circulation models for weather and climate , author=. Nature , volume=. 2024 , publisher=

2024