Quantifying Very Extreme Precipitation and Temperature Using Huge Ensembles Generated by Machine Learning-based Climate Model Emulators

Christopher J. Paciorek; Daniel Cooley

arxiv: 2510.08893 · v2 · submitted 2025-10-10 · 📊 stat.AP

Quantifying Very Extreme Precipitation and Temperature Using Huge Ensembles Generated by Machine Learning-based Climate Model Emulators

Christopher J. Paciorek , Daniel Cooley This is my paper

Pith reviewed 2026-05-18 08:29 UTC · model grok-4.3

classification 📊 stat.AP

keywords extreme precipitationtemperature extremesmachine learning emulatorsextreme value theoryhuge ensemblesProbable Maximum Precipitationthreshold exceedanceclimate statistics

0 comments

The pith

Huge ensembles from a machine learning climate emulator allow practical estimation of very extreme precipitation and temperature quantiles using threshold-exceedance methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether machine learning emulators can generate climate data volumes large enough to characterize extremely rare weather events that short observational records cannot capture. It applies extreme value statistics to a 10,560-year ensemble of precipitation and temperature produced by the ACE2 emulator over the contiguous United States. The work shows that high-threshold exceedance techniques produce stable quantile estimates, that results hold across seasons and storm types, and that the ensemble size keeps statistical uncertainty tightly bounded. The emulator also generates values outside its training range, which opens the possibility of using such ensembles for infrastructure design quantities like Probable Maximum Precipitation.

Core claim

A state-of-the-art machine learning emulator trained on reanalysis can produce a 10,560-year ensemble whose extremes, when analyzed with threshold-exceedance extreme value methods at sufficiently high thresholds, yield reliable and robust estimates of very low-probability precipitation and temperature quantiles, with statistical uncertainty well constrained by the ensemble size.

What carries the argument

The 10,560-year ensemble generated by the ACE2 machine learning emulator, analyzed via peak-over-threshold extreme value techniques.

If this is right

Threshold-exceedance methods with high thresholds are required for reliable estimation of precipitation extremes.
Extreme quantile estimates remain consistent when stratified by season or storm type.
An ensemble of roughly ten thousand years is large enough to produce well-constrained statistical uncertainty for these tails.
The emulator approach can be used to generate values beyond the range seen in the original reanalysis training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This workflow could be applied to update Probable Maximum Precipitation estimates used in critical infrastructure design.
If future emulators better capture physical tail behavior, the same ensemble-size strategy might reduce reliance on traditional climate-model runs for extreme-event studies.
The method offers a route to quantify how climate change alters the far tails of precipitation and temperature distributions at regional scales.

Load-bearing premise

The machine learning emulator produces realistic extremes outside the range of its training data that match the true tail behavior of precipitation and temperature.

What would settle it

Direct comparison of the emulator-derived quantiles against independent long observational records or physical upper bounds that shows systematic mismatch at the highest return periods.

read the original abstract

Weather extremes produce major impacts on society and ecosystems and are likely to change in likelihood and magnitude with climate change. However, very low probability events are hard to characterize statistically using observations or even climate model output because of short records/runs. For precipitation, consideration of such events arises in quantifying Probable Maximum Precipitation (PMP), namely estimating extreme precipitation magnitudes for designing and assessing critical infrastructure. A recent National Academies report on modernizing PMP estimation proposed using very large climate model-based ensembles to estimate extreme quantiles, possibly through machine learning-based ensemble boosting. Here we assess statistical aspects of such an approach for the contiguous United States using a huge ensemble (10560 years) produced by a state-of-the-art emulator (ACE2) trained on ERA5 reanalysis. The results indicate that one can practically estimate very extreme precipitation and temperature quantiles, provided one uses appropriate statistical extreme value techniques. More specifically, the results provide evidence for (1) the use of threshold-exceedance methods with a sufficiently high threshold (necessary for precipitation) for reliable estimation, (2) the robustness of results to variation in extremes by season and storm type, and (3) the sufficiency of the ensemble for well-constrained statistical uncertainty. Our results also show that the emulator produces extremes outside the range of the ERA5 training data. While encouraging for emulators' potential use for quantifying the climatology of extremes, more investigation is needed to assess whether emulators are fit for this purpose. Our focus is on how to use huge ensembles to estimate very extreme statistics; we expect the results to be relevant for future improved emulators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Huge ML emulator ensembles let you run standard EVT on very rare precipitation and temperature events with tight uncertainty, but the emulator's tail behavior still needs external checks.

read the letter

The punchline is that a 10,000-plus-year ensemble from the ACE2 emulator makes it practical to estimate very extreme quantiles with threshold-exceedance methods, and the paper shows the ensemble size keeps statistical uncertainty low while results hold up across seasons and storm types. They also confirm the emulator generates values past the ERA5 training range, which is a basic requirement for this kind of work. This is a direct test of the National Academies proposal on modernizing PMP estimates using large synthetic ensembles rather than just theorizing about it. The robustness checks and the demonstration that high thresholds are needed for precipitation are the parts that feel most useful on the statistical side. The main soft spot is exactly what the abstract flags: there is no direct validation that the emulator's extremes match real tail behavior from independent observations or full physics models. Without that comparison, the reported quantiles remain conditional on the emulator getting rare storm dynamics and moisture transport right, and the stress-test note on tail realism holds up. The paper is aimed at people doing extreme-value work for infrastructure design or climate impacts who want to see how these methods behave at this scale. It deserves a serious referee because it supplies concrete empirical results on the workflow and is upfront about the remaining emulator question. I would send it to review and ask referees to focus on whether the tails can be checked against other data sources.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates the feasibility of estimating very extreme precipitation and temperature quantiles over the contiguous United States by applying extreme-value theory to a 10560-year ensemble generated by the ACE2 machine-learning emulator trained on ERA5 reanalysis. It reports that threshold-exceedance methods with sufficiently high thresholds yield stable estimates, that results are robust across seasons and storm types, and that the ensemble size produces well-constrained statistical uncertainty. The emulator is shown to generate values outside the ERA5 training range, but the authors explicitly note that further work is required to determine whether such emulators are fit for purpose in extreme-value applications.

Significance. If the emulator faithfully reproduces physical tail behavior, the work would provide a concrete demonstration that machine-learning ensemble boosting can supply the sample sizes needed for reliable estimation of Probable Maximum Precipitation and other rare-event statistics. The explicit focus on statistical methodology rather than emulator development, together with the large ensemble size, constitutes a useful contribution to the modernization of PMP methods proposed by the National Academies report.

major comments (1)

[Abstract] Abstract and final paragraph: the claim that the results demonstrate one can 'practically estimate very extreme precipitation and temperature quantiles' rests on the untested assumption that ACE2 reproduces the true tail behavior outside the ERA5 training distribution. No quantitative comparison of emulator-generated extremes against independent long observational records or high-resolution physics-based model integrations is presented for the rare-event regime, leaving open the possibility that reported quantiles reflect emulator artifacts rather than climatological reality.

minor comments (2)

[Methods] The manuscript would benefit from an explicit statement of the precise threshold-selection criterion (e.g., mean residual life plot or 95th percentile) and the number of exceedances retained for each variable and season.
[Figures] Figure captions should clarify whether the plotted quantiles are return levels or normalized exceedance probabilities, and should include the effective sample size after declustering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful review, which highlights an important consideration regarding the scope of our claims. We respond to the major comment below and have made revisions to improve the clarity and precision of our presentation.

read point-by-point responses

Referee: [Abstract] Abstract and final paragraph: the claim that the results demonstrate one can 'practically estimate very extreme precipitation and temperature quantiles' rests on the untested assumption that ACE2 reproduces the true tail behavior outside the ERA5 training distribution. No quantitative comparison of emulator-generated extremes against independent long observational records or high-resolution physics-based model integrations is presented for the rare-event regime, leaving open the possibility that reported quantiles reflect emulator artifacts rather than climatological reality.

Authors: We appreciate the referee drawing attention to this point. Our manuscript is deliberately scoped to examine the statistical feasibility of applying extreme-value methods to very large ensembles generated by an existing emulator, as proposed in the National Academies report on modernizing PMP estimation; it is not intended as a validation study of the emulator's tail fidelity. We already include explicit caveats in both the abstract and final paragraph stating that further investigation is needed to assess whether emulators are fit for extreme-value applications. We agree that the current wording could be tightened to avoid any implication that tail realism has been demonstrated. Accordingly, we will revise the abstract and concluding paragraph to more precisely state that the results show one can practically estimate very extreme quantiles from such ensembles using appropriate threshold-exceedance methods, provided the emulator produces realistic tails, while reiterating the need for future comparisons against independent long records or physics-based integrations. This change clarifies the contribution without affecting the reported statistical findings on threshold selection, robustness, or uncertainty. revision: partial

Circularity Check

0 steps flagged

No significant circularity; standard EVT applied to external emulator output

full rationale

The paper generates a large ensemble (10560 years) via the ACE2 emulator trained on ERA5 and then applies established threshold-exceedance extreme value methods to estimate quantiles. No derivation step reduces by construction to a fitted parameter or self-citation; the central results concern robustness of standard EVT to threshold choice, season, and storm type on this external ensemble, with explicit caveats that further validation of emulator tails is needed. The analysis remains self-contained against external reanalysis benchmarks without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim depends on the emulator faithfully extending the distribution tails beyond training data and on extreme value theory applying directly to the generated fields without emulator-specific biases.

axioms (2)

domain assumption Extreme value theory threshold-exceedance methods remain valid when applied to output from a machine learning climate emulator.
Invoked when recommending high thresholds for reliable estimation of precipitation extremes.
ad hoc to paper The ACE2 emulator produces statistically representative extremes outside the ERA5 training range.
Stated as observed result but flagged as requiring further investigation for fitness-for-purpose.

pith-pipeline@v0.9.0 · 5829 in / 1417 out tokens · 31536 ms · 2026-05-18T08:29:34.140037+00:00 · methodology

Quantifying Very Extreme Precipitation and Temperature Using Huge Ensembles Generated by Machine Learning-based Climate Model Emulators

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)