pith. sign in

arxiv: 2604.09754 · v1 · submitted 2026-04-10 · 📊 stat.AP

Surface temperature extremes produced by huge machine learning hindcasts of summer 2023

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 📊 stat.AP
keywords machine learningextreme heatensemble forecastinghindcastingextreme value theoryhumid heatweather simulationSpherical Fourier Neural Operator
0
0 comments X

The pith

A machine learning model run in an ensemble of 7,424 members produces summer 2023 surface temperature extremes that exceed extreme value theory predictions over one-third of global land.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a large ensemble of machine learning weather simulations can generate heatwave conditions more intense than those found in reanalysis data or in smaller numerical weather prediction ensembles. This matters because extreme heat, especially when combined with humidity, poses increasing risks to health and infrastructure, and traditional methods may miss the upper range of possible events. The authors report that the machine learning extremes fall inside the range expected from extreme value theory for roughly two-thirds of land areas but lie outside that range for the remaining third. The same large ensemble also produces detailed storyline simulations of humid heat that reach higher public safety alert categories than smaller ensembles can reach.

Core claim

Running the Spherical Fourier Neural Operator machine learning model as a hindcast ensemble of 7,424 members for summer 2023 yields surface temperature extremes that surpass reanalysis and standard numerical weather prediction ensembles. For about two-thirds of global land the resulting extremes remain consistent with extrapolations from smaller ensembles using extreme value theory, yet for the other third they fall well outside that envelope. The large ensemble additionally produces humid heat simulations that map onto more dangerous public safety alert categories than smaller ensembles can generate.

What carries the argument

The Spherical Fourier Neural Operator machine learning weather model executed as a 7,424-member ensemble to hindcast and explore the tails of surface temperature distributions.

If this is right

  • Large machine learning ensembles can supply storyline simulations of humid heat that reach higher danger categories than smaller traditional ensembles allow.
  • Extreme temperature risks may be underestimated in regions where the machine learning results depart from extreme value theory envelopes.
  • Hindcasting with thousands of members offers a practical route to map the full range of possible heat events for a given season.
  • Both dry and humid temperature extremes become more accessible for analysis and alert system design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the machine learning extremes prove realistic, current risk assessments that rely on smaller ensembles or extreme value theory alone may need revision in certain regions.
  • The method could be tested by running analogous ensembles for other recent extreme seasons and checking consistency with independent observations.
  • Extending the approach to future climate scenarios might reveal whether the same discrepancy with traditional methods appears under warmer conditions.
  • Lower computational cost of machine learning ensembles relative to physics-based models could allow routine production of very large sets for operational risk mapping.

Load-bearing premise

The machine learning model generates physically plausible extreme surface temperatures rather than artifacts of its training data or architecture when scaled to this ensemble size.

What would settle it

Future real-world temperatures or independent high-resolution simulations in the one-third of land regions fail to approach the upper tail values produced by the machine learning ensemble, or the machine learning outputs violate basic physical constraints such as energy or moisture budgets.

Figures

Figures reproduced from arXiv: 2604.09754 by Ankur Mahesh, Boris Bonev, Joshua North, Karthik Kashinath, Mark Risser, Michael S. Pritchard, Shashank Subramanian, Thorsten Kurth, William D. Collins.

Figure 1
Figure 1. Figure 1: Comparison of three different versions of the hottest 2-meter air temperatures over [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of IFS-GEV extreme event thresholds for 2m temperature (99 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extreme heat index analysis for summer 2023. Panel (a) shows the summer max [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

The summer of 2023 was the second hottest on record, with numerous extreme heatwaves across the globe. Using the Spherical Fourier Neural Operator machine learning (ML) weather model, we generated a massive ensemble of 7,424 weather scenarios simulating summer temperature extremes. The ML ensemble produced extreme heatwave scenarios exceeding temperatures from reanalysis and numerical weather prediction ensembles. Our results show that the ML model's extreme surface temperatures were not unusual for approximately two-thirds of the global land area. However, for the other one-third, ML-generated extreme events were well outside the prediction envelope from extrapolating smaller ensembles with extreme value theory. Furthermore, the ML ensemble readily generates storyline simulations of humid heat extremes, which yield more dangerous categories of public safety alerts than can be simulated from smaller ensembles. This research highlights the potential of huge ensemble simulations to improve understanding and prediction of both humid and dry temperature extremes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a 7,424-member ensemble of summer-2023 hindcasts generated with the Spherical Fourier Neural Operator (SFNO) machine-learning weather model. It reports that the resulting surface-temperature extremes exceed reanalysis and smaller NWP ensembles, lie outside extreme-value-theory extrapolation envelopes over roughly one-third of global land area, and enable more dangerous humid-heat storyline simulations than smaller ensembles can produce.

Significance. If the ML-generated tails are shown to be physically consistent rather than artifacts, the approach would provide a practical route to sampling rare heat extremes at scales unattainable with conventional ensembles, directly supporting improved risk assessment for both dry and humid heat events.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Results): the central claim that ML extremes lie 'well outside the prediction envelope' for one-third of land area is stated without any quantitative validation metrics (e.g., tail-specific RMSE, exceedance ratios, or Kolmogorov-Smirnov statistics against reanalysis), error bars on the one-third fraction, or explicit verification protocol. This information is load-bearing for the comparison to EVT envelopes and must be supplied.
  2. [§2] §2 (Methods): the assumption that SFNO produces physically plausible extremes at ensemble size 7,424 is not accompanied by targeted diagnostics for tail artifacts (e.g., spectral energy spectra at high wavenumbers, conservation of integrated heat content, or comparison of higher-order moments against reanalysis). These checks are required before the EVT-exceedance result can be interpreted as a genuine tail difference rather than a model-specific bias.
minor comments (2)
  1. [Figure captions and §3] Figure captions and §3: clarify whether the 'one-third' and 'two-thirds' fractions are area-weighted or grid-point counts and whether they are sensitive to the choice of EVT fitting window.
  2. [§4] §4 (Humid-heat storylines): add a brief statement on how the wet-bulb temperature thresholds used for public-safety alerts were chosen and whether they are consistent with existing operational definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the quantitative support of our claims. We respond to each major comment below and have revised the manuscript to incorporate the requested metrics and diagnostics.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Results): the central claim that ML extremes lie 'well outside the prediction envelope' for one-third of land area is stated without any quantitative validation metrics (e.g., tail-specific RMSE, exceedance ratios, or Kolmogorov-Smirnov statistics against reanalysis), error bars on the one-third fraction, or explicit verification protocol. This information is load-bearing for the comparison to EVT envelopes and must be supplied.

    Authors: We agree that the original submission lacked explicit quantitative validation metrics for the EVT-exceedance claim. In the revised manuscript we have added to §3 an explicit verification protocol subsection. This includes tail-specific RMSE between the ML ensemble extremes and ERA5 reanalysis, exceedance ratios relative to the EVT envelopes derived from the smaller NWP ensembles, and two-sample Kolmogorov-Smirnov statistics. We also report bootstrap-derived error bars on the land-area fraction (now stated as 33% ± 4%). These additions confirm that the ML extremes lie outside the EVT envelopes over a statistically robust fraction of land area and are now fully documented in the text and supplementary figures. revision: yes

  2. Referee: [§2] §2 (Methods): the assumption that SFNO produces physically plausible extremes at ensemble size 7,424 is not accompanied by targeted diagnostics for tail artifacts (e.g., spectral energy spectra at high wavenumbers, conservation of integrated heat content, or comparison of higher-order moments against reanalysis). These checks are required before the EVT-exceedance result can be interpreted as a genuine tail difference rather than a model-specific bias.

    Authors: We acknowledge that the original §2 did not present targeted tail-artifact diagnostics. We have expanded the Methods section to include (i) kinetic-energy spectra at high wavenumbers for the full 7,424-member ensemble, (ii) verification that integrated column heat content is conserved to within reanalysis uncertainty across ensemble members, and (iii) direct comparison of skewness and kurtosis of daily-maximum temperature distributions against ERA5. These diagnostics show no systematic high-wavenumber excess or moment bias relative to reanalysis, supporting the interpretation that the reported EVT exceedances reflect genuine tail differences rather than model artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper performs a direct empirical comparison of outputs from a large ML ensemble (7424 members) against independent reanalysis data and EVT-based extrapolations from smaller ensembles. No derivation, equation, or procedure reduces the central claims (exceedances for one-third of land area, humid-heat storylines) to fitted parameters or self-citations by construction. The Spherical Fourier Neural Operator is invoked as a black-box generative tool whose tail behavior is tested rather than assumed; the reported differences are observational outcomes, not tautological renamings or self-referential fits. This is a standard descriptive ensemble study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, axioms, or invented entities beyond the use of an existing ML model and standard extreme-value extrapolation.

pith-pipeline@v0.9.0 · 5482 in / 1066 out tokens · 40624 ms · 2026-05-10T16:25:53.811892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    (2019, Febru- ary)

    Algarra, I., Eiras-Barca, J., Miguez-Macho, G., Nieto, R., & Gimeno, L. (2019, Febru- ary). On the assessment of the moisture transport by the Great Plains low-level jet.Earth System Dynamics,10(1), 107–119. Retrieved fromhttp://dx.doi.org/ 10.5194/esd-10-107-2019doi: 10.5194/esd-10-107-2019 Barriopedro, D., Garcia-Herrera, R., Ordonez, C., Miralles, D. G...

  2. [2]

    A., Wehner, M

    doi: 10.1038/s41558-022-01532-0 Bercos-Hickey, E., O’Brien, T. A., Wehner, M. F., Zhang, L., Patricola, C. M., Huang, H., & Risser, M. D. (2022). Anthropogenic Contributions to the 2021 Pacific North- west Heatwave.Geophysical Research Letters,49(23). doi: 10.1029/2022Gl099396 Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., & Anandk...