Recognition: unknown
Towards accurate extreme event likelihoods from diffusion model climate emulators
Pith reviewed 2026-05-07 12:35 UTC · model grok-4.3
The pith
Diffusion model climate emulators quantify extreme event likelihoods by comparing guided and unguided probability densities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Diffusion models such as cBottle approximate the probability density of training data. When the model is guided toward states that include tropical cyclones, the ratio of probability densities between the guided and unguided versions directly quantifies how much more likely the guidance has rendered the cyclone. These odds ratios then enable importance sampling from the TC distribution, which reduces the standard error of the probability estimate relative to ordinary Monte Carlo sampling.
What carries the argument
The odds ratio between guided and unguided model probability densities, which reweights samples to importance-sample the distribution of extreme events.
If this is right
- Fewer emulator runs are needed to obtain reliable probability estimates for rare events.
- Guidance can target specific locations or boundary conditions while still delivering corrected likelihoods.
- Model densities open a route to attribution-style calculations that compare event likelihoods under different forcings.
- Emulators shift from pure generation tools to sources of quantitative probabilistic information.
Where Pith is reading between the lines
- The same density-ratio technique could be tested on other extremes such as heat waves or extreme precipitation.
- Combining the approach with observational constraints might improve calibration of the underlying density estimate.
- If the approximation remains reliable, the method could lower the cost of tail-risk assessment in long climate projections.
Load-bearing premise
The diffusion model accurately approximates the true probability density of atmospheric states and the guidance mechanism does not distort that density in ways that invalidate likelihood comparisons for extremes.
What would settle it
Importance-sampled estimates of tropical cyclone occurrence rates that systematically diverge from rates obtained by a much larger set of unguided Monte Carlo runs or from direct observational records would falsify the accuracy claim.
Figures
read the original abstract
ML climate model emulators are useful for scenario planning and adaptation, allowing for cost-efficient experimentation. Recently, the diffusion model Climate in a Bottle (cBottle) has been proposed for generation of atmospheric states compatible with boundary conditions of solar position and sea surface temperatures. Crucially, cBottle can be guided to generate extreme events such as Tropical Cyclones (TCs) over locations of interest. Diffusion models such as cBottle work by approximating the probability density of the training data. Here, we show use cases of the probability density estimates of atmospheric states obtained from this climate emulator. Most importantly, these estimates allow us to calculate likelihoods of extreme events under guidance. When guiding the model towards states including TCs, comparing the probability density under the guided and unguided model enables us to quantify how much more likely the guidance has made the TC. We show how these odds ratios allow us to importance-sample from the TC distribution, reducing the standard error of the probability estimate compared to simple Monte Carlo sampling. Furthermore, we discuss results and limitations of the application of model probability densities to extreme event attribution-like experiments. We present these early but encouraging results hoping they will spur more research into probabilistic information that can be gained from diffusion models of the atmosphere.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes using probability density estimates from the diffusion model climate emulator cBottle to compute likelihoods of extreme events such as tropical cyclones (TCs) under guidance. By comparing the density under guided versus unguided conditions, odds ratios are derived to enable importance sampling from the TC distribution, which is claimed to reduce the standard error of probability estimates relative to simple Monte Carlo sampling. The work further discusses applications and limitations of these densities for extreme event attribution-like experiments.
Significance. If the density approximations hold for rare events, the approach could provide an efficient means to extract probabilistic information on extremes directly from climate emulators, reducing the need for large ensembles in scenario planning and attribution. The method capitalizes on the generative and density-estimating properties of diffusion models in a novel way for atmospheric science, though its practical impact depends on validation of the tail estimates.
major comments (3)
- [Abstract] Abstract: The central claim that odds ratios from guided/unguided densities enable importance sampling with reduced standard error is presented without any quantitative demonstration (e.g., reported variance reduction factors, effective sample sizes, or direct Monte Carlo comparisons), leaving the practical benefit of the method unverified.
- [Methods (density estimation and guidance)] The approach requires evaluating p_unguided(x) at rare TC states x drawn from the guided distribution. No details are given on the likelihood computation method (e.g., probability-flow ODE integration or ELBO) nor any calibration of absolute or relative density values against empirical frequencies, which is load-bearing because diffusion likelihood approximations are known to have larger relative errors in low-density regions.
- [Results and Discussion] The discussion of results and limitations for extreme event attribution-like experiments does not include any benchmark or sensitivity test confirming that guidance does not distort the density ratios in ways that bias the importance weights, undermining the reliability of the derived likelihoods for rare events.
minor comments (1)
- [Title] The title uses 'accurate' while the abstract describes 'early but encouraging results'; aligning the title with the preliminary nature of the validation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that odds ratios from guided/unguided densities enable importance sampling with reduced standard error is presented without any quantitative demonstration (e.g., reported variance reduction factors, effective sample sizes, or direct Monte Carlo comparisons), leaving the practical benefit of the method unverified.
Authors: We agree that the abstract would benefit from explicit quantitative support for the claimed reduction in standard error. The manuscript illustrates the importance sampling approach through examples of TC generation, but does not report specific metrics such as variance reduction factors or effective sample sizes. In the revised version, we will update the abstract to include these quantitative results and expand the results section with direct Monte Carlo comparisons. revision: yes
-
Referee: [Methods (density estimation and guidance)] The approach requires evaluating p_unguided(x) at rare TC states x drawn from the guided distribution. No details are given on the likelihood computation method (e.g., probability-flow ODE integration or ELBO) nor any calibration of absolute or relative density values against empirical frequencies, which is load-bearing because diffusion likelihood approximations are known to have larger relative errors in low-density regions.
Authors: We will revise the Methods section to provide explicit details on the likelihood computation, specifying the probability-flow ODE integration approach used to evaluate the densities. We will also add a calibration subsection comparing the estimated densities (both absolute and relative) to empirical frequencies derived from the training and validation data, with particular attention to behavior in low-density regions relevant to rare events. revision: yes
-
Referee: [Results and Discussion] The discussion of results and limitations for extreme event attribution-like experiments does not include any benchmark or sensitivity test confirming that guidance does not distort the density ratios in ways that bias the importance weights, undermining the reliability of the derived likelihoods for rare events.
Authors: We acknowledge that the current discussion of limitations is primarily qualitative. To address this, we will add benchmark and sensitivity tests in the revised Results and Discussion sections. These will include comparisons of density ratios with and without guidance for both common and rare events, along with checks on the stability of the resulting importance weights. revision: yes
Circularity Check
No circularity: likelihood ratios derived from model's own density estimates without reduction to inputs
full rationale
The paper proposes applying the diffusion model's learned probability density p(x) (approximated via the score function) to compute odds ratios p_guided(x)/p_unguided(x) for importance sampling of TCs. This is a direct use of the model's internal density estimator under different conditioning, not a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain starts from the standard diffusion model training objective and guidance mechanism (already established in prior literature) and applies it to extreme-event likelihoods without equations that equate the output back to the fitted inputs by construction. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known empirical patterns are invoked as load-bearing steps. The method is self-contained against the model's own density approximation, with acknowledged limitations on tail accuracy.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models can accurately approximate the probability density function of the training data distribution.
Reference graph
Works this paper leans on
-
[1]
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
Ai, X., He, Y ., Gu, A., Salakhutdinov, R., Kolter, J. Z., Boffi, N. M., and Simchowitz, M.: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models, https://arxiv.org/abs/2512.02636,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Alexe, M., Boucher, E., Lean, P., Pinnington, E., Laloyaux, P., McNally, A., Lang, S., Chantry, M., Burrows, C., Chrust, M., Pinault, F., Villeneuve, E., Bormann, N., and Healy, S.: GraphDOP: Towards skilful data-driven medium-range weather forecasts learnt and initialised directly from observations, arXiv e-prints, arXiv:2412.15687, https://doi.org/10.48...
-
[3]
I., and Donohoe, A.: The largest ever recorded heatwave—Characteristics and attribution of the Antarctic heatwave of March 2022, Geophysical Research Letters, 50, e2023GL104 910,
Blanchard-Wrigglesworth, E., Cox, T., Espinosa, Z. I., and Donohoe, A.: The largest ever recorded heatwave—Characteristics and attribution of the Antarctic heatwave of March 2022, Geophysical Research Letters, 50, e2023GL104 910,
2022
-
[4]
Score-based generative emulation of impact-relevant Earth system model outputs
Bouabid, S., Souza, A. N., and Ferrari, R.: Score-based generative emulation of impact-relevant Earth system model outputs, arXiv preprint arXiv:2510.04358,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Brenowitz, N. D., Ge, T., Subramaniam, A., Manshausen, P., Gupta, A., Hall, D. M., Mardani, M., Vahdat, A., Kashinath, K., and Pritchard, M. S.: Climate in a bottle: Towards a generative foundation model for the kilometer-scale global atmosphere, arXiv preprint arXiv:2505.06474,
-
[6]
Chapman, W. E., Schreck, J. S., Sha, Y ., Gagne II, D. J., Kimpara, D., Zanna, L., Mayer, K. J., and Berner, J.: CAMulator: Fast emulation of the community atmosphere model, arXiv preprint arXiv:2504.06007,
-
[7]
AGU Advances 6(4), 2025–001706 (2025) https://doi.org/10.1029/2025A V001706
Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., and Karlbauer, M.: A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate, AGU Advances, 6, e2025A V001706, https://doi.org/10.1029/2025A V001706,
- [8]
-
[9]
HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts
Gupta, A., Subramaniam, A., Pritchard, M. S., Kashinath, K., Frolov, S., Lieberman, K., Miller, C., Silverman, N., and Brenowitz, N. D.: HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts, arXiv preprint arXiv:2601.17636,
work page internal anchor Pith review arXiv
-
[10]
RNE: plug-and-play diffusion inference-time control and energy-based training
He, J., Hernández-Lobato, J. M., Du, Y ., and Vargas, F.: RNE: plug-and-play diffusion inference-time control and energy-based training, https://arxiv.org/abs/2506.05668,
work page internal anchor Pith review arXiv
- [11]
-
[12]
Kossaifi, J., Kovachki, N., Mardani, M., Leibovici, D., Ravuri, S., Shokar, I., Calvello, E., Abbas, M. S., Harrington, P., Subramaniam, A., et al.: Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting, arXiv preprint arXiv:2601.18111,
-
[13]
arXiv preprint arXiv:2406.01465 (2024 )
Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., Clare, M. C., Lessig, C., Maier-Gerber, M., Magnusson, L., et al.: AIFS–ECMWF’s data-driven forecasting system, arXiv preprint arXiv:2406.01465,
- [14]
-
[15]
Nalisnick, E., Matsukawa, A., Teh, Y . W., and Lakshminarayanan, B.: Detecting out-of-distribution inputs to deep generative models using typicality, arXiv preprint arXiv:1906.02994,
-
[16]
Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., et al.: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators, arXiv preprint arXiv:2202.11214,
work page internal anchor Pith review arXiv
-
[17]
Pathak, J., Cohen, Y ., Garg, P., Harrington, P., Brenowitz, N., Durran, D., Mardani, M., Vahdat, A., Xu, S., Kashinath, K., and Pritchard, M.: Kilometer-scale convection-allowing model emulation using generative diffusion modeling, Science Advances, 12, eadv0423, https://doi.org/10.1126/sciadv.adv0423,
-
[18]
Pathak, J., Shoaib Abbas, M., Harrington, P., Hu, Z., Brenowitz, N., Ravuri, S., Carpentieri, A., Leinonen, J., Adams, C., Hennigh, O., Geneva, N., Durran, D., and Pritchard, M.: Learning Accurate Storm-Scale Evolution from Observations, arXiv e-prints, arXiv:2601.17268, https://doi.org/10.48550/arXiv.2601.17268,
-
[19]
A., Kwa, A., McGibbon, J., Arcomano, T., Clark, S
Perkins, W. A., Kwa, A., McGibbon, J., Arcomano, T., Clark, S. K., Watt-Meyer, O., Bretherton, C. S., and Harris, L. M.: HiRO-ACE: Fast and skillful AI emulation and downscaling trained on a 3 km global storm-resolving model, https://arxiv.org/abs/2512.18224,
- [20]
- [21]
-
[22]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B.: Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review arXiv 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.