arxiv: 2604.21903 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.AI

Recognition: unknown

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Max Defez , Filippo Quarenghi , Mathieu Vrac , Stephan Mandt , Tom Beucler

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords spatiotemporal super-resolutiondiffusion modelsscale-adaptive frameworkprecipitation downscalingclimate reanalysisconditional diffusionmass conservation

0 comments

The pith

The same architecture handles joint spatiotemporal super-resolution for factors from 1-25 in space and 1-6 in time by retuning only three hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reusable framework for jointly increasing both spatial and temporal resolution in climate data such as precipitation sequences. It splits the task into a deterministic attention-based prediction of the expected high-resolution output followed by a conditional diffusion model that adds realistic residual variability. Scale adaptivity is achieved by adjusting the diffusion noise amplitude, the length of temporal context, and optionally a mass-conservation transform rather than redesigning the network for each new pair of upscaling factors. This matters for climate applications because different studies and operational systems require downscaling at widely varying spatial resolutions and temporal cadences, making a single reusable architecture and tuning recipe valuable.

Core claim

By decomposing joint spatiotemporal super-resolution into a deterministic prediction of the conditional mean with attention plus a residual conditional diffusion model, and by retuning only the diffusion noise schedule amplitude beta, the temporal context length L, and optionally the mass-conservation function f, the identical architecture successfully spans super-resolution factors from 1 to 25 spatially and 1 to 6 temporally on reanalysis precipitation data over France.

What carries the argument

Scale-adaptive framework that decomposes spatiotemporal SR into deterministic conditional-mean prediction with attention plus residual conditional diffusion model, adapted by retuning beta, L and optional mass-conservation f.

If this is right

The identical network architecture works for any super-resolution factor in the tested range without structural changes.
Increasing the diffusion noise schedule amplitude beta produces the greater output diversity needed at larger factors.
Setting temporal context length L maintains comparable attention horizons when temporal cadence changes.
Optional tapered mass-conservation preserves total precipitation amounts while limiting extreme-value amplification at large factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tuning recipe could transfer to other geophysical variables or regions where multi-scale joint downscaling is required.
It might reduce the need to retrain separate models when moving between different climate datasets or operational resolutions.
Similar hyperparameter-based adaptation could be tested on alternative diffusion backbones or non-diffusion generators.

Load-bearing premise

Larger super-resolution factors primarily increase underdetermination and required residual uncertainty without changing the structure of the conditional mean.

What would settle it

Finding that for some large SR factor the optimal deterministic predictor requires a substantially different architecture or attention mechanism than for small factors.

Figures

Figures reproduced from arXiv: 2604.21903 by Filippo Quarenghi, Mathieu Vrac, Max Defez, Stephan Mandt, Tom Beucler.

**Figure 2.** Figure 2: Qualitative example of spatiotemporal precipitation super-resolution produced by the proposed model. The [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Proposed architecture for the deterministic U-Net. The encoder–decoder structure with convolutions, skip [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Schematic overview of the proposed super-resolution pipeline for precipitation. The workflow consists of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Spatial partitioning of the study domain into four cross-validation folds. Each fold contains one geographical [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of Probability Integral Transform (PIT) behavior for under-dispersive and over-dispersive predictive [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Deep-learning video super-resolution has progressed rapidly, but climate applications typically super-resolve (increase resolution) either space or time, and joint spatiotemporal models are often designed for a single pair of super-resolution (SR) factors (upscaling spatial and temporal ratio between the low-resolution sequence and the high-resolution sequence), limiting transfer across spatial resolutions and temporal cadences (frame rates). We present a scale-adaptive framework that reuses the same architecture across factors by decomposing spatiotemporal SR into a deterministic prediction of the conditional mean, with attention, and a residual conditional diffusion model, with an optional mass-conservation (same precipitation amount in inputs and outputs) transform to preserve aggregated totals. Assuming that larger SR factors primarily increase underdetermination (hence required context and residual uncertainty) rather than changing the conditional-mean structure, scale adaptivity is achieved by retuning three factor-dependent hyperparameters before retraining: the diffusion noise schedule amplitude beta (larger for larger factors to increase diversity), the temporal context length L (set to maintain comparable attention horizons across cadences) and optionally a third, the mass-conservation function f (tapered to limit the amplification of extremes for large factors). Demonstrated on reanalysis precipitation over France (Comephore), the same architecture spans super-resolution factors from 1 to 25 in space and 1 to 6 in time, yielding a reusable architecture and tuning recipe for joint spatiotemporal super-resolution across scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a reusable architecture for joint space-time super-resolution on climate precipitation by splitting attention-based mean prediction from residual diffusion and retuning only beta, L, and optional f, but the invariance assumption for the mean predictor across factors 1-25 is unverified.

read the letter

The core contribution is a scale-adaptive framework that reuses one architecture for joint spatiotemporal super-resolution across a wide range of factors. It decomposes the problem into an attention-based deterministic predictor for the conditional mean plus a residual conditional diffusion model, with an optional mass-conservation transform. Scale adaptivity comes from retuning three hyperparameters: the diffusion noise amplitude beta (larger for bigger factors), the temporal context length L (to keep attention horizons comparable), and optionally the mass-conservation function f. They demonstrate this on Comephore reanalysis precipitation over France, spanning spatial factors 1-25 and temporal factors 1-6 with the same model backbone.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a scale-adaptive framework for joint spatiotemporal super-resolution of fields such as precipitation using diffusion models. The approach decomposes the task into an attention-based deterministic prediction of the conditional mean and a residual conditional diffusion model, with an optional mass-conservation transform. By retuning only three hyperparameters—the diffusion noise schedule amplitude beta, the temporal context length L, and optionally the mass-conservation function f—the same architecture is reused for super-resolution factors ranging from 1 to 25 in space and 1 to 6 in time, as demonstrated on the Comephore reanalysis dataset over France.

Significance. If the central assumption holds and the empirical results support the reusability, this work could have high significance for climate science applications by providing a flexible, reusable model that avoids the need to design and train separate models for each combination of spatial and temporal upscaling factors. The emphasis on physical consistency through mass conservation and the decomposition strategy are notable strengths.

major comments (2)

The reusability claim depends on the assumption that larger SR factors primarily increase underdetermination rather than changing the conditional-mean structure; however, the abstract provides no quantitative metrics, ablation studies, or cross-factor comparisons of the deterministic outputs to validate this invariance, which is load-bearing for the scale-adaptive framework.
The decomposition into deterministic mean predictor and residual diffusion is presented as enabling adaptivity via retuning beta, L, and f, but without evidence that the attention mechanism's ability to capture the mean structure remains consistent across scales (e.g., 2x vs 25x spatial), the claim that only these three parameters need adjustment is not yet substantiated.

minor comments (2)

The description of the mass-conservation function f as 'tapered to limit the amplification of extremes for large factors' could be clarified with a specific functional form or equation.
Comparison to existing joint spatiotemporal SR methods for specific factors would strengthen the motivation for the scale-adaptive approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive evaluation of the work's potential significance for climate applications. We address each major comment below and will revise the manuscript to provide the requested evidence.

read point-by-point responses

Referee: The reusability claim depends on the assumption that larger SR factors primarily increase underdetermination rather than changing the conditional-mean structure; however, the abstract provides no quantitative metrics, ablation studies, or cross-factor comparisons of the deterministic outputs to validate this invariance, which is load-bearing for the scale-adaptive framework.

Authors: We agree that the central assumption requires explicit validation beyond the abstract statement. Although the manuscript demonstrates successful application across scales 1-25 spatially and 1-6 temporally, we will add a dedicated subsection with quantitative metrics (MSE, SSIM, and bias on conditional-mean predictions) and cross-factor comparisons of the deterministic outputs to directly support the invariance of the mean structure. revision: yes
Referee: The decomposition into deterministic mean predictor and residual diffusion is presented as enabling adaptivity via retuning beta, L, and f, but without evidence that the attention mechanism's ability to capture the mean structure remains consistent across scales (e.g., 2x vs 25x spatial), the claim that only these three parameters need adjustment is not yet substantiated.

Authors: We acknowledge that additional evidence is needed to substantiate consistency of the attention-based mean predictor. We will include new ablation results and attention-map visualizations comparing performance at small (e.g., 2x) and large (e.g., 25x) spatial scales, showing that the core mean-structure capture remains stable while scale-dependent effects are absorbed by the retuned diffusion component. revision: yes

Circularity Check

0 steps flagged

No circularity; reusability rests on explicit assumption and empirical demonstration

full rationale

The paper states an assumption that larger SR factors increase underdetermination without altering conditional-mean structure, then achieves scale adaptivity via retuning of beta, L, and optionally f. This is presented as a modeling choice followed by demonstration on Comephore precipitation data across factors 1-25 (space) and 1-6 (time). No equations reduce the architecture or reusability claim to a self-definition, fitted input renamed as prediction, or self-citation chain. The derivation chain is self-contained against the external benchmark of multi-factor performance on held-out reanalysis fields.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about how SR factors affect uncertainty structure plus three tunable hyperparameters whose values are chosen per factor; no new entities are postulated.

free parameters (3)

beta (diffusion noise schedule amplitude)
Retuned larger for larger factors to increase output diversity
L (temporal context length)
Adjusted to maintain comparable attention horizons across different temporal cadences
f (mass-conservation function)
Tapered optionally for large factors to limit extreme amplification

axioms (1)

domain assumption Larger SR factors primarily increase underdetermination rather than changing the conditional-mean structure
This premise justifies reusing the same architecture and only retuning the three hyperparameters

pith-pipeline@v0.9.0 · 5569 in / 1322 out tokens · 53269 ms · 2026-05-09T21:49:49.821563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references

[1]

Video super-resolution based on deep learning: a comprehensive survey.Artificial Intelligence Review, 55(8):5981–6035, Dec 2022

Hongying Liu, Zhubo Ruan, Peng Zhao, Chao Dong, Fanhua Shang, Yuanyuan Liu, Linlin Yang, and Radu Timofte. Video super-resolution based on deep learning: a comprehensive survey.Artificial Intelligence Review, 55(8):5981–6035, Dec 2022

2022
[2]

Le Zhang, Ao Li, Qibin Hou, Ce Zhu, and Yonina C. Eldar. Deep learning empowered super-resolution: A comprehensive survey and future prospects, 2025

2025
[3]

A ‘deep’ review of video super-resolution.Signal Processing: Image Communication, 129:117175, 2024

Subhadra Gopalakrishnan and Anustup Choudhury. A ‘deep’ review of video super-resolution.Signal Processing: Image Communication, 129:117175, 2024

2024
[4]

Physical modeling and analysis of rain and clouds by anisotropic scaling multiplicative processes.Journal of Geophysical Research: Atmospheres, 92(D8):9693–9714, 1987

Daniel Schertzer and Shaun Lovejoy. Physical modeling and analysis of rain and clouds by anisotropic scaling multiplicative processes.Journal of Geophysical Research: Atmospheres, 92(D8):9693–9714, 1987

1987
[5]

Impact of spatial and temporal resolution of rainfall inputs on urban hydrodynamic modelling outputs: A multi-catchment investigation.Journal of Hydrology, 531:389–407, 2015

Susana Ochoa-Rodriguez, Li-Pen Wang, Auguste Gires, Rui Daniel Pina, Ricardo Reinoso-Rondinel, Guendalina Bruni, Abdellah Ichiba, Santiago Gaitan, Elena Cristiano, Johan van Assel, Stefan Kroll, Damian Murlà-Tuyls, Bruno Tisserand, Daniel Schertzer, Ioulia Tchiguirinskaia, Christian Onof, Patrick Willems, and Marie-Claire ten Veldhuis. Impact of spatial a...

2015
[6]

Spatialandtemporalvariabilityofrainfallandtheireffects onhydrologicalresponseinurbanareas–areview.HydrologyandEarthSystemSciences,21(7):3859–3878,2017

E.Cristiano, M.-C.tenVeldhuis, andN.vandeGiesen. Spatialandtemporalvariabilityofrainfallandtheireffects onhydrologicalresponseinurbanareas–areview.HydrologyandEarthSystemSciences,21(7):3859–3878,2017

2017
[7]

Convolutional lstm network: A machine learning approach for precipitation nowcasting, 2015

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai kin Wong, and Wang chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting, 2015

2015
[8]

Jussi Leinonen, Daniele Nerini, and Alexis Berne. Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network.IEEE Transactions on Geoscience and Remote Sensing, 59(9):7211–7223, September 2021

2021
[9]

Global spatio-temporal era5 precipitation downscaling to km and sub-hourly scale using generative ai.npj Climate and Atmospheric Science, 8(1):219, 2025

Luca Glawion, Julius Polz, Harald Kunstmann, Benjamin Fersch, and Christian Chwala. Global spatio-temporal era5 precipitation downscaling to km and sub-hourly scale using generative ai.npj Climate and Atmospheric Science, 8(1):219, 2025

2025
[10]

Canaibeenabledtoperformdynamicaldownscaling? alatentdiffusion modeltomimickilometer-scalecosmo5.0_clm9simulations.GeoscientificModelDevelopment, 18(6):2051–2078, 2025

E.Tomasi,G.Franch,andM.Cristoforetti. Canaibeenabledtoperformdynamicaldownscaling? alatentdiffusion modeltomimickilometer-scalecosmo5.0_clm9simulations.GeoscientificModelDevelopment, 18(6):2051–2078, 2025

2051
[11]

Diffcast: A unified framework via residual diffusion for precipitation nowcasting, 2024

Demin Yu, Xutao Li, Yunming Ye, Baoquan Zhang, Chuyao Luo, Kuai Dai, Rui Wang, and Xunlai Chen. Diffcast: A unified framework via residual diffusion for precipitation nowcasting, 2024

2024
[12]

Precipitation downscaling with spatiotemporal video diffusion, 2024

Prakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Bretherton, and Stephan Mandt. Precipitation downscaling with spatiotemporal video diffusion, 2024

2024
[13]

Residual corrective diffusion modeling for km-scale atmospheric downscaling, 2024

Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash Vahdat, Mohammad Amin Nabian, Tao Ge, Akshay Subramaniam, Karthik Kashinath, Jan Kautz, and Mike Pritchard. Residual corrective diffusion modeling for km-scale atmospheric downscaling, 2024

2024
[14]

U-net: Convolutional networks for biomedical image segmentation, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015

2015
[15]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 9 A preprint - April 24, 2026

2023
[16]

Hard-constrained deep learning for climate downscaling, 2024

Paula Harder, Alex Hernandez-Garcia, Venkatesh Ramesh, Qidong Yang, Prasanna Sattigeri, Daniela Szwarcman, Campbell Watson, and David Rolnick. Hard-constrained deep learning for climate downscaling, 2024

2024
[17]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

2020
[18]

Progressive distillation for fast sampling of diffusion models, 2022

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022

2022
[19]

David Neelin

Cristian Martinez-Villalobos and J. David Neelin. Why do precipitation intensities tend to follow gamma distributions?Journal of the Atmospheric Sciences, 76(11):3611 – 3631, 2019

2019
[20]

Bell, and Vernon Meentemeyer

Alan Basist, Gerald D. Bell, and Vernon Meentemeyer. Statistical relationships between topography and precipitation patterns.Journal of Climate, 7(9):1305 – 1315, 1994

1994
[21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

2017
[22]

Sgdr: Stochastic gradient descent with warm restarts, 2017

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts, 2017

2017
[23]

Denoising diffusion implicit models, 2022

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022

2022
[24]

On fast sampling of diffusion probabilistic models, 2021

Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models, 2021

2021
[25]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

2022
[26]

Enhanced deep residual networks for single image super-resolution, 2017

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution, 2017. Appendix 8 Deterministic Coarsening by Block Averaging We formalize the deterministic HR to LR coarsening used in the perfect-model setting. Let𝑆∈N ∗ be the spatial SR factor and assume𝑆 divides 𝐻 and 𝑊. The LR grid ...

2017