Physics-Constrained Adaptive Flow Matching for Climate Downscaling

Ayta\c{c} Pa\c{c}al; Kevin Debeire; Luis Medrano-Navarro; Nils Thuerey; Pierre Gentine; Veronika Eyring

arxiv: 2604.03459 · v1 · submitted 2026-04-03 · ⚛️ physics.ao-ph · cs.LG

Physics-Constrained Adaptive Flow Matching for Climate Downscaling

Kevin Debeire , Ayta\c{c} Pa\c{c}al , Pierre Gentine , Luis Medrano-Navarro , Nils Thuerey , Veronika Eyring This is my paper

Pith reviewed 2026-05-13 18:00 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.LG

keywords climate downscalingflow matchingphysics constraintsgenerative modelingprecipitation biasout-of-distribution generalizationconservation laws

0 comments

The pith

Physics-constrained adaptive flow matching halves precipitation wet bias in out-of-distribution climate downscaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution regional climate data is needed to assess change impacts, yet global models are too costly to run at kilometer scales. Machine learning alternatives often break physical laws like conservation and produce large errors when applied to climates unlike their training data. PC-AFM starts from adaptive flow matching and adds soft constraints that enforce consistency between downscaled precipitation and humidity fields and the coarse input. Gradient surgery via the ConFIG algorithm prevents these constraints from harming the generative training objective. On Central European training data the model improves conservation and ensemble calibration while matching baselines on skill scores; on two held-out regions it cuts precipitation wet bias in half, lowers conservation error, and sharpens extreme-quantile accuracy without any access to target-climate statistics at inference time.

Core claim

PC-AFM augments adaptive flow matching with soft conservation constraints on precipitation and specific humidity, resolved against the generative objective by ConFIG gradient surgery. Trained on Central Europe data for 10-fold downscaling of six variables, the model matches or exceeds the unconstrained baseline inside the training distribution on standard metrics while reducing conservation errors. On two held-out climate regions it halves precipitation wet bias, reduces conservation error, and improves extreme-quantile accuracy without receiving any information about the target climate at inference.

What carries the argument

Soft conservation constraints on precipitation and humidity combined with ConFIG gradient surgery inside an adaptive flow matching generator.

If this is right

Downscaled fields remain consistent with large-scale mass and moisture budgets even under unseen climate conditions.
Extreme precipitation quantiles are recovered more accurately without explicit training on target-region extremes.
Generative downscaling becomes usable for future climate scenarios without requiring retraining on those scenarios.
Ensemble calibration improves because systematic extrapolation errors are suppressed by the constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraint-plus-surgery pattern could be added to other generative architectures used for physics-constrained simulation tasks.
Quantifying the distance between training and test climates would allow clearer statements about how far the generalization extends.
If the constraints prove robust across many regions, high-resolution impact studies could be run on demand without region-specific fine-tuning.

Load-bearing premise

The soft conservation constraints continue to enforce physical consistency effectively when the input climate lies outside the training distribution.

What would settle it

A test on a held-out region whose precipitation statistics differ markedly from Central Europe in which the halved wet bias disappears or conservation error rises above the unconstrained baseline.

Figures

Figures reproduced from arXiv: 2604.03459 by Ayta\c{c} Pa\c{c}al, Kevin Debeire, Luis Medrano-Navarro, Nils Thuerey, Pierre Gentine, Veronika Eyring.

**Figure 1.** Figure 1: Geographic domains used for training and evaluation. Evaluation diagnostics for all three domains are computed using ESMValTool. The Central Europe domain (blue) is used for training. The Iberian Peninsula (green) and Northern Europe region (orange) are withheld from training and used exclusively for out-of-distribution evaluation. clean target x1 directly from the noisy interpolant. The AFM training loss … view at source ↗

**Figure 2.** Figure 2: Overview of the PC-AFM architecture and training procedure. Top: At inference, the low-resolution input (32×32, 63 km) is bilinearly upsampled and passed through the learned encoder Eψ to produce an initial high-resolution estimate ˆx0. A stochastic interpolant xt = (1 − t)ˆx0 + tx1 + σtε is constructed and refined by the denoiser Dθ, conditioned on the lowresolution input and noise level σt. Fifty denoi… view at source ↗

**Figure 3.** Figure 3: Relative performance of PC-AFM versus AFM-baseline for the Central Europe training region. Each cell shows the ratio of PC-AFM to AFM-baseline; values below 1 (green) indicate improvement. Conservation error is not applicable (“–”) for variables without an explicit conservation constraint. Bold entries summarize row and column averages. –12– [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Precipitation (pr) evaluation for the Central Europe training region. (A) Spatial maps of time-mean bias, relative bias, CRPS, and conservation error for AFM-baseline (top) and PC-AFM (bottom). (B) Radially averaged power spectral density. (C) Log-transformed marginal PDF. (D) Rank histograms with MCB. PC-AFM halves the conservation error and improves ensemble calibration (MCB: 0.767 to 0.523) while mainta… view at source ↗

**Figure 5.** Figure 5: Quantile MAE relative performance (PC-AFM / AFM-baseline) for impact-relevant diagnostics in the Central Europe training region. Average ratio: 0.71. –13– [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Relative performance of PC-AFM versus AFM-baseline for the Northern Europe region (unseen during training). Layout as in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Precipitation evaluation for the Northern Europe region (unseen during training). Layout as in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Quantile MAE relative performance for the Northern Europe region (unseen during training). Average ratio: 0.77. –15– [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Relative performance of PC-AFM versus AFM-baseline for the Iberian Peninsula (unseen during training). Layout as in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Precipitation evaluation for the Iberian Peninsula (unseen during training). Layout as in [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Quantile MAE relative performance for the Iberian Peninsula (unseen during training). Average ratio: 0.80. –17– [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Regional climate information at kilometer scales is essential for assessing the impacts of climate change, but generating it with global climate models is too expensive due to their high computational costs. Machine learning models offer a fast alternative, yet they often violate basic physical laws and degrade when applied to climates outside of their training distribution. We present Physics-Constrained Adaptive Flow Matching (PC-AFM), a generative downscaling model that addresses both problems. Building on the Adaptive Flow Matching (AFM) model of Fotiadis et al. (2025) as our baseline, we add soft conservation constraints that keep the downscaled output consistent with the large-scale input for precipitation and humidity, and use gradient surgery via the ConFIG algorithm to prevent these constraints from interfering with the generative objective. We train the model on Central Europe climate data, evaluate it on a 10-time downscaling task (63km to 6.3km) over six variables (near-surface temperature, precipitation, specific humidity, surface pressure, and horizontal wind components) across a comprehensive set of metrics including bias, ensemble skill scores, power spectra, and conservation error, and test the generalization on two held-out climate regions. Within the training distribution, PC-AFM reduces conservation errors and improves ensemble calibration while matching the baseline on standard skill metrics. Outside the training distribution, where unconstrained models develop large systematic errors by extrapolating learned statistics, PC-AFM halves precipitation wet bias, reduces conservation error and improves extreme-quantile accuracy, all without any information about the target climate at inference time. These results indicate that physical consistency is a practical requirement for deploying generative downscaling models in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PC-AFM layers conservation penalties and ConFIG surgery onto adaptive flow matching, delivering lower conservation errors and halved wet bias on held-out regions, but the OOD claim needs explicit domain-shift numbers to hold up.

read the letter

PC-AFM layers soft conservation penalties on precipitation and humidity plus ConFIG gradient surgery onto the adaptive flow matching backbone. The result is a downscaler that keeps outputs physically consistent with the coarse input while cutting errors when the climate moves away from training conditions. That specific combination is the concrete addition over the cited AFM baseline. Inside the Central Europe training domain the model matches the baseline on skill scores and ensemble calibration but lowers conservation error. On the two held-out regions it halves precipitation wet bias, reduces conservation violations, and improves extreme-quantile accuracy without any target-climate information at inference. Those are the usable gains for impact work. The soft spot is the missing quantification of how different the held-out regions actually are. The abstract calls them separate climate regions, yet supplies no distributional distances or moment comparisons on the six input variables. Without that, the observed improvements could still be interpolation rather than the claimed extrapolation robustness. If the full paper includes those diagnostics, the case strengthens; otherwise the central OOD story stays plausible but not fully pinned down. Readers who build or apply generative downscaling for regional climate impacts will find the recipe worth testing. The approach is concrete, the constraints are defined from the input rather than the target, and the results point to a practical fix for unphysical drift. It shows clear engagement with the mechanics of the model and the application constraints, so it deserves a serious referee. Send it to review and ask for the domain-shift metrics and the exact constraint-weight values in the methods.

Referee Report

2 major / 2 minor

Summary. The paper presents Physics-Constrained Adaptive Flow Matching (PC-AFM), extending the AFM baseline of Fotiadis et al. (2025) by adding soft conservation constraints on precipitation and humidity (enforced via the ConFIG gradient-surgery algorithm) to a generative flow-matching model for 10× climate downscaling (63 km to 6.3 km). Trained on Central European data for six near-surface variables, the model is evaluated on bias, ensemble skill, power spectra, and conservation error; the central claim is that PC-AFM matches the baseline inside the training distribution while halving precipitation wet bias, lowering conservation error, and improving extreme-quantile accuracy on two held-out climate regions, all without target-climate information at inference.

Significance. If the OOD robustness result holds, the work is significant for practical climate-impact applications, where generative downscalers must remain physically consistent and avoid large systematic errors when applied to unseen climates. The explicit use of ConFIG to prevent constraint–generative-objective interference is a concrete technical contribution, and the breadth of reported metrics (including conservation error) provides a stronger basis for assessing physical fidelity than is common in the literature.

major comments (2)

[Abstract and OOD evaluation section] The headline OOD claim (halving of wet bias and improved extremes outside the training distribution) rests on results from only two held-out regions, yet the manuscript supplies no quantitative distributional-shift diagnostics (Wasserstein distance, mean/variance differences, or similar) on the six input variables between the Central Europe training domain and the test regions. Without such metrics it is impossible to determine whether the held-out cases constitute genuine extrapolation or lie inside a similar climate manifold, directly undermining the generalization argument.
[Methods (constraint implementation and loss formulation)] The description of the physics constraints (soft penalties on precipitation and humidity consistency with large-scale inputs, resolved by ConFIG) omits the numerical values of the constraint weights, the precise additive form of the composite loss, and any ablation or sensitivity analysis on those weights. These omissions make it impossible to reproduce the reported factor-of-two bias reduction or to assess whether the constraints remain non-degrading under stronger distributional shifts.

minor comments (2)

[Evaluation metrics] The conservation-error metric should be defined explicitly (including the exact variables and integration domain) in the methods or appendix so that readers can interpret the numerical reductions.
[Results figures] Power-spectrum figures would benefit from ensemble-spread shading or error bars to allow visual assessment of whether the reported improvements are statistically distinguishable from the baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of our work. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Abstract and OOD evaluation section] The headline OOD claim (halving of wet bias and improved extremes outside the training distribution) rests on results from only two held-out regions, yet the manuscript supplies no quantitative distributional-shift diagnostics (Wasserstein distance, mean/variance differences, or similar) on the six input variables between the Central Europe training domain and the test regions. Without such metrics it is impossible to determine whether the held-out cases constitute genuine extrapolation or lie inside a similar climate manifold, directly undermining the generalization argument.

Authors: We agree that quantitative distributional-shift diagnostics are necessary to strengthen the OOD generalization claims. In the revised manuscript we will add Wasserstein distances together with mean and variance differences computed on all six input variables between the Central European training domain and each of the two held-out test regions. These metrics will be reported in a new table or figure in the OOD evaluation section, allowing readers to assess the degree of extrapolation. While the two regions were deliberately chosen to span distinct climate regimes (different precipitation climatologies and temperature ranges), we acknowledge that the explicit diagnostics will make the extrapolation argument more rigorous. revision: yes
Referee: [Methods (constraint implementation and loss formulation)] The description of the physics constraints (soft penalties on precipitation and humidity consistency with large-scale inputs, resolved by ConFIG) omits the numerical values of the constraint weights, the precise additive form of the composite loss, and any ablation or sensitivity analysis on those weights. These omissions make it impossible to reproduce the reported factor-of-two bias reduction or to assess whether the constraints remain non-degrading under stronger distributional shifts.

Authors: We thank the referee for identifying these omissions. In the revised Methods section we will explicitly state the numerical values of the constraint weights used for the soft penalties on precipitation and humidity. We will also write out the precise additive form of the composite loss (generative flow-matching term plus the two constraint terms after ConFIG gradient surgery). Finally, we will add a sensitivity analysis and ablation study varying the constraint weights, reporting the resulting changes in bias, conservation error, and extreme-quantile accuracy. These additions will enable full reproducibility and allow assessment of robustness under distributional shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper extends the prior AFM baseline (Fotiadis et al. 2025) by adding soft conservation constraints defined directly from large-scale input fields for precipitation and humidity, combined with ConFIG gradient surgery. These constraints are not fitted to target outputs or defined in terms of the claimed performance metrics. Generalization results on held-out regions are empirical evaluations rather than predictions forced by construction from training data. No self-definitional equations, renamed known results, or load-bearing self-citations that reduce the central claims to tautologies appear in the provided text. The design choices for constraint variables and penalties are acknowledged as modeling decisions but do not collapse the reported improvements into input equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard assumption that large-scale climate fields already satisfy conservation to within acceptable error, plus two design choices (which variables receive constraints and the relative weighting of the constraint loss) that are not derived from first principles.

free parameters (1)

constraint_weight
Relative strength of the soft conservation penalty versus the generative loss; value not stated in abstract.

axioms (1)

domain assumption Large-scale input fields conserve total precipitation and humidity mass to within model error.
Invoked when the soft constraint is defined to match the coarse input.

pith-pipeline@v0.9.0 · 5625 in / 1408 out tokens · 53276 ms · 2026-05-13T18:00:46.778567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., & Watson, P. A. (2024, July).Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model.Retrieved 2025-06-10, from https://arxiv.org/abs/2407.14158v2 Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. I...

work page doi:10.1063/5.0304492doi: 2024
[2]

doi: 10.48550/arXiv.2506.08604 Ba˜ no-Medina, J., Manzanas, R., & Guti´ errez, J. M. (2020, April). Configu- ration and intercomparison of deep learning neural models for statistical downscaling.Geoscientific Model Development,13(4), 2109–2124. Re- trieved fromhttps://gmd.copernicus.org/articles/13/2109/2020/doi: 10.5194/gmd-13-2109-2020 Bernini, L., Laga...

work page doi:10.48550/arxiv.2506.08604 2020
[3]

Retrieved 2024-11- 21, fromhttps://www.nature.com/articles/s41597-023-02805-9doi: 10.1038/s41597-023-02805-9 Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. InInternational conference on learning rep- resentations (iclr). Liu, Q., Cai, Z., & Zhu, Y. (2024).ConFIG: Towards conflict-free training...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41597-023-02805-9 2024
[4]

Retrieved 2025-04-10, fromhttps:// doi.org/10.1186/s40645-019-0304-zdoi: 10.1186/s40645-019-0304-z Vandal, T., Kodra, E., Ganguly, S., Michaelis, A., Nemani, R., & Ganguly, A. R. (2017). DeepSD: Generating high fidelity daily climate projec- tions using deep learning. InProceedings of the 23rd acm sigkdd interna- tional conference on knowledge discovery a...

work page doi:10.1186/s40645-019-0304-zdoi: 2025

[1] [1]

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., & Watson, P. A. (2024, July).Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model.Retrieved 2025-06-10, from https://arxiv.org/abs/2407.14158v2 Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. I...

work page doi:10.1063/5.0304492doi: 2024

[2] [2]

doi: 10.48550/arXiv.2506.08604 Ba˜ no-Medina, J., Manzanas, R., & Guti´ errez, J. M. (2020, April). Configu- ration and intercomparison of deep learning neural models for statistical downscaling.Geoscientific Model Development,13(4), 2109–2124. Re- trieved fromhttps://gmd.copernicus.org/articles/13/2109/2020/doi: 10.5194/gmd-13-2109-2020 Bernini, L., Laga...

work page doi:10.48550/arxiv.2506.08604 2020

[3] [3]

Retrieved 2024-11- 21, fromhttps://www.nature.com/articles/s41597-023-02805-9doi: 10.1038/s41597-023-02805-9 Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. InInternational conference on learning rep- resentations (iclr). Liu, Q., Cai, Z., & Zhu, Y. (2024).ConFIG: Towards conflict-free training...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41597-023-02805-9 2024

[4] [4]

Retrieved 2025-04-10, fromhttps:// doi.org/10.1186/s40645-019-0304-zdoi: 10.1186/s40645-019-0304-z Vandal, T., Kodra, E., Ganguly, S., Michaelis, A., Nemani, R., & Ganguly, A. R. (2017). DeepSD: Generating high fidelity daily climate projec- tions using deep learning. InProceedings of the 23rd acm sigkdd interna- tional conference on knowledge discovery a...

work page doi:10.1186/s40645-019-0304-zdoi: 2025