pith. sign in

arxiv: 2603.03239 · v3 · submitted 2026-03-03 · 💻 cs.CV

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Pith reviewed 2026-05-15 16:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords Earth observationlatent diffusiongenerative modelingmultimodalconditional generationSentinel-2data synthesiscross-modal translation
0
0 comments X

The pith

COP-GEN models cross-modal Earth observation relationships as conditional distributions to generate diverse, physically consistent samples across sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COP-GEN, a multimodal latent diffusion transformer that treats relationships between optical, radar, elevation, and land-cover data as conditional distributions instead of single deterministic outputs. This addresses the non-injective nature of sensor mappings, where the same input can correspond to multiple valid observations, and avoids the collapse to conditional means seen in prior models. The approach supports any-to-any generation, including zero-shot modality translation, while preserving physical consistency. On a new multi-temporal Sentinel-2 benchmark, the model covers 90% of the real observation manifold and 63% of per-band reflectance range, compared to 2.8% and 18% for the strongest baseline.

Core claim

COP-GEN is a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at native spatial resolutions by parameterizing cross-modal mappings as conditional distributions, enabling flexible any-to-any conditional generation including zero-shot modality translation without task-specific retraining, while producing diverse yet physically consistent realizations that capture meaningful cross-modal structure and adapt output uncertainty to the conditioning information.

What carries the argument

The multimodal latent diffusion transformer that parameterizes cross-modal mappings as conditional distributions for joint modeling of Earth observation modalities.

Load-bearing premise

That the multi-temporal Sentinel-2 benchmark represents the full joint distribution of heterogeneous Earth observation modalities and that visual plus quantitative checks confirm physical consistency of the generated samples.

What would settle it

A new expanded benchmark or physical-law violation test showing that COP-GEN samples fall outside plausible reflectance or backscatter ranges or cover less of the real manifold than the strongest competing method.

Figures

Figures reproduced from arXiv: 2603.03239 by Elliot J. Crowley, Eva Gmelich Meijling, Miguel Espinosa, Mikolaj Czerkawski, Valerio Marsocci.

Figure 1
Figure 1. Figure 1: Conditional generation of Sentinel-2 L2A imagery from topography (DEM) and land-cover (LULC) inputs. We condition COP-GEN generations on DEM and LULC inputs (geographic location is provided solely for visual reference). COP-GEN produces diverse and physically consistent outputs, demonstrating variability in spectral appearance, illumination, and atmospheric conditions while preserving topographic and land-… view at source ↗
Figure 2
Figure 2. Figure 2: COP-GEN architecture, training, and inference overview. Multimodal inputs (optical, radar, elevation, land￾cover, geolocation, and timestamps) are encoded into latent representations using modality-specific VAEs (or directly tokenized for scalar inputs). All tokens, augmented with modality-specific diffusion timestep embeddings, are processed by a shared transformer diffusion backbone. The model is trained… view at source ↗
Figure 3
Figure 3. Figure 3: Geospatial Distribution Analysis. We predict latitude–longitude coordinates conditioned on DEM and LULC inputs (n = 50 runs). TerraMind (blue) collapses to a few locations, whereas COP-GEN (green) predicts a distribution of plausible locations that share similar topographic and biome characteristics, accurately reflecting the non-injective nature of the mapping. A Koppen–Geiger climate classification basem… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution spread narrowing by progressively increasing input conditioning. As additional modalities are provided as input, the generated samples better align with the ground-truth (GT) distribution. For each conditioning setup, we generate 25 stochastic samples of Sentinel-2 L2A (S2L2A) imagery and report the predicted band distributions using histograms and kernel density estimates (KDEs). One spectral… view at source ↗
Figure 5
Figure 5. Figure 5: Per-pixel spectral reflectance profiles across multiple LULC classes. Conditioned on DEM and LULC inputs, COP-GEN generates multispectral S2L2A imagery that captures physically consistent spectral relationships. The plots com￾pare spectral profiles from selected pixel locations in both real and generated images across the Sentinel-2 bands. The close alignment for trees, bare soil, water, crops, built-up ar… view at source ↗
Figure 6
Figure 6. Figure 6: Band infilling via resolution-specific latent groups. By grouping Sentinel-2 bands according to resolution, COP￾GEN treats spectral groups as independent modalities. Here, the model is conditioned only on the high-resolution visible band group (B2, B3, B4, B8) and successfully reconstructs the remaining spectral bands, auxiliary sensors (S1RTC, DEM), LULC maps, cloud mask, timestamp, and location. The high… view at source ↗
Figure 7
Figure 7. Figure 7: Multiple conditional Sentinel-2 L2A generations from DEM and LULC inputs. Generated outputs per scene demonstrates COP-GEN’s ability to model multimodal variability in spectral response, illumination, and atmospheric condi￾tions, while maintaining consistency with underlying terrain and land-cover information [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conditional generation of Sentinel-2 L2A imagery from DEM and LULC inputs. For each location, a grid of generated examples illustrates the diversity of COP-GEN outputs under varying spectral, illumination, and atmospheric conditions, while respecting topographic and land-cover constraints. This demonstrates the model’s ability to represent one￾to-many relationships in multimodal Earth observation [PITH_FU… view at source ↗
Figure 9
Figure 9. Figure 9: Per-pixel spectral reflectance profiles across multiple LULC classes (560U 34R tile). The COP-GEN model reproduces characteristic Sentinel-2 band signatures for LULC classes, indicating robust learning of physical spectral patterns. REAL S2L2A COP-GEN INPUTS DEM LULC S2L2A SPECTRAL PROFILE BY CLASS (shown for reference) OUTPUTS 272D_1525R [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-pixel spectral reflectance profiles across multiple LULC classes (502U 263R tile). COP-GEN closely matches Sentinel-2 band responses, indicating physically meaningful spectral signatures across land-cover types. DIGITAL ELEVATION MODEL LAND USE LAND COVER REAL S2L2A IMAGE COP-GEN S2L2A GENERATION INPUTS OUTPUTS S2L2A SPECTRAL PROFILE BY CLASS 560U_34R (shown for reference) [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 11
Figure 11. Figure 11: Per-pixel spectral reflectance profiles across multiple LULC classes (256U 1125L tile). COP-GEN preserves characteristic Sentinel-2 spectral responses, illustrating consistent physical fidelity across land-cover types [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Geospatial distribution analysis. Latitude–longitude coordinates are predicted from DEM and LULC inputs over n = 50 runs. TerraMind (blue) collapses to a small set of locations, whereas COP-GEN (green) produces a broader distribution of plausible sites sharing similar topographic and biome characteristics, reflecting the non-injective nature of the mapping. Real thumbnail visualisations of predicted locat… view at source ↗
Figure 13
Figure 13. Figure 13: Geospatial distribution analysis. Conditional latitude–longitude predictions from DEM and LULC inputs (n = 50) reveal that TerraMind (blue) concentrates on a few modes, while COP-GEN (green) captures multiple plausible geographic locations with similar terrain and biome properties, consistent with a non-injective mapping. Example real-image thumbnails are provided for comparison [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 14
Figure 14. Figure 14: Geospatial distribution analysis. Given DEM and LULC inputs, latitude–longitude predictions over n = 50 runs show TerraMind (blue) collapsing to a limited set of locations, while COP-GEN (green) captures a diverse distribution of plausible sites with similar topographic and biome characteristics, reflecting the non-injective nature of the task. This behavior is further supported by the fact that most COP-… view at source ↗
Figure 15
Figure 15. Figure 15: Geospatial distribution analysis. We evaluate conditional latitude–longitude prediction from DEM and LULC inputs across n = 50 samples. TerraMind (blue) tends to collapse to a few specific locations, whereas COP-GEN (green) generates a spatially distributed set of plausible locations with similar terrain and biome attributes, consistent with a non￾injective mapping. Real thumbnail visualizations of predic… view at source ↗
Figure 16
Figure 16. Figure 16: Effect of input conditioning on spectral distribution spread (195D 669L). As additional input modalities are incorporated, generated samples increasingly align with the ground-truth (GT) distribution. For each conditioning configu￾ration, 25 stochastic Sentinel-2 L2A (S2L2A) samples are generated, and predicted band distributions are summarized using histograms and kernel density estimates (KDEs) [PITH_F… view at source ↗
Figure 17
Figure 17. Figure 17: Effect of input conditioning on spectral distribution spread (248U 978R). Increasing the number of conditioning modalities progressively constrains the generated Sentinel-2 L2A (S2L2A) outputs toward the ground-truth (GT) distribution. For each setting, 25 stochastic samples are generated and evaluated using histograms and kernel density estimates (KDEs) [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Effect of input conditioning on spectral distribution spread (250U 409R). As additional modalities are used for conditioning, the variability of generated Sentinel-2 L2A (S2L2A) samples decreases and better matches the ground￾truth (GT) distribution. For each configuration, 25 stochastic generations are summarized via histograms and kernel density estimates (KDEs) [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Effect of input conditioning on spectral distribution spread (143D 1481R). Providing additional conditioning modalities narrows the distribution of generated Sentinel-2 L2A (S2L2A) samples toward the ground truth. For each setup, 25 stochastic samples are evaluated using histograms and KDEs [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Class-conditional geolocation priors for trees. COP-GEN is conditioned solely on a land-cover tile that is entirely labeled as trees. The model generates multiple plausible latitude–longitude samples, which are plotted as points on the map. The background shows global tree cover, normalised by percentage, where darker green indicates higher tree density. Predicted locations concentrate in regions with sub… view at source ↗
Figure 21
Figure 21. Figure 21: Class-conditional geolocation priors for snow and ice. The model is conditioned on a tile fully assigned to the snow/ice land-cover class and outputs multiple plausible latitude–longitude samples. Predictions are overlaid on a global mountain-range basemap to provide geographic context. The majority of samples align with high-latitude and high-elevation regions, including the Himalayas, Alaska, Patagonia,… view at source ↗
Figure 22
Figure 22. Figure 22: Class-conditional geolocation priors for rangeland. COP-GEN is conditioned on a homogeneous rangeland tile and generates a distribution of plausible geographic locations. The predictions are visualised on a Koppen–Geiger climate ¨ classification basemap. Generated locations predominantly fall within arid climate zone and exhibit broad global coverage, indicating that the model has learned large-scale clim… view at source ↗
Figure 23
Figure 23. Figure 23: Class-conditional geolocation priors for flooded vegetation. Given a tile entirely labeled as flooded vegetation, COP-GEN outputs multiple candidate latitude–longitude samples. Predictions are plotted on a global tree cover basemap (normalised by percentage) to contextualise vegetation density. The predicted locations tend to cluster in forested and wetland￾rich regions, consistent with the ecological con… view at source ↗
Figure 24
Figure 24. Figure 24: Class-conditional geolocation priors for crops. The model is conditioned on a land-cover tile fully classified as crops and produces multiple plausible geographic locations. Predictions are overlaid on a Koppen–Geiger climate basemap. ¨ Samples concentrate in temperate and continental climate zones, with notable densities over Central Europe, North America, and Central Asia. The inferred latitude distribu… view at source ↗
Figure 25
Figure 25. Figure 25: Class-conditional geolocation priors for clouds. COP-GEN is conditioned on a tile entirely labeled as clouds and generates multiple latitude–longitude predictions. The points are visualised on a global mountain-range basemap. Many samples align with major orographic regions such as the Himalayas and the Andes, where cloud formation is frequent due to topographic lifting, suggesting that the model has lear… view at source ↗
Figure 26
Figure 26. Figure 26: Class-conditional geolocation priors for built-up areas. Conditioned on a homogeneous built-up land-cover tile, COP-GEN generates a distribution of plausible geographic locations. Predictions are overlaid on a global population density basemap. The model places most samples in densely populated regions, particularly across Asia, with additional concentrations in Europe and North America, reflecting realis… view at source ↗
Figure 27
Figure 27. Figure 27: Class-conditional geolocation priors for bare ground. Given a tile fully classified as bare ground, COP-GEN outputs multiple candidate latitude–longitude locations. Predictions are shown on a Koppen–Geiger climate basemap. The ¨ majority of samples fall within arid climate zones and are broadly distributed across global desert regions, aligning well with the expected geographic distribution of bare soil a… view at source ↗
read the original abstract

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover. Relationships between modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations, and should be parametrised as conditional distributions. Deterministic models, by contrast, collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation without task-specific retraining. Experiments show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and adapts its output uncertainty as conditioning information increases. We release a stochastic benchmark built from multi-temporal Sentinel-2 observations that enables distribution-level comparison of generative EO models. On this benchmark, COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range, while the strongest competing method collapses to 2.8% and 18%, respectively. These results highlight the importance of stochastic generative modeling for EO and motivate evaluation protocols beyond single-reference, pointwise metrics. Website: https://miquel-espinosa.github.io/cop-gen

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Copernicus Earth observation modalities (optical, radar, elevation, land-cover) at native resolutions. It parameterizes cross-modal mappings as conditional distributions to enable any-to-any generation, including zero-shot translation, and releases a multi-temporal Sentinel-2 benchmark on which it reports 90% coverage of the real observation manifold and 63% per-band reflectance range versus 2.8% and 18% for the strongest baseline.

Significance. If the multimodal claims are substantiated, the work would meaningfully advance stochastic generative modeling for EO by capturing uncertainty and variability in cross-sensor tasks, where deterministic approaches collapse to conditional means. The released benchmark protocol is a constructive contribution that shifts evaluation beyond pointwise metrics.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The headline quantitative results (90% manifold coverage, 63% reflectance range) are obtained exclusively on the optical multi-temporal Sentinel-2 benchmark. No equivalent coverage or range metrics are reported for cross-modal generation (e.g., optical-to-radar or optical-to-DEM), leaving the central any-to-any multimodal claim without direct quantitative support.
  2. [§3] §3 (Model Architecture): The description of the latent diffusion transformer does not specify how heterogeneous modalities are tokenized or conditioned at native resolutions, nor the precise form of the training loss and sampling schedule. These omissions make it impossible to assess whether the reported physical consistency arises from the architecture or from benchmark-specific tuning.
  3. [§4.2] §4.2 (Quantitative Evaluation): The manifold-coverage and reflectance-range metrics are defined only for the optical marginal; the paper provides no ablation or extension showing that the same metrics remain high when the model is conditioned on or generates non-optical modalities, which directly tests the multimodal joint-distribution claim.
minor comments (2)
  1. [Abstract] The abstract states that the model 'adapts its output uncertainty as conditioning information increases,' but no quantitative plot or table quantifies this adaptation (e.g., variance vs. number of conditioning bands).
  2. [§4.1] Figure captions and §4.1 should explicitly state the number of samples drawn per conditioning input when computing coverage statistics, as this affects the interpretation of the 90% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps clarify the scope of our quantitative claims and the need for additional architectural details. We address each major comment point by point below, indicating revisions where appropriate to strengthen the manuscript's support for the multimodal claims.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The headline quantitative results (90% manifold coverage, 63% reflectance range) are obtained exclusively on the optical multi-temporal Sentinel-2 benchmark. No equivalent coverage or range metrics are reported for cross-modal generation (e.g., optical-to-radar or optical-to-DEM), leaving the central any-to-any multimodal claim without direct quantitative support.

    Authors: The referee correctly notes that the manifold coverage and reflectance range metrics are reported specifically for the optical Sentinel-2 benchmark, as these metrics are defined with respect to the multi-temporal optical observation distribution. Cross-modal results in the manuscript are supported by qualitative visualizations and per-modality fidelity metrics (e.g., PSNR/SSIM for radar and DEM) demonstrating physical consistency and diversity. To better substantiate the any-to-any claim, we will revise the manuscript to include additional quantitative results for cross-modal tasks using adapted distribution-level metrics where feasible, and we will explicitly state the scope of the 90% figure as applying to the optical marginal. revision: yes

  2. Referee: [§3] §3 (Model Architecture): The description of the latent diffusion transformer does not specify how heterogeneous modalities are tokenized or conditioned at native resolutions, nor the precise form of the training loss and sampling schedule. These omissions make it impossible to assess whether the reported physical consistency arises from the architecture or from benchmark-specific tuning.

    Authors: We agree that §3 requires expanded detail for full reproducibility and to allow assessment of the architecture's contribution. In the revised manuscript we will add: (i) modality-specific tokenization at native resolutions using patch embeddings (16×16 for optical/radar, adjusted for DEM/land-cover); (ii) conditioning via cross-attention in the transformer backbone enabling any-to-any mappings; (iii) the training loss as the standard diffusion noise-prediction objective with modality-weighted terms; and (iv) the sampling schedule (linear beta schedule, 1000 timesteps, DDIM inference). These additions will be integrated into the main text rather than left to supplementary material. revision: yes

  3. Referee: [§4.2] §4.2 (Quantitative Evaluation): The manifold-coverage and reflectance-range metrics are defined only for the optical marginal; the paper provides no ablation or extension showing that the same metrics remain high when the model is conditioned on or generates non-optical modalities, which directly tests the multimodal joint-distribution claim.

    Authors: We acknowledge that the current evaluation focuses the manifold metrics on the optical marginal due to the availability of multi-temporal references for that modality. We will add an ablation study in the revised §4.2 that reports results when the model is conditioned on non-optical inputs (radar, DEM) for optical generation and vice versa, using the same or suitably adapted coverage and range metrics. This will provide direct quantitative support for the joint-distribution modeling across modalities. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on independent benchmark

full rationale

The paper defines COP-GEN as a latent diffusion transformer adapted for multimodal EO data and evaluates it on a separately released multi-temporal Sentinel-2 benchmark. No equations reduce by construction to fitted parameters from the same data, no predictions are statistically forced by input fits, and no load-bearing self-citations or uniqueness theorems collapse the central claims to tautologies. The 90%/63% coverage numbers are presented as empirical measurements on an external benchmark rather than derived from the model's definition. Minor related-work citations do not carry the central multimodal claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms or invented entities; the approach inherits standard diffusion and transformer assumptions from prior literature without additional ad-hoc constructs stated.

pith-pipeline@v0.9.0 · 5593 in / 1044 out tokens · 26465 ms · 2026-05-15T16:37:24.575804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Segdiff: Image segmentation with diffusion probabilistic models

    Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models.arXiv preprint arXiv:2112.00390, 2021. 3

  2. [2]

    Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023

    Tai An, Bin Xue, Chunlei Huo, Shiming Xiang, and Chunhong Pan. Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023. 3

  3. [3]

    Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning

    Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero. Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

  4. [4]

    OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024. 2, 4

  5. [5]

    Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

    Wele Gedara Chaminda Bandara, Nithin Gopalakrish- nan Nair, and Vishal M Patel. Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

  6. [6]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 3, 6

  7. [7]

    One transformer fits all distributions in multi- modal diffusion at scale

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi- modal diffusion at scale. InInternational Confer- ence on Machine Learning, pages 1692–1717. PMLR,

  8. [8]

    Brown, Michal R

    Christopher F. Brown, Michal R. Kazmierski, Va- lerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Es- tefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. Al- phaearth found...

  9. [9]

    Terrafm: A scalable foundation model for unified multisensor earth observation

    Muhammad Sohail Danish, Muhammad Akhtar Mu- nir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fa- had Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation. 2025. 2

  10. [10]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 3

  11. [11]

    Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model

    Runmin Dong, Shuai Yuan, Bin Luo, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Weijia Li, Juepeng Zheng, and Haohuan Fu. Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2...

  12. [12]

    Remote sensing image super-resolution via enhanced back-projection networks

    Xiaoyu Dong, Zhihong Xi, Xu Sun, and Lina Yang. Remote sensing image super-resolution via enhanced back-projection networks. InIGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, pages 1480–1483. IEEE, 2020. 3

  13. [13]

    Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails

    Miguel Espinosa, Valerio Marsocci, Yuru Jia, Elliot Crowley, and Mikolaj Czerkawski. Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

  14. [14]

    Taming transformers for high-resolution image syn- thesis

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image syn- thesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3

  15. [15]

    Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025

    Daniela Szwarcman et al. Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025. 2

  16. [16]

    Copernicus: Europes eyes on Earth, 2025

    European Commission. Copernicus: Europes eyes on Earth, 2025. Accessed: 2024-12-30. 1

  17. [17]

    Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav

    Zhengpeng Feng, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, James Ball, Clement Atzberger, David A. Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav. Tessera: Temporal embeddings of surface spectra for earth representation and analysis,

  18. [18]

    Major tom: Expandable datasets for earth observation

    Alistair Francis and Mikolaj Czerkawski. Major tom: Expandable datasets for earth observation. InIGARSS 2024-2024 IEEE International Geoscience and Re- mote Sensing Symposium, pages 2935–2940. IEEE,

  19. [19]

    Masked diffusion transformer is a strong image synthesizer

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 23164–23173, 2023. 3

  20. [20]

    Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014. 3

  21. [21]

    Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023

    Jiang He, Yajie Li, Qiangqiang Yuan, et al. Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023. 3

  22. [22]

    Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025

    Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Far- ley, et al. Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025. 4

  23. [23]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020. 3

  24. [24]

    TerraMind: Large-scale generative multimodality for Earth observation,

    Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation.arXiv preprint arXiv:2504.11171, 2025. 4

  25. [25]

    Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery

    Jia Jia, Geunho Lee, Zhibo Wang, Lyu Zhi, and Yuchu He. Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery. IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing, 2024. 3

  26. [26]

    Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025

    Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, and Andrea Nascetti. Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025. 2, 3

  27. [27]

    Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

    Minghao Jin, Pengwei Wang, and Yusong Li. Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

  28. [28]

    Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023

    Ran Jing, Fuzhou Duan, Fengxian Lu, Miao Zhang, and Wenji Zhao. Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023. 3

  29. [29]

    Analyz- ing and improving the image quality of StyleGAN

    Tero Karras, Samuli Laine, and Timo Aila. Analyz- ing and improving the image quality of StyleGAN. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2020. 3

  30. [30]

    Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023

    Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lo- bell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023. 2, 3

  31. [31]

    Multi-class segmentation from aerial views using recursive noise diffusion

    Benedikt Kolbeinsson and Krystian Mikolajczyk. Multi-class segmentation from aerial views using recursive noise diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8439–8449, 2024. 3

  32. [32]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 7

  33. [33]

    Lawrence, Victoria L

    Bryan N. Lawrence, Victoria L. Bennett, James Churchill, Martin Juckes, Philip Kershaw, Stephen Pascoe, Sam Pepler, Matthew Pritchard, and Ag Stephens. Storing and manipulating environmental big data with jasmin. InIEEE Big Data, pages 1–5, San Francisco, 2013. IEEE. 14

  34. [34]

    Detecting out-of-distribution earth observation images with dif- fusion models

    Georges Le Bellier and Nicolas Audebert. Detecting out-of-distribution earth observation images with dif- fusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 481–491, 2024. 3

  35. [35]

    Mdfl: Multi-domain diffusion-driven feature learn- ing

    Daixun Li, Weiying Xie, Jiaqing Zhang, and Yunsong Li. Mdfl: Multi-domain diffusion-driven feature learn- ing. InProceedings of the AAAI conference on artifi- cial intelligence, 2024. 3

  36. [36]

    A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022

    Yang Liu, Yexin Wang, Kaichang Di, Man Peng, Wen- hui Wan, and Zhaoqin Liu. A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022. 3

  37. [37]

    Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024

    Yidan Liu, Jun Yue, Shaobo Xia, Pedram Ghamisi, Weiying Xie, and Leyuan Fang. Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024. 3

  38. [38]

    Revisiting clas- sifier two-sample tests

    David Lopez-Paz and Maxime Oquab. Revisiting clas- sifier two-sample tests. InInternational Conference on Learning Representations (ICLR), 2017. 7

  39. [39]

    Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020

    Jiayi Ma, Wei Yu, Chen Chen, Pengwei Liang, Xiao- jie Guo, and Junjun Jiang. Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020. 3

  40. [40]

    Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020

    Andrea Meraner, Patrick Ebel, Xiao Xiang Zhu, and Michael Schmitt. Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020. 3

  41. [41]

    Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning

    Vishal Nedungadi, Ankit Kariryaa, Stefan Oehm- cke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. InEuropean Con- ference on Computer Vision, pages 164–182. Springer,

  42. [42]

    Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models

    Li Pang, Xiangyu Rui, Long Cui, Hongzhong Wang, Deyu Meng, and Xiangyong Cao. Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 3005–3014, 2024. 3

  43. [43]

    Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023

    Z ´arate L Paola, L ´opez S Jes ´us, Arroyo H Christian, and Rinc ´on U Sonia. Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023. 3

  44. [44]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4195–4205, 2023. 3

  45. [45]

    Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities

    Jiahui Qu, Yuanbo Yang, Wenqian Dong, and Yufei Yang. Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 14731–14739, 2024. 3

  46. [46]

    Zero-shot text-to-image generation

    Aditya Ramesh, Pavel Pavlov, Gabriel Goh, et al. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, 2021. 3

  47. [47]

    High- resolution image synthesis with latent diffusion mod- els

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 3, 5

  48. [48]

    U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015. 3

  49. [49]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealis- tic text-to-image diffusion models with deep language understanding, 2022. 3

  50. [50]

    Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024

    Neetu Sigger, Quoc-Tuan Vien, Sinh Van Nguyen, Gi- anluca Tozzi, and Tuan Thanh Nguyen. Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024. 3

  51. [51]

    Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015

    Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015. 3

  52. [52]

    Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Junmin Liu, and Deyu Meng. Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3

  53. [53]

    Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Jiayuan Tian, Jie Lei, Jiaqing Zhang, Weiying Xie, and Yunsong Li. Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

  54. [54]

    Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation

    Aysim Toker, Marvin Eisenberger, Daniel Cremers, and Laura Leal-Taix´e. Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 27695–27705, 2024. 3

  55. [55]

    Galileo: Learning global & local features of many remote sensing modalities

    Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global & local features of many remote sensing modalities. InForty-second International Conference on Machine Learning, 2025. 2, 4

  56. [56]

    Panop- ticon: Advancing any-sensor foundation models for earth observation

    Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam Stewart, Zhitong Xiong, Xiao Xi- ang Zhu, Stefan Bauer, and John Chuang. Panop- ticon: Advancing any-sensor foundation models for earth observation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 2204– 2214, 2025. 2

  57. [57]

    Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

    Ce Wang and Wanjie Sun. Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

  58. [58]

    Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019

    Lei Wang, Xin Xu, Yue Yu, Rui Yang, Rong Gui, Zhaozhuo Xu, and Fangling Pu. Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019. 3

  59. [59]

    Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024

    Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, and Liang Lin. Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

  60. [60]

    Es- rgan: Enhanced super-resolution generative adversar- ial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Es- rgan: Enhanced super-resolution generative adversar- ial networks. InProceedings of the European confer- ence on computer vision (ECCV) workshops, pages 0– 0, 2018. 3

  61. [61]

    Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu

    Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu. Towards a unified copernicus foundation model for earth vision,

  62. [62]

    Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm

    Yihan Wen, Xianping Ma, Xiaokang Zhang, and Man- On Pun. Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm. IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 3

  63. [63]

    Xiong, Y

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo ¨elle Hanna, Damian Borth, Ioannis Pa- poutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired founda- tion model for observing the Earth crossing modalities. arXiv preprint arXiv:2403.15356, 2024. 2, 4

  64. [64]

    Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025

    Zhiping Yu, Chenyang Liu, Liqin Liu, Zhenwei Shi, and Zhengxia Zou. Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025. 2, 3

  65. [65]

    Diffucd: Unsuper- vised hyperspectral image change detection with semantic correlation diffusion model,

    Xiangrong Zhang, Shunli Tian, Guanchun Wang, Huiyu Zhou, and Licheng Jiao. Diffucd: Unsuper- vised hyperspectral image change detection with se- mantic correlation diffusion model.arXiv preprint arXiv:2305.12410, 2023. 3

  66. [66]

    Zhang, G

    Y . Zhang, G. Tseng, J. Redmon, H. Herzog, F. Bas- tani, H. Sablon, R. Park, J. Morrison, A. Buraczyn- ski, K. Farley, J. Hansen, A. Howe, P. Johnson, M. Otterlee, H. Pitelka, R. Ratner, T. Schmitt, C. Wil- helm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema. OlmoEarth: Earth observation foundation model.https : / / w...

  67. [67]

    Changen2: Multi-temporal remote sensing generative change foundation model

    Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

  68. [68]

    Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024

    Jingyi Zhou, Jiamu Sheng, Peng Ye, Jiayuan Fan, Tong He, Bin Wang, and Tao Chen. Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

  69. [69]

    Condition

    Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiy- ing Wang, Lei Jin, and Pin Tao. Diffcr: A fast con- ditional diffusion framework for cloud removal from optical satellite images.IEEE Transactions on Geo- science and Remote Sensing, 62:1–14, 2024. 3 A. Supplementary Material We provide additional qualitative, quantitative, and architectural results that...

  70. [70]

    Example real-image thumbnails are provided for comparison

    reveal that TerraMind (blue) concentrates on a few modes, while COP-GEN (green) captures multiple plausible geographic locations with similar terrain and biome properties, consistent with a non-injective mapping. Example real-image thumbnails are provided for comparison. INPUT MODALITIES S2L2A 293U_659R 407U_358R 456U_995L 352U_1041L 388U_486R Figure 14.G...