COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Elliot J. Crowley; Eva Gmelich Meijling; Miguel Espinosa; Mikolaj Czerkawski; Valerio Marsocci

arxiv: 2603.03239 · v3 · submitted 2026-03-03 · 💻 cs.CV

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Miguel Espinosa , Eva Gmelich Meijling , Valerio Marsocci , Elliot J. Crowley , Mikolaj Czerkawski This is my paper

Pith reviewed 2026-05-15 16:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords Earth observationlatent diffusiongenerative modelingmultimodalconditional generationSentinel-2data synthesiscross-modal translation

0 comments

The pith

COP-GEN models cross-modal Earth observation relationships as conditional distributions to generate diverse, physically consistent samples across sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COP-GEN, a multimodal latent diffusion transformer that treats relationships between optical, radar, elevation, and land-cover data as conditional distributions instead of single deterministic outputs. This addresses the non-injective nature of sensor mappings, where the same input can correspond to multiple valid observations, and avoids the collapse to conditional means seen in prior models. The approach supports any-to-any generation, including zero-shot modality translation, while preserving physical consistency. On a new multi-temporal Sentinel-2 benchmark, the model covers 90% of the real observation manifold and 63% of per-band reflectance range, compared to 2.8% and 18% for the strongest baseline.

Core claim

COP-GEN is a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at native spatial resolutions by parameterizing cross-modal mappings as conditional distributions, enabling flexible any-to-any conditional generation including zero-shot modality translation without task-specific retraining, while producing diverse yet physically consistent realizations that capture meaningful cross-modal structure and adapt output uncertainty to the conditioning information.

What carries the argument

The multimodal latent diffusion transformer that parameterizes cross-modal mappings as conditional distributions for joint modeling of Earth observation modalities.

Load-bearing premise

That the multi-temporal Sentinel-2 benchmark represents the full joint distribution of heterogeneous Earth observation modalities and that visual plus quantitative checks confirm physical consistency of the generated samples.

What would settle it

A new expanded benchmark or physical-law violation test showing that COP-GEN samples fall outside plausible reflectance or backscatter ranges or cover less of the real manifold than the strongest competing method.

Figures

Figures reproduced from arXiv: 2603.03239 by Elliot J. Crowley, Eva Gmelich Meijling, Miguel Espinosa, Mikolaj Czerkawski, Valerio Marsocci.

**Figure 1.** Figure 1: Conditional generation of Sentinel-2 L2A imagery from topography (DEM) and land-cover (LULC) inputs. We condition COP-GEN generations on DEM and LULC inputs (geographic location is provided solely for visual reference). COP-GEN produces diverse and physically consistent outputs, demonstrating variability in spectral appearance, illumination, and atmospheric conditions while preserving topographic and land-… view at source ↗

**Figure 2.** Figure 2: COP-GEN architecture, training, and inference overview. Multimodal inputs (optical, radar, elevation, landcover, geolocation, and timestamps) are encoded into latent representations using modality-specific VAEs (or directly tokenized for scalar inputs). All tokens, augmented with modality-specific diffusion timestep embeddings, are processed by a shared transformer diffusion backbone. The model is trained… view at source ↗

**Figure 3.** Figure 3: Geospatial Distribution Analysis. We predict latitude–longitude coordinates conditioned on DEM and LULC inputs (n = 50 runs). TerraMind (blue) collapses to a few locations, whereas COP-GEN (green) predicts a distribution of plausible locations that share similar topographic and biome characteristics, accurately reflecting the non-injective nature of the mapping. A Koppen–Geiger climate classification basem… view at source ↗

**Figure 4.** Figure 4: Distribution spread narrowing by progressively increasing input conditioning. As additional modalities are provided as input, the generated samples better align with the ground-truth (GT) distribution. For each conditioning setup, we generate 25 stochastic samples of Sentinel-2 L2A (S2L2A) imagery and report the predicted band distributions using histograms and kernel density estimates (KDEs). One spectral… view at source ↗

**Figure 5.** Figure 5: Per-pixel spectral reflectance profiles across multiple LULC classes. Conditioned on DEM and LULC inputs, COP-GEN generates multispectral S2L2A imagery that captures physically consistent spectral relationships. The plots compare spectral profiles from selected pixel locations in both real and generated images across the Sentinel-2 bands. The close alignment for trees, bare soil, water, crops, built-up ar… view at source ↗

**Figure 6.** Figure 6: Band infilling via resolution-specific latent groups. By grouping Sentinel-2 bands according to resolution, COPGEN treats spectral groups as independent modalities. Here, the model is conditioned only on the high-resolution visible band group (B2, B3, B4, B8) and successfully reconstructs the remaining spectral bands, auxiliary sensors (S1RTC, DEM), LULC maps, cloud mask, timestamp, and location. The high… view at source ↗

**Figure 7.** Figure 7: Multiple conditional Sentinel-2 L2A generations from DEM and LULC inputs. Generated outputs per scene demonstrates COP-GEN’s ability to model multimodal variability in spectral response, illumination, and atmospheric conditions, while maintaining consistency with underlying terrain and land-cover information [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Conditional generation of Sentinel-2 L2A imagery from DEM and LULC inputs. For each location, a grid of generated examples illustrates the diversity of COP-GEN outputs under varying spectral, illumination, and atmospheric conditions, while respecting topographic and land-cover constraints. This demonstrates the model’s ability to represent oneto-many relationships in multimodal Earth observation [PITH_FU… view at source ↗

**Figure 9.** Figure 9: Per-pixel spectral reflectance profiles across multiple LULC classes (560U 34R tile). The COP-GEN model reproduces characteristic Sentinel-2 band signatures for LULC classes, indicating robust learning of physical spectral patterns. REAL S2L2A COP-GEN INPUTS DEM LULC S2L2A SPECTRAL PROFILE BY CLASS (shown for reference) OUTPUTS 272D_1525R [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Per-pixel spectral reflectance profiles across multiple LULC classes (502U 263R tile). COP-GEN closely matches Sentinel-2 band responses, indicating physically meaningful spectral signatures across land-cover types. DIGITAL ELEVATION MODEL LAND USE LAND COVER REAL S2L2A IMAGE COP-GEN S2L2A GENERATION INPUTS OUTPUTS S2L2A SPECTRAL PROFILE BY CLASS 560U_34R (shown for reference) [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 11.** Figure 11: Per-pixel spectral reflectance profiles across multiple LULC classes (256U 1125L tile). COP-GEN preserves characteristic Sentinel-2 spectral responses, illustrating consistent physical fidelity across land-cover types [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Geospatial distribution analysis. Latitude–longitude coordinates are predicted from DEM and LULC inputs over n = 50 runs. TerraMind (blue) collapses to a small set of locations, whereas COP-GEN (green) produces a broader distribution of plausible sites sharing similar topographic and biome characteristics, reflecting the non-injective nature of the mapping. Real thumbnail visualisations of predicted locat… view at source ↗

**Figure 13.** Figure 13: Geospatial distribution analysis. Conditional latitude–longitude predictions from DEM and LULC inputs (n = 50) reveal that TerraMind (blue) concentrates on a few modes, while COP-GEN (green) captures multiple plausible geographic locations with similar terrain and biome properties, consistent with a non-injective mapping. Example real-image thumbnails are provided for comparison [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 14.** Figure 14: Geospatial distribution analysis. Given DEM and LULC inputs, latitude–longitude predictions over n = 50 runs show TerraMind (blue) collapsing to a limited set of locations, while COP-GEN (green) captures a diverse distribution of plausible sites with similar topographic and biome characteristics, reflecting the non-injective nature of the task. This behavior is further supported by the fact that most COP-… view at source ↗

**Figure 15.** Figure 15: Geospatial distribution analysis. We evaluate conditional latitude–longitude prediction from DEM and LULC inputs across n = 50 samples. TerraMind (blue) tends to collapse to a few specific locations, whereas COP-GEN (green) generates a spatially distributed set of plausible locations with similar terrain and biome attributes, consistent with a noninjective mapping. Real thumbnail visualizations of predic… view at source ↗

**Figure 16.** Figure 16: Effect of input conditioning on spectral distribution spread (195D 669L). As additional input modalities are incorporated, generated samples increasingly align with the ground-truth (GT) distribution. For each conditioning configuration, 25 stochastic Sentinel-2 L2A (S2L2A) samples are generated, and predicted band distributions are summarized using histograms and kernel density estimates (KDEs) [PITH_F… view at source ↗

**Figure 17.** Figure 17: Effect of input conditioning on spectral distribution spread (248U 978R). Increasing the number of conditioning modalities progressively constrains the generated Sentinel-2 L2A (S2L2A) outputs toward the ground-truth (GT) distribution. For each setting, 25 stochastic samples are generated and evaluated using histograms and kernel density estimates (KDEs) [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Effect of input conditioning on spectral distribution spread (250U 409R). As additional modalities are used for conditioning, the variability of generated Sentinel-2 L2A (S2L2A) samples decreases and better matches the groundtruth (GT) distribution. For each configuration, 25 stochastic generations are summarized via histograms and kernel density estimates (KDEs) [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Effect of input conditioning on spectral distribution spread (143D 1481R). Providing additional conditioning modalities narrows the distribution of generated Sentinel-2 L2A (S2L2A) samples toward the ground truth. For each setup, 25 stochastic samples are evaluated using histograms and KDEs [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Class-conditional geolocation priors for trees. COP-GEN is conditioned solely on a land-cover tile that is entirely labeled as trees. The model generates multiple plausible latitude–longitude samples, which are plotted as points on the map. The background shows global tree cover, normalised by percentage, where darker green indicates higher tree density. Predicted locations concentrate in regions with sub… view at source ↗

**Figure 21.** Figure 21: Class-conditional geolocation priors for snow and ice. The model is conditioned on a tile fully assigned to the snow/ice land-cover class and outputs multiple plausible latitude–longitude samples. Predictions are overlaid on a global mountain-range basemap to provide geographic context. The majority of samples align with high-latitude and high-elevation regions, including the Himalayas, Alaska, Patagonia,… view at source ↗

**Figure 22.** Figure 22: Class-conditional geolocation priors for rangeland. COP-GEN is conditioned on a homogeneous rangeland tile and generates a distribution of plausible geographic locations. The predictions are visualised on a Koppen–Geiger climate ¨ classification basemap. Generated locations predominantly fall within arid climate zone and exhibit broad global coverage, indicating that the model has learned large-scale clim… view at source ↗

**Figure 23.** Figure 23: Class-conditional geolocation priors for flooded vegetation. Given a tile entirely labeled as flooded vegetation, COP-GEN outputs multiple candidate latitude–longitude samples. Predictions are plotted on a global tree cover basemap (normalised by percentage) to contextualise vegetation density. The predicted locations tend to cluster in forested and wetlandrich regions, consistent with the ecological con… view at source ↗

**Figure 24.** Figure 24: Class-conditional geolocation priors for crops. The model is conditioned on a land-cover tile fully classified as crops and produces multiple plausible geographic locations. Predictions are overlaid on a Koppen–Geiger climate basemap. ¨ Samples concentrate in temperate and continental climate zones, with notable densities over Central Europe, North America, and Central Asia. The inferred latitude distribu… view at source ↗

**Figure 25.** Figure 25: Class-conditional geolocation priors for clouds. COP-GEN is conditioned on a tile entirely labeled as clouds and generates multiple latitude–longitude predictions. The points are visualised on a global mountain-range basemap. Many samples align with major orographic regions such as the Himalayas and the Andes, where cloud formation is frequent due to topographic lifting, suggesting that the model has lear… view at source ↗

**Figure 26.** Figure 26: Class-conditional geolocation priors for built-up areas. Conditioned on a homogeneous built-up land-cover tile, COP-GEN generates a distribution of plausible geographic locations. Predictions are overlaid on a global population density basemap. The model places most samples in densely populated regions, particularly across Asia, with additional concentrations in Europe and North America, reflecting realis… view at source ↗

**Figure 27.** Figure 27: Class-conditional geolocation priors for bare ground. Given a tile fully classified as bare ground, COP-GEN outputs multiple candidate latitude–longitude locations. Predictions are shown on a Koppen–Geiger climate basemap. The ¨ majority of samples fall within arid climate zones and are broadly distributed across global desert regions, aligning well with the expected geographic distribution of bare soil a… view at source ↗

read the original abstract

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover. Relationships between modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations, and should be parametrised as conditional distributions. Deterministic models, by contrast, collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation without task-specific retraining. Experiments show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and adapts its output uncertainty as conditioning information increases. We release a stochastic benchmark built from multi-temporal Sentinel-2 observations that enables distribution-level comparison of generative EO models. On this benchmark, COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range, while the strongest competing method collapses to 2.8% and 18%, respectively. These results highlight the importance of stochastic generative modeling for EO and motivate evaluation protocols beyond single-reference, pointwise metrics. Website: https://miquel-espinosa.github.io/cop-gen

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COP-GEN offers a practical latent diffusion setup for any-to-any EO generation at native resolutions plus a new stochastic benchmark, but the headline coverage numbers rest on optical data alone.

read the letter

COP-GEN is a latent diffusion transformer that models joint distributions over heterogeneous Earth observation modalities at native resolutions and supports zero-shot any-to-any conditional generation. The central move is to treat cross-modal mappings as conditional distributions instead of deterministic outputs, which matches the non-injective character of sensor relationships better than mean-seeking models. They also release a multi-temporal Sentinel-2 benchmark that measures manifold coverage and per-band reflectance range, where the model reaches 90% coverage and 63% range against 2.8% and 18% for the strongest baseline. Releasing that benchmark is a concrete step that lets others compare generative EO models on distribution-level metrics rather than single-reference scores. The paper shows clear engagement with why stochastic modeling matters for data completion and cross-sensor translation, and the abstract claims experiments that maintain physical consistency across optical, radar, and elevation. The main limitation is that the quantitative benchmark uses only optical Sentinel-2 time series, so the strong numbers do not directly test performance when conditioning or generating across modality boundaries such as S2-to-S1 or S2-to-DEM. If the full methods contain separate cross-modal checks with statistical support, that would strengthen the multimodal claim; without those details the central numbers stay narrower than the positioning suggests. This is for remote-sensing groups that already work with diffusion models or need tools for uncertainty-aware data fusion. A reader focused on practical EO generation would find the benchmark and any-to-any framing worth examining. The work deserves a serious referee because the benchmark is new and the problem framing is honest, even though revisions on evaluation scope and method transparency would be expected.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Copernicus Earth observation modalities (optical, radar, elevation, land-cover) at native resolutions. It parameterizes cross-modal mappings as conditional distributions to enable any-to-any generation, including zero-shot translation, and releases a multi-temporal Sentinel-2 benchmark on which it reports 90% coverage of the real observation manifold and 63% per-band reflectance range versus 2.8% and 18% for the strongest baseline.

Significance. If the multimodal claims are substantiated, the work would meaningfully advance stochastic generative modeling for EO by capturing uncertainty and variability in cross-sensor tasks, where deterministic approaches collapse to conditional means. The released benchmark protocol is a constructive contribution that shifts evaluation beyond pointwise metrics.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The headline quantitative results (90% manifold coverage, 63% reflectance range) are obtained exclusively on the optical multi-temporal Sentinel-2 benchmark. No equivalent coverage or range metrics are reported for cross-modal generation (e.g., optical-to-radar or optical-to-DEM), leaving the central any-to-any multimodal claim without direct quantitative support.
[§3] §3 (Model Architecture): The description of the latent diffusion transformer does not specify how heterogeneous modalities are tokenized or conditioned at native resolutions, nor the precise form of the training loss and sampling schedule. These omissions make it impossible to assess whether the reported physical consistency arises from the architecture or from benchmark-specific tuning.
[§4.2] §4.2 (Quantitative Evaluation): The manifold-coverage and reflectance-range metrics are defined only for the optical marginal; the paper provides no ablation or extension showing that the same metrics remain high when the model is conditioned on or generates non-optical modalities, which directly tests the multimodal joint-distribution claim.

minor comments (2)

[Abstract] The abstract states that the model 'adapts its output uncertainty as conditioning information increases,' but no quantitative plot or table quantifies this adaptation (e.g., variance vs. number of conditioning bands).
[§4.1] Figure captions and §4.1 should explicitly state the number of samples drawn per conditioning input when computing coverage statistics, as this affects the interpretation of the 90% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps clarify the scope of our quantitative claims and the need for additional architectural details. We address each major comment point by point below, indicating revisions where appropriate to strengthen the manuscript's support for the multimodal claims.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The headline quantitative results (90% manifold coverage, 63% reflectance range) are obtained exclusively on the optical multi-temporal Sentinel-2 benchmark. No equivalent coverage or range metrics are reported for cross-modal generation (e.g., optical-to-radar or optical-to-DEM), leaving the central any-to-any multimodal claim without direct quantitative support.

Authors: The referee correctly notes that the manifold coverage and reflectance range metrics are reported specifically for the optical Sentinel-2 benchmark, as these metrics are defined with respect to the multi-temporal optical observation distribution. Cross-modal results in the manuscript are supported by qualitative visualizations and per-modality fidelity metrics (e.g., PSNR/SSIM for radar and DEM) demonstrating physical consistency and diversity. To better substantiate the any-to-any claim, we will revise the manuscript to include additional quantitative results for cross-modal tasks using adapted distribution-level metrics where feasible, and we will explicitly state the scope of the 90% figure as applying to the optical marginal. revision: yes
Referee: [§3] §3 (Model Architecture): The description of the latent diffusion transformer does not specify how heterogeneous modalities are tokenized or conditioned at native resolutions, nor the precise form of the training loss and sampling schedule. These omissions make it impossible to assess whether the reported physical consistency arises from the architecture or from benchmark-specific tuning.

Authors: We agree that §3 requires expanded detail for full reproducibility and to allow assessment of the architecture's contribution. In the revised manuscript we will add: (i) modality-specific tokenization at native resolutions using patch embeddings (16×16 for optical/radar, adjusted for DEM/land-cover); (ii) conditioning via cross-attention in the transformer backbone enabling any-to-any mappings; (iii) the training loss as the standard diffusion noise-prediction objective with modality-weighted terms; and (iv) the sampling schedule (linear beta schedule, 1000 timesteps, DDIM inference). These additions will be integrated into the main text rather than left to supplementary material. revision: yes
Referee: [§4.2] §4.2 (Quantitative Evaluation): The manifold-coverage and reflectance-range metrics are defined only for the optical marginal; the paper provides no ablation or extension showing that the same metrics remain high when the model is conditioned on or generates non-optical modalities, which directly tests the multimodal joint-distribution claim.

Authors: We acknowledge that the current evaluation focuses the manifold metrics on the optical marginal due to the availability of multi-temporal references for that modality. We will add an ablation study in the revised §4.2 that reports results when the model is conditioned on non-optical inputs (radar, DEM) for optical generation and vice versa, using the same or suitably adapted coverage and range metrics. This will provide direct quantitative support for the joint-distribution modeling across modalities. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on independent benchmark

full rationale

The paper defines COP-GEN as a latent diffusion transformer adapted for multimodal EO data and evaluates it on a separately released multi-temporal Sentinel-2 benchmark. No equations reduce by construction to fitted parameters from the same data, no predictions are statistically forced by input fits, and no load-bearing self-citations or uniqueness theorems collapse the central claims to tautologies. The 90%/63% coverage numbers are presented as empirical measurements on an external benchmark rather than derived from the model's definition. Minor related-work citations do not carry the central multimodal claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms or invented entities; the approach inherits standard diffusion and transformer assumptions from prior literature without additional ad-hoc constructs stated.

pith-pipeline@v0.9.0 · 5593 in / 1044 out tokens · 26465 ms · 2026-05-15T16:37:24.575804+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

COP-GEN is a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions... unified transformer diffusion backbone... independent timestep control per modality
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We release a stochastic benchmark... COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Segdiff: Image segmentation with diffusion probabilistic models

Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models.arXiv preprint arXiv:2112.00390, 2021. 3

work page arXiv 2021
[2]

Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023

Tai An, Bin Xue, Chunlei Huo, Shiming Xiang, and Chunhong Pan. Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023. 3

work page 2023
[3]

Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning

Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero. Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

work page 2025
[4]

OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024. 2, 4

work page 2024
[5]

Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

Wele Gedara Chaminda Bandara, Nithin Gopalakrish- nan Nair, and Vishal M Patel. Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

work page arXiv
[6]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 3, 6

work page 2023
[7]

One transformer fits all distributions in multi- modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi- modal diffusion at scale. InInternational Confer- ence on Machine Learning, pages 1692–1717. PMLR,

work page
[8]

Brown, Michal R

Christopher F. Brown, Michal R. Kazmierski, Va- lerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Es- tefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. Al- phaearth found...

work page 2025
[9]

Terrafm: A scalable foundation model for unified multisensor earth observation

Muhammad Sohail Danish, Muhammad Akhtar Mu- nir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fa- had Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation. 2025. 2

work page 2025
[10]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 3

work page 2021
[11]

Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model

Runmin Dong, Shuai Yuan, Bin Luo, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Weijia Li, Juepeng Zheng, and Haohuan Fu. Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2...

work page 2024
[12]

Remote sensing image super-resolution via enhanced back-projection networks

Xiaoyu Dong, Zhihong Xi, Xu Sun, and Lina Yang. Remote sensing image super-resolution via enhanced back-projection networks. InIGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, pages 1480–1483. IEEE, 2020. 3

work page 2020
[13]

Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails

Miguel Espinosa, Valerio Marsocci, Yuru Jia, Elliot Crowley, and Mikolaj Czerkawski. Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025
[14]

Taming transformers for high-resolution image syn- thesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image syn- thesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3

work page 2021
[15]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025

Daniela Szwarcman et al. Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025. 2

work page 2025
[16]

Copernicus: Europes eyes on Earth, 2025

European Commission. Copernicus: Europes eyes on Earth, 2025. Accessed: 2024-12-30. 1

work page 2025
[17]

Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav

Zhengpeng Feng, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, James Ball, Clement Atzberger, David A. Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav. Tessera: Temporal embeddings of surface spectra for earth representation and analysis,

work page
[18]

Major tom: Expandable datasets for earth observation

Alistair Francis and Mikolaj Czerkawski. Major tom: Expandable datasets for earth observation. InIGARSS 2024-2024 IEEE International Geoscience and Re- mote Sensing Symposium, pages 2935–2940. IEEE,

work page 2024
[19]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 23164–23173, 2023. 3

work page 2023
[20]

Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014. 3

work page 2014
[21]

Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023

Jiang He, Yajie Li, Qiangqiang Yuan, et al. Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023. 3

work page arXiv 2023
[22]

Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Far- ley, et al. Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025. 4

work page arXiv 2025
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020. 3

work page 2020
[24]

TerraMind: Large-scale generative multimodality for Earth observation,

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation.arXiv preprint arXiv:2504.11171, 2025. 4

work page arXiv 2025
[25]

Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery

Jia Jia, Geunho Lee, Zhibo Wang, Lyu Zhi, and Yuchu He. Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery. IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing, 2024. 3

work page 2024
[26]

Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025

Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, and Andrea Nascetti. Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025. 2, 3

work page arXiv 2025
[27]

Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

Minghao Jin, Pengwei Wang, and Yusong Li. Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

work page
[28]

Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023

Ran Jing, Fuzhou Duan, Fengxian Lu, Miao Zhang, and Wenji Zhao. Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023. 3

work page 2023
[29]

Analyz- ing and improving the image quality of StyleGAN

Tero Karras, Samuli Laine, and Timo Aila. Analyz- ing and improving the image quality of StyleGAN. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2020. 3

work page 2020
[30]

Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023

Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lo- bell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023. 2, 3

work page arXiv 2023
[31]

Multi-class segmentation from aerial views using recursive noise diffusion

Benedikt Kolbeinsson and Krystian Mikolajczyk. Multi-class segmentation from aerial views using recursive noise diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8439–8449, 2024. 3

work page 2024
[32]

Improved precision and recall metric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 7

work page 2019
[33]

Lawrence, Victoria L

Bryan N. Lawrence, Victoria L. Bennett, James Churchill, Martin Juckes, Philip Kershaw, Stephen Pascoe, Sam Pepler, Matthew Pritchard, and Ag Stephens. Storing and manipulating environmental big data with jasmin. InIEEE Big Data, pages 1–5, San Francisco, 2013. IEEE. 14

work page 2013
[34]

Detecting out-of-distribution earth observation images with dif- fusion models

Georges Le Bellier and Nicolas Audebert. Detecting out-of-distribution earth observation images with dif- fusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 481–491, 2024. 3

work page 2024
[35]

Mdfl: Multi-domain diffusion-driven feature learn- ing

Daixun Li, Weiying Xie, Jiaqing Zhang, and Yunsong Li. Mdfl: Multi-domain diffusion-driven feature learn- ing. InProceedings of the AAAI conference on artifi- cial intelligence, 2024. 3

work page 2024
[36]

A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022

Yang Liu, Yexin Wang, Kaichang Di, Man Peng, Wen- hui Wan, and Zhaoqin Liu. A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022. 3

work page 2022
[37]

Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024

Yidan Liu, Jun Yue, Shaobo Xia, Pedram Ghamisi, Weiying Xie, and Leyuan Fang. Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024. 3

work page arXiv 2024
[38]

Revisiting clas- sifier two-sample tests

David Lopez-Paz and Maxime Oquab. Revisiting clas- sifier two-sample tests. InInternational Conference on Learning Representations (ICLR), 2017. 7

work page 2017
[39]

Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020

Jiayi Ma, Wei Yu, Chen Chen, Pengwei Liang, Xiao- jie Guo, and Junjun Jiang. Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020. 3

work page 2020
[40]

Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020

Andrea Meraner, Patrick Ebel, Xiao Xiang Zhu, and Michael Schmitt. Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020. 3

work page 2020
[41]

Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehm- cke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. InEuropean Con- ference on Computer Vision, pages 164–182. Springer,

work page
[42]

Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models

Li Pang, Xiangyu Rui, Long Cui, Hongzhong Wang, Deyu Meng, and Xiangyong Cao. Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 3005–3014, 2024. 3

work page 2024
[43]

Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023

Z ´arate L Paola, L ´opez S Jes ´us, Arroyo H Christian, and Rinc ´on U Sonia. Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023. 3

work page 2023
[44]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4195–4205, 2023. 3

work page 2023
[45]

Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities

Jiahui Qu, Yuanbo Yang, Wenqian Dong, and Yufei Yang. Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 14731–14739, 2024. 3

work page 2024
[46]

Zero-shot text-to-image generation

Aditya Ramesh, Pavel Pavlov, Gabriel Goh, et al. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, 2021. 3

work page 2021
[47]

High- resolution image synthesis with latent diffusion mod- els

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 3, 5

work page 2022
[48]

U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015. 3

work page 2015
[49]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealis- tic text-to-image diffusion models with deep language understanding, 2022. 3

work page 2022
[50]

Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024

Neetu Sigger, Quoc-Tuan Vien, Sinh Van Nguyen, Gi- anluca Tozzi, and Tuan Thanh Nguyen. Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024. 3

work page 2024
[51]

Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015

Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015. 3

work page 2015
[52]

Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024

Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Junmin Liu, and Deyu Meng. Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3

work page 2024
[53]

Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024

Jiayuan Tian, Jie Lei, Jiaqing Zhang, Weiying Xie, and Yunsong Li. Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

work page 2024
[54]

Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation

Aysim Toker, Marvin Eisenberger, Daniel Cremers, and Laura Leal-Taix´e. Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 27695–27705, 2024. 3

work page 2024
[55]

Galileo: Learning global & local features of many remote sensing modalities

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global & local features of many remote sensing modalities. InForty-second International Conference on Machine Learning, 2025. 2, 4

work page 2025
[56]

Panop- ticon: Advancing any-sensor foundation models for earth observation

Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam Stewart, Zhitong Xiong, Xiao Xi- ang Zhu, Stefan Bauer, and John Chuang. Panop- ticon: Advancing any-sensor foundation models for earth observation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 2204– 2214, 2025. 2

work page 2025
[57]

Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

Ce Wang and Wanjie Sun. Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

work page
[58]

Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019

Lei Wang, Xin Xu, Yue Yu, Rui Yang, Rong Gui, Zhaozhuo Xu, and Fangling Pu. Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019. 3

work page 2019
[59]

Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024

Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, and Liang Lin. Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

work page 2024
[60]

Es- rgan: Enhanced super-resolution generative adversar- ial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Es- rgan: Enhanced super-resolution generative adversar- ial networks. InProceedings of the European confer- ence on computer vision (ECCV) workshops, pages 0– 0, 2018. 3

work page 2018
[61]

Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu

Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu. Towards a unified copernicus foundation model for earth vision,

work page
[62]

Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm

Yihan Wen, Xianping Ma, Xiaokang Zhang, and Man- On Pun. Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm. IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 3

work page 2024
[63]

Xiong, Y

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo ¨elle Hanna, Damian Borth, Ioannis Pa- poutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired founda- tion model for observing the Earth crossing modalities. arXiv preprint arXiv:2403.15356, 2024. 2, 4

work page arXiv 2024
[64]

Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025

Zhiping Yu, Chenyang Liu, Liqin Liu, Zhenwei Shi, and Zhengxia Zou. Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025. 2, 3

work page 2025
[65]

Diffucd: Unsuper- vised hyperspectral image change detection with semantic correlation diffusion model,

Xiangrong Zhang, Shunli Tian, Guanchun Wang, Huiyu Zhou, and Licheng Jiao. Diffucd: Unsuper- vised hyperspectral image change detection with se- mantic correlation diffusion model.arXiv preprint arXiv:2305.12410, 2023. 3

work page arXiv 2023
[66]

Zhang, G

Y . Zhang, G. Tseng, J. Redmon, H. Herzog, F. Bas- tani, H. Sablon, R. Park, J. Morrison, A. Buraczyn- ski, K. Farley, J. Hansen, A. Howe, P. Johnson, M. Otterlee, H. Pitelka, R. Ratner, T. Schmitt, C. Wil- helm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema. OlmoEarth: Earth observation foundation model.https : / / w...

work page 2025
[67]

Changen2: Multi-temporal remote sensing generative change foundation model

Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

work page 2024
[68]

Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024

Jingyi Zhou, Jiamu Sheng, Peng Ye, Jiayuan Fan, Tong He, Bin Wang, and Tao Chen. Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

work page 2024
[69]

Condition

Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiy- ing Wang, Lei Jin, and Pin Tao. Diffcr: A fast con- ditional diffusion framework for cloud removal from optical satellite images.IEEE Transactions on Geo- science and Remote Sensing, 62:1–14, 2024. 3 A. Supplementary Material We provide additional qualitative, quantitative, and architectural results that...

work page arXiv 2024
[70]

Example real-image thumbnails are provided for comparison

reveal that TerraMind (blue) concentrates on a few modes, while COP-GEN (green) captures multiple plausible geographic locations with similar terrain and biome properties, consistent with a non-injective mapping. Example real-image thumbnails are provided for comparison. INPUT MODALITIES S2L2A 293U_659R 407U_358R 456U_995L 352U_1041L 388U_486R Figure 14.G...

work page

[1] [1]

Segdiff: Image segmentation with diffusion probabilistic models

Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models.arXiv preprint arXiv:2112.00390, 2021. 3

work page arXiv 2021

[2] [2]

Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023

Tai An, Bin Xue, Chunlei Huo, Shiming Xiang, and Chunhong Pan. Efficient remote sensing image super- resolution via lightweight diffusion models.IEEE Geoscience and Remote Sensing Letters, 2023. 3

work page 2023

[3] [3]

Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning

Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero. Joint-embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

work page 2025

[4] [4]

OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. OmniSat: Self-supervised modal- ity fusion for Earth observation.ECCV, 2024. 2, 4

work page 2024

[5] [5]

Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

Wele Gedara Chaminda Bandara, Nithin Gopalakrish- nan Nair, and Vishal M Patel. Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection.arXiv preprint arXiv:2206.11892,

work page arXiv

[6] [6]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 3, 6

work page 2023

[7] [7]

One transformer fits all distributions in multi- modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi- modal diffusion at scale. InInternational Confer- ence on Machine Learning, pages 1692–1717. PMLR,

work page

[8] [8]

Brown, Michal R

Christopher F. Brown, Michal R. Kazmierski, Va- lerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Es- tefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. Al- phaearth found...

work page 2025

[9] [9]

Terrafm: A scalable foundation model for unified multisensor earth observation

Muhammad Sohail Danish, Muhammad Akhtar Mu- nir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fa- had Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation. 2025. 2

work page 2025

[10] [10]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 3

work page 2021

[11] [11]

Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model

Runmin Dong, Shuai Yuan, Bin Luo, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Weijia Li, Juepeng Zheng, and Haohuan Fu. Building bridges across spa- tial and temporal resolutions: Reference-based super- resolution via change priors and conditional diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2...

work page 2024

[12] [12]

Remote sensing image super-resolution via enhanced back-projection networks

Xiaoyu Dong, Zhihong Xi, Xu Sun, and Lina Yang. Remote sensing image super-resolution via enhanced back-projection networks. InIGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, pages 1480–1483. IEEE, 2020. 3

work page 2020

[13] [13]

Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails

Miguel Espinosa, Valerio Marsocci, Yuru Jia, Elliot Crowley, and Mikolaj Czerkawski. Cop-gen-beta: Unified generative modelling of copernicus imagery thumbnails. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025

[14] [14]

Taming transformers for high-resolution image syn- thesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image syn- thesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3

work page 2021

[15] [15]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025

Daniela Szwarcman et al. Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025. 2

work page 2025

[16] [16]

Copernicus: Europes eyes on Earth, 2025

European Commission. Copernicus: Europes eyes on Earth, 2025. Accessed: 2024-12-30. 1

work page 2025

[17] [17]

Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav

Zhengpeng Feng, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, James Ball, Clement Atzberger, David A. Coomes, Anil Madhavapeddy, Andrew Blake, and Srinivasan Keshav. Tessera: Temporal embeddings of surface spectra for earth representation and analysis,

work page

[18] [18]

Major tom: Expandable datasets for earth observation

Alistair Francis and Mikolaj Czerkawski. Major tom: Expandable datasets for earth observation. InIGARSS 2024-2024 IEEE International Geoscience and Re- mote Sensing Symposium, pages 2935–2940. IEEE,

work page 2024

[19] [19]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 23164–23173, 2023. 3

work page 2023

[20] [20]

Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- ial nets.Advances in Neural Information Processing Systems, 27, 2014. 3

work page 2014

[21] [21]

Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023

Jiang He, Yajie Li, Qiangqiang Yuan, et al. Td- iffde: A truncated diffusion model for remote sens- ing hyperspectral image denoising.arXiv preprint arXiv:2311.13622, 2023. 3

work page arXiv 2023

[22] [22]

Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Far- ley, et al. Olmoearth: Stable latent image model- ing for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025. 4

work page arXiv 2025

[23] [23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020. 3

work page 2020

[24] [24]

TerraMind: Large-scale generative multimodality for Earth observation,

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation.arXiv preprint arXiv:2504.11171, 2025. 4

work page arXiv 2025

[25] [25]

Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery

Jia Jia, Geunho Lee, Zhibo Wang, Lyu Zhi, and Yuchu He. Siamese meets diffusion network: Smdnet for en- hanced change detection in high-resolution rs imagery. IEEE Journal of Selected Topics in Applied Earth Ob- servations and Remote Sensing, 2024. 3

work page 2024

[26] [26]

Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025

Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, and Andrea Nascetti. Can gen- erative geospatial diffusion models excel as discrimi- native geospatial foundation models?arXiv preprint arXiv:2503.07890, 2025. 2, 3

work page arXiv 2025

[27] [27]

Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

Minghao Jin, Pengwei Wang, and Yusong Li. Hya- gan: remote sensing image cloud removal based on hy- brid attention generation adversarial network.Interna- tional Journal of Remote Sensing, 45(6):1755–1773,

work page

[28] [28]

Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023

Ran Jing, Fuzhou Duan, Fengxian Lu, Miao Zhang, and Wenji Zhao. Denoising diffusion probabilistic feature-based network for cloud removal in sentinel-2 imagery.Remote Sensing, 15(9):2217, 2023. 3

work page 2023

[29] [29]

Analyz- ing and improving the image quality of StyleGAN

Tero Karras, Samuli Laine, and Timo Aila. Analyz- ing and improving the image quality of StyleGAN. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2020. 3

work page 2020

[30] [30]

Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023

Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lo- bell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023. 2, 3

work page arXiv 2023

[31] [31]

Multi-class segmentation from aerial views using recursive noise diffusion

Benedikt Kolbeinsson and Krystian Mikolajczyk. Multi-class segmentation from aerial views using recursive noise diffusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8439–8449, 2024. 3

work page 2024

[32] [32]

Improved precision and recall metric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 7

work page 2019

[33] [33]

Lawrence, Victoria L

Bryan N. Lawrence, Victoria L. Bennett, James Churchill, Martin Juckes, Philip Kershaw, Stephen Pascoe, Sam Pepler, Matthew Pritchard, and Ag Stephens. Storing and manipulating environmental big data with jasmin. InIEEE Big Data, pages 1–5, San Francisco, 2013. IEEE. 14

work page 2013

[34] [34]

Detecting out-of-distribution earth observation images with dif- fusion models

Georges Le Bellier and Nicolas Audebert. Detecting out-of-distribution earth observation images with dif- fusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 481–491, 2024. 3

work page 2024

[35] [35]

Mdfl: Multi-domain diffusion-driven feature learn- ing

Daixun Li, Weiying Xie, Jiaqing Zhang, and Yunsong Li. Mdfl: Multi-domain diffusion-driven feature learn- ing. InProceedings of the AAAI conference on artifi- cial intelligence, 2024. 3

work page 2024

[36] [36]

A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022

Yang Liu, Yexin Wang, Kaichang Di, Man Peng, Wen- hui Wan, and Zhaoqin Liu. A generative adversar- ial network for pixel-scale lunar dem generation from high-resolution monocular imagery and low-resolution dem.Remote Sensing, 14(21):5420, 2022. 3

work page 2022

[37] [37]

Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024

Yidan Liu, Jun Yue, Shaobo Xia, Pedram Ghamisi, Weiying Xie, and Leyuan Fang. Diffusion models meet remote sensing: Principles, methods, and per- spectives.arXiv preprint arXiv:2404.08926, 2024. 3

work page arXiv 2024

[38] [38]

Revisiting clas- sifier two-sample tests

David Lopez-Paz and Maxime Oquab. Revisiting clas- sifier two-sample tests. InInternational Conference on Learning Representations (ICLR), 2017. 7

work page 2017

[39] [39]

Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020

Jiayi Ma, Wei Yu, Chen Chen, Pengwei Liang, Xiao- jie Guo, and Junjun Jiang. Pan-gan: An unsupervised pan-sharpening method for remote sensing image fu- sion.Information Fusion, 62:110–120, 2020. 3

work page 2020

[40] [40]

Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020

Andrea Meraner, Patrick Ebel, Xiao Xiang Zhu, and Michael Schmitt. Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion.ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020. 3

work page 2020

[41] [41]

Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehm- cke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. InEuropean Con- ference on Computer Vision, pages 164–182. Springer,

work page

[42] [42]

Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models

Li Pang, Xiangyu Rui, Long Cui, Hongzhong Wang, Deyu Meng, and Xiangyong Cao. Hir-diff: Unsu- pervised hyperspectral image restoration via improved diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 3005–3014, 2024. 3

work page 2024

[43] [43]

Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023

Z ´arate L Paola, L ´opez S Jes ´us, Arroyo H Christian, and Rinc ´on U Sonia. Correction of banding errors in satellite images with generative adversarial networks (gan).IEEE Access, 11:51960–51970, 2023. 3

work page 2023

[44] [44]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4195–4205, 2023. 3

work page 2023

[45] [45]

Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities

Jiahui Qu, Yuanbo Yang, Wenqian Dong, and Yufei Yang. Lds2ae: Local diffusion shared-specific autoen- coder for multimodal remote sensing image classifi- cation with arbitrary missing modalities. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 14731–14739, 2024. 3

work page 2024

[46] [46]

Zero-shot text-to-image generation

Aditya Ramesh, Pavel Pavlov, Gabriel Goh, et al. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, 2021. 3

work page 2021

[47] [47]

High- resolution image synthesis with latent diffusion mod- els

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 3, 5

work page 2022

[48] [48]

U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation.International Conference on Medical Image Computing and Computer-Assisted Interven- tion, 2015. 3

work page 2015

[49] [49]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealis- tic text-to-image diffusion models with deep language understanding, 2022. 3

work page 2022

[50] [50]

Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024

Neetu Sigger, Quoc-Tuan Vien, Sinh Van Nguyen, Gi- anluca Tozzi, and Tuan Thanh Nguyen. Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification.Sci- entific Reports, 14(1):8438, 2024. 3

work page 2024

[51] [51]

Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015

Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.Inter- national Conference on Machine Learning, 2015. 3

work page 2015

[52] [52]

Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024

Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Junmin Liu, and Deyu Meng. Crs-diff: Controllable remote sensing image gener- ation with diffusion model.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3

work page 2024

[53] [53]

Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024

Jiayuan Tian, Jie Lei, Jiaqing Zhang, Weiying Xie, and Yunsong Li. Swimdiff: Scene-wide matching con- trastive learning with diffusion constraint for remote sensing image.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

work page 2024

[54] [54]

Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation

Aysim Toker, Marvin Eisenberger, Daniel Cremers, and Laura Leal-Taix´e. Satsynth: Augmenting image- mask pairs through diffusion models for aerial seman- tic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 27695–27705, 2024. 3

work page 2024

[55] [55]

Galileo: Learning global & local features of many remote sensing modalities

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global & local features of many remote sensing modalities. InForty-second International Conference on Machine Learning, 2025. 2, 4

work page 2025

[56] [56]

Panop- ticon: Advancing any-sensor foundation models for earth observation

Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam Stewart, Zhitong Xiong, Xiao Xi- ang Zhu, Stefan Bauer, and John Chuang. Panop- ticon: Advancing any-sensor foundation models for earth observation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 2204– 2214, 2025. 2

work page 2025

[57] [57]

Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

Ce Wang and Wanjie Sun. Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior.ISPRS Journal of Photogrammetry and Remote Sensing, 220:125–138,

work page

[58] [58]

Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019

Lei Wang, Xin Xu, Yue Yu, Rui Yang, Rong Gui, Zhaozhuo Xu, and Fangling Pu. Sar-to-optical image translation using supervised cycle-consistent adversar- ial networks.Ieee Access, 7:129136–129149, 2019. 3

work page 2019

[59] [59]

Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024

Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, and Liang Lin. Idf-cr: Iterative dif- fusion process for divide-and-conquer cloud removal in remote-sensing images.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

work page 2024

[60] [60]

Es- rgan: Enhanced super-resolution generative adversar- ial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Es- rgan: Enhanced super-resolution generative adversar- ial networks. InProceedings of the European confer- ence on computer vision (ECCV) workshops, pages 0– 0, 2018. 3

work page 2018

[61] [61]

Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu

Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taix ´e, and Xiao Xiang Zhu. Towards a unified copernicus foundation model for earth vision,

work page

[62] [62]

Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm

Yihan Wen, Xianping Ma, Xiaokang Zhang, and Man- On Pun. Gcd-ddpm: A generative change detec- tion model based on difference-feature guided ddpm. IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 3

work page 2024

[63] [63]

Xiong, Y

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo ¨elle Hanna, Damian Borth, Ioannis Pa- poutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired founda- tion model for observing the Earth crossing modalities. arXiv preprint arXiv:2403.15356, 2024. 2, 4

work page arXiv 2024

[64] [64]

Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025

Zhiping Yu, Chenyang Liu, Liqin Liu, Zhenwei Shi, and Zhengxia Zou. Metaearth: A generative founda- tion model for global-scale remote sensing image gen- eration.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1764–1781, 2025. 2, 3

work page 2025

[65] [65]

Diffucd: Unsuper- vised hyperspectral image change detection with semantic correlation diffusion model,

Xiangrong Zhang, Shunli Tian, Guanchun Wang, Huiyu Zhou, and Licheng Jiao. Diffucd: Unsuper- vised hyperspectral image change detection with se- mantic correlation diffusion model.arXiv preprint arXiv:2305.12410, 2023. 3

work page arXiv 2023

[66] [66]

Zhang, G

Y . Zhang, G. Tseng, J. Redmon, H. Herzog, F. Bas- tani, H. Sablon, R. Park, J. Morrison, A. Buraczyn- ski, K. Farley, J. Hansen, A. Howe, P. Johnson, M. Otterlee, H. Pitelka, R. Ratner, T. Schmitt, C. Wil- helm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema. OlmoEarth: Earth observation foundation model.https : / / w...

work page 2025

[67] [67]

Changen2: Multi-temporal remote sensing generative change foundation model

Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

work page 2024

[68] [68]

Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024

Jingyi Zhou, Jiamu Sheng, Peng Ye, Jiayuan Fan, Tong He, Bin Wang, and Tao Chen. Exploring multi- timestep multi-stage diffusion features for hyperspec- tral image classification.IEEE Transactions on Geo- science and Remote Sensing, 2024. 3

work page 2024

[69] [69]

Condition

Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiy- ing Wang, Lei Jin, and Pin Tao. Diffcr: A fast con- ditional diffusion framework for cloud removal from optical satellite images.IEEE Transactions on Geo- science and Remote Sensing, 62:1–14, 2024. 3 A. Supplementary Material We provide additional qualitative, quantitative, and architectural results that...

work page arXiv 2024

[70] [70]

Example real-image thumbnails are provided for comparison

reveal that TerraMind (blue) concentrates on a few modes, while COP-GEN (green) captures multiple plausible geographic locations with similar terrain and biome properties, consistent with a non-injective mapping. Example real-image thumbnails are provided for comparison. INPUT MODALITIES S2L2A 293U_659R 407U_358R 456U_995L 352U_1041L 388U_486R Figure 14.G...

work page