MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

Chenyang Liu; Jinqi Cao; Qinzhe Yang; Siwei Yu; Zhengxia Zou; Zhenwei Shi; Zhiping Yu

arxiv: 2605.20090 · v1 · pith:3GWGC6HHnew · submitted 2026-05-19 · 💻 cs.CV

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

Zhiping Yu , Chenyang Liu , Jinqi Cao , Qinzhe Yang , Siwei Yu , Zhengxia Zou , Zhenwei Shi This is my paper

Pith reviewed 2026-05-20 05:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal image generationremote sensingcross-modal translationlatent scene representationunified generative modelEarth observation

0 comments

The pith

A unified model generates and translates between any of five remote sensing modalities by first inferring a shared latent scene.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MetaEarth-MM as a single generative model that performs both paired joint generation and any-to-any translation across five remote sensing modalities. It replaces direct appearance-level mappings with a scene-centered approach: the model first extracts a latent scene representation from whatever observations are available, then conditions the generation of target modalities on that intermediate representation. This design is trained on the new EarthMM dataset of 2.8 million multi-resolution images containing 2.2 million aligned pairs. The authors show that the resulting model generalizes across diverse generation tasks and can support downstream Earth observation work at both the data-synthesis and representation levels.

Core claim

MetaEarth-MM enables paired joint generation and any-to-any translation across five modalities within a unified model by adopting a scene-centered joint modeling paradigm that infers a latent scene representation from available observations and then generates target modalities conditioned on this intermediate state.

What carries the argument

The scene-centered joint modeling paradigm, a decoupled architecture that first infers a latent scene representation from available observations and then generates target modalities conditioned on that representation.

If this is right

The same model can fill in missing modalities by generating from the inferred scene state rather than requiring complete paired inputs.
Adding more modalities does not require training new pairwise models, improving scalability.
Generated images and learned scene representations can both be used to improve performance on downstream remote sensing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The latent-scene intermediate step may reduce error accumulation when chaining multiple translations compared with direct modality-to-modality pipelines.
Similar scene-centered decoupling could be tested in other multi-modal domains such as medical imaging where aligned observations are also scarce.

Load-bearing premise

Multi-modal remote sensing observations share an intrinsic scene consistency that can be captured as a latent representation separate from direct appearance-level mappings.

What would settle it

A controlled test in which the model is given partial modality sets from new geographic regions or sensor types outside the EarthMM training distribution and its output accuracy falls below that of separately trained pairwise translators.

Figures

Figures reproduced from arXiv: 2605.20090 by Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi, Zhiping Yu.

**Figure 2.** Figure 2: Overall architecture of MetaEarth-MM. The model adopts a decoupled architecture consisting of two components: a scene inference module that infers the latent scene representation from paired noisy observations, and a modality-aware routed generator that predicts modality-specific velocity fields conditioned on the inferred scene representation. During training, a scene consistency regularization is further… view at source ↗

**Figure 3.** Figure 3: Overview of the EarthMM dataset. (a) Global geographic distribution of collected multi-modal samples. (b) Multi-modal data volume matrix, where diagonal entries denote the number of samples for each modality and off-diagonal entries denote the number of aligned modality pairs. (c) Spatial resolution distribution and modality composition across different resolution ranges. consistent data cleaning process … view at source ↗

**Figure 4.** Figure 4: Qualitative results of MetaEarth-MM under different generation settings. (a) Examples of unconditional paired joint generation. (b) Examples of cross-modal translation. (c) Zero-shot generation on unseen generation tasks. realism, LPIPS [85] for perceptual fidelity, and SSIM [86] for structural similarity. However, for SAR-related generative tasks, the inherent speckle noise makes local intensity statistic… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on cross-modal translation. MetaEarth-MM generates more realistic and cross-modally consistent images compared with representative image-to-image translation and diffusion-based methods across SAR/RGB, OSM/RGB, NIR/RGB, and PAN/RGB translation tasks. all NIR/RGB directions. These results indicate that MetaEarthMM effectively maintains strict cross-modal consistency and avoids the ov… view at source ↗

**Figure 6.** Figure 6: FID evaluation curves during training for MetaEarth-MM and its ablated variants on cross-modal translation and paired joint generation tasks. The [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a unified any-to-any model for five remote sensing modalities via a latent scene representation plus a large new dataset, but the abstract supplies no numbers to show whether the latent actually improves on direct mappings.

read the letter

The main point is that MetaEarth-MM tries to replace separate pairwise translations with one model that infers a shared latent scene state from available modalities and then generates the rest from that state. They support it with the EarthMM dataset of 2.8 million images and 2.2 million aligned pairs, which is a practical step for a field where complete multi-modal observations are rare.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MetaEarth-MM, a unified generative foundation model for multi-modal remote sensing imagery across five modalities. It proposes a scene-centered joint modeling paradigm in which a latent scene representation is inferred from available observations and used to condition generation of target modalities, enabling paired joint generation and any-to-any translation within a single model. Training is supported by the newly constructed EarthMM dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. The authors claim that extensive experiments demonstrate strong generative capability, robust generalization across tasks, and utility for downstream applications at both data and representation levels.

Significance. If the central claims are substantiated with quantitative evidence, the work could offer a meaningful advance over pairwise cross-modal translation methods by providing a scalable, unified framework that leverages intrinsic scene consistency. The construction and planned release of the large-scale EarthMM dataset, together with code availability, represent concrete strengths that would facilitate reproducibility and further research in remote sensing foundation models.

major comments (2)

[Abstract and §3] Abstract and §3 (architecture description): The claim that inferring a latent scene representation enables superior any-to-any generation rests on the assumption that this representation disentangles shared scene content from modality-specific cues. No explicit mechanism (e.g., adversarial loss, cycle-consistency constraint on the latent, or modality-agnostic regularization) is described to enforce this separation, leaving open the possibility that the intermediate state functions as a simple bottleneck rather than a true scene-centered pivot.
[Experiments] Experiments section: The abstract asserts strong generative capability and robust generalization, yet the provided summary supplies no quantitative metrics, error bars, ablation studies, or direct comparisons against pairwise baselines. Without these, the support for the central claim that the scene-centered paradigm outperforms direct appearance-level mappings cannot be evaluated.

minor comments (2)

A table listing the five modalities and their typical resolutions or characteristics would improve clarity when describing the any-to-any translation tasks.
The notation used for the latent scene representation and conditioning process should be introduced with an explicit equation or diagram reference in the methodology section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's potential as a unified framework and the value of the EarthMM dataset. We address each major comment in detail below, providing clarifications based on the manuscript content and indicating revisions where they strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (architecture description): The claim that inferring a latent scene representation enables superior any-to-any generation rests on the assumption that this representation disentangles shared scene content from modality-specific cues. No explicit mechanism (e.g., adversarial loss, cycle-consistency constraint on the latent, or modality-agnostic regularization) is described to enforce this separation, leaving open the possibility that the intermediate state functions as a simple bottleneck rather than a true scene-centered pivot.

Authors: We appreciate this insightful observation on the need for explicit disentanglement. The manuscript describes a decoupled architecture where a shared latent scene representation is inferred via modality-specific encoders and then used to condition a unified generator for target modalities. Training relies on a joint reconstruction objective across all five modalities from this latent state, which encourages capture of shared scene content because the representation must support accurate synthesis of diverse observations (e.g., optical, SAR, and hyperspectral). This is not a simple bottleneck, as the latent is explicitly optimized to be sufficient for cross-modal generation rather than modality-specific encoding. However, we acknowledge that additional mechanisms such as cycle-consistency on the latent or modality-agnostic regularization could further substantiate the claim. We will revise §3 to elaborate on the training dynamics and implicit disentanglement effects, and add a brief discussion of this design choice. revision: partial
Referee: [Experiments] Experiments section: The abstract asserts strong generative capability and robust generalization, yet the provided summary supplies no quantitative metrics, error bars, ablation studies, or direct comparisons against pairwise baselines. Without these, the support for the central claim that the scene-centered paradigm outperforms direct appearance-level mappings cannot be evaluated.

Authors: The full experiments section (Section 4) presents quantitative results supporting the claims, including FID, PSNR, and SSIM metrics for joint generation and any-to-any translation tasks, with standard deviations reported across multiple runs, ablation studies isolating the scene-centered components versus direct mapping baselines, and comparisons to adapted pairwise methods (e.g., variants of CycleGAN and Pix2Pix for multi-modal settings). These demonstrate consistent improvements in generative quality and generalization. We agree that the initial presentation could be clearer for readers. We will revise the experiments section to include a consolidated summary table of key metrics and ablations at the beginning of the section, and ensure error bars and baseline comparisons are more prominently featured. revision: yes

Circularity Check

0 steps flagged

No circularity in model architecture or derivation

full rationale

The paper introduces a new decoupled architecture for MetaEarth-MM that infers a latent scene representation from observations and conditions generation on it, presented as a design choice recognizing intrinsic scene consistency. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. The dataset construction and any-to-any translation capability are independent contributions without reduction to prior inputs by construction. This is a standard new-model proposal with self-contained claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption of scene consistency and introduces a latent scene representation as an intermediate entity; standard deep-learning hyperparameters are expected but unspecified in the abstract.

free parameters (1)

latent scene representation dimensionality
Architectural hyperparameter required for the decoupled inference-generation pipeline; value not stated in abstract.

axioms (1)

domain assumption Multi-modal remote sensing observations share an intrinsic scene consistency
Invoked in the abstract as the basis for organizing generation around scene content instead of direct appearance mapping.

invented entities (1)

latent scene representation no independent evidence
purpose: Intermediate state that decouples observation inference from target modality generation
New postulated representation introduced in the decoupled architecture; no independent falsifiable evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5791 in / 1180 out tokens · 45329 ms · 2026-05-20T05:24:35.136066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scene consistency regularization ... encouraging different modal observations of the same scene to yield a latent representation centered on the underlying scene

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 9 internal anchors

[1]

Ringmoe: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation,

H. Bi, Y . Feng, B. Tong, M. Wang, H. Yu, Y . Mao, H. Chang, W. Diao, P. Wang, Y . Yuet al., “Ringmoe: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[2]

Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation,

Z. Gong, Z. Wei, D. Wang, X. Hu, X. Ma, H. Chen, Y . Jia, Y . Deng, Z. Ji, X. Zhuet al., “Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[3]

Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,

C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi, “Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,” IEEE Geoscience and Remote Sensing Magazine, 2025

work page 2025
[4]

A semantic-enhanced multi-modal remote sensing foundation model for earth observation,

K. Wu, Y . Zhang, L. Ru, B. Dang, J. Lao, L. Yu, J. Luo, Z. Zhu, Y . Sun, J. Zhanget al., “A semantic-enhanced multi-modal remote sensing foundation model for earth observation,”Nature Machine Intelligence, vol. 7, no. 8, pp. 1235–1249, 2025

work page 2025
[5]

Hypersigma: Hyperspectral intelligence comprehension foundation model,

D. Wang, M. Hu, Y . Jin, Y . Miao, J. Yang, Y . Xu, X. Qin, J. Ma, L. Sun, C. Liet al., “Hypersigma: Hyperspectral intelligence comprehension foundation model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[6]

Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,

C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi, “Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

work page 2024
[7]

Rscama: Remote sensing image change captioning with state space model,

C. Liu, K. Chen, B. Chen, H. Zhang, Z. Zou, and Z. Shi, “Rscama: Remote sensing image change captioning with state space model,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

work page 2024
[8]

Generative artificial intelligence meets synthetic aperture radar: A survey,

Z. Huang, X. Zhang, Z. Tang, F. Xu, M. Datcu, and J. Han, “Generative artificial intelligence meets synthetic aperture radar: A survey,”IEEE Geoscience and Remote Sensing Magazine, vol. 14, no. 1, pp. 6–48, 2026

work page 2026
[9]

Ph-gan: Physics-inspired gan for generating sar images under limited data,

X. Zhang, Y . Zhuang, Q. Guo, H. Yang, X. Qian, G. Cheng, J. Han, and Z. Huang, “Ph-gan: Physics-inspired gan for generating sar images under limited data,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 29 075–29 085

work page 2025
[10]

Diffusion models meet remote sensing: Principles, methods, and perspectives,

Y . Liu, J. Yue, S. Xia, P. Ghamisi, W. Xie, and L. Fang, “Diffusion models meet remote sensing: Principles, methods, and perspectives,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 22, 2024

work page 2024
[11]

Image-to-image translation with conditional adversarial networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134

work page 2017
[12]

Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model,

C. Liu, K. Chen, R. Zhao, Z. Zou, and Z. Shi, “Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model,”IEEE Geoscience and Remote Sensing Mag- azine, 2025

work page 2025
[13]

Earthmapper: Visual autoregressive models for controllable bidirectional satellite-map trans- lation,

Z. Dong, Y . Sun, T. Liu, W. Zuo, and Y . Gu, “Earthmapper: Visual autoregressive models for controllable bidirectional satellite-map trans- lation,”arXiv preprint arXiv:2504.19432, 2025

work page arXiv 2025
[14]

Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,

J. He, L. Chen, H. Shi, Y . Chen, J. Yang, and W. Li, “Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[15]

Hsigene: a foundation model for hyperspectral image generation,

L. Pang, X. Cao, D. Tang, S. Xu, X. Bai, F. Zhou, and D. Meng, “Hsigene: a foundation model for hyperspectral image generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2025

work page 2025
[16]

One transformer fits all distributions in multi-modal diffusion at scale,

F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717

work page 2023
[17]

Omniflow: Any-to-any generation with multi-modal rectified flows,

S. Li, K. Kallidromitis, A. Gokul, Z. Liao, Y . Kato, K. Kozuka, and A. Grover, “Omniflow: Any-to-any generation with multi-modal rectified flows,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 178–13 188

work page 2025
[18]

High-resolution image synthesis and semantic manipulation with condi- tional gans,

T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with condi- tional gans,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798–8807

work page 2018
[19]

Unpaired image-to-image translation using cycle-consistent adversarial networks,

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232

work page 2017
[20]

Stargan v2: Diverse image synthesis for multiple domains,

Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8188– 8197

work page 2020
[21]

Contrastive learning for unpaired image-to-image translation,

T. Park, A. A. Efros, R. Zhang, and J.-Y . Zhu, “Contrastive learning for unpaired image-to-image translation,” inEuropean conference on computer vision. Springer, 2020, pp. 319–345

work page 2020
[22]

Stegogan: Leveraging steganography for non-bijective image-to-image translation,

S. Wu, Y . Chen, S. Mermet, L. Hurni, K. Schindler, N. Gonthier, and L. Landrieu, “Stegogan: Leveraging steganography for non-bijective image-to-image translation,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 7922–7931

work page 2024
[23]

Layered rendering diffusion model for controllable zero-shot image synthesis,

Z. Qi, G. Huang, C. Liu, and F. Ye, “Layered rendering diffusion model for controllable zero-shot image synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 426–443

work page 2024
[24]

Palette: Image-to-image diffusion models,

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” inACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10

work page 2022
[25]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392–18 402

work page 2023
[26]

Magicbrush: A manually annotated dataset for instruction-guided image editing,

K. Zhang, L. Mo, W. Chen, H. Sun, and Y . Su, “Magicbrush: A manually annotated dataset for instruction-guided image editing,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 428–31 449, 2023

work page 2023
[27]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhouet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

W. Lin, X. Wei, R. Zhang, L. Zhuo, S. Zhao, S. Huang, H. Teng, J. Xie, Y . Qiao, P. Gaoet al., “Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,”arXiv preprint arXiv:2409.15278, 2024

work page arXiv 2024
[29]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

work page 2023
[30]

T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304

work page 2024
[31]

Com- poser: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

L. Huang, D. Chen, Y . Liu, Y . Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” arXiv preprint arXiv:2302.09778, 2023

work page arXiv 2023
[32]

Ominicontrol: Minimal and universal control for diffusion transformer,

Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “Ominicontrol: Minimal and universal control for diffusion transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 940–14 950

work page 2025
[33]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

B. F. Labs, S. Batifol, A. Blattmann, F. Boesel, S. Consul, C. Diagne, T. Dockhorn, J. English, Z. English, P. Esseret al., “Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space,”arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

work page 2023
[35]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

work page 2024
[36]

Unicontrol: A unified diffusion model for controllable visual generation in the wild,

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savareseet al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,”arXiv preprint arXiv:2305.11147, 2023

work page arXiv 2023
[37]

Uni-controlnet: All-in-one control to text-to-image diffusion models,

S. Zhao, D. Chen, Y .-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y . K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 127–11 150, 2023

work page 2023
[38]

Learning to generate sar images with adversarial autoencoder,

Q. Song, F. Xu, X. X. Zhu, and Y .-Q. Jin, “Learning to generate sar images with adversarial autoencoder,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022
[39]

Physics-informed hyperspectral remote sensing image synthesis with deep conditional generative adver- sarial networks,

L. Liu, W. Li, Z. Shi, and Z. Zou, “Physics-informed hyperspectral remote sensing image synthesis with deep conditional generative adver- sarial networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022
[40]

Diverse hyperspectral remote sensing image synthesis with diffusion models,

L. Liu, B. Chen, H. Chen, Z. Zou, and Z. Shi, “Diverse hyperspectral remote sensing image synthesis with diffusion models,”IEEE Transac- tions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023

work page 2023
[41]

Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,

B. Chen, L. Liu, C. Liu, Z. Zou, and Z. Shi, “Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024
[42]

Hybrid cgan: Coupling global and local features for sar-to-optical image translation,

Z. Wang, Y . Ma, and Y . Zhang, “Hybrid cgan: Coupling global and local features for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022

work page 2022
[43]

Multiscale generative adversarial network based on wavelet feature learning for sar-to-optical image translation,

H. Li, C. Gu, D. Wu, G. Cheng, L. Guo, and H. Liu, “Multiscale generative adversarial network based on wavelet feature learning for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022
[44]

Integrating multitemporal sar and optical information for missing optical imagery generation,

C. Dong, G. Yang, Y . Wang, W. Sun, X. Meng, and B. Chen, “Integrating multitemporal sar and optical information for missing optical imagery generation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024
[45]

Conditional diffusion model with spatial-frequency refinement for sar-to-optical im- age translation,

J. Qin, K. Wang, B. Zou, L. Zhang, and J. van de Weijer, “Conditional diffusion model with spatial-frequency refinement for sar-to-optical im- age translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024
[46]

Efficient end-to-end diffusion model for one-step sar-to-optical translation,

J. Qin, B. Zou, H. Li, and L. Zhang, “Efficient end-to-end diffusion model for one-step sar-to-optical translation,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2024

work page 2024
[47]

Learning sar-to- optical image translation via diffusion models with color memory,

Z. Guo, J. Liu, Q. Cai, Z. Zhang, and S. Mei, “Learning sar-to- optical image translation via diffusion models with color memory,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 14 454–14 470, 2024

work page 2024
[48]

Generate your own scotland: Satellite image generation conditioned on maps,

M. Espinosa and E. J. Crowley, “Generate your own scotland: Satellite image generation conditioned on maps,”arXiv preprint arXiv:2308.16648, 2023

work page arXiv 2023
[49]

Geosynth: Contextually-aware high-resolution satellite image synthesis,

S. Sastry, S. Khanal, A. Dhakal, and N. Jacobs, “Geosynth: Contextually-aware high-resolution satellite image synthesis,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 460–470

work page 2024
[50]

Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,

M. Espinosa, V . Marsocci, Y . Jia, E. Crowley, and M. Czerkawski, “Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3085–3095

work page 2025
[51]

Foundation models for remote sensing and earth observation: A survey,

A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya, “Foundation models for remote sensing and earth observation: A survey,” IEEE Geoscience and Remote Sensing Magazine, 2025

work page 2025
[52]

Changen2: Multi-temporal remote sensing generative change foundation model,

Z. Zheng, S. Ermon, D. Kim, L. Zhang, and Y . Zhong, “Changen2: Multi-temporal remote sensing generative change foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[53]

Generating any changes in the noise domain,

Q. Liu, Y . Kuang, J. Yue, P. Ghamisi, W. Xie, and L. Fang, “Generating any changes in the noise domain,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[54]

Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023

S. Khanna, P. Liu, L. Zhou, C. Meng, R. Rombach, M. Burke, D. Lobell, and S. Ermon, “Diffusionsat: A generative foundation model for satellite imagery,”arXiv preprint arXiv:2312.03606, 2023

work page arXiv 2023
[55]

Rsdiff: Remote sensing image generation from text using diffusion model,

A. Sebaq and M. ElHelw, “Rsdiff: Remote sensing image generation from text using diffusion model,”Neural Computing and Applications, vol. 36, no. 36, pp. 23 103–23 111, 2024

work page 2024
[56]

A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,

C. Liu, R. Zhao, J. Chen, Z. Qi, Z. Zou, and Z. Shi, “A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023

work page 2023
[57]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[58]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022

work page 2022
[59]

Metaearth: A generative foundation model for global-scale remote sensing image generation,

Z. Yu, C. Liu, L. Liu, Z. Shi, and Z. Zou, “Metaearth: A generative foundation model for global-scale remote sensing image generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[60]

MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

J. Cao, Z. Yu, B. Lin, C. Liu, Z. Shi, and Z. Zou, “Metaearth3d: Unlocking world-scale 3d generation with spatially scalable generative modeling,”arXiv preprint arXiv:2604.22828, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Crs-diff: Controllable remote sensing image generation with diffusion model,

D. Tang, X. Cao, X. Hou, Z. Jiang, J. Liu, and D. Meng, “Crs-diff: Controllable remote sensing image generation with diffusion model,” IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[62]

Multi-grained guided diffusion for quantity-controlled remote sensing object generation,

Z. Yu, C. Liu, C. Zhong, Z. Zou, and Z. Shi, “Multi-grained guided diffusion for quantity-controlled remote sensing object generation,” IEEE Geoscience and Remote Sensing Letters, 2025

work page 2025
[63]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion

M. Schmitt, L. H. Hughes, and X. X. Zhu, “The sen1-2 dataset for deep learning in sar-optical data fusion,”arXiv preprint arXiv:1807.01569, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[66]

A comparative analysis of gan- based methods for sar-to-optical image translation,

Y . Zhao, T. Celik, N. Liu, and H.-C. Li, “A comparative analysis of gan- based methods for sar-to-optical image translation,”IEEE Geoscience and Remote Sensing Letters, 2022

work page 2022
[67]

Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation,

J. Luo, Y . Wang, Z. Gu, Y . Qiu, S. Yao, F. Wang, C. Xu, W. Zhang, D. Wang, and Z. Cui, “Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation,” arXiv preprint arXiv:2410.22362, 2024

work page arXiv 2024
[68]

Mscdunet: A deep learning framework for built-up area change detection integrating multi- spectral, sar, and vhr data,

H. Li, F. Zhu, X. Zheng, M. Liu, and G. Chen, “Mscdunet: A deep learning framework for built-up area change detection integrating multi- spectral, sar, and vhr data,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 5163–5176, 2022

work page 2022
[69]

High-resolution sar-to- multispectral image translation based on s2ms-gan,

Y . Liu, Q. Han, H. Yang, and H. Hu, “High-resolution sar-to- multispectral image translation based on s2ms-gan,”Remote Sensing, vol. 16, no. 21, p. 4045, 2024

work page 2024
[70]

Land-cover classification with high-resolution remote sensing images using transferable deep models,

X.-Y . Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,”Remote Sensing of Environment, vol. 237, p. 111322, 2020

work page 2020
[71]

Openearthmap-sar: A benchmark synthetic aperture radar dataset for global high-resolution land cover mapping [software and data sets],

J. Xia, H. Chen, C. Broni-Bediako, Y . Wei, J. Song, and N. Yokoya, “Openearthmap-sar: A benchmark synthetic aperture radar dataset for global high-resolution land cover mapping [software and data sets],” 16 IEEE Geoscience and Remote Sensing Magazine, vol. 13, no. 4, pp. 476–487, 2025

work page 2025
[72]

2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction,

C. Persello, R. H ¨ansch, G. Vivone, K. Chen, Z. Yan, D. Tang, H. Huang, M. Schmitt, and X. Sun, “2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction,”

work page 2023
[73]

Available: https://dx.doi.org/10.21227/mrnt-8w27

[Online]. Available: https://dx.doi.org/10.21227/mrnt-8w27

work page doi:10.21227/mrnt-8w27
[74]

A dual-stream high resolution network: Deep fusion of gf-2 and gf- 3 data for land cover classification,

B. Ren, S. Ma, B. Hou, D. Hong, J. Chanussot, J. Wang, and L. Jiao, “A dual-stream high resolution network: Deep fusion of gf-2 and gf- 3 data for land cover classification,”International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102896, 2022

work page 2022
[75]

Functional map of the world,

G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map of the world,” inCVPR, 2018

work page 2018
[76]

Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,

W. Zhang, R. Zhao, Y . Yao, Y . Wan, P. Wu, J. Li, Y . Li, and Y . Zhang, “Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,”arXiv preprint arXiv:2502.01002, 2025

work page arXiv 2025
[77]

Automatic registration of optical and sar images via improved phase congruency model,

Y . Xiang, R. Tao, F. Wang, H. You, and B. Han, “Automatic registration of optical and sar images via improved phase congruency model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 5847–5861, 2020

work page 2020
[78]

The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,

M. Huang, Y . Xu, L. Qian, W. Shi, Y . Zhang, W. Bao, N. Wang, X. Liu, and X. Xiang, “The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,”arXiv preprint arXiv:2103.08259, 2021

work page arXiv 2021
[79]

SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion

M. Schmitt, L. H. Hughes, C. Qiu, and X. X. Zhu, “Sen12ms–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion,”arXiv preprint arXiv:1906.07789, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[80]

SpaceNet: A Remote Sensing Dataset and Challenge Series

A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet: A remote sensing dataset and challenge series,”arXiv preprint arXiv:1807.01232, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

Showing first 80 references.

[1] [1]

Ringmoe: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation,

H. Bi, Y . Feng, B. Tong, M. Wang, H. Yu, Y . Mao, H. Chang, W. Diao, P. Wang, Y . Yuet al., “Ringmoe: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[2] [2]

Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation,

Z. Gong, Z. Wei, D. Wang, X. Hu, X. Ma, H. Chen, Y . Jia, Y . Deng, Z. Ji, X. Zhuet al., “Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[3] [3]

Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,

C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi, “Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,” IEEE Geoscience and Remote Sensing Magazine, 2025

work page 2025

[4] [4]

A semantic-enhanced multi-modal remote sensing foundation model for earth observation,

K. Wu, Y . Zhang, L. Ru, B. Dang, J. Lao, L. Yu, J. Luo, Z. Zhu, Y . Sun, J. Zhanget al., “A semantic-enhanced multi-modal remote sensing foundation model for earth observation,”Nature Machine Intelligence, vol. 7, no. 8, pp. 1235–1249, 2025

work page 2025

[5] [5]

Hypersigma: Hyperspectral intelligence comprehension foundation model,

D. Wang, M. Hu, Y . Jin, Y . Miao, J. Yang, Y . Xu, X. Qin, J. Ma, L. Sun, C. Liet al., “Hypersigma: Hyperspectral intelligence comprehension foundation model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[6] [6]

Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,

C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi, “Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

work page 2024

[7] [7]

Rscama: Remote sensing image change captioning with state space model,

C. Liu, K. Chen, B. Chen, H. Zhang, Z. Zou, and Z. Shi, “Rscama: Remote sensing image change captioning with state space model,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

work page 2024

[8] [8]

Generative artificial intelligence meets synthetic aperture radar: A survey,

Z. Huang, X. Zhang, Z. Tang, F. Xu, M. Datcu, and J. Han, “Generative artificial intelligence meets synthetic aperture radar: A survey,”IEEE Geoscience and Remote Sensing Magazine, vol. 14, no. 1, pp. 6–48, 2026

work page 2026

[9] [9]

Ph-gan: Physics-inspired gan for generating sar images under limited data,

X. Zhang, Y . Zhuang, Q. Guo, H. Yang, X. Qian, G. Cheng, J. Han, and Z. Huang, “Ph-gan: Physics-inspired gan for generating sar images under limited data,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 29 075–29 085

work page 2025

[10] [10]

Diffusion models meet remote sensing: Principles, methods, and perspectives,

Y . Liu, J. Yue, S. Xia, P. Ghamisi, W. Xie, and L. Fang, “Diffusion models meet remote sensing: Principles, methods, and perspectives,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 22, 2024

work page 2024

[11] [11]

Image-to-image translation with conditional adversarial networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134

work page 2017

[12] [12]

Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model,

C. Liu, K. Chen, R. Zhao, Z. Zou, and Z. Shi, “Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model,”IEEE Geoscience and Remote Sensing Mag- azine, 2025

work page 2025

[13] [13]

Earthmapper: Visual autoregressive models for controllable bidirectional satellite-map trans- lation,

Z. Dong, Y . Sun, T. Liu, W. Zuo, and Y . Gu, “Earthmapper: Visual autoregressive models for controllable bidirectional satellite-map trans- lation,”arXiv preprint arXiv:2504.19432, 2025

work page arXiv 2025

[14] [14]

Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,

J. He, L. Chen, H. Shi, Y . Chen, J. Yang, and W. Li, “Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025

[15] [15]

Hsigene: a foundation model for hyperspectral image generation,

L. Pang, X. Cao, D. Tang, S. Xu, X. Bai, F. Zhou, and D. Meng, “Hsigene: a foundation model for hyperspectral image generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2025

work page 2025

[16] [16]

One transformer fits all distributions in multi-modal diffusion at scale,

F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717

work page 2023

[17] [17]

Omniflow: Any-to-any generation with multi-modal rectified flows,

S. Li, K. Kallidromitis, A. Gokul, Z. Liao, Y . Kato, K. Kozuka, and A. Grover, “Omniflow: Any-to-any generation with multi-modal rectified flows,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 178–13 188

work page 2025

[18] [18]

High-resolution image synthesis and semantic manipulation with condi- tional gans,

T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with condi- tional gans,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798–8807

work page 2018

[19] [19]

Unpaired image-to-image translation using cycle-consistent adversarial networks,

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232

work page 2017

[20] [20]

Stargan v2: Diverse image synthesis for multiple domains,

Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8188– 8197

work page 2020

[21] [21]

Contrastive learning for unpaired image-to-image translation,

T. Park, A. A. Efros, R. Zhang, and J.-Y . Zhu, “Contrastive learning for unpaired image-to-image translation,” inEuropean conference on computer vision. Springer, 2020, pp. 319–345

work page 2020

[22] [22]

Stegogan: Leveraging steganography for non-bijective image-to-image translation,

S. Wu, Y . Chen, S. Mermet, L. Hurni, K. Schindler, N. Gonthier, and L. Landrieu, “Stegogan: Leveraging steganography for non-bijective image-to-image translation,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 7922–7931

work page 2024

[23] [23]

Layered rendering diffusion model for controllable zero-shot image synthesis,

Z. Qi, G. Huang, C. Liu, and F. Ye, “Layered rendering diffusion model for controllable zero-shot image synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 426–443

work page 2024

[24] [24]

Palette: Image-to-image diffusion models,

C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” inACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10

work page 2022

[25] [25]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392–18 402

work page 2023

[26] [26]

Magicbrush: A manually annotated dataset for instruction-guided image editing,

K. Zhang, L. Mo, W. Chen, H. Sun, and Y . Su, “Magicbrush: A manually annotated dataset for instruction-guided image editing,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 428–31 449, 2023

work page 2023

[27] [27]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhouet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

W. Lin, X. Wei, R. Zhang, L. Zhuo, S. Zhao, S. Huang, H. Teng, J. Xie, Y . Qiao, P. Gaoet al., “Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,”arXiv preprint arXiv:2409.15278, 2024

work page arXiv 2024

[29] [29]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

work page 2023

[30] [30]

T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304

work page 2024

[31] [31]

Com- poser: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

L. Huang, D. Chen, Y . Liu, Y . Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” arXiv preprint arXiv:2302.09778, 2023

work page arXiv 2023

[32] [32]

Ominicontrol: Minimal and universal control for diffusion transformer,

Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “Ominicontrol: Minimal and universal control for diffusion transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 940–14 950

work page 2025

[33] [33]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

B. F. Labs, S. Batifol, A. Blattmann, F. Boesel, S. Consul, C. Diagne, T. Dockhorn, J. English, Z. English, P. Esseret al., “Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space,”arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

work page 2023

[35] [35]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

work page 2024

[36] [36]

Unicontrol: A unified diffusion model for controllable visual generation in the wild,

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savareseet al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,”arXiv preprint arXiv:2305.11147, 2023

work page arXiv 2023

[37] [37]

Uni-controlnet: All-in-one control to text-to-image diffusion models,

S. Zhao, D. Chen, Y .-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y . K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 127–11 150, 2023

work page 2023

[38] [38]

Learning to generate sar images with adversarial autoencoder,

Q. Song, F. Xu, X. X. Zhu, and Y .-Q. Jin, “Learning to generate sar images with adversarial autoencoder,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022

[39] [39]

Physics-informed hyperspectral remote sensing image synthesis with deep conditional generative adver- sarial networks,

L. Liu, W. Li, Z. Shi, and Z. Zou, “Physics-informed hyperspectral remote sensing image synthesis with deep conditional generative adver- sarial networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022

[40] [40]

Diverse hyperspectral remote sensing image synthesis with diffusion models,

L. Liu, B. Chen, H. Chen, Z. Zou, and Z. Shi, “Diverse hyperspectral remote sensing image synthesis with diffusion models,”IEEE Transac- tions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023

work page 2023

[41] [41]

Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,

B. Chen, L. Liu, C. Liu, Z. Zou, and Z. Shi, “Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024

[42] [42]

Hybrid cgan: Coupling global and local features for sar-to-optical image translation,

Z. Wang, Y . Ma, and Y . Zhang, “Hybrid cgan: Coupling global and local features for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022

work page 2022

[43] [43]

Multiscale generative adversarial network based on wavelet feature learning for sar-to-optical image translation,

H. Li, C. Gu, D. Wu, G. Cheng, L. Guo, and H. Liu, “Multiscale generative adversarial network based on wavelet feature learning for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

work page 2022

[44] [44]

Integrating multitemporal sar and optical information for missing optical imagery generation,

C. Dong, G. Yang, Y . Wang, W. Sun, X. Meng, and B. Chen, “Integrating multitemporal sar and optical information for missing optical imagery generation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024

[45] [45]

Conditional diffusion model with spatial-frequency refinement for sar-to-optical im- age translation,

J. Qin, K. Wang, B. Zou, L. Zhang, and J. van de Weijer, “Conditional diffusion model with spatial-frequency refinement for sar-to-optical im- age translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024

[46] [46]

Efficient end-to-end diffusion model for one-step sar-to-optical translation,

J. Qin, B. Zou, H. Li, and L. Zhang, “Efficient end-to-end diffusion model for one-step sar-to-optical translation,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2024

work page 2024

[47] [47]

Learning sar-to- optical image translation via diffusion models with color memory,

Z. Guo, J. Liu, Q. Cai, Z. Zhang, and S. Mei, “Learning sar-to- optical image translation via diffusion models with color memory,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 14 454–14 470, 2024

work page 2024

[48] [48]

Generate your own scotland: Satellite image generation conditioned on maps,

M. Espinosa and E. J. Crowley, “Generate your own scotland: Satellite image generation conditioned on maps,”arXiv preprint arXiv:2308.16648, 2023

work page arXiv 2023

[49] [49]

Geosynth: Contextually-aware high-resolution satellite image synthesis,

S. Sastry, S. Khanal, A. Dhakal, and N. Jacobs, “Geosynth: Contextually-aware high-resolution satellite image synthesis,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 460–470

work page 2024

[50] [50]

Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,

M. Espinosa, V . Marsocci, Y . Jia, E. Crowley, and M. Czerkawski, “Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3085–3095

work page 2025

[51] [51]

Foundation models for remote sensing and earth observation: A survey,

A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya, “Foundation models for remote sensing and earth observation: A survey,” IEEE Geoscience and Remote Sensing Magazine, 2025

work page 2025

[52] [52]

Changen2: Multi-temporal remote sensing generative change foundation model,

Z. Zheng, S. Ermon, D. Kim, L. Zhang, and Y . Zhong, “Changen2: Multi-temporal remote sensing generative change foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[53] [53]

Generating any changes in the noise domain,

Q. Liu, Y . Kuang, J. Yue, P. Ghamisi, W. Xie, and L. Fang, “Generating any changes in the noise domain,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[54] [54]

Diffusionsat: A generative foundation model for satellite imagery.arXiv preprint arXiv:2312.03606, 2023

S. Khanna, P. Liu, L. Zhou, C. Meng, R. Rombach, M. Burke, D. Lobell, and S. Ermon, “Diffusionsat: A generative foundation model for satellite imagery,”arXiv preprint arXiv:2312.03606, 2023

work page arXiv 2023

[55] [55]

Rsdiff: Remote sensing image generation from text using diffusion model,

A. Sebaq and M. ElHelw, “Rsdiff: Remote sensing image generation from text using diffusion model,”Neural Computing and Applications, vol. 36, no. 36, pp. 23 103–23 111, 2024

work page 2024

[56] [56]

A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,

C. Liu, R. Zhao, J. Chen, Z. Qi, Z. Zou, and Z. Shi, “A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023

work page 2023

[57] [57]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[58] [58]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022

work page 2022

[59] [59]

Metaearth: A generative foundation model for global-scale remote sensing image generation,

Z. Yu, C. Liu, L. Liu, Z. Shi, and Z. Zou, “Metaearth: A generative foundation model for global-scale remote sensing image generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[60] [60]

MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

J. Cao, Z. Yu, B. Lin, C. Liu, Z. Shi, and Z. Zou, “Metaearth3d: Unlocking world-scale 3d generation with spatially scalable generative modeling,”arXiv preprint arXiv:2604.22828, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Crs-diff: Controllable remote sensing image generation with diffusion model,

D. Tang, X. Cao, X. Hou, Z. Jiang, J. Liu, and D. Meng, “Crs-diff: Controllable remote sensing image generation with diffusion model,” IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024

[62] [62]

Multi-grained guided diffusion for quantity-controlled remote sensing object generation,

Z. Yu, C. Liu, C. Zhong, Z. Zou, and Z. Shi, “Multi-grained guided diffusion for quantity-controlled remote sensing object generation,” IEEE Geoscience and Remote Sensing Letters, 2025

work page 2025

[63] [63]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[65] [65]

The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion

M. Schmitt, L. H. Hughes, and X. X. Zhu, “The sen1-2 dataset for deep learning in sar-optical data fusion,”arXiv preprint arXiv:1807.01569, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[66] [66]

A comparative analysis of gan- based methods for sar-to-optical image translation,

Y . Zhao, T. Celik, N. Liu, and H.-C. Li, “A comparative analysis of gan- based methods for sar-to-optical image translation,”IEEE Geoscience and Remote Sensing Letters, 2022

work page 2022

[67] [67]

Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation,

J. Luo, Y . Wang, Z. Gu, Y . Qiu, S. Yao, F. Wang, C. Xu, W. Zhang, D. Wang, and Z. Cui, “Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation,” arXiv preprint arXiv:2410.22362, 2024

work page arXiv 2024

[68] [68]

Mscdunet: A deep learning framework for built-up area change detection integrating multi- spectral, sar, and vhr data,

H. Li, F. Zhu, X. Zheng, M. Liu, and G. Chen, “Mscdunet: A deep learning framework for built-up area change detection integrating multi- spectral, sar, and vhr data,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 5163–5176, 2022

work page 2022

[69] [69]

High-resolution sar-to- multispectral image translation based on s2ms-gan,

Y . Liu, Q. Han, H. Yang, and H. Hu, “High-resolution sar-to- multispectral image translation based on s2ms-gan,”Remote Sensing, vol. 16, no. 21, p. 4045, 2024

work page 2024

[70] [70]

Land-cover classification with high-resolution remote sensing images using transferable deep models,

X.-Y . Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,”Remote Sensing of Environment, vol. 237, p. 111322, 2020

work page 2020

[71] [71]

Openearthmap-sar: A benchmark synthetic aperture radar dataset for global high-resolution land cover mapping [software and data sets],

J. Xia, H. Chen, C. Broni-Bediako, Y . Wei, J. Song, and N. Yokoya, “Openearthmap-sar: A benchmark synthetic aperture radar dataset for global high-resolution land cover mapping [software and data sets],” 16 IEEE Geoscience and Remote Sensing Magazine, vol. 13, no. 4, pp. 476–487, 2025

work page 2025

[72] [72]

2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction,

C. Persello, R. H ¨ansch, G. Vivone, K. Chen, Z. Yan, D. Tang, H. Huang, M. Schmitt, and X. Sun, “2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction,”

work page 2023

[73] [73]

Available: https://dx.doi.org/10.21227/mrnt-8w27

[Online]. Available: https://dx.doi.org/10.21227/mrnt-8w27

work page doi:10.21227/mrnt-8w27

[74] [74]

A dual-stream high resolution network: Deep fusion of gf-2 and gf- 3 data for land cover classification,

B. Ren, S. Ma, B. Hou, D. Hong, J. Chanussot, J. Wang, and L. Jiao, “A dual-stream high resolution network: Deep fusion of gf-2 and gf- 3 data for land cover classification,”International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102896, 2022

work page 2022

[75] [75]

Functional map of the world,

G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map of the world,” inCVPR, 2018

work page 2018

[76] [76]

Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,

W. Zhang, R. Zhao, Y . Yao, Y . Wan, P. Wu, J. Li, Y . Li, and Y . Zhang, “Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,”arXiv preprint arXiv:2502.01002, 2025

work page arXiv 2025

[77] [77]

Automatic registration of optical and sar images via improved phase congruency model,

Y . Xiang, R. Tao, F. Wang, H. You, and B. Han, “Automatic registration of optical and sar images via improved phase congruency model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 5847–5861, 2020

work page 2020

[78] [78]

The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,

M. Huang, Y . Xu, L. Qian, W. Shi, Y . Zhang, W. Bao, N. Wang, X. Liu, and X. Xiang, “The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,”arXiv preprint arXiv:2103.08259, 2021

work page arXiv 2021

[79] [79]

SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion

M. Schmitt, L. H. Hughes, C. Qiu, and X. X. Zhu, “Sen12ms–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion,”arXiv preprint arXiv:1906.07789, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[80] [80]

SpaceNet: A Remote Sensing Dataset and Challenge Series

A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet: A remote sensing dataset and challenge series,”arXiv preprint arXiv:1807.01232, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018