MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
Pith reviewed 2026-05-20 05:24 UTC · model grok-4.3
The pith
A unified model generates and translates between any of five remote sensing modalities by first inferring a shared latent scene.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaEarth-MM enables paired joint generation and any-to-any translation across five modalities within a unified model by adopting a scene-centered joint modeling paradigm that infers a latent scene representation from available observations and then generates target modalities conditioned on this intermediate state.
What carries the argument
The scene-centered joint modeling paradigm, a decoupled architecture that first infers a latent scene representation from available observations and then generates target modalities conditioned on that representation.
If this is right
- The same model can fill in missing modalities by generating from the inferred scene state rather than requiring complete paired inputs.
- Adding more modalities does not require training new pairwise models, improving scalability.
- Generated images and learned scene representations can both be used to improve performance on downstream remote sensing tasks.
Where Pith is reading between the lines
- The latent-scene intermediate step may reduce error accumulation when chaining multiple translations compared with direct modality-to-modality pipelines.
- Similar scene-centered decoupling could be tested in other multi-modal domains such as medical imaging where aligned observations are also scarce.
Load-bearing premise
Multi-modal remote sensing observations share an intrinsic scene consistency that can be captured as a latent representation separate from direct appearance-level mappings.
What would settle it
A controlled test in which the model is given partial modality sets from new geographic regions or sensor types outside the EarthMM training distribution and its output accuracy falls below that of separately trained pairwise translators.
Figures
read the original abstract
Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MetaEarth-MM, a unified generative foundation model for multi-modal remote sensing imagery across five modalities. It proposes a scene-centered joint modeling paradigm in which a latent scene representation is inferred from available observations and used to condition generation of target modalities, enabling paired joint generation and any-to-any translation within a single model. Training is supported by the newly constructed EarthMM dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. The authors claim that extensive experiments demonstrate strong generative capability, robust generalization across tasks, and utility for downstream applications at both data and representation levels.
Significance. If the central claims are substantiated with quantitative evidence, the work could offer a meaningful advance over pairwise cross-modal translation methods by providing a scalable, unified framework that leverages intrinsic scene consistency. The construction and planned release of the large-scale EarthMM dataset, together with code availability, represent concrete strengths that would facilitate reproducibility and further research in remote sensing foundation models.
major comments (2)
- [Abstract and §3] Abstract and §3 (architecture description): The claim that inferring a latent scene representation enables superior any-to-any generation rests on the assumption that this representation disentangles shared scene content from modality-specific cues. No explicit mechanism (e.g., adversarial loss, cycle-consistency constraint on the latent, or modality-agnostic regularization) is described to enforce this separation, leaving open the possibility that the intermediate state functions as a simple bottleneck rather than a true scene-centered pivot.
- [Experiments] Experiments section: The abstract asserts strong generative capability and robust generalization, yet the provided summary supplies no quantitative metrics, error bars, ablation studies, or direct comparisons against pairwise baselines. Without these, the support for the central claim that the scene-centered paradigm outperforms direct appearance-level mappings cannot be evaluated.
minor comments (2)
- A table listing the five modalities and their typical resolutions or characteristics would improve clarity when describing the any-to-any translation tasks.
- The notation used for the latent scene representation and conditioning process should be introduced with an explicit equation or diagram reference in the methodology section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the positive assessment of the work's potential as a unified framework and the value of the EarthMM dataset. We address each major comment in detail below, providing clarifications based on the manuscript content and indicating revisions where they strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (architecture description): The claim that inferring a latent scene representation enables superior any-to-any generation rests on the assumption that this representation disentangles shared scene content from modality-specific cues. No explicit mechanism (e.g., adversarial loss, cycle-consistency constraint on the latent, or modality-agnostic regularization) is described to enforce this separation, leaving open the possibility that the intermediate state functions as a simple bottleneck rather than a true scene-centered pivot.
Authors: We appreciate this insightful observation on the need for explicit disentanglement. The manuscript describes a decoupled architecture where a shared latent scene representation is inferred via modality-specific encoders and then used to condition a unified generator for target modalities. Training relies on a joint reconstruction objective across all five modalities from this latent state, which encourages capture of shared scene content because the representation must support accurate synthesis of diverse observations (e.g., optical, SAR, and hyperspectral). This is not a simple bottleneck, as the latent is explicitly optimized to be sufficient for cross-modal generation rather than modality-specific encoding. However, we acknowledge that additional mechanisms such as cycle-consistency on the latent or modality-agnostic regularization could further substantiate the claim. We will revise §3 to elaborate on the training dynamics and implicit disentanglement effects, and add a brief discussion of this design choice. revision: partial
-
Referee: [Experiments] Experiments section: The abstract asserts strong generative capability and robust generalization, yet the provided summary supplies no quantitative metrics, error bars, ablation studies, or direct comparisons against pairwise baselines. Without these, the support for the central claim that the scene-centered paradigm outperforms direct appearance-level mappings cannot be evaluated.
Authors: The full experiments section (Section 4) presents quantitative results supporting the claims, including FID, PSNR, and SSIM metrics for joint generation and any-to-any translation tasks, with standard deviations reported across multiple runs, ablation studies isolating the scene-centered components versus direct mapping baselines, and comparisons to adapted pairwise methods (e.g., variants of CycleGAN and Pix2Pix for multi-modal settings). These demonstrate consistent improvements in generative quality and generalization. We agree that the initial presentation could be clearer for readers. We will revise the experiments section to include a consolidated summary table of key metrics and ablations at the beginning of the section, and ensure error bars and baseline comparisons are more prominently featured. revision: yes
Circularity Check
No circularity in model architecture or derivation
full rationale
The paper introduces a new decoupled architecture for MetaEarth-MM that infers a latent scene representation from observations and conditions generation on it, presented as a design choice recognizing intrinsic scene consistency. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. The dataset construction and any-to-any translation capability are independent contributions without reduction to prior inputs by construction. This is a standard new-model proposal with self-contained claims.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent scene representation dimensionality
axioms (1)
- domain assumption Multi-modal remote sensing observations share an intrinsic scene consistency
invented entities (1)
-
latent scene representation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
scene consistency regularization ... encouraging different modal observations of the same scene to yield a latent representation centered on the underlying scene
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
H. Bi, Y . Feng, B. Tong, M. Wang, H. Yu, Y . Mao, H. Chang, W. Diao, P. Wang, Y . Yuet al., “Ringmoe: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[2]
Z. Gong, Z. Wei, D. Wang, X. Hu, X. Ma, H. Chen, Y . Jia, Y . Deng, Z. Ji, X. Zhuet al., “Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[3]
Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,
C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi, “Remote sens- ing spatiotemporal vision–language models: A comprehensive survey,” IEEE Geoscience and Remote Sensing Magazine, 2025
work page 2025
-
[4]
A semantic-enhanced multi-modal remote sensing foundation model for earth observation,
K. Wu, Y . Zhang, L. Ru, B. Dang, J. Lao, L. Yu, J. Luo, Z. Zhu, Y . Sun, J. Zhanget al., “A semantic-enhanced multi-modal remote sensing foundation model for earth observation,”Nature Machine Intelligence, vol. 7, no. 8, pp. 1235–1249, 2025
work page 2025
-
[5]
Hypersigma: Hyperspectral intelligence comprehension foundation model,
D. Wang, M. Hu, Y . Jin, Y . Miao, J. Yang, Y . Xu, X. Qin, J. Ma, L. Sun, C. Liet al., “Hypersigma: Hyperspectral intelligence comprehension foundation model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[6]
Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,
C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi, “Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024
work page 2024
-
[7]
Rscama: Remote sensing image change captioning with state space model,
C. Liu, K. Chen, B. Chen, H. Zhang, Z. Zou, and Z. Shi, “Rscama: Remote sensing image change captioning with state space model,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024
work page 2024
-
[8]
Generative artificial intelligence meets synthetic aperture radar: A survey,
Z. Huang, X. Zhang, Z. Tang, F. Xu, M. Datcu, and J. Han, “Generative artificial intelligence meets synthetic aperture radar: A survey,”IEEE Geoscience and Remote Sensing Magazine, vol. 14, no. 1, pp. 6–48, 2026
work page 2026
-
[9]
Ph-gan: Physics-inspired gan for generating sar images under limited data,
X. Zhang, Y . Zhuang, Q. Guo, H. Yang, X. Qian, G. Cheng, J. Han, and Z. Huang, “Ph-gan: Physics-inspired gan for generating sar images under limited data,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 29 075–29 085
work page 2025
-
[10]
Diffusion models meet remote sensing: Principles, methods, and perspectives,
Y . Liu, J. Yue, S. Xia, P. Ghamisi, W. Xie, and L. Fang, “Diffusion models meet remote sensing: Principles, methods, and perspectives,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 22, 2024
work page 2024
-
[11]
Image-to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134
work page 2017
-
[12]
C. Liu, K. Chen, R. Zhao, Z. Zou, and Z. Shi, “Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model,”IEEE Geoscience and Remote Sensing Mag- azine, 2025
work page 2025
-
[13]
Z. Dong, Y . Sun, T. Liu, W. Zuo, and Y . Gu, “Earthmapper: Visual autoregressive models for controllable bidirectional satellite-map trans- lation,”arXiv preprint arXiv:2504.19432, 2025
-
[14]
Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,
J. He, L. Chen, H. Shi, Y . Chen, J. Yang, and W. Li, “Dogan: Dino- based optical-prior-driven gan for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, 2025
work page 2025
-
[15]
Hsigene: a foundation model for hyperspectral image generation,
L. Pang, X. Cao, D. Tang, S. Xu, X. Bai, F. Zhou, and D. Meng, “Hsigene: a foundation model for hyperspectral image generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2025
work page 2025
-
[16]
One transformer fits all distributions in multi-modal diffusion at scale,
F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y . Wang, G. Yue, Y . Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1692–1717
work page 2023
-
[17]
Omniflow: Any-to-any generation with multi-modal rectified flows,
S. Li, K. Kallidromitis, A. Gokul, Z. Liao, Y . Kato, K. Kozuka, and A. Grover, “Omniflow: Any-to-any generation with multi-modal rectified flows,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 178–13 188
work page 2025
-
[18]
High-resolution image synthesis and semantic manipulation with condi- tional gans,
T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with condi- tional gans,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798–8807
work page 2018
-
[19]
Unpaired image-to-image translation using cycle-consistent adversarial networks,
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232
work page 2017
-
[20]
Stargan v2: Diverse image synthesis for multiple domains,
Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8188– 8197
work page 2020
-
[21]
Contrastive learning for unpaired image-to-image translation,
T. Park, A. A. Efros, R. Zhang, and J.-Y . Zhu, “Contrastive learning for unpaired image-to-image translation,” inEuropean conference on computer vision. Springer, 2020, pp. 319–345
work page 2020
-
[22]
Stegogan: Leveraging steganography for non-bijective image-to-image translation,
S. Wu, Y . Chen, S. Mermet, L. Hurni, K. Schindler, N. Gonthier, and L. Landrieu, “Stegogan: Leveraging steganography for non-bijective image-to-image translation,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 7922–7931
work page 2024
-
[23]
Layered rendering diffusion model for controllable zero-shot image synthesis,
Z. Qi, G. Huang, C. Liu, and F. Ye, “Layered rendering diffusion model for controllable zero-shot image synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 426–443
work page 2024
-
[24]
Palette: Image-to-image diffusion models,
C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” inACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10
work page 2022
-
[25]
Instructpix2pix: Learning to follow image editing instructions,
T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392–18 402
work page 2023
-
[26]
Magicbrush: A manually annotated dataset for instruction-guided image editing,
K. Zhang, L. Mo, W. Chen, H. Sun, and Y . Su, “Magicbrush: A manually annotated dataset for instruction-guided image editing,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 428–31 449, 2023
work page 2023
-
[27]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhouet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,
W. Lin, X. Wei, R. Zhang, L. Zhuo, S. Zhao, S. Huang, H. Teng, J. Xie, Y . Qiao, P. Gaoet al., “Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,”arXiv preprint arXiv:2409.15278, 2024
-
[29]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847
work page 2023
-
[30]
C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304
work page 2024
-
[31]
L. Huang, D. Chen, Y . Liu, Y . Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” arXiv preprint arXiv:2302.09778, 2023
-
[32]
Ominicontrol: Minimal and universal control for diffusion transformer,
Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “Ominicontrol: Minimal and universal control for diffusion transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 940–14 950
work page 2025
-
[33]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
B. F. Labs, S. Batifol, A. Blattmann, F. Boesel, S. Consul, C. Diagne, T. Dockhorn, J. English, Z. English, P. Esseret al., “Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space,”arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205
work page 2023
-
[35]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024
work page 2024
-
[36]
Unicontrol: A unified diffusion model for controllable visual generation in the wild,
C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savareseet al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,”arXiv preprint arXiv:2305.11147, 2023
-
[37]
Uni-controlnet: All-in-one control to text-to-image diffusion models,
S. Zhao, D. Chen, Y .-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y . K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 127–11 150, 2023
work page 2023
-
[38]
Learning to generate sar images with adversarial autoencoder,
Q. Song, F. Xu, X. X. Zhu, and Y .-Q. Jin, “Learning to generate sar images with adversarial autoencoder,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022
work page 2022
-
[39]
L. Liu, W. Li, Z. Shi, and Z. Zou, “Physics-informed hyperspectral remote sensing image synthesis with deep conditional generative adver- sarial networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022
work page 2022
-
[40]
Diverse hyperspectral remote sensing image synthesis with diffusion models,
L. Liu, B. Chen, H. Chen, Z. Zou, and Z. Shi, “Diverse hyperspectral remote sensing image synthesis with diffusion models,”IEEE Transac- tions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023
work page 2023
-
[41]
Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,
B. Chen, L. Liu, C. Liu, Z. Zou, and Z. Shi, “Spectral-cascaded diffusion model for remote sensing image spectral super-resolution,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024
work page 2024
-
[42]
Hybrid cgan: Coupling global and local features for sar-to-optical image translation,
Z. Wang, Y . Ma, and Y . Zhang, “Hybrid cgan: Coupling global and local features for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022
work page 2022
-
[43]
H. Li, C. Gu, D. Wu, G. Cheng, L. Guo, and H. Liu, “Multiscale generative adversarial network based on wavelet feature learning for sar-to-optical image translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022
work page 2022
-
[44]
Integrating multitemporal sar and optical information for missing optical imagery generation,
C. Dong, G. Yang, Y . Wang, W. Sun, X. Meng, and B. Chen, “Integrating multitemporal sar and optical information for missing optical imagery generation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024
work page 2024
-
[45]
J. Qin, K. Wang, B. Zou, L. Zhang, and J. van de Weijer, “Conditional diffusion model with spatial-frequency refinement for sar-to-optical im- age translation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024
work page 2024
-
[46]
Efficient end-to-end diffusion model for one-step sar-to-optical translation,
J. Qin, B. Zou, H. Li, and L. Zhang, “Efficient end-to-end diffusion model for one-step sar-to-optical translation,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2024
work page 2024
-
[47]
Learning sar-to- optical image translation via diffusion models with color memory,
Z. Guo, J. Liu, Q. Cai, Z. Zhang, and S. Mei, “Learning sar-to- optical image translation via diffusion models with color memory,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 14 454–14 470, 2024
work page 2024
-
[48]
Generate your own scotland: Satellite image generation conditioned on maps,
M. Espinosa and E. J. Crowley, “Generate your own scotland: Satellite image generation conditioned on maps,”arXiv preprint arXiv:2308.16648, 2023
-
[49]
Geosynth: Contextually-aware high-resolution satellite image synthesis,
S. Sastry, S. Khanal, A. Dhakal, and N. Jacobs, “Geosynth: Contextually-aware high-resolution satellite image synthesis,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 460–470
work page 2024
-
[50]
Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,
M. Espinosa, V . Marsocci, Y . Jia, E. Crowley, and M. Czerkawski, “Cop- gen-beta: Unified generative modelling of copernicus imagery thumb- nails,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3085–3095
work page 2025
-
[51]
Foundation models for remote sensing and earth observation: A survey,
A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya, “Foundation models for remote sensing and earth observation: A survey,” IEEE Geoscience and Remote Sensing Magazine, 2025
work page 2025
-
[52]
Changen2: Multi-temporal remote sensing generative change foundation model,
Z. Zheng, S. Ermon, D. Kim, L. Zhang, and Y . Zhong, “Changen2: Multi-temporal remote sensing generative change foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[53]
Generating any changes in the noise domain,
Q. Liu, Y . Kuang, J. Yue, P. Ghamisi, W. Xie, and L. Fang, “Generating any changes in the noise domain,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[54]
S. Khanna, P. Liu, L. Zhou, C. Meng, R. Rombach, M. Burke, D. Lobell, and S. Ermon, “Diffusionsat: A generative foundation model for satellite imagery,”arXiv preprint arXiv:2312.03606, 2023
-
[55]
Rsdiff: Remote sensing image generation from text using diffusion model,
A. Sebaq and M. ElHelw, “Rsdiff: Remote sensing image generation from text using diffusion model,”Neural Computing and Applications, vol. 36, no. 36, pp. 23 103–23 111, 2024
work page 2024
-
[56]
A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,
C. Liu, R. Zhao, J. Chen, Z. Qi, Z. Zou, and Z. Shi, “A decoupling paradigm with prompt learning for remote sensing image change cap- tioning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023
work page 2023
-
[57]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[58]
Photorealistic text-to-image diffusion models with deep language understanding,
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022
work page 2022
-
[59]
Metaearth: A generative foundation model for global-scale remote sensing image generation,
Z. Yu, C. Liu, L. Liu, Z. Shi, and Z. Zou, “Metaearth: A generative foundation model for global-scale remote sensing image generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[60]
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling
J. Cao, Z. Yu, B. Lin, C. Liu, Z. Shi, and Z. Zou, “Metaearth3d: Unlocking world-scale 3d generation with spatially scalable generative modeling,”arXiv preprint arXiv:2604.22828, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
Crs-diff: Controllable remote sensing image generation with diffusion model,
D. Tang, X. Cao, X. Hou, Z. Jiang, J. Liu, and D. Meng, “Crs-diff: Controllable remote sensing image generation with diffusion model,” IEEE Transactions on Geoscience and Remote Sensing, 2024
work page 2024
-
[62]
Multi-grained guided diffusion for quantity-controlled remote sensing object generation,
Z. Yu, C. Liu, C. Zhong, Z. Zou, and Z. Shi, “Multi-grained guided diffusion for quantity-controlled remote sensing object generation,” IEEE Geoscience and Remote Sensing Letters, 2025
work page 2025
-
[63]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[65]
The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion
M. Schmitt, L. H. Hughes, and X. X. Zhu, “The sen1-2 dataset for deep learning in sar-optical data fusion,”arXiv preprint arXiv:1807.01569, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[66]
A comparative analysis of gan- based methods for sar-to-optical image translation,
Y . Zhao, T. Celik, N. Liu, and H.-C. Li, “A comparative analysis of gan- based methods for sar-to-optical image translation,”IEEE Geoscience and Remote Sensing Letters, 2022
work page 2022
-
[67]
J. Luo, Y . Wang, Z. Gu, Y . Qiu, S. Yao, F. Wang, C. Xu, W. Zhang, D. Wang, and Z. Cui, “Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation,” arXiv preprint arXiv:2410.22362, 2024
-
[68]
H. Li, F. Zhu, X. Zheng, M. Liu, and G. Chen, “Mscdunet: A deep learning framework for built-up area change detection integrating multi- spectral, sar, and vhr data,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 5163–5176, 2022
work page 2022
-
[69]
High-resolution sar-to- multispectral image translation based on s2ms-gan,
Y . Liu, Q. Han, H. Yang, and H. Hu, “High-resolution sar-to- multispectral image translation based on s2ms-gan,”Remote Sensing, vol. 16, no. 21, p. 4045, 2024
work page 2024
-
[70]
Land-cover classification with high-resolution remote sensing images using transferable deep models,
X.-Y . Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,”Remote Sensing of Environment, vol. 237, p. 111322, 2020
work page 2020
-
[71]
J. Xia, H. Chen, C. Broni-Bediako, Y . Wei, J. Song, and N. Yokoya, “Openearthmap-sar: A benchmark synthetic aperture radar dataset for global high-resolution land cover mapping [software and data sets],” 16 IEEE Geoscience and Remote Sensing Magazine, vol. 13, no. 4, pp. 476–487, 2025
work page 2025
-
[72]
C. Persello, R. H ¨ansch, G. Vivone, K. Chen, Z. Yan, D. Tang, H. Huang, M. Schmitt, and X. Sun, “2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction,”
work page 2023
-
[73]
Available: https://dx.doi.org/10.21227/mrnt-8w27
[Online]. Available: https://dx.doi.org/10.21227/mrnt-8w27
-
[74]
B. Ren, S. Ma, B. Hou, D. Hong, J. Chanussot, J. Wang, and L. Jiao, “A dual-stream high resolution network: Deep fusion of gf-2 and gf- 3 data for land cover classification,”International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102896, 2022
work page 2022
-
[75]
G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map of the world,” inCVPR, 2018
work page 2018
-
[76]
W. Zhang, R. Zhao, Y . Yao, Y . Wan, P. Wu, J. Li, Y . Li, and Y . Zhang, “Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,”arXiv preprint arXiv:2502.01002, 2025
-
[77]
Automatic registration of optical and sar images via improved phase congruency model,
Y . Xiang, R. Tao, F. Wang, H. You, and B. Han, “Automatic registration of optical and sar images via improved phase congruency model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 5847–5861, 2020
work page 2020
-
[78]
The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,
M. Huang, Y . Xu, L. Qian, W. Shi, Y . Zhang, W. Bao, N. Wang, X. Liu, and X. Xiang, “The qxs-saropt dataset for deep learning in sar-optical data fusion. arxiv 2021,”arXiv preprint arXiv:2103.08259, 2021
-
[79]
M. Schmitt, L. H. Hughes, C. Qiu, and X. X. Zhu, “Sen12ms–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion,”arXiv preprint arXiv:1906.07789, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[80]
SpaceNet: A Remote Sensing Dataset and Challenge Series
A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet: A remote sensing dataset and challenge series,”arXiv preprint arXiv:1807.01232, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.