GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing
Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3
The pith
A 2.5-million-sample multimodal dataset with agentic semantic captions enables foundation models with improved transfer and cross-sensor performance in remote sensing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoMeld is a large-scale multimodal dataset consisting of approximately 2.5 million spatially aligned samples across diverse modalities and resolutions, constructed under a unified alignment protocol. It incorporates semantically grounded language supervision generated by an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, thereby encoding measurable cross-modality relationships. The GeoMeld-FM pretraining framework integrates multi-pretext masked autoencoding, JEPA representation learning, and caption-vision contrastive alignment to produce representations that capture both reliable cross-s
What carries the argument
The agentic captioning framework that generates verifiable textual descriptions encoding cross-modality relationships from spectral, terrain, and geographic data, combined with the joint pretraining objective of masked autoencoding, JEPA, and contrastive alignment.
If this is right
- Pretrained models exhibit consistent improvements in performance on various downstream remote sensing tasks.
- The learned representations demonstrate enhanced robustness when applied to data collected by different sensors.
- The dataset and framework together provide a scalable reference for developing semantically grounded multimodal foundation models in remote sensing.
Where Pith is reading between the lines
- Similar approaches combining agentic annotation with multimodal pretraining could be extended to other fields involving heterogeneous sensor data, such as autonomous driving or medical imaging.
- The emphasis on encoding measurable physical relationships in text may reduce hallucinations in generative models for geospatial applications.
- Future work might test whether these representations support zero-shot inference on novel sensor combinations not seen during pretraining.
Load-bearing premise
The agentic captioning framework generates accurate, verifiable annotations that meaningfully encode measurable cross-modality relationships from the input signals.
What would settle it
Demonstrating that models trained without the agentic captions or without the full joint pretraining objective achieve equivalent gains in downstream transfer and cross-sensor robustness would challenge the paper's central claim.
Figures
read the original abstract
Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoMeld, a dataset of approximately 2.5 million spatially aligned multimodal remote sensing samples spanning diverse modalities and resolutions under a unified alignment protocol. It provides semantically grounded language supervision via an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and geographic metadata to encode cross-modality relationships. The authors also present GeoMeld-FM, a pretraining framework combining multi-pretext masked autoencoding, JEPA representation learning, and caption-vision contrastive alignment, claiming that this joint objective yields representations capturing both physical consistency and grounded semantics, with experiments demonstrating consistent gains in downstream transfer and cross-sensor robustness.
Significance. If the reported gains hold under rigorous validation and the agentic captions prove to be accurate encodings of measurable physical relationships, the work could establish a useful reference for scalable, semantically grounded multimodal pretraining in remote sensing. The integration of physical consistency objectives with language supervision addresses a recognized gap in current foundation models for the domain. The dataset scale and alignment protocol are potentially valuable contributions, though their impact hinges on the reliability of the supervision and the strength of the empirical evidence.
major comments (2)
- Abstract: The claim that 'experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness' provides no information on baselines, evaluation metrics, statistical tests, data splits, or potential confounding factors. This absence prevents verification of the central empirical claim and is load-bearing for assessing whether the proposed pretraining framework delivers the stated improvements.
- Abstract: The agentic captioning framework is presented as synthesizing and verifying annotations that encode 'measurable cross-modality relationships' from spectral signals, terrain statistics, and geographic metadata, yet the manuscript supplies no quantitative validation metrics such as caption error rates, human agreement scores, or ablations on caption fidelity. Without these, it is unclear whether the language supervision is reliable or merely plausible, directly undermining the semantic grounding premise of both the dataset and the pretraining objectives.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We provide point-by-point responses to the major comments and outline the revisions we will make to address the concerns about the abstract's description of experiments and the validation of the captioning framework.
read point-by-point responses
-
Referee: Abstract: The claim that 'experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness' provides no information on baselines, evaluation metrics, statistical tests, data splits, or potential confounding factors. This absence prevents verification of the central empirical claim and is load-bearing for assessing whether the proposed pretraining framework delivers the stated improvements.
Authors: We agree that the abstract lacks sufficient detail on the experimental setup, which is necessary for independent verification of the claims. The complete manuscript details the baselines (standard masked autoencoding and contrastive learning approaches adapted to our multimodal setting), evaluation metrics (including accuracy, mIoU, and retrieval metrics), data splits (with geographic separation to ensure robustness), statistical tests, and analysis of confounding factors such as sensor variability in the dedicated Experiments section. To make this information more immediately accessible, we will revise the abstract to include a brief reference to the evaluation protocol and the nature of the gains observed. This change will be incorporated in the revised manuscript. revision: yes
-
Referee: Abstract: The agentic captioning framework is presented as synthesizing and verifying annotations that encode 'measurable cross-modality relationships' from spectral signals, terrain statistics, and geographic metadata, yet the manuscript supplies no quantitative validation metrics such as caption error rates, human agreement scores, or ablations on caption fidelity. Without these, it is unclear whether the language supervision is reliable or merely plausible, directly undermining the semantic grounding premise of both the dataset and the pretraining objectives.
Authors: We recognize that quantitative metrics on caption quality are important to substantiate the semantic grounding. The manuscript describes the agentic framework and its verification mechanisms based on cross-referencing with spectral signals and metadata, but does not provide aggregate quantitative metrics or ablations. In the revised manuscript, we will include a new analysis subsection reporting caption validation metrics, such as error rates from automated consistency checks, human agreement scores on a sampled set of captions, and ablations demonstrating the effect of caption quality on pretraining outcomes. These additions will directly address the concern and provide evidence for the reliability of the language supervision. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a new large-scale dataset (GeoMeld) constructed via a unified alignment protocol and an agentic captioning process, then defines a pretraining framework (GeoMeld-FM) that combines standard components: multi-pretext masked autoencoding, JEPA representation learning, and caption-vision contrastive alignment. No equations, derivations, or load-bearing steps are shown that reduce any claimed result (e.g., downstream gains or cross-sensor robustness) to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain. The central claims rest on the novelty of the collected data and the joint objective applied to it; these are independent of the reported outputs and do not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023. 2, 5
work page 2023
-
[2]
Anysat: One earth observation model for many resolutions, scales, and modalities
Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Anysat: One earth observation model for many resolutions, scales, and modalities. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19530–19540, 2025. 2
work page 2025
-
[3]
Satlaspretrain: A large- scale dataset for remote sensing image understanding
Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdi- nando, and Aniruddha Kembhavi. Satlaspretrain: A large- scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 2
work page 2023
-
[4]
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muham- mad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2501.06281, 2025. 1, 2
-
[5]
Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar- optical masked autoencoders.Advances in Neural Informa- tion Processing Systems, pages 5566–5586, 2023. 1, 2
work page 2023
-
[6]
Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 1, 2
work page 2024
-
[7]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 5
work page 2022
-
[8]
Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. Ringmoagent: A unified re- mote sensing foundation model for multi-platform and multi- modal reasoning.arXiv preprint arXiv:2507.20776, 2025. 2
-
[9]
Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Ming- ming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, and Yunhong Wang. A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025. 1
-
[10]
Jarosław Jasiewicz and Tomasz F Stepinski. Geomor- phons—a pattern recognition approach to classification and mapping of landforms.Geomorphology, 182:147–156,
-
[11]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831– 27840, 2024. 2
work page 2024
-
[12]
Geo- bench: Toward foundation models for earth monitoring
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Bj¨orn L¨utjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. Geo- bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems, 36: 51080–51093, 2023. 7
work page 2023
-
[13]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 2
work page 2024
-
[14]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,
-
[15]
Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, and Qingyun Pan. Sarchat-bench-2m: a multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025. 1, 2
-
[16]
Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06060, 2023. 2
-
[17]
Mmearth: Ex- ploring multi-modal pretext tasks for geospatial representa- tion learning
Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Ex- ploring multi-modal pretext tasks for geospatial representa- tion learning. InEuropean Conference on Computer Vision, pages 164–182. Springer, 2024. 1, 2, 4, 5
work page 2024
-
[18]
Akashah Shabbir, Mohammed Zunair, Mohammed Ben- amoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025. 1, 2
-
[19]
Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, and Paolo Rota. Earth- mind: Towards multi-granular and multi-sensor earth obser- vation with large multimodal models.arXiv e-prints, pages arXiv–2506, 2025. 1
work page 2025
-
[20]
Earthdial: Turning multi-sensory earth observations to interactive dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025. 2
work page 2025
-
[21]
Gencer Sumbul, Ame De Wall, Tristan Kreuziger, Filipe Marcelino, Hugo Costa, Pedro Benevides, Mario Caetano, Begum Demir, and V olker Markl. Bigearthnet-mm: A large- scale, multimodal, multi-label benchmark archive for remote sensing image classification and retrieval.IEEE Geoscience and Remote Sensing Magazine, 9(3):174–180, 2021. 2
work page 2021
-
[22]
Daniela Swartzman, Sujit Roy, Paolo Fraccaro, Onsen Giela- son, Benedikt Blumenstiel, Rinki Ghesati, Pedro Henrique De Oliveira, Joao Lucas de Souza Almeida, Rocco Sed- lar, Yanghui Kang, et al. Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applica- tions.IEEE Transactions on Geoscience and Remote Sens- ing, 2025. 1, 2
work page 2025
-
[23]
Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Cheny- ing Liu, Conrad M Albrecht, and Xiao Xiang Zhu. Ssl4eo- s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023. 1, 2
work page 2023
-
[24]
Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 1, 2, 4
work page 2024
-
[25]
Con- vnext v2: Co-designing and scaling convnets with masked autoencoders
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con- vnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133– 16142, 2023. 5
work page 2023
-
[26]
Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xi- ang Zhu. Chatearthnet: A global-scale image-text dataset empowering vision-language geo-foundation models.Earth System Science Data Discussions, 2024:1–24, 2024. 1
work page 2024
-
[27]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. Earthmarker: A visual prompting multi- modal large language model for remote sensing.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 2
work page 2024
-
[28]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024. 2
work page 2024
-
[29]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large-scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–23,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.