pith. sign in

arxiv: 2503.16683 · v2 · submitted 2025-03-20 · 💻 cs.CV · cs.AI

GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

Pith reviewed 2026-05-22 22:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords geospatial representationsimplicit neural representationsself-supervised contrastive learningremote sensingstreet view imagerygeo-aligned embeddingsvision transformerslocation-aware pre-training
0
0 comments X

The pith

GAIR uses an implicit neural interpolation module to produce continuous geo-aligned representations from unlabeled multi-modal geospatial data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard Vision Transformers fall short for geospatial work because they cannot supply detailed representations at arbitrary positions inside an image. It introduces GAIR, which adds a Neural Implicit Local Interpolation module to ViT so that representations remain defined and aligned at any queried location. Three separate encoders handle overhead imagery, street-view photos, and geolocation metadata; contrastive losses then pull matching locations together across modalities without labels. The resulting embeddings are tested on nine different geospatial tasks covering twenty-two datasets, where they exceed both prior geo-foundation models and standard self-supervised baselines that lack the fine-grained spatial alignment. If the claim holds, downstream models for remote-sensing classification, street-view retrieval, and location embedding can be initialized from a single pre-trained checkpoint rather than task-specific training.

Core claim

GAIR extends ViT with a Neural Implicit Local Interpolation module that yields a continuous representation over any point in an overhead image; this module is trained jointly with factorized encoders for remote-sensing imagery, street-view imagery, and geolocation metadata under a location-aware contrastive objective, producing embeddings that remain geographically aligned at arbitrary query locations and that improve accuracy on downstream geospatial benchmarks.

What carries the argument

The Neural Implicit Local Interpolation module, which converts discrete ViT patch features into a continuous function that can be queried at any geographic coordinate inside the image.

If this is right

  • A single pre-trained GAIR checkpoint can initialize models for both overhead image classification and street-view place recognition without separate fine-tuning pipelines.
  • Representations remain usable at any spatial scale because the interpolation is continuous rather than tied to fixed patch grids.
  • Temporal transfer improves because the geographic alignment is learned from metadata rather than from visual appearance alone.
  • Multi-modal fusion becomes simpler since all modalities are projected into a shared embedding space indexed by location.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interpolation trick could be applied to other grid-based sensors such as LiDAR or weather station arrays to create continuous fields from discrete observations.
  • If the alignment holds under distribution shift, GAIR-style pre-training might reduce the need for expensive labeled geospatial datasets in low-resource regions.
  • One could test whether adding a temporal dimension to the implicit module would allow the model to interpolate across time as well as space.

Load-bearing premise

The implicit interpolation module, trained only with contrastive losses on unlabeled image pairs, will keep producing representations that stay aligned to real geographic positions instead of fitting only the training image layouts.

What would settle it

A controlled test in which query locations are shifted slightly from the training image centers and performance on a held-out geospatial task drops below the level of a standard ViT baseline that does not use the interpolation module.

Figures

Figures reproduced from arXiv: 2503.16683 by Gengchen Mai, Junfeng Jiao, Ni Lao, Zeping Liu, Zhangyu Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed GAIR architecture, which [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed GAIR architecture, which encodes data of three geospatial modalities: geolocation coordinate [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The model evaluation pipelines on three benchmarks covering 10 tasks. Specifically, after pretraining GAIR, we fine-tune the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of spatial alignment across different modalities in GAIR using heat maps. The red star indicates the geographic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration on the pipeline to construct our [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The architecture of one ablation setting, GAIR-MAE. In this variant, feature representations are extracted independently using [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the number of remote sensing patches [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: More results of cosine similarities between a SV image embedding [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More results of cosine similarities between a SV image embedding [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAIR, a self-supervised contrastive pre-training method that augments a ViT backbone with a Neural Implicit Local Interpolation (INR) module to produce continuous, geo-aligned representations at arbitrary locations within remote-sensing images. Three factorized encoders process RS imagery, street-view imagery, and geolocation metadata; the INR module is used to enforce geographic alignment, and the entire system is trained with contrastive losses on unlabeled multi-modal pairs. The central claim is that GAIR outperforms existing GeoFMs and non-geo-aligned SSL baselines (MoCo V3, MAE) across 9 geospatial tasks and 22 datasets spanning RS, SV, and location-embedding benchmarks. Code is released at https://github.com/zpl99/GAIR.

Significance. If the reported gains are reproducible and the INR module truly generalizes beyond training geometries, the work would constitute a meaningful step toward fine-grained, location-aware geospatial foundation models. The explicit release of training code is a clear strength that supports reproducibility. The significance remains conditional on verification that the INR does not overfit to the spatial sampling patterns of the pre-training pairs.

major comments (2)
  1. [§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.
  2. [§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'We proposed to solve this representation problem' should be revised to present tense for consistency with the rest of the manuscript.
  2. [§3.3] The manuscript should clarify the precise form of the contrastive loss (e.g., temperature, number of negatives) and whether any geo-specific regularization terms are added beyond standard InfoNCE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the INR module and experimental reporting. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.

    Authors: We agree that explicit validation of generalization under coordinate shifts and out-of-distribution query points is missing. The contrastive objective aligns features at geolocated points, but this does not directly test arbitrary or shifted locations. In the revision we will add targeted experiments evaluating INR performance on perturbed coordinates and held-out spatial regions, reporting downstream task metrics to substantiate the geographic alignment claim. revision: yes

  2. Referee: [§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.

    Authors: We acknowledge the absence of per-task statistical tests, confidence intervals, and a dedicated INR ablation. The current comparisons to MoCo V3 and MAE provide indirect evidence, but do not isolate the INR. In the revised manuscript we will add per-task significance tests (e.g., paired t-tests), 95% confidence intervals where multiple runs exist, and an explicit ablation removing only the INR while retaining the contrastive framework. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent empirical evaluation

full rationale

The paper introduces GAIR by extending ViT with a Neural Implicit Local Interpolation INR module, factorized encoders for RS/SV modalities, and contrastive objectives on unlabeled geo-aligned pairs. No equations, self-citations, or uniqueness theorems are invoked that reduce the claimed representations or performance gains to fitted inputs or prior self-referential results by construction. The central claims rest on reported outperformance across 9 tasks and 22 datasets, which constitutes external empirical content rather than a definitional loop. This is the common case of a self-contained architectural proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The implicit representation module itself functions as a learned continuous function whose parameters are optimized during pre-training.

pith-pipeline@v0.9.0 · 5842 in / 1114 out tokens · 19295 ms · 2026-05-22T22:37:46.750678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial ...

  2. UNIGEOCLIP: Unified Geospatial Contrastive Learning

    cs.CV 2026-04 unverdicted novelty 7.0

    UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Pretrain a remote sensing foundation model by pro- moting intra-instance similarity

    Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by pro- moting intra-instance similarity. IEEE Transactions on Geo- science and Remote Sensing, 2024. 5, 7, 13

  2. [2]

    Omnisat: Self-supervised modality fusion for earth observation

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2024. 2, 16

  3. [3]

    Omnisat: Self-supervised modality fusion for earth observation

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2025. 2

  4. [4]

    Geography-aware self-supervised learning

    Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021. 2

  5. [5]

    Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries

    George Azzari, Meha Jain, and David B Lobell. Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries. Remote Sensing of Environment, 202:129–141, 2017. 1

  6. [6]

    Satlaspretrain: A large-scale dataset for remote sensing image understanding

    Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 7

  7. [7]

    Bird- snap: Large-scale fine-grained visual categorization of birds

    Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2011–2018, 2014. 7, 15

  8. [8]

    Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

    Ling Cai, Krzysztof Janowicz, Gengchen Mai, Bo Yan, and Rui Zhu. Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS, 24(3):736–755, 2020. 1

  9. [9]

    Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution

    Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, and Luc Van Gool. Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1796–1807, 2023. 3

  10. [10]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

  11. [11]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020

  12. [12]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen*, Saining Xie*, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 2, 5, 7

  13. [13]

    Learning contin- uous image representation with local implicit image function

    Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning contin- uous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021. 2, 3, 13

  14. [14]

    Spatial implicit neural representations for global-scale species mapping

    Elijah Cole, Grant Van Horn, Christian Lange, Alexander Shepard, Patrick Leary, Pietro Perona, Scott Loarie, and Oisin Mac Aodha. Spatial implicit neural representations for global-scale species mapping. In International Confer- ence on Machine Learning, pages 6320–6342. PMLR, 2023. 1, 3

  15. [15]

    Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022. 1, 2, 5, 7, 13, 16

  16. [16]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 5, 7

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

  18. [18]

    Coin: Compression with implicit neural representations,

    Emilien Dupont, Adam Goli´nski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. arXiv preprint arXiv:2103.03123 ,

  19. [19]

    Urban visual intelligence: Uncovering hidden city profiles with street view images

    Zhuangyuan Fan, Fan Zhang, Becky PY Loo, and Carlo Ratti. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences, 120(27):e2220417120, 2023. 6

  20. [20]

    Croma: Remote sensing representations with contrastive radar-optical masked autoencoders

    Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 5, 7, 13, 16

  21. [21]

    Implicit diffusion models for continuous super-resolution

    Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10021– 10030, 2023. 2, 3

  22. [22]

    Lightweight temporal self-attention for classifying satellite images time series

    Vivien Sainte Fare Garnot and Loic Landrieu. Lightweight temporal self-attention for classifying satellite images time series. In Advanced Analytics and Learning on Temporal Data: 5th ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised Selected Papers 6 , pages 171–181. Springer, 2020. 5, 7

  23. [23]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2

  24. [24]

    Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingx- iang Hu, et al. Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27672–27683, ...

  25. [25]

    Deep learning for multi-year enso forecasts

    Yoo-Geun Ham, Jeong-Hwan Kim, and Jing-Jia Luo. Deep learning for multi-year enso forecasts. Nature, 573(7775): 568–572, 2019. 1

  26. [26]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 4

  27. [27]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2, 5, 6, 7, 13

  28. [28]

    Spectralgpt: Spectral remote sensing foundation model

    Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiup- ing Jia, et al. Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 7, 16

  29. [29]

    Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

    Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogramme- try and Remote Sensing, 215:216–238, 2024. 5, 13, 14

  30. [30]

    Chia-Yu Hsu, Wenwen Li, and Sizhe Wang. Geospatial foun- dation models for image analysis: Evaluating and enhancing nasa-ibm prithvi’s domain adaptability.International Journal of Geographical Information Science, pages 1–30, 2024. 2

  31. [31]

    Spatial transformer networks

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015. 3

  32. [32]

    Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting

    Pallavi Jain, Dino Ienco, Roberto Interdonato, Tristan Berchoux, and Diego Marcos. Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting. arXiv preprint arXiv:2412.08536, 2024. 16

  33. [33]

    Multimodal contrastive learning for remote sensing tasks

    Umangi Jain, Alex Wilson, and Varun Gulshan. Multimodal contrastive learning for remote sensing tasks. arXiv preprint arXiv:2209.02329, 2022. 2

  34. [34]

    org/abs/2310.18660

    Johannes Jakubik, Sujit Roy, CE Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence. arXiv preprint arXiv:2310.18660, 2023. 6, 7, 14

  35. [35]

    Combining satellite imagery and machine learning to predict poverty

    Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016. 1

  36. [36]

    Klemmer, E

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179, 2023. 1, 2, 3

  37. [37]

    Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa

    Hao Li, Benjamin Herfort, Sven Lautenbach, Jiaoyan Chen, and Alexander Zipf. Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa. Transactions in GIS, 26(8):3125–3146, 2022. 1

  38. [38]

    Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa

    Hao Li, Jiapan Wang, Johann Maximilian Zollner, Gengchen Mai, Ni Lao, and Martin Werner. Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa. In Proceedings of the 31st ACM International Confer- ence on Advances in Geographic Information Systems, pages ...

  39. [39]

    Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 7

  40. [40]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

  41. [41]

    Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods

    Rose M Rustowicz, Robin Cheong, Lijing Wang, Stefano Ermon, Marshall Burke, and David Lobell. Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/cvf confer- ence on computer vision and pattern recognition workshops, pages 75–82, 2019. 7, 14

  42. [42]

    Presence- only geographical priors for fine-grained image classification

    Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence- only geographical priors for fine-grained image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9596–9606, 2019. 3, 5, 15

  43. [43]

    Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting

    Gengchen Mai, Krzysztof Janowicz, Ling Cai, Rui Zhu, Blake Regalia, Bo Yan, Meilin Shi, and Ni Lao. Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting. Transactions in GIS, 24(3):623–655, 2020. 1, 3

  44. [44]

    Multi-scale representation learning for spatial feature distributions using grid cells

    Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. In International Conference on Learning Representations, 2020. 3, 5

  45. [45]

    Multi-scale representation learning for spatial feature distributions using grid cells

    Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. arXiv preprint arXiv:2003.00824, 2020. 2

  46. [46]

    Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations

    Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, and Ste- fano Ermon. Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations. In Interna- tional Conference on Machine Learning, pages 23498–23515. PMLR, 2023. 1, 2, 3

  47. [47]

    Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions

    Gengchen Mai, Yao Xuan, Wenyun Zuo, Yutong He, Ji- aming Song, Stefano Ermon, Krzysztof Janowicz, and Ni Lao. Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions. ISPRS Journal of Photogrammetry and Remote Sensing, 202:439–462, 2023. 1, 3

  48. [48]

    On the opportunities and challenges of foundation models for geoai (vision paper)

    Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. On the opportunities and challenges of foundation models for geoai (vision paper). ACM Transac- tions on Spatial Algorithms and Systems, 2024. 2 10

  49. [49]

    Towards the next generation of geospatial artificial intel- ligence

    Gengchen Mai, Yiqun Xie, Xiaowei Jia, Ni Lao, Jinmeng Rao, Qing Zhu, Zeping Liu, Yao-Yi Chiang, and Junfeng Jiao. Towards the next generation of geospatial artificial intel- ligence. International Journal of Applied Earth Observation and Geoinformation, 136:104368, 2025. 1

  50. [50]

    Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data

    Oscar Manas, Alexandre Lacoste, Xavier Gir´o-i Nieto, David Vazquez, and Pau Rodriguez. Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9414–9423, 2021. 2

  51. [51]

    Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024

    Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024. 5, 16

  52. [52]

    Gfm: Building geospatial founda- tion models via continual pretraining

    Mat´ıas Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen, and Mu Li. Gfm: Building geospatial founda- tion models via continual pretraining. arXiv preprint arXiv:2302.04476, 3, 2023. 7

  53. [53]

    Occupancy networks: Learning 3d reconstruction in function space

    Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019. 3

  54. [54]

    Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

    Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foun- dation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023. 2

  55. [55]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

  56. [56]

    Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms

    Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and Andrew Nelson. Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms. IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 6, 14

  57. [57]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 3, 4

  58. [58]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7

  59. [59]

    A generalizable and accessible approach to machine learning with global satellite imagery

    Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bol- liger, Vaishaal Shankar, Miyabi Ishihara, Benjamin Recht, and Solomon Hsiang. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021. 7, 14

  60. [60]

    Implicit neural representa- tions with periodic activation functions

    Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representa- tions with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020. 2

  61. [61]

    Ssl4eo-l: Datasets and foundation models for landsat imagery

    Adam Stewart, Nils Lehmann, Isaac Corley, Yi Wang, Yi- Chia Chang, Nassim Ait Ait Ali Braham, Shradha Sehgal, Caleb Robinson, and Arindam Banerjee. Ssl4eo-l: Datasets and foundation models for landsat imagery. Advances in Neural Information Processing Systems , 36:59787–59807,

  62. [62]

    Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding

    Vladan Stojnic and Vladimir Risojevic. Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1182–1191, 2021. 2

  63. [63]

    ebird: A citizen- based bird observation network in the biological sciences

    Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve Kelling. ebird: A citizen- based bird observation network in the biological sciences. Biological conservation, 142(10):2282–2292, 2009. 15

  64. [64]

    Ringmo: A remote sensing foundation model with masked image modeling

    Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiaonan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61:1–22, 2022. 2

  65. [65]

    Rethinking the inception ar- chitecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 5

  66. [66]

    Fourier features let networks learn high frequency functions in low dimen- sional domains

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020. 3

  67. [67]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 595–604, ...

  68. [68]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,

  69. [69]

    Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization

    Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Ad- vances in Neural Information Processing Systems, 36, 2024. 3, 4, 6

  70. [70]

    Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. 11 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023. 2

  71. [71]

    Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops

    Xinye Wanyan, Sachith Seneviratne, Shuchang Shen, and Michael Kirley. Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops. arXiv preprint arXiv:2303.06670, 2023. 2

  72. [72]

    Mapping human perception of urban landscape from street- view images: A deep-learning approach

    Jingxian Wei, Wenze Yue, Mengmeng Li, and Jiabin Gao. Mapping human perception of urban landscape from street- view images: A deep-learning approach. International Jour- nal of Applied Earth Observation and Geoinformation, 112: 102886, 2022. 6

  73. [73]

    Visual transformers: Token-based image representation and processing for com- puter vision, 2020

    Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gon- zalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for com- puter vision, 2020. 5, 7

  74. [74]

    Torchspatial: A location encoding framework and benchmark for spatial representation learning

    Nemin Wu, Qian Cao, Zhangyu Wang, Zeping Liu, Yanlin Qi, Jielu Zhang, Joshua Ni, Xiaobai Yao, Hongxu Ma, Lan Mu, et al. Torchspatial: A location encoding framework and benchmark for spatial representation learning. arXiv preprint arXiv:2406.15658, 2024. 1, 2, 3, 5, 8, 16

  75. [75]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 5

  76. [76]

    Neural plasticity-inspired foundation model for observing the earth crossing modalities

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv e-prints, pages arXiv–2403, 2024. 2, 7, 16

  77. [77]

    Yahoo flickr creative commons 100m

    Yahoo. Yahoo flickr creative commons 100m. http:// webscope.sandbox.yahoo.com/catalog.php? datatype=i&did=67. Accessed: 2024-06-03. 7, 15

  78. [78]

    Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning

    Christopher Yeh, Chenlin Meng, Sherrie Wang, Anne Driscoll, Erik Rozi, Patrick Liu, Jihyeon Lee, Marshall Burke, David B Lobell, and Stefano Ermon. Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning. arXiv preprint arXiv:2111.04724, 2021. 14

  79. [79]

    Deep gaussian process for crop yield pre- diction based on remote sensing data

    Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield pre- diction based on remote sensing data. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1

  80. [80]

    Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions

    Dazhou Yu, Riyang Bao, Gengchen Mai, and Liang Zhao. Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions. arXiv preprint arXiv:2502.18470, 2025. 1

Showing first 80 references.