GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

Gengchen Mai; Junfeng Jiao; Ni Lao; Zeping Liu; Zhangyu Wang

arxiv: 2503.16683 · v2 · submitted 2025-03-20 · 💻 cs.CV · cs.AI

GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

Zeping Liu , Ni Lao , Zhangyu Wang , Junfeng Jiao , Gengchen Mai This is my paper

Pith reviewed 2026-05-22 22:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords geospatial representationsimplicit neural representationsself-supervised contrastive learningremote sensingstreet view imagerygeo-aligned embeddingsvision transformerslocation-aware pre-training

0 comments

The pith

GAIR uses an implicit neural interpolation module to produce continuous geo-aligned representations from unlabeled multi-modal geospatial data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard Vision Transformers fall short for geospatial work because they cannot supply detailed representations at arbitrary positions inside an image. It introduces GAIR, which adds a Neural Implicit Local Interpolation module to ViT so that representations remain defined and aligned at any queried location. Three separate encoders handle overhead imagery, street-view photos, and geolocation metadata; contrastive losses then pull matching locations together across modalities without labels. The resulting embeddings are tested on nine different geospatial tasks covering twenty-two datasets, where they exceed both prior geo-foundation models and standard self-supervised baselines that lack the fine-grained spatial alignment. If the claim holds, downstream models for remote-sensing classification, street-view retrieval, and location embedding can be initialized from a single pre-trained checkpoint rather than task-specific training.

Core claim

GAIR extends ViT with a Neural Implicit Local Interpolation module that yields a continuous representation over any point in an overhead image; this module is trained jointly with factorized encoders for remote-sensing imagery, street-view imagery, and geolocation metadata under a location-aware contrastive objective, producing embeddings that remain geographically aligned at arbitrary query locations and that improve accuracy on downstream geospatial benchmarks.

What carries the argument

The Neural Implicit Local Interpolation module, which converts discrete ViT patch features into a continuous function that can be queried at any geographic coordinate inside the image.

If this is right

A single pre-trained GAIR checkpoint can initialize models for both overhead image classification and street-view place recognition without separate fine-tuning pipelines.
Representations remain usable at any spatial scale because the interpolation is continuous rather than tied to fixed patch grids.
Temporal transfer improves because the geographic alignment is learned from metadata rather than from visual appearance alone.
Multi-modal fusion becomes simpler since all modalities are projected into a shared embedding space indexed by location.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interpolation trick could be applied to other grid-based sensors such as LiDAR or weather station arrays to create continuous fields from discrete observations.
If the alignment holds under distribution shift, GAIR-style pre-training might reduce the need for expensive labeled geospatial datasets in low-resource regions.
One could test whether adding a temporal dimension to the implicit module would allow the model to interpolate across time as well as space.

Load-bearing premise

The implicit interpolation module, trained only with contrastive losses on unlabeled image pairs, will keep producing representations that stay aligned to real geographic positions instead of fitting only the training image layouts.

What would settle it

A controlled test in which query locations are shifted slightly from the training image centers and performance on a held-out geospatial task drops below the level of a standard ViT baseline that does not use the interpolation module.

Figures

Figures reproduced from arXiv: 2503.16683 by Gengchen Mai, Junfeng Jiao, Ni Lao, Zeping Liu, Zhangyu Wang.

**Figure 2.** Figure 2: Overview of the proposed GAIR architecture, which encodes data of three geospatial modalities: geolocation coordinate [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The model evaluation pipelines on three benchmarks covering 10 tasks. Specifically, after pretraining GAIR, we fine-tune the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of spatial alignment across different modalities in GAIR using heat maps. The red star indicates the geographic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration on the pipeline to construct our [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The architecture of one ablation setting, GAIR-MAE. In this variant, feature representations are extracted independently using [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the number of remote sensing patches [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 9.** Figure 9: More results of cosine similarities between a SV image embedding [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: More results of cosine similarities between a SV image embedding [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAIR adds an INR interpolation head to align RS, street-view, and geo metadata in SSL, but the abstract gives no evidence the module generalizes beyond training geometries.

read the letter

The main takeaway is that GAIR adds a Neural Implicit Local Interpolation module to ViT-based models so they can produce representations at arbitrary locations within remote sensing images, then trains it with a contrastive objective that pulls together overhead imagery, street-view photos, and their shared geolocation metadata. What stands out as new is the combination of factorized encoders for the three modalities with the INR head to enforce geographic alignment during self-supervised pre-training. The evaluation covers a lot of ground with 9 tasks and 22 datasets, which gives a decent sense of where it might help. The paper does a reasonable job stating the motivation for continuous spatial reps in geospatial work and making the code available. The soft spot is the lack of any visible checks on whether the INR actually generalizes to query points outside the training image geometries. The abstract claims better performance than GeoFM and non-geo SSL methods, but without ablations on coordinate shifts or out-of-distribution locations, it is difficult to know if the claimed advantage holds for the fine-grained tasks they target. That assumption about producing useful aligned reps at arbitrary spots is central and untested in the summary we have. This work is aimed at researchers building or using foundation models for earth observation, urban planning, or any pipeline that mixes satellite and ground-level data with precise locations. It shows clear thinking about the representation gap in current ViTs for geo data, so it deserves a serious referee to examine the full experiments and controls. I would send this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAIR, a self-supervised contrastive pre-training method that augments a ViT backbone with a Neural Implicit Local Interpolation (INR) module to produce continuous, geo-aligned representations at arbitrary locations within remote-sensing images. Three factorized encoders process RS imagery, street-view imagery, and geolocation metadata; the INR module is used to enforce geographic alignment, and the entire system is trained with contrastive losses on unlabeled multi-modal pairs. The central claim is that GAIR outperforms existing GeoFMs and non-geo-aligned SSL baselines (MoCo V3, MAE) across 9 geospatial tasks and 22 datasets spanning RS, SV, and location-embedding benchmarks. Code is released at https://github.com/zpl99/GAIR.

Significance. If the reported gains are reproducible and the INR module truly generalizes beyond training geometries, the work would constitute a meaningful step toward fine-grained, location-aware geospatial foundation models. The explicit release of training code is a clear strength that supports reproducibility. The significance remains conditional on verification that the INR does not overfit to the spatial sampling patterns of the pre-training pairs.

major comments (2)

[§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.
[§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.

minor comments (2)

[Abstract] Abstract: the sentence 'We proposed to solve this representation problem' should be revised to present tense for consistency with the rest of the manuscript.
[§3.3] The manuscript should clarify the precise form of the contrastive loss (e.g., temperature, number of negatives) and whether any geo-specific regularization terms are added beyond standard InfoNCE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the INR module and experimental reporting. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.

Authors: We agree that explicit validation of generalization under coordinate shifts and out-of-distribution query points is missing. The contrastive objective aligns features at geolocated points, but this does not directly test arbitrary or shifted locations. In the revision we will add targeted experiments evaluating INR performance on perturbed coordinates and held-out spatial regions, reporting downstream task metrics to substantiate the geographic alignment claim. revision: yes
Referee: [§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.

Authors: We acknowledge the absence of per-task statistical tests, confidence intervals, and a dedicated INR ablation. The current comparisons to MoCo V3 and MAE provide indirect evidence, but do not isolate the INR. In the revised manuscript we will add per-task significance tests (e.g., paired t-tests), 95% confidence intervals where multiple runs exist, and an explicit ablation removing only the INR while retaining the contrastive framework. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent empirical evaluation

full rationale

The paper introduces GAIR by extending ViT with a Neural Implicit Local Interpolation INR module, factorized encoders for RS/SV modalities, and contrastive objectives on unlabeled geo-aligned pairs. No equations, self-citations, or uniqueness theorems are invoked that reduce the claimed representations or performance gains to fitted inputs or prior self-referential results by construction. The central claims rest on reported outperformance across 9 tasks and 22 datasets, which constitutes external empirical content rather than a definitional loop. This is the common case of a self-contained architectural proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The implicit representation module itself functions as a learned continuous function whose parameters are optimized during pre-training.

pith-pipeline@v0.9.0 · 5842 in / 1114 out tokens · 19295 ms · 2026-05-22T22:37:46.750678+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We utilize three factorized neural encoders... novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image’s geolocation... trained with contrastive learning objectives from unlabeled data.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The INR module refines f(ri) into a localized embedding z(q)i through feature unfolding and local ensemble interpolation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
cs.CV 2026-05 unverdicted novelty 7.0

TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial ...
UNIGEOCLIP: Unified Geospatial Contrastive Learning
cs.CV 2026-04 unverdicted novelty 7.0

UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Pretrain a remote sensing foundation model by pro- moting intra-instance similarity

Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by pro- moting intra-instance similarity. IEEE Transactions on Geo- science and Remote Sensing, 2024. 5, 7, 13

work page 2024
[2]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2024. 2, 16

work page 2024
[3]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2025. 2

work page 2025
[4]

Geography-aware self-supervised learning

Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021. 2

work page 2021
[5]

Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries

George Azzari, Meha Jain, and David B Lobell. Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries. Remote Sensing of Environment, 202:129–141, 2017. 1

work page 2017
[6]

Satlaspretrain: A large-scale dataset for remote sensing image understanding

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 7

work page 2023
[7]

Bird- snap: Large-scale fine-grained visual categorization of birds

Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2011–2018, 2014. 7, 15

work page 2011
[8]

Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

Ling Cai, Krzysztof Janowicz, Gengchen Mai, Bo Yan, and Rui Zhu. Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS, 24(3):736–755, 2020. 1

work page 2020
[9]

Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution

Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, and Luc Van Gool. Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1796–1807, 2023. 3

work page 2023
[10]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021
[11]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020

work page 2020
[12]

An empirical study of training self-supervised vision transformers

Xinlei Chen*, Saining Xie*, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 2, 5, 7

work page arXiv 2021
[13]

Learning contin- uous image representation with local implicit image function

Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning contin- uous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021. 2, 3, 13

work page 2021
[14]

Spatial implicit neural representations for global-scale species mapping

Elijah Cole, Grant Van Horn, Christian Lange, Alexander Shepard, Patrick Leary, Pietro Perona, Scott Loarie, and Oisin Mac Aodha. Spatial implicit neural representations for global-scale species mapping. In International Confer- ence on Machine Learning, pages 6320–6342. PMLR, 2023. 1, 3

work page 2023
[15]

Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022. 1, 2, 5, 7, 13, 16

work page 2022
[16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 5, 7

work page 2009
[17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[18]

Coin: Compression with implicit neural representations,

Emilien Dupont, Adam Goli´nski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. arXiv preprint arXiv:2103.03123 ,

work page arXiv
[19]

Urban visual intelligence: Uncovering hidden city profiles with street view images

Zhuangyuan Fan, Fan Zhang, Becky PY Loo, and Carlo Ratti. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences, 120(27):e2220417120, 2023. 6

work page 2023
[20]

Croma: Remote sensing representations with contrastive radar-optical masked autoencoders

Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 5, 7, 13, 16

work page 2024
[21]

Implicit diffusion models for continuous super-resolution

Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10021– 10030, 2023. 2, 3

work page 2023
[22]

Lightweight temporal self-attention for classifying satellite images time series

Vivien Sainte Fare Garnot and Loic Landrieu. Lightweight temporal self-attention for classifying satellite images time series. In Advanced Analytics and Learning on Temporal Data: 5th ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised Selected Papers 6 , pages 171–181. Springer, 2020. 5, 7

work page 2020
[23]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2

work page 2020
[24]

Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingx- iang Hu, et al. Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27672–27683, ...

work page 2024
[25]

Deep learning for multi-year enso forecasts

Yoo-Geun Ham, Jeong-Hwan Kim, and Jing-Jia Luo. Deep learning for multi-year enso forecasts. Nature, 573(7775): 568–572, 2019. 1

work page 2019
[26]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 4

work page 2020
[27]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2, 5, 6, 7, 13

work page 2022
[28]

Spectralgpt: Spectral remote sensing foundation model

Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiup- ing Jia, et al. Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 7, 16

work page 2024
[29]

Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogramme- try and Remote Sensing, 215:216–238, 2024. 5, 13, 14

work page 2024
[30]

Chia-Yu Hsu, Wenwen Li, and Sizhe Wang. Geospatial foun- dation models for image analysis: Evaluating and enhancing nasa-ibm prithvi’s domain adaptability.International Journal of Geographical Information Science, pages 1–30, 2024. 2

work page 2024
[31]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015. 3

work page 2015
[32]

Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting

Pallavi Jain, Dino Ienco, Roberto Interdonato, Tristan Berchoux, and Diego Marcos. Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting. arXiv preprint arXiv:2412.08536, 2024. 16

work page arXiv 2024
[33]

Multimodal contrastive learning for remote sensing tasks

Umangi Jain, Alex Wilson, and Varun Gulshan. Multimodal contrastive learning for remote sensing tasks. arXiv preprint arXiv:2209.02329, 2022. 2

work page arXiv 2022
[34]

org/abs/2310.18660

Johannes Jakubik, Sujit Roy, CE Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence. arXiv preprint arXiv:2310.18660, 2023. 6, 7, 14

work page arXiv 2023
[35]

Combining satellite imagery and machine learning to predict poverty

Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016. 1

work page 2016
[36]

Klemmer, E

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179, 2023. 1, 2, 3

work page arXiv 2023
[37]

Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa

Hao Li, Benjamin Herfort, Sven Lautenbach, Jiaoyan Chen, and Alexander Zipf. Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa. Transactions in GIS, 26(8):3125–3146, 2022. 1

work page 2022
[38]

Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa

Hao Li, Jiapan Wang, Johann Maximilian Zollner, Gengchen Mai, Ni Lao, and Martin Werner. Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa. In Proceedings of the 31st ACM International Confer- ence on Advances in Geographic Information Systems, pages ...

work page 2023
[39]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 7

work page 2024
[40]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021
[41]

Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods

Rose M Rustowicz, Robin Cheong, Lijing Wang, Stefano Ermon, Marshall Burke, and David Lobell. Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/cvf confer- ence on computer vision and pattern recognition workshops, pages 75–82, 2019. 7, 14

work page 2019
[42]

Presence- only geographical priors for fine-grained image classification

Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence- only geographical priors for fine-grained image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9596–9606, 2019. 3, 5, 15

work page 2019
[43]

Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting

Gengchen Mai, Krzysztof Janowicz, Ling Cai, Rui Zhu, Blake Regalia, Bo Yan, Meilin Shi, and Ni Lao. Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting. Transactions in GIS, 24(3):623–655, 2020. 1, 3

work page 2020
[44]

Multi-scale representation learning for spatial feature distributions using grid cells

Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. In International Conference on Learning Representations, 2020. 3, 5

work page 2020
[45]

Multi-scale representation learning for spatial feature distributions using grid cells

Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. arXiv preprint arXiv:2003.00824, 2020. 2

work page arXiv 2003
[46]

Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations

Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, and Ste- fano Ermon. Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations. In Interna- tional Conference on Machine Learning, pages 23498–23515. PMLR, 2023. 1, 2, 3

work page 2023
[47]

Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions

Gengchen Mai, Yao Xuan, Wenyun Zuo, Yutong He, Ji- aming Song, Stefano Ermon, Krzysztof Janowicz, and Ni Lao. Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions. ISPRS Journal of Photogrammetry and Remote Sensing, 202:439–462, 2023. 1, 3

work page 2023
[48]

On the opportunities and challenges of foundation models for geoai (vision paper)

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. On the opportunities and challenges of foundation models for geoai (vision paper). ACM Transac- tions on Spatial Algorithms and Systems, 2024. 2 10

work page 2024
[49]

Towards the next generation of geospatial artificial intel- ligence

Gengchen Mai, Yiqun Xie, Xiaowei Jia, Ni Lao, Jinmeng Rao, Qing Zhu, Zeping Liu, Yao-Yi Chiang, and Junfeng Jiao. Towards the next generation of geospatial artificial intel- ligence. International Journal of Applied Earth Observation and Geoinformation, 136:104368, 2025. 1

work page 2025
[50]

Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data

Oscar Manas, Alexandre Lacoste, Xavier Gir´o-i Nieto, David Vazquez, and Pau Rodriguez. Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9414–9423, 2021. 2

work page 2021
[51]

Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024. 5, 16

work page 2024
[52]

Gfm: Building geospatial founda- tion models via continual pretraining

Mat´ıas Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen, and Mu Li. Gfm: Building geospatial founda- tion models via continual pretraining. arXiv preprint arXiv:2302.04476, 3, 2023. 7

work page arXiv 2023
[53]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019. 3

work page 2019
[54]

Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foun- dation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023. 2

work page arXiv 2023
[55]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms

Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and Andrew Nelson. Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms. IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 6, 14

work page 2023
[57]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 3, 4

work page 2021
[58]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7

work page 2023
[59]

A generalizable and accessible approach to machine learning with global satellite imagery

Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bol- liger, Vaishaal Shankar, Miyabi Ishihara, Benjamin Recht, and Solomon Hsiang. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021. 7, 14

work page 2021
[60]

Implicit neural representa- tions with periodic activation functions

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representa- tions with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020. 2

work page 2020
[61]

Ssl4eo-l: Datasets and foundation models for landsat imagery

Adam Stewart, Nils Lehmann, Isaac Corley, Yi Wang, Yi- Chia Chang, Nassim Ait Ait Ali Braham, Shradha Sehgal, Caleb Robinson, and Arindam Banerjee. Ssl4eo-l: Datasets and foundation models for landsat imagery. Advances in Neural Information Processing Systems , 36:59787–59807,

work page
[62]

Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding

Vladan Stojnic and Vladimir Risojevic. Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1182–1191, 2021. 2

work page 2021
[63]

ebird: A citizen- based bird observation network in the biological sciences

Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve Kelling. ebird: A citizen- based bird observation network in the biological sciences. Biological conservation, 142(10):2282–2292, 2009. 15

work page 2009
[64]

Ringmo: A remote sensing foundation model with masked image modeling

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiaonan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61:1–22, 2022. 2

work page 2022
[65]

Rethinking the inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 5

work page 2016
[66]

Fourier features let networks learn high frequency functions in low dimen- sional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020. 3

work page 2020
[67]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 595–604, ...

work page 2015
[68]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,

work page
[69]

Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Ad- vances in Neural Information Processing Systems, 36, 2024. 3, 4, 6

work page 2024
[70]

Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. 11 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023. 2

work page 2023
[71]

Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops

Xinye Wanyan, Sachith Seneviratne, Shuchang Shen, and Michael Kirley. Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops. arXiv preprint arXiv:2303.06670, 2023. 2

work page arXiv 2023
[72]

Mapping human perception of urban landscape from street- view images: A deep-learning approach

Jingxian Wei, Wenze Yue, Mengmeng Li, and Jiabin Gao. Mapping human perception of urban landscape from street- view images: A deep-learning approach. International Jour- nal of Applied Earth Observation and Geoinformation, 112: 102886, 2022. 6

work page 2022
[73]

Visual transformers: Token-based image representation and processing for com- puter vision, 2020

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gon- zalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for com- puter vision, 2020. 5, 7

work page 2020
[74]

Torchspatial: A location encoding framework and benchmark for spatial representation learning

Nemin Wu, Qian Cao, Zhangyu Wang, Zeping Liu, Yanlin Qi, Jielu Zhang, Joshua Ni, Xiaobai Yao, Hongxu Ma, Lan Mu, et al. Torchspatial: A location encoding framework and benchmark for spatial representation learning. arXiv preprint arXiv:2406.15658, 2024. 1, 2, 3, 5, 8, 16

work page arXiv 2024
[75]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 5

work page 2018
[76]

Neural plasticity-inspired foundation model for observing the earth crossing modalities

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv e-prints, pages arXiv–2403, 2024. 2, 7, 16

work page 2024
[77]

Yahoo flickr creative commons 100m

Yahoo. Yahoo flickr creative commons 100m. http:// webscope.sandbox.yahoo.com/catalog.php? datatype=i&did=67. Accessed: 2024-06-03. 7, 15

work page 2024
[78]

Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning

Christopher Yeh, Chenlin Meng, Sherrie Wang, Anne Driscoll, Erik Rozi, Patrick Liu, Jihyeon Lee, Marshall Burke, David B Lobell, and Stefano Ermon. Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning. arXiv preprint arXiv:2111.04724, 2021. 14

work page arXiv 2021
[79]

Deep gaussian process for crop yield pre- diction based on remote sensing data

Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield pre- diction based on remote sensing data. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1

work page 2017
[80]

Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions

Dazhou Yu, Riyang Bao, Gengchen Mai, and Liang Zhao. Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions. arXiv preprint arXiv:2502.18470, 2025. 1

work page arXiv 2025

Showing first 80 references.

[1] [1]

Pretrain a remote sensing foundation model by pro- moting intra-instance similarity

Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by pro- moting intra-instance similarity. IEEE Transactions on Geo- science and Remote Sensing, 2024. 5, 7, 13

work page 2024

[2] [2]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2024. 2, 16

work page 2024

[3] [3]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2025. 2

work page 2025

[4] [4]

Geography-aware self-supervised learning

Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021. 2

work page 2021

[5] [5]

Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries

George Azzari, Meha Jain, and David B Lobell. Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries. Remote Sensing of Environment, 202:129–141, 2017. 1

work page 2017

[6] [6]

Satlaspretrain: A large-scale dataset for remote sensing image understanding

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 7

work page 2023

[7] [7]

Bird- snap: Large-scale fine-grained visual categorization of birds

Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2011–2018, 2014. 7, 15

work page 2011

[8] [8]

Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

Ling Cai, Krzysztof Janowicz, Gengchen Mai, Bo Yan, and Rui Zhu. Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS, 24(3):736–755, 2020. 1

work page 2020

[9] [9]

Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution

Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, and Luc Van Gool. Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1796–1807, 2023. 3

work page 2023

[10] [10]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021

[11] [11]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020

work page 2020

[12] [12]

An empirical study of training self-supervised vision transformers

Xinlei Chen*, Saining Xie*, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 2, 5, 7

work page arXiv 2021

[13] [13]

Learning contin- uous image representation with local implicit image function

Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning contin- uous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021. 2, 3, 13

work page 2021

[14] [14]

Spatial implicit neural representations for global-scale species mapping

Elijah Cole, Grant Van Horn, Christian Lange, Alexander Shepard, Patrick Leary, Pietro Perona, Scott Loarie, and Oisin Mac Aodha. Spatial implicit neural representations for global-scale species mapping. In International Confer- ence on Machine Learning, pages 6320–6342. PMLR, 2023. 1, 3

work page 2023

[15] [15]

Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022. 1, 2, 5, 7, 13, 16

work page 2022

[16] [16]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 5, 7

work page 2009

[17] [17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[18] [18]

Coin: Compression with implicit neural representations,

Emilien Dupont, Adam Goli´nski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. arXiv preprint arXiv:2103.03123 ,

work page arXiv

[19] [19]

Urban visual intelligence: Uncovering hidden city profiles with street view images

Zhuangyuan Fan, Fan Zhang, Becky PY Loo, and Carlo Ratti. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences, 120(27):e2220417120, 2023. 6

work page 2023

[20] [20]

Croma: Remote sensing representations with contrastive radar-optical masked autoencoders

Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 5, 7, 13, 16

work page 2024

[21] [21]

Implicit diffusion models for continuous super-resolution

Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10021– 10030, 2023. 2, 3

work page 2023

[22] [22]

Lightweight temporal self-attention for classifying satellite images time series

Vivien Sainte Fare Garnot and Loic Landrieu. Lightweight temporal self-attention for classifying satellite images time series. In Advanced Analytics and Learning on Temporal Data: 5th ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised Selected Papers 6 , pages 171–181. Springer, 2020. 5, 7

work page 2020

[23] [23]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2

work page 2020

[24] [24]

Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingx- iang Hu, et al. Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27672–27683, ...

work page 2024

[25] [25]

Deep learning for multi-year enso forecasts

Yoo-Geun Ham, Jeong-Hwan Kim, and Jing-Jia Luo. Deep learning for multi-year enso forecasts. Nature, 573(7775): 568–572, 2019. 1

work page 2019

[26] [26]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 4

work page 2020

[27] [27]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2, 5, 6, 7, 13

work page 2022

[28] [28]

Spectralgpt: Spectral remote sensing foundation model

Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiup- ing Jia, et al. Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 7, 16

work page 2024

[29] [29]

Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogramme- try and Remote Sensing, 215:216–238, 2024. 5, 13, 14

work page 2024

[30] [30]

Chia-Yu Hsu, Wenwen Li, and Sizhe Wang. Geospatial foun- dation models for image analysis: Evaluating and enhancing nasa-ibm prithvi’s domain adaptability.International Journal of Geographical Information Science, pages 1–30, 2024. 2

work page 2024

[31] [31]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015. 3

work page 2015

[32] [32]

Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting

Pallavi Jain, Dino Ienco, Roberto Interdonato, Tristan Berchoux, and Diego Marcos. Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting. arXiv preprint arXiv:2412.08536, 2024. 16

work page arXiv 2024

[33] [33]

Multimodal contrastive learning for remote sensing tasks

Umangi Jain, Alex Wilson, and Varun Gulshan. Multimodal contrastive learning for remote sensing tasks. arXiv preprint arXiv:2209.02329, 2022. 2

work page arXiv 2022

[34] [34]

org/abs/2310.18660

Johannes Jakubik, Sujit Roy, CE Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence. arXiv preprint arXiv:2310.18660, 2023. 6, 7, 14

work page arXiv 2023

[35] [35]

Combining satellite imagery and machine learning to predict poverty

Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016. 1

work page 2016

[36] [36]

Klemmer, E

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179, 2023. 1, 2, 3

work page arXiv 2023

[37] [37]

Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa

Hao Li, Benjamin Herfort, Sven Lautenbach, Jiaoyan Chen, and Alexander Zipf. Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa. Transactions in GIS, 26(8):3125–3146, 2022. 1

work page 2022

[38] [38]

Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa

Hao Li, Jiapan Wang, Johann Maximilian Zollner, Gengchen Mai, Ni Lao, and Martin Werner. Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa. In Proceedings of the 31st ACM International Confer- ence on Advances in Geographic Information Systems, pages ...

work page 2023

[39] [39]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 7

work page 2024

[40] [40]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021

[41] [41]

Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods

Rose M Rustowicz, Robin Cheong, Lijing Wang, Stefano Ermon, Marshall Burke, and David Lobell. Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/cvf confer- ence on computer vision and pattern recognition workshops, pages 75–82, 2019. 7, 14

work page 2019

[42] [42]

Presence- only geographical priors for fine-grained image classification

Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence- only geographical priors for fine-grained image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9596–9606, 2019. 3, 5, 15

work page 2019

[43] [43]

Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting

Gengchen Mai, Krzysztof Janowicz, Ling Cai, Rui Zhu, Blake Regalia, Bo Yan, Meilin Shi, and Ni Lao. Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting. Transactions in GIS, 24(3):623–655, 2020. 1, 3

work page 2020

[44] [44]

Multi-scale representation learning for spatial feature distributions using grid cells

Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. In International Conference on Learning Representations, 2020. 3, 5

work page 2020

[45] [45]

Multi-scale representation learning for spatial feature distributions using grid cells

Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. arXiv preprint arXiv:2003.00824, 2020. 2

work page arXiv 2003

[46] [46]

Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations

Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, and Ste- fano Ermon. Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations. In Interna- tional Conference on Machine Learning, pages 23498–23515. PMLR, 2023. 1, 2, 3

work page 2023

[47] [47]

Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions

Gengchen Mai, Yao Xuan, Wenyun Zuo, Yutong He, Ji- aming Song, Stefano Ermon, Krzysztof Janowicz, and Ni Lao. Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions. ISPRS Journal of Photogrammetry and Remote Sensing, 202:439–462, 2023. 1, 3

work page 2023

[48] [48]

On the opportunities and challenges of foundation models for geoai (vision paper)

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. On the opportunities and challenges of foundation models for geoai (vision paper). ACM Transac- tions on Spatial Algorithms and Systems, 2024. 2 10

work page 2024

[49] [49]

Towards the next generation of geospatial artificial intel- ligence

Gengchen Mai, Yiqun Xie, Xiaowei Jia, Ni Lao, Jinmeng Rao, Qing Zhu, Zeping Liu, Yao-Yi Chiang, and Junfeng Jiao. Towards the next generation of geospatial artificial intel- ligence. International Journal of Applied Earth Observation and Geoinformation, 136:104368, 2025. 1

work page 2025

[50] [50]

Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data

Oscar Manas, Alexandre Lacoste, Xavier Gir´o-i Nieto, David Vazquez, and Pau Rodriguez. Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9414–9423, 2021. 2

work page 2021

[51] [51]

Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024. 5, 16

work page 2024

[52] [52]

Gfm: Building geospatial founda- tion models via continual pretraining

Mat´ıas Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen, and Mu Li. Gfm: Building geospatial founda- tion models via continual pretraining. arXiv preprint arXiv:2302.04476, 3, 2023. 7

work page arXiv 2023

[53] [53]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019. 3

work page 2019

[54] [54]

Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foun- dation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023. 2

work page arXiv 2023

[55] [55]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms

Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and Andrew Nelson. Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms. IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 6, 14

work page 2023

[57] [57]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 3, 4

work page 2021

[58] [58]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7

work page 2023

[59] [59]

A generalizable and accessible approach to machine learning with global satellite imagery

Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bol- liger, Vaishaal Shankar, Miyabi Ishihara, Benjamin Recht, and Solomon Hsiang. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021. 7, 14

work page 2021

[60] [60]

Implicit neural representa- tions with periodic activation functions

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representa- tions with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020. 2

work page 2020

[61] [61]

Ssl4eo-l: Datasets and foundation models for landsat imagery

Adam Stewart, Nils Lehmann, Isaac Corley, Yi Wang, Yi- Chia Chang, Nassim Ait Ait Ali Braham, Shradha Sehgal, Caleb Robinson, and Arindam Banerjee. Ssl4eo-l: Datasets and foundation models for landsat imagery. Advances in Neural Information Processing Systems , 36:59787–59807,

work page

[62] [62]

Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding

Vladan Stojnic and Vladimir Risojevic. Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1182–1191, 2021. 2

work page 2021

[63] [63]

ebird: A citizen- based bird observation network in the biological sciences

Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve Kelling. ebird: A citizen- based bird observation network in the biological sciences. Biological conservation, 142(10):2282–2292, 2009. 15

work page 2009

[64] [64]

Ringmo: A remote sensing foundation model with masked image modeling

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiaonan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61:1–22, 2022. 2

work page 2022

[65] [65]

Rethinking the inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 5

work page 2016

[66] [66]

Fourier features let networks learn high frequency functions in low dimen- sional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020. 3

work page 2020

[67] [67]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 595–604, ...

work page 2015

[68] [68]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,

work page

[69] [69]

Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Ad- vances in Neural Information Processing Systems, 36, 2024. 3, 4, 6

work page 2024

[70] [70]

Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. 11 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023. 2

work page 2023

[71] [71]

Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops

Xinye Wanyan, Sachith Seneviratne, Shuchang Shen, and Michael Kirley. Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops. arXiv preprint arXiv:2303.06670, 2023. 2

work page arXiv 2023

[72] [72]

Mapping human perception of urban landscape from street- view images: A deep-learning approach

Jingxian Wei, Wenze Yue, Mengmeng Li, and Jiabin Gao. Mapping human perception of urban landscape from street- view images: A deep-learning approach. International Jour- nal of Applied Earth Observation and Geoinformation, 112: 102886, 2022. 6

work page 2022

[73] [73]

Visual transformers: Token-based image representation and processing for com- puter vision, 2020

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gon- zalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for com- puter vision, 2020. 5, 7

work page 2020

[74] [74]

Torchspatial: A location encoding framework and benchmark for spatial representation learning

Nemin Wu, Qian Cao, Zhangyu Wang, Zeping Liu, Yanlin Qi, Jielu Zhang, Joshua Ni, Xiaobai Yao, Hongxu Ma, Lan Mu, et al. Torchspatial: A location encoding framework and benchmark for spatial representation learning. arXiv preprint arXiv:2406.15658, 2024. 1, 2, 3, 5, 8, 16

work page arXiv 2024

[75] [75]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 5

work page 2018

[76] [76]

Neural plasticity-inspired foundation model for observing the earth crossing modalities

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv e-prints, pages arXiv–2403, 2024. 2, 7, 16

work page 2024

[77] [77]

Yahoo flickr creative commons 100m

Yahoo. Yahoo flickr creative commons 100m. http:// webscope.sandbox.yahoo.com/catalog.php? datatype=i&did=67. Accessed: 2024-06-03. 7, 15

work page 2024

[78] [78]

Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning

Christopher Yeh, Chenlin Meng, Sherrie Wang, Anne Driscoll, Erik Rozi, Patrick Liu, Jihyeon Lee, Marshall Burke, David B Lobell, and Stefano Ermon. Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning. arXiv preprint arXiv:2111.04724, 2021. 14

work page arXiv 2021

[79] [79]

Deep gaussian process for crop yield pre- diction based on remote sensing data

Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield pre- diction based on remote sensing data. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1

work page 2017

[80] [80]

Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions

Dazhou Yu, Riyang Bao, Gengchen Mai, and Liang Zhao. Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions. arXiv preprint arXiv:2502.18470, 2025. 1

work page arXiv 2025