GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations
Pith reviewed 2026-05-22 22:37 UTC · model grok-4.3
The pith
GAIR uses an implicit neural interpolation module to produce continuous geo-aligned representations from unlabeled multi-modal geospatial data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAIR extends ViT with a Neural Implicit Local Interpolation module that yields a continuous representation over any point in an overhead image; this module is trained jointly with factorized encoders for remote-sensing imagery, street-view imagery, and geolocation metadata under a location-aware contrastive objective, producing embeddings that remain geographically aligned at arbitrary query locations and that improve accuracy on downstream geospatial benchmarks.
What carries the argument
The Neural Implicit Local Interpolation module, which converts discrete ViT patch features into a continuous function that can be queried at any geographic coordinate inside the image.
If this is right
- A single pre-trained GAIR checkpoint can initialize models for both overhead image classification and street-view place recognition without separate fine-tuning pipelines.
- Representations remain usable at any spatial scale because the interpolation is continuous rather than tied to fixed patch grids.
- Temporal transfer improves because the geographic alignment is learned from metadata rather than from visual appearance alone.
- Multi-modal fusion becomes simpler since all modalities are projected into a shared embedding space indexed by location.
Where Pith is reading between the lines
- The same interpolation trick could be applied to other grid-based sensors such as LiDAR or weather station arrays to create continuous fields from discrete observations.
- If the alignment holds under distribution shift, GAIR-style pre-training might reduce the need for expensive labeled geospatial datasets in low-resource regions.
- One could test whether adding a temporal dimension to the implicit module would allow the model to interpolate across time as well as space.
Load-bearing premise
The implicit interpolation module, trained only with contrastive losses on unlabeled image pairs, will keep producing representations that stay aligned to real geographic positions instead of fitting only the training image layouts.
What would settle it
A controlled test in which query locations are shifted slightly from the training image centers and performance on a held-out geospatial task drops below the level of a standard ViT baseline that does not use the interpolation module.
Figures
read the original abstract
Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GAIR, a self-supervised contrastive pre-training method that augments a ViT backbone with a Neural Implicit Local Interpolation (INR) module to produce continuous, geo-aligned representations at arbitrary locations within remote-sensing images. Three factorized encoders process RS imagery, street-view imagery, and geolocation metadata; the INR module is used to enforce geographic alignment, and the entire system is trained with contrastive losses on unlabeled multi-modal pairs. The central claim is that GAIR outperforms existing GeoFMs and non-geo-aligned SSL baselines (MoCo V3, MAE) across 9 geospatial tasks and 22 datasets spanning RS, SV, and location-embedding benchmarks. Code is released at https://github.com/zpl99/GAIR.
Significance. If the reported gains are reproducible and the INR module truly generalizes beyond training geometries, the work would constitute a meaningful step toward fine-grained, location-aware geospatial foundation models. The explicit release of training code is a clear strength that supports reproducibility. The significance remains conditional on verification that the INR does not overfit to the spatial sampling patterns of the pre-training pairs.
major comments (2)
- [§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.
- [§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.
minor comments (2)
- [Abstract] Abstract: the sentence 'We proposed to solve this representation problem' should be revised to present tense for consistency with the rest of the manuscript.
- [§3.3] The manuscript should clarify the precise form of the contrastive loss (e.g., temperature, number of negatives) and whether any geo-specific regularization terms are added beyond standard InfoNCE.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the INR module and experimental reporting. We address each major comment below and will make the necessary revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Neural Implicit Local Interpolation module): the claim that the INR produces geographically aligned representations at arbitrary query locations rests on an untested assumption. No experiment or analysis is described that evaluates performance under coordinate shifts or on query points lying outside the spatial sampling distribution of the training RS/SV pairs; without such evidence the claimed advantage over standard ViT-based GeoFMs cannot be substantiated.
Authors: We agree that explicit validation of generalization under coordinate shifts and out-of-distribution query points is missing. The contrastive objective aligns features at geolocated points, but this does not directly test arbitrary or shifted locations. In the revision we will add targeted experiments evaluating INR performance on perturbed coordinates and held-out spatial regions, reporting downstream task metrics to substantiate the geographic alignment claim. revision: yes
-
Referee: [§5] §5 (Experimental results): while aggregate outperformance is stated across 9 tasks and 22 datasets, the manuscript provides no per-task statistical significance tests, confidence intervals, or ablation isolating the contribution of the INR module versus the contrastive objective alone. This omission makes it impossible to determine whether the reported gains are load-bearing or attributable to other factors.
Authors: We acknowledge the absence of per-task statistical tests, confidence intervals, and a dedicated INR ablation. The current comparisons to MoCo V3 and MAE provide indirect evidence, but do not isolate the INR. In the revised manuscript we will add per-task significance tests (e.g., paired t-tests), 95% confidence intervals where multiple runs exist, and an explicit ablation removing only the INR while retaining the contrastive framework. revision: yes
Circularity Check
No circularity: derivation relies on independent empirical evaluation
full rationale
The paper introduces GAIR by extending ViT with a Neural Implicit Local Interpolation INR module, factorized encoders for RS/SV modalities, and contrastive objectives on unlabeled geo-aligned pairs. No equations, self-citations, or uniqueness theorems are invoked that reduce the claimed representations or performance gains to fitted inputs or prior self-referential results by construction. The central claims rest on reported outperformance across 9 tasks and 22 datasets, which constitutes external empirical content rather than a definitional loop. This is the common case of a self-contained architectural proposal.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We utilize three factorized neural encoders... novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image’s geolocation... trained with contrastive learning objectives from unlabeled data.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The INR module refines f(ri) into a localized embedding z(q)i through feature unfolding and local ensemble interpolation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial ...
-
UNIGEOCLIP: Unified Geospatial Contrastive Learning
UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
Reference graph
Works this paper leans on
-
[1]
Pretrain a remote sensing foundation model by pro- moting intra-instance similarity
Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by pro- moting intra-instance similarity. IEEE Transactions on Geo- science and Remote Sensing, 2024. 5, 7, 13
work page 2024
-
[2]
Omnisat: Self-supervised modality fusion for earth observation
Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2024. 2, 16
work page 2024
-
[3]
Omnisat: Self-supervised modality fusion for earth observation
Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. In European Conference on Computer Vision, pages 409–427. Springer, 2025. 2
work page 2025
-
[4]
Geography-aware self-supervised learning
Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021. 2
work page 2021
-
[5]
George Azzari, Meha Jain, and David B Lobell. Towards fine resolution global maps of crop yields: Testing multiple methods and satellites in three countries. Remote Sensing of Environment, 202:129–141, 2017. 1
work page 2017
-
[6]
Satlaspretrain: A large-scale dataset for remote sensing image understanding
Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 7
work page 2023
-
[7]
Bird- snap: Large-scale fine-grained visual categorization of birds
Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2011–2018, 2014. 7, 15
work page 2011
-
[8]
Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting
Ling Cai, Krzysztof Janowicz, Gengchen Mai, Bo Yan, and Rui Zhu. Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS, 24(3):736–755, 2020. 1
work page 2020
-
[9]
Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, and Luc Van Gool. Ciaosr: Continuous implicit attention-in- attention network for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1796–1807, 2023. 3
work page 2023
-
[10]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2
work page 2021
-
[11]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020
work page 2020
-
[12]
An empirical study of training self-supervised vision transformers
Xinlei Chen*, Saining Xie*, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 2, 5, 7
-
[13]
Learning contin- uous image representation with local implicit image function
Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning contin- uous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021. 2, 3, 13
work page 2021
-
[14]
Spatial implicit neural representations for global-scale species mapping
Elijah Cole, Grant Van Horn, Christian Lange, Alexander Shepard, Patrick Leary, Pietro Perona, Scott Loarie, and Oisin Mac Aodha. Spatial implicit neural representations for global-scale species mapping. In International Confer- ence on Machine Learning, pages 6320–6342. PMLR, 2023. 1, 3
work page 2023
-
[15]
Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery
Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022. 1, 2, 5, 7, 13, 16
work page 2022
-
[16]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 5, 7
work page 2009
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
Coin: Compression with implicit neural representations,
Emilien Dupont, Adam Goli´nski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. arXiv preprint arXiv:2103.03123 ,
-
[19]
Urban visual intelligence: Uncovering hidden city profiles with street view images
Zhuangyuan Fan, Fan Zhang, Becky PY Loo, and Carlo Ratti. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences, 120(27):e2220417120, 2023. 6
work page 2023
-
[20]
Croma: Remote sensing representations with contrastive radar-optical masked autoencoders
Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 5, 7, 13, 16
work page 2024
-
[21]
Implicit diffusion models for continuous super-resolution
Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10021– 10030, 2023. 2, 3
work page 2023
-
[22]
Lightweight temporal self-attention for classifying satellite images time series
Vivien Sainte Fare Garnot and Loic Landrieu. Lightweight temporal self-attention for classifying satellite images time series. In Advanced Analytics and Learning on Temporal Data: 5th ECML PKDD Workshop, AALTD 2020, Ghent, Belgium, September 18, 2020, Revised Selected Papers 6 , pages 171–181. Springer, 2020. 5, 7
work page 2020
-
[23]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2
work page 2020
-
[24]
Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingx- iang Hu, et al. Skysense: A multi-modal remote sensing 9 foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27672–27683, ...
work page 2024
-
[25]
Deep learning for multi-year enso forecasts
Yoo-Geun Ham, Jeong-Hwan Kim, and Jing-Jia Luo. Deep learning for multi-year enso forecasts. Nature, 573(7775): 568–572, 2019. 1
work page 2019
-
[26]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 4
work page 2020
-
[27]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2, 5, 6, 7, 13
work page 2022
-
[28]
Spectralgpt: Spectral remote sensing foundation model
Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiup- ing Jia, et al. Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 7, 16
work page 2024
-
[29]
Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global streetscapes—a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics. ISPRS Journal of Photogramme- try and Remote Sensing, 215:216–238, 2024. 5, 13, 14
work page 2024
-
[30]
Chia-Yu Hsu, Wenwen Li, and Sizhe Wang. Geospatial foun- dation models for image analysis: Evaluating and enhancing nasa-ibm prithvi’s domain adaptability.International Journal of Geographical Information Science, pages 1–30, 2024. 2
work page 2024
-
[31]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015. 3
work page 2015
-
[32]
Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting
Pallavi Jain, Dino Ienco, Roberto Interdonato, Tristan Berchoux, and Diego Marcos. Senclip: Enhancing zero-shot land-use mapping for sentinel-2 with ground-level prompting. arXiv preprint arXiv:2412.08536, 2024. 16
-
[33]
Multimodal contrastive learning for remote sensing tasks
Umangi Jain, Alex Wilson, and Varun Gulshan. Multimodal contrastive learning for remote sensing tasks. arXiv preprint arXiv:2209.02329, 2022. 2
-
[34]
Johannes Jakubik, Sujit Roy, CE Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence. arXiv preprint arXiv:2310.18660, 2023. 6, 7, 14
-
[35]
Combining satellite imagery and machine learning to predict poverty
Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016. 1
work page 2016
-
[36]
Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179, 2023. 1, 2, 3
-
[37]
Hao Li, Benjamin Herfort, Sven Lautenbach, Jiaoyan Chen, and Alexander Zipf. Improving openstreetmap missing build- ing detection using few-shot transfer learning in sub-saharan africa. Transactions in GIS, 26(8):3125–3146, 2022. 1
work page 2022
-
[38]
Hao Li, Jiapan Wang, Johann Maximilian Zollner, Gengchen Mai, Ni Lao, and Martin Werner. Rethink geographical gener- alizability with unsupervised self-attention model ensemble: A case study of openstreetmap missing building detection in africa. In Proceedings of the 31st ACM International Confer- ence on Advances in Geographic Information Systems, pages ...
work page 2023
-
[39]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 7
work page 2024
-
[40]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2
work page 2021
-
[41]
Rose M Rustowicz, Robin Cheong, Lijing Wang, Stefano Ermon, Marshall Burke, and David Lobell. Semantic segmen- tation of crop type in africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/cvf confer- ence on computer vision and pattern recognition workshops, pages 75–82, 2019. 7, 14
work page 2019
-
[42]
Presence- only geographical priors for fine-grained image classification
Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence- only geographical priors for fine-grained image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9596–9606, 2019. 3, 5, 15
work page 2019
-
[43]
Gengchen Mai, Krzysztof Janowicz, Ling Cai, Rui Zhu, Blake Regalia, Bo Yan, Meilin Shi, and Ni Lao. Se-kge: A location- aware knowledge graph embedding model for geographic question answering and spatial semantic lifting. Transactions in GIS, 24(3):623–655, 2020. 1, 3
work page 2020
-
[44]
Multi-scale representation learning for spatial feature distributions using grid cells
Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. In International Conference on Learning Representations, 2020. 3, 5
work page 2020
-
[45]
Multi-scale representation learning for spatial feature distributions using grid cells
Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale representation learning for spatial feature distributions using grid cells. arXiv preprint arXiv:2003.00824, 2020. 2
-
[46]
Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations
Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, and Ste- fano Ermon. Csp: Self-supervised contrastive spatial pre- training for geospatial-visual representations. In Interna- tional Conference on Machine Learning, pages 23498–23515. PMLR, 2023. 1, 2, 3
work page 2023
-
[47]
Gengchen Mai, Yao Xuan, Wenyun Zuo, Yutong He, Ji- aming Song, Stefano Ermon, Krzysztof Janowicz, and Ni Lao. Sphere2vec: A general-purpose location representation learning over a spherical surface for large-scale geospatial predictions. ISPRS Journal of Photogrammetry and Remote Sensing, 202:439–462, 2023. 1, 3
work page 2023
-
[48]
On the opportunities and challenges of foundation models for geoai (vision paper)
Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. On the opportunities and challenges of foundation models for geoai (vision paper). ACM Transac- tions on Spatial Algorithms and Systems, 2024. 2 10
work page 2024
-
[49]
Towards the next generation of geospatial artificial intel- ligence
Gengchen Mai, Yiqun Xie, Xiaowei Jia, Ni Lao, Jinmeng Rao, Qing Zhu, Zeping Liu, Yao-Yi Chiang, and Junfeng Jiao. Towards the next generation of geospatial artificial intel- ligence. International Journal of Applied Earth Observation and Geoinformation, 136:104368, 2025. 1
work page 2025
-
[50]
Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data
Oscar Manas, Alexandre Lacoste, Xavier Gir´o-i Nieto, David Vazquez, and Pau Rodriguez. Seasonal contrast: Unsuper- vised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9414–9423, 2021. 2
work page 2021
-
[51]
Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024
Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2024. 5, 16
work page 2024
-
[52]
Gfm: Building geospatial founda- tion models via continual pretraining
Mat´ıas Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen, and Mu Li. Gfm: Building geospatial founda- tion models via continual pretraining. arXiv preprint arXiv:2302.04476, 3, 2023. 7
-
[53]
Occupancy networks: Learning 3d reconstruction in function space
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019. 3
work page 2019
-
[54]
Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023
Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foun- dation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023. 2
-
[55]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms
Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and Andrew Nelson. Ai4smallfarms: A dataset for crop field delineation in south- east asian smallholder farms. IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 6, 14
work page 2023
-
[57]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 3, 4
work page 2021
-
[58]
Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Can- dido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 7
work page 2023
-
[59]
A generalizable and accessible approach to machine learning with global satellite imagery
Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bol- liger, Vaishaal Shankar, Miyabi Ishihara, Benjamin Recht, and Solomon Hsiang. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021. 7, 14
work page 2021
-
[60]
Implicit neural representa- tions with periodic activation functions
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representa- tions with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020. 2
work page 2020
-
[61]
Ssl4eo-l: Datasets and foundation models for landsat imagery
Adam Stewart, Nils Lehmann, Isaac Corley, Yi Wang, Yi- Chia Chang, Nassim Ait Ait Ali Braham, Shradha Sehgal, Caleb Robinson, and Arindam Banerjee. Ssl4eo-l: Datasets and foundation models for landsat imagery. Advances in Neural Information Processing Systems , 36:59787–59807,
-
[62]
Vladan Stojnic and Vladimir Risojevic. Self-supervised learn- ing of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1182–1191, 2021. 2
work page 2021
-
[63]
ebird: A citizen- based bird observation network in the biological sciences
Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve Kelling. ebird: A citizen- based bird observation network in the biological sciences. Biological conservation, 142(10):2282–2292, 2009. 15
work page 2009
-
[64]
Ringmo: A remote sensing foundation model with masked image modeling
Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiaonan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing, 61:1–22, 2022. 2
work page 2022
-
[65]
Rethinking the inception ar- chitecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 5
work page 2016
-
[66]
Fourier features let networks learn high frequency functions in low dimen- sional domains
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020. 3
work page 2020
-
[67]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 595–604, ...
work page 2015
-
[68]
The inaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778,
-
[69]
Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Ad- vances in Neural Information Processing Systems, 36, 2024. 3, 4, 6
work page 2024
-
[70]
Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. 11 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023. 2
work page 2023
-
[71]
Xinye Wanyan, Sachith Seneviratne, Shuchang Shen, and Michael Kirley. Dino-mc: Self-supervised contrastive learn- ing for remote sensing imagery with multi-sized local crops. arXiv preprint arXiv:2303.06670, 2023. 2
-
[72]
Mapping human perception of urban landscape from street- view images: A deep-learning approach
Jingxian Wei, Wenze Yue, Mengmeng Li, and Jiabin Gao. Mapping human perception of urban landscape from street- view images: A deep-learning approach. International Jour- nal of Applied Earth Observation and Geoinformation, 112: 102886, 2022. 6
work page 2022
-
[73]
Visual transformers: Token-based image representation and processing for com- puter vision, 2020
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gon- zalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for com- puter vision, 2020. 5, 7
work page 2020
-
[74]
Torchspatial: A location encoding framework and benchmark for spatial representation learning
Nemin Wu, Qian Cao, Zhangyu Wang, Zeping Liu, Yanlin Qi, Jielu Zhang, Joshua Ni, Xiaobai Yao, Hongxu Ma, Lan Mu, et al. Torchspatial: A location encoding framework and benchmark for spatial representation learning. arXiv preprint arXiv:2406.15658, 2024. 1, 2, 3, 5, 8, 16
-
[75]
Unified perceptual parsing for scene understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 5
work page 2018
-
[76]
Neural plasticity-inspired foundation model for observing the earth crossing modalities
Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv e-prints, pages arXiv–2403, 2024. 2, 7, 16
work page 2024
-
[77]
Yahoo flickr creative commons 100m
Yahoo. Yahoo flickr creative commons 100m. http:// webscope.sandbox.yahoo.com/catalog.php? datatype=i&did=67. Accessed: 2024-06-03. 7, 15
work page 2024
-
[78]
Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning
Christopher Yeh, Chenlin Meng, Sherrie Wang, Anne Driscoll, Erik Rozi, Patrick Liu, Jihyeon Lee, Marshall Burke, David B Lobell, and Stefano Ermon. Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning. arXiv preprint arXiv:2111.04724, 2021. 14
-
[79]
Deep gaussian process for crop yield pre- diction based on remote sensing data
Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield pre- diction based on remote sensing data. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1
work page 2017
-
[80]
Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions
Dazhou Yu, Riyang Bao, Gengchen Mai, and Liang Zhao. Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions. arXiv preprint arXiv:2502.18470, 2025. 1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.