pith. machine review for the scientific record. sign in

arxiv: 2604.11668 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

UNIGEOCLIP: Unified Geospatial Contrastive Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal learningcontrastive learninggeospatialembedding spacelatitude longitudeaerial imagerystreet view
0
0 comments X

The pith

UNIGEOCLIP creates a unified embedding space by aligning five geospatial modalities through all-to-all contrastive learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a contrastive learning method that places aerial imagery, street-level views, elevation models, text descriptions, and geographic coordinates into the same representation space. It achieves this by contrasting every modality against every other one, rather than merging them or centering on a single reference modality. A scaled version of the latitude-longitude encoding helps the model represent locations across different geographic scales. This setup makes it possible to retrieve or reason between any combination of these data types without extra steps. Tests on several geospatial tasks show consistent improvements over models that use only one data type or simple coordinates.

Core claim

UNIGEOCLIP is a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, the method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. It further proposes a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments demonstrate that the approach consistently outperforms single-modality contrastive models and coordinate-only baselines.

What carries the argument

The all-to-all contrastive alignment mechanism across five modalities in a shared embedding space, together with the scaled latitude-longitude encoder for multi-scale spatial features.

If this is right

  • Any pair of modalities can be directly compared or retrieved in the shared space without additional training.
  • The framework supports reasoning tasks that combine information from arbitrary subsets of the five modalities.
  • Downstream geospatial applications benefit from richer representations that integrate complementary information from images, views, elevation, text, and coordinates.
  • The scaled encoder allows better handling of geographic structures at varying resolutions compared to standard coordinate inputs.
  • Overall performance gains validate the value of holistic multimodal alignment over isolated modality training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the unified space generalizes well, it could enable new applications in geospatial AI that mix data sources on the fly, such as generating elevation profiles from text queries paired with imagery.
  • Similar all-to-all strategies might apply to other fields with co-located multimodal datasets, like combining medical scans, reports, and lab results.
  • Testing the model on modalities not seen during training or on sparsely co-located data would reveal the robustness of the alignment approach.
  • The method suggests that avoiding a central pivot modality preserves more information across all inputs.

Load-bearing premise

Sufficient quantities of accurately co-located data exist across all five modalities to support effective all-to-all contrastive training without relying on a dominant pivot modality.

What would settle it

An experiment showing that retrieval performance between text and elevation data in the unified model equals or falls below that of a model trained only on those two modalities would falsify the benefit of the full all-to-all alignment.

Figures

Figures reproduced from arXiv: 2604.11668 by Eduard Trulls, Guillaume Astruc, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin.

Figure 1
Figure 1. Figure 1: Unified contrastive learning of geospatial data. We jointly train encoders for five modalities (text, aerial imagery, street￾level imagery, elevation, and geographic coordinates), which si￾multaneously are contrasted across all modality pairs. This yields a single unified embedding space that represents heterogeneous geospatial information. to model dynamic. Multimodal fusion models collapse all available … view at source ↗
Figure 2
Figure 2. Figure 2: Sample from our multimodal geospatial dataset. Each location is represented through five complementary modalities: aerial imagery, street-level imagery, a Digital Surface Model (DSM), geographic coordinates, and an automatically generated text description. 1All modalities are spatially co-registered and jointly contrasted during training. ages), our approach adopts a fully holistic formulation: all modalit… view at source ↗
Figure 4
Figure 4. Figure 4: Geographic Coverage. Spatial distribution of sampled lo￾cations across the continental United States. Green regions indicate areas containing samples after spatial filtering and farthest-point sampling. Multi-Way Contrastive Objective. We supervise the en￾coders using a multi-way contrastive objective that jointly aligns all modalities. Specifically, we minimize the aver￾age InfoNCE loss [23] over all orde… view at source ↗
Figure 5
Figure 5. Figure 5: PCA of Coordinate Embeddings. Embeddings computed over a dense grid in Manhattan, NYC are projected using PCA, with the top three principal components mapped to RGB. UniGeoCLIP produces spatial patterns that reflect underlying urban structure (e.g., Central Park and surrounding neighborhoods), indicating semantically informed representations. In contrast, SatCLIP and GeoCLIP exhibit smoother, predominantly… view at source ↗
Figure 6
Figure 6. Figure 6: Location Embedding Visualization. T-SNE projection of embeddings for 48 distinct locations. Each cluster corresponds to a single geographic location and contains embeddings from all modalities [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces UNIGEOCLIP, a multimodal contrastive learning framework that jointly aligns five geospatial modalities (aerial imagery, street-level views, elevation models, text, and geographic coordinates) in a single embedding space via all-to-all contrastive alignment rather than fusion or a pivot modality. It additionally proposes a scaled latitude-longitude encoder to capture multi-scale geographic structure and reports that the resulting model outperforms single-modality contrastive baselines and coordinate-only models on downstream geospatial tasks, with a reference implementation released.

Significance. If the empirical claims hold, the all-to-all formulation could enable more flexible cross-modal retrieval and reasoning in geospatial applications without requiring a central pivot. The open reference implementation is a clear strength for reproducibility.

major comments (3)
  1. [Abstract] Abstract: the central claim that UNIGEOCLIP 'consistently outperforms single-modality contrastive models and coordinate-only baselines' is presented without any dataset sizes, concrete metrics, statistical tests, or ablation details, leaving the strength of the superiority assertion impossible to evaluate from the provided evidence.
  2. [Method (implied by abstract description of the framework)] The all-to-all contrastive alignment across five modalities presupposes sufficiently dense co-located 5-tuples to supply positive pairs for every combination; the manuscript does not specify how missing modalities (common in geospatial corpora) are handled during batch construction or loss computation, which directly affects whether the 'seamless comparison across arbitrary combinations' follows from the training objective.
  3. [Method (scaled latitude-longitude encoder description)] The scaled latitude-longitude encoder is asserted to improve spatial representation by capturing multi-scale structure, yet no ablation against standard positional encodings or coordinate baselines is referenced, making it impossible to isolate whether the claimed benefit is genuine or incremental.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments' but supplies no summary statistics or result highlights; adding a compact results table or key metric improvements would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments. Below we provide point-by-point responses and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that UNIGEOCLIP 'consistently outperforms single-modality contrastive models and coordinate-only baselines' is presented without any dataset sizes, concrete metrics, statistical tests, or ablation details, leaving the strength of the superiority assertion impossible to evaluate from the provided evidence.

    Authors: The abstract is designed to be brief and high-level. Detailed information on datasets, metrics, statistical significance, and ablations is provided in the Experiments section of the full manuscript. To address this, we will update the abstract to include key quantitative results and a mention of the evaluation setup. revision: yes

  2. Referee: [Method (implied by abstract description of the framework)] The all-to-all contrastive alignment across five modalities presupposes sufficiently dense co-located 5-tuples to supply positive pairs for every combination; the manuscript does not specify how missing modalities (common in geospatial corpora) are handled during batch construction or loss computation, which directly affects whether the 'seamless comparison across arbitrary combinations' follows from the training objective.

    Authors: Our framework assumes access to co-located multimodal tuples during training, consistent with the datasets described. In practice, the loss is computed over available modality pairs when some are missing. We will expand the Methods section to detail the batch sampling strategy and loss computation for incomplete tuples. revision: yes

  3. Referee: [Method (scaled latitude-longitude encoder description)] The scaled latitude-longitude encoder is asserted to improve spatial representation by capturing multi-scale structure, yet no ablation against standard positional encodings or coordinate baselines is referenced, making it impossible to isolate whether the claimed benefit is genuine or incremental.

    Authors: The paper includes comparisons to coordinate-only baselines, which demonstrate the benefit of the full model including the scaled encoder. To more precisely isolate the contribution of the scaling mechanism versus standard encodings, we will add a dedicated ablation study in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical contrastive training with independent architectural choice

full rationale

The paper defines UNIGEOCLIP as a contrastive framework that performs all-to-all alignment across five modalities via standard InfoNCE-style losses on co-located tuples. This is a modeling choice, not a derivation that reduces to its own inputs. The scaled lat-long encoder is presented as an architectural proposal to capture multi-scale structure; its benefit is evaluated empirically rather than assumed by definition. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claims rest on downstream task performance, which is external to the training objective itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on standard contrastive learning assumptions plus one new architectural component. No explicit free parameters are named in the abstract, but training necessarily involves loss hyperparameters and encoder scaling choices.

free parameters (2)
  • contrastive loss temperature
    Standard hyperparameter in contrastive objectives; must be chosen or tuned for the five-modality setting.
  • scaling parameters in lat-long encoder
    Introduced to capture multi-scale structure; values are either learned or set during model design.
axioms (1)
  • domain assumption Co-located multimodal geospatial data can be aligned effectively through pairwise contrastive objectives without requiring a central pivot modality.
    Invoked by the choice of all-to-all alignment across imagery, street views, elevation, text, and coordinates.
invented entities (1)
  • scaled latitude-longitude encoder no independent evidence
    purpose: To encode geographic coordinates at multiple scales for improved spatial representation in the unified embedding.
    New component proposed to address limitations of standard coordinate encoding.

pith-pipeline@v0.9.0 · 5457 in / 1520 out tokens · 52544 ms · 2026-05-10T14:55:46.739785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    General Geospatial Inference with a Population Dynamics Foundation Model

    Mohit Agarwal, Mimi Sun, Chaitanya Kamath, Arbaaz Mus- lim, Prithul Sarker, Joydeep Paul, Hector Yee, Marcin Sieniek, Kim Jablonski, Swapnil Vispute, et al. General Geospatial Inference with a Population Dynamics Foundation Model. arXiv:2411.07207, 2024. 6

  2. [2]

    OpenStreetView-5M: The Many Roads to Global Visual Geolocation

    Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Con- stantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vin- cent, et al. OpenStreetView-5M: The Many Roads to Global Visual Geolocation. InCVPR, 2024. 3

  3. [3]

    OmniSat: Self-Supervised Modality Fusion for Earth Observation

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. OmniSat: Self-Supervised Modality Fusion for Earth Observation. InECCV, 2024. 1

  4. [4]

    AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities. InCVPR, 2025. 1, 2, 5

  5. [5]

    Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

    Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chen- hui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. AlphaEarth Foundations: An em- bedding field model for accurate and efficient global mapping from sparse label data.arXiv:2507.22291, 2025. 1, 2, 6

  6. [6]

    RANGE: Retrieval Aug- mented Neural Fields for Multi-Resolution Geo-Embeddings

    Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ah- mad, Eric Xing, and Nathan Jacobs. RANGE: Retrieval Aug- mented Neural Fields for Multi-Resolution Geo-Embeddings. InCVPR, 2025. 2

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021. 7

  8. [8]

    Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

    Nicolas Dufour, Vicky Kalogeiton, David Picard, and Loic Landrieu. Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation. InCVPR, 2025. 3

  9. [9]

    The farthest point strategy for progressive image sampling.Transactions on image processing, 1997

    Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. The farthest point strategy for progressive image sampling.Transactions on image processing, 1997. 4

  10. [10]

    ImageBind: One Embedding Space To Bind Them All

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. ImageBind: One Embedding Space To Bind Them All. InCVPR, 2023. 1, 2

  11. [11]

    PIGEON: Predicting Image Geolocations

    Lukas Haas, Silas Alberti, and Michal Skreta. PIGEON: Predicting Image Geolocations. InCVPR, 2024. 3

  12. [12]

    IM2GPS: estimating geo- graphic information from a single image

    James Hays and Alexei A Efros. IM2GPS: estimating geo- graphic information from a single image. InCVPR, 2008. 3, 5

  13. [13]

    Large-Scale Image Geolocal- ization.Multimodal location estimation of videos and images,

    James Hays and Alexei A Efros. Large-Scale Image Geolocal- ization.Multimodal location estimation of videos and images,

  14. [14]

    MDAS: A New Multimodal Benchmark Dataset for Remote Sensing.Earth System Science Data, 15 (1):113–131, 2023

    Jingliang Hu, Rong Liu, Danfeng Hong, Andr ´es Camero, Jing Yao, Mathias Schneider, Franz Kurz, Karl Segl, and Xiao Xiang Zhu. MDAS: A New Multimodal Benchmark Dataset for Remote Sensing.Earth System Science Data, 15 (1):113–131, 2023. 1, 3, 7

  15. [15]

    SatCLIP: Global, General- Purpose Location Embeddings with Satellite Imagery

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. SatCLIP: Global, General- Purpose Location Embeddings with Satellite Imagery. In AAAI, 2025. 1, 2, 5, 6, 8

  16. [16]

    GEO-Bench: Toward Foundation Models for Earth Monitoring.NeurIPS,

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sher- win, Hannah Kerner, Bj¨orn L¨utjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO-Bench: Toward Foundation Models for Earth Monitoring.NeurIPS,

  17. [17]

    S2 Geometry Library

    Dan Larkin-York, Google Inc., Koordinates Limited, Mike Playle, and Tiago Brito. S2 Geometry Library. https: //github.com/google/s2geometry, 2015. [Online; accessed 13-May-2025]. 4

  18. [18]

    Scaling Image Geo-Localization to Continent Level.NeurIPS,

    Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Mat- teo Balice, Marc Pollefeys, Simon Lynen, and Eduard Trulls. Scaling Image Geo-Localization to Continent Level.NeurIPS,

  19. [19]

    GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

    Zeping Liu, Fan Zhang, Junfeng Jiao, Ni Lao, and Gengchen Mai. GAIR: Improving Multimodal Geo- Foundation Model with Geo-Aligned Implicit Representa- tions.arXiv:2503.16683, 2025. 2

  20. [20]

    UniBind: LLM-Augmented Unified and Balanced Represen- tation Space to Bind Them All

    Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. UniBind: LLM-Augmented Unified and Balanced Represen- tation Space to Bind Them All. InCVPR, 2024. 1, 3

  21. [21]

    CityLoc: 6DoF Pose Distributional Localization for Text Descrip- tions in Large-Scale Scenes with Gaussian Representation

    Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool, and Danda Pani Paudel. CityLoc: 6DoF Pose Distributional Localization for Text Descrip- tions in Large-Scale Scenes with Gaussian Representation. arXiv:2501.08982, 2025. 3

  22. [22]

    3d-pv-locator: Large-scale detection of rooftop-mounted pho- tovoltaic systems in 3d.Applied Energy, 310:118469, 2022

    Kevin Mayer, Benjamin Rausch, Marie-Louise Arlt, Gunther Gust, Zhecheng Wang, Dirk Neumann, and Ram Rajagopal. 3d-pv-locator: Large-scale detection of rooftop-mounted pho- tovoltaic systems in 3d.Applied Energy, 310:118469, 2022. 5

  23. [23]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep- resentation Learning with Contrastive Predictive Coding. arXiv:1807.03748, 2018. 4

  24. [24]

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido As- sran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po- Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patr...

  25. [25]

    SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

    Jonathan Prexl and Michael Schmitt. SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining. InGCPR, 2024. 5

  26. [26]

    Large scale high-resolution land cover mapping with multi-resolution data

    Caleb Robinson, Le Hou, Kolya Malkin, Rachel Soobitsky, Ja- cob Czawlytko, Bistra Dilkina, and Nebojsa Jojic. Large scale high-resolution land cover mapping with multi-resolution data. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12726–12735,

  27. [27]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, 2015. 7

  28. [28]

    Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Net- works

    Marc Rußwurm, Konstantin Klemmer, Esther Rolf, Robin Zbinden, and Devis Tuia. Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Net- works. InICLR, 2024. 2, 8

  29. [29]

    TaxaBind: A Unified Embedding Space for Ecological Applications

    Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ah- mad, and Nathan Jacobs. TaxaBind: A Unified Embedding Space for Ecological Applications. InWACV, 2025. 2

  30. [30]

    The equal earth map projection.International Journal of Geographical Information Science, 33(3):454–465, 2019

    Bojan ˇSavriˇc, Tom Patterson, and Bernhard Jenny. The equal earth map projection.International Journal of Geographical Information Science, 33(3):454–465, 2019. 3

  31. [31]

    GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

    David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, and Mubarak Shah. GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space. InICCV,

  32. [32]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. DI- NOv3.arXiv:2508.10104, 2025. 5, 6

  33. [33]

    Community search signatures as foundation features for human-centered geospatial modeling.arXiv:2410.22721, 2024

    Mimi Sun, Chaitanya Kamath, Mohit Agarwal, Arbaaz Mus- lim, Hector Yee, David Schottlander, Shailesh Bavadekar, Niv Efron, Shravya Shetty, and Gautam Prasad. Community search signatures as foundation features for human-centered geospatial modeling.arXiv:2410.22721, 2024. 1, 3, 6

  34. [34]

    Fourier Features Let Networks Learn High Frequency Functions in Low Di- mensional Domains.NeurIPS, 2020

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier Features Let Networks Learn High Frequency Functions in Low Di- mensional Domains.NeurIPS, 2020. 3, 8

  35. [35]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language En- coders with Improved Semantic Understanding, Localization, and Dense Features.arXiv:2502.14786, 2025. 3, 5, 6

  36. [36]

    Galileo: Learning Global & Local Features of Many Remote Sensing Modalities

    Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shel- hamer, Hannah Kerner, and David Rolnick. Galileo: Learning Global & Local Features of Many Remote Sensing Modalities. InICML, 2025. 1, 2

  37. [37]

    Visualizing data using t-sne.Journal of machine learning research, 9(11),

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11),

  38. [38]

    GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization

    Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. In NeurIPS, 2023. 1, 2, 3, 5, 6, 8

  39. [39]

    Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation

    Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam J Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, and John Chuang. Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation. InCVPR Work- shops, 2025. 2, 5

  40. [40]

    Text2Loc: 3D Point Cloud Localization from Natural Language

    Yan Xia, Letian Shi, Zifeng Ding, Joao F Henriques, and Daniel Cremers. Text2Loc: 3D Point Cloud Localization from Natural Language. InCVPR, 2024. 3

  41. [41]

    Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation.arXiv:2403.15356, 2024. 5