pith. machine review for the scientific record. sign in

arxiv: 2512.17492 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: no theorem link

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal datasetgeo-spatial understandingcross-view retrievalgeolocalizationaerial imageryground imagerylandmark matchingbenchmark
0
0 comments X

The pith

The MMLandmarks dataset supplies instance-level matches across aerial images, ground views, text, and coordinates for 18,557 US landmarks to support multimodal geo-spatial model training and benchmarking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMLandmarks, a benchmark with 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct US landmarks. These entries maintain one-to-one correspondence at the landmark level, which supports training and testing on integrated tasks such as cross-view retrieval and geolocalization. Experiments show that neither specialized geo-spatial models nor off-the-shelf foundation models can be applied directly to solve the range of tasks without such aligned data. The authors include a CLIP-inspired baseline trained on the dataset that demonstrates improved versatility and generalization.

Core claim

MMLandmarks establishes one-to-one landmark correspondences across four modalities, enabling models to address cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image retrieval, and Text-to-GPS retrieval. Current models cannot trivially solve these tasks, revealing a gap that multimodal aligned datasets can fill for broader geo-spatial understanding.

What carries the argument

The instance-level one-to-one correspondence between the four modalities for each landmark, which carries the argument by providing aligned training signals across views.

If this is right

  • Cross-view Ground-to-Satellite retrieval can be trained directly using the matched image pairs.
  • Ground and satellite geolocalization benefit from joint training on aligned modalities.
  • Text-to-Image and Text-to-GPS retrieval tasks become benchmarkable with the provided correspondences.
  • A CLIP-style model trained on MMLandmarks achieves broad generalization across the geo-spatial tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications in mapping and navigation could integrate satellite and street-level data more effectively with such alignments.
  • The identified performance gap may encourage creation of similar multimodal datasets for other geographic regions or object types.
  • Future work could test whether scaling the dataset size further closes the gap for foundation models in this domain.
  • Instance-level alignment might prove more critical than model architecture choices for advancing geo-spatial multimodal learning.

Load-bearing premise

The collected images, text, and coordinates correspond to the same physical landmarks at the instance level with sufficient quality and diversity for meaningful cross-modal training.

What would settle it

A verification process that finds many mismatched landmarks across modalities or shows that models trained on the dataset perform no better than separate-modality models on the defined tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.17492 by Alba Reinders S\'anchez, Anders Bjorholm Dahl, Dim P. Papadopoulos, Morten Rieger Hannemose, Oskar Kristoffersen.

Figure 1
Figure 1. Figure 1: MMLANDMARKS. We present four distinct data modalities: ground-view images, aerial imagery, GPS coordinates, and textual descriptions, collected from 18,557 unique landmarks in the United States. Data sources are included alongside each modality. the scientific research on datasets collected through Google, slowing research involving street-view and satellite im￾agery [31]. Despite the advances of multimoda… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for collecting the landmarks with the required criteria. Tags from OpenStreetMaps are used to collect Wiki-identifiers, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Text-to-GPS (top 1000), Text-to-Ground and Text-to-Satellite retrieval from the index set with the baseline model. The model accurately locates regions and images that are semantically relevant to the prompt, illustrating strong feature alignment across modalities [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histogram distribution of the number of images per [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual and Geographical illustrations of the landmark distribution across MML [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the center GPS (green) and bounding boxes (purple) for the polygons associated with different landmarks. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: 15 most popular categories from the “government build [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: VLM Filtering: Examples of wrong categorizations during the VLM - indoor/outdoor filtering stage. The model may incorrectly [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional examples of landmarks from MML [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual diversity in the aerial imagery from landmarks in MML [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional examples from the MMLANDMARKS dataset. We illustrate the diversity in the dataset by randomly sampling landmarks. We show sample ground and satellite views, as well as the exact GPS location and parts of the textual descriptions. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional examples from the MMLANDMARKS dataset. We illustrate the diversity in the dataset by randomly sampling landmarks. We show sample ground and satellite views, as well as the exact GPS location and parts of the textual descriptions. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional examples from the MMLANDMARKS dataset. We illustrate the diversity in the dataset by randomly sampling landmarks. We show sample ground and satellite views, as well as the exact GPS location and parts of the textual descriptions. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional examples from the MMLANDMARKS dataset. We illustrate the diversity in the dataset by randomly sampling landmarks. We show sample ground and satellite views, as well as the exact GPS location and parts of the textual descriptions. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
read the original abstract

Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, geographic coordinates, etc.). Current benchmarks have limited coverage across modalities, leading to specialized models that perform well in their respective domains, but do not fully take advantage of other geo-spatial modalities. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18.557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one landmark level correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We show that current specialized and off-the-shelf foundation models cannot be trivially used to solve this variety of geo-spatial tasks, illustrating a gap where multimodal datasets lead to broader geo-spatial understanding. We employ a simple CLIP-inspired baseline that reflects versatility and broad generalization when trained with MMLandmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMLandmarks, a multimodal benchmark with 197k aerial images, 329k ground-view images, text, and GPS coordinates for 18,557 US landmarks claiming one-to-one instance-level correspondences across modalities. It evaluates cross-view Ground-to-Satellite retrieval, geolocalization, Text-to-Image, and Text-to-GPS tasks, showing that specialized and off-the-shelf foundation models cannot trivially solve them, while a simple CLIP-inspired baseline exhibits better versatility when trained on the dataset.

Significance. If the claimed correspondences hold with high accuracy and diversity, the dataset would fill an important gap by providing a unified multimodal resource for geo-spatial tasks, enabling training of more general models beyond domain-specialized ones and supporting reproducible benchmarking in computer vision and multimodal learning.

major comments (2)
  1. [Dataset construction] Dataset construction section: The pipeline for web-sourced image and text collection with one-to-one landmark matching is described only at high level and reports no quantitative validation such as precision/recall for correspondences, inter-annotator agreement, or error rates on ambiguous ground-aerial pairings. This directly undermines the central claim that the dataset supports reliable training and benchmarking, as even modest mismatch rates would induce spurious associations.
  2. [Experiments] Experiments section: The reported failure of existing models and success of the CLIP baseline are only weakly supported without details on train/test splits, handling of potential noisy alignments, or ablation studies quantifying the effect of correspondence quality on task performance.
minor comments (2)
  1. [Abstract] Abstract: '18.557' is a typographical error and should read '18,557'.
  2. [Abstract] Abstract: The scale and task list are stated clearly, but a one-sentence summary of collection methodology and verification steps would improve completeness without lengthening the paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the manuscript. We address each major point below and will revise the paper to incorporate additional details and validations as outlined.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The pipeline for web-sourced image and text collection with one-to-one landmark matching is described only at high level and reports no quantitative validation such as precision/recall for correspondences, inter-annotator agreement, or error rates on ambiguous ground-aerial pairings. This directly undermines the central claim that the dataset supports reliable training and benchmarking, as even modest mismatch rates would induce spurious associations.

    Authors: We agree that a high-level description alone is insufficient to fully substantiate the one-to-one correspondences. In the revised manuscript, we will expand the Dataset Construction section with quantitative validation results, including precision/recall from manual verification on a sampled subset of landmarks, inter-annotator agreement scores, and error analysis for ambiguous ground-aerial pairings. These additions will directly address concerns about potential mismatch rates and their impact on training reliability. revision: yes

  2. Referee: [Experiments] Experiments section: The reported failure of existing models and success of the CLIP baseline are only weakly supported without details on train/test splits, handling of potential noisy alignments, or ablation studies quantifying the effect of correspondence quality on task performance.

    Authors: We acknowledge the need for greater experimental rigor. The revised Experiments section will include explicit details on the train/test splits (with leakage prevention measures), a description of how noisy alignments were handled during training, and new ablation studies that quantify performance sensitivity to varying levels of correspondence quality. These changes will provide stronger empirical support for the baseline's versatility relative to existing models. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with independent empirical claims

full rationale

The paper's core contribution is the MMLandmarks dataset itself, which defines instance-level one-to-one correspondences across aerial images, ground images, text, and GPS for 18,557 landmarks. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims that existing models cannot trivially solve the tasks are supported by direct empirical evaluation on the new benchmark, not by self-referential fitting or self-citation chains. The collection pipeline is described at a high level without reducing any result to prior fitted quantities from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contributes a new curated dataset rather than a derivation from axioms or parameters; the central premise rests on accurate cross-modal landmark matching whose verification is not detailed in the abstract.

pith-pipeline@v0.9.0 · 5538 in / 1041 out tokens · 24108 ms · 2026-05-16T20:46:42.957949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 9 internal anchors

  1. [1]

    Self- supervised material and texture representation learning for remote sensing tasks

    Peri Akiva, Matthew Purri, and Matthew Leotta. Self- supervised material and texture representation learning for remote sensing tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8203–8215, 2022. 1, 3

  2. [2]

    Self-supervised multimodal versatile networks.Advances in neural information processing systems, 33:25–37, 2020

    Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi ´c, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks.Advances in neural information processing systems, 33:25–37, 2020. 3

  3. [3]

    Look, listen and learn

    Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. InProceedings of the IEEE international confer- ence on computer vision, pages 609–617, 2017. 3

  4. [4]

    Netvlad: Cnn architecture for weakly supervised place recognition

    Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016. 3

  5. [5]

    Openstreetview-5m: The many roads to global visual geolocation

    Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Con- stantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vin- cent, et al. Openstreetview-5m: The many roads to global visual geolocation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21967–21977, 2024. 1, 3, 7

  6. [6]

    Geography-aware self-supervised learning

    Kumar Ayush, Burak Uzkent, Chenlin Meng, Kumar Tan- may, Marshall Burke, David Lobell, and Stefano Ermon. Geography-aware self-supervised learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10181–10190, 2021. 1, 3

  7. [7]

    Data2vec: A general frame- work for self-supervised learning in speech, vision and lan- guage

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general frame- work for self-supervised learning in speech, vision and lan- guage. InInternational conference on machine learning, pages 1298–1312. PMLR, 2022. 3

  8. [8]

    Satlaspretrain: A large- scale dataset for remote sensing image understanding

    Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdi- nando, and Aniruddha Kembhavi. Satlaspretrain: A large- scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023. 3

  9. [9]

    Megaloc: One re- trieval to place them all

    Gabriele Berton and Carlo Masone. Megaloc: One re- trieval to place them all. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2861– 2867, 2025. 1, 3

  10. [10]

    Eigenplaces: Training viewpoint robust models for visual place recognition

    Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone. Eigenplaces: Training viewpoint robust models for visual place recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11080–11090, 2023. 1

  11. [11]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 6

  12. [12]

    Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 3

  13. [13]

    Functional map of the world

    Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018. 3

  14. [14]

    Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

    Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vi- cente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23182–23190, 2023. 1, 3

  15. [15]

    Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 1, 3

  16. [16]

    Wildsat: Learning satel- lite image representations from wildlife observations

    Rangel Daroya, Elijah Cole, Oisin Mac Aodha, Grant Van Horn, and Subhransu Maji. Wildsat: Learning satel- lite image representations from wildlife observations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6143–6154, 2025. 1, 3

  17. [17]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009. 1

  18. [18]

    Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

    Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 16847– 16856, 2023. 1, 3, 6, 7

  19. [19]

    Geobind: Binding text, im- age, and audio through satellite images

    Aayush Dhakal, Subash Khanal, Srikumar Sastry, Adeel Ahmad, and Nathan Jacobs. Geobind: Binding text, im- age, and audio through satellite images. InIGARSS 2024- 2024 IEEE International Geoscience and Remote Sensing Symposium, pages 2729–2733. IEEE, 2024. 1, 3

  20. [20]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

  21. [21]

    Major tom: Ex- pandable datasets for earth observation

    Alistair Francis and Mikolaj Czerkawski. Major tom: Ex- pandable datasets for earth observation. InIGARSS 2024- 2024 IEEE International Geoscience and Remote Sensing Symposium, pages 2935–2940. IEEE, 2024. 3

  22. [22]

    Omnivore: A single model for many visual modalities

    Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A single model for many visual modalities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16102–16112, 2022. 2

  23. [23]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. 3

  24. [24]

    9 Omnimae: Single model masked pretraining on images and videos

    Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 9 Omnimae: Single model masked pretraining on images and videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10406– 10417, 2023. 3

  25. [25]

    Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  26. [26]

    Audioclip: Extending clip to image, text and audio

    Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022. 3

  27. [27]

    Learning generalized zero-shot learners for open-domain image ge- olocalization.arXiv preprint arXiv:2302.00275, 2023

    Lukas Haas, Silas Alberti, and Michal Skreta. Learning generalized zero-shot learners for open-domain image ge- olocalization.arXiv preprint arXiv:2302.00275, 2023. 1, 3, 7

  28. [28]

    Pigeon: Predicting image geolocations

    Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. Pigeon: Predicting image geolocations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12893–12902, 2024. 1, 3, 7

  29. [29]

    James Hays and Alexei A. Efros. im2gps: estimating ge- ographic information from a single image. InProceedings of the IEEE Conf. on Computer Vision and Pattern Recog- nition (CVPR), 2008. 3

  30. [30]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 3

  31. [31]

    To use or not to use proprietary street view images in (health and place) research? that is the question.Health & Place, 87:103244, 2024

    Marco Helbich, Matthew Danish, SM Labib, and Britta Ricker. To use or not to use proprietary street view images in (health and place) research? that is the question.Health & Place, 87:103244, 2024. 2, 5

  32. [32]

    Cv-cities: Advancing cross-view geo-localization in global cities.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024

    Gaoshuang Huang, Yang Zhou, Luying Zhao, and Wenjian Gan. Cv-cities: Advancing cross-view geo-localization in global cities.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. 1, 3, 4, 5

  33. [33]

    Satclip: Global, general- purpose location embeddings with satellite imagery

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4347–4355, 2025. 3

  34. [34]

    xView: Objects in Context in Overhead Imagery

    Darius Lam, Richard Kuzma, Kevin McGee, Samuel Doo- ley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery.arXiv preprint arXiv:1802.07856, 2018. 1, 3

  35. [35]

    The bench- marking initiative for multimedia evaluation: Mediaeval 2016.IEEE MultiMedia, 24(1):93–96, 2017

    Martha Larson, Mohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, and Gareth JF Jones. The bench- marking initiative for multimedia evaluation: Mediaeval 2016.IEEE MultiMedia, 24(1):93–96, 2017. 3, 5, 8

  36. [36]

    Unleash- ing unlabeled data: A paradigm for cross-view geo- localization

    Guopeng Li, Ming Qian, and Gui-Song Xia. Unleash- ing unlabeled data: A paradigm for cross-view geo- localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16719– 16729, 2024. 3

  37. [37]

    S2mae: A spatial-spectral pretraining foundation model for spectral remote sensing data

    Xuyang Li, Danfeng Hong, and Jocelyn Chanussot. S2mae: A spatial-spectral pretraining foundation model for spectral remote sensing data. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24088–24097, 2024. 1, 3

  38. [38]

    Masked angle-aware au- toencoder for remote sensing images

    Zhihao Li, Biao Hou, Siteng Ma, Zitong Wu, Xianpeng Guo, Bo Ren, and Licheng Jiao. Masked angle-aware au- toencoder for remote sensing images. InEuropean Confer- ence on Computer Vision, pages 260–278. Springer, 2024. 3

  39. [39]

    Polyvit: Co-training vision transformers on im- ages, videos and audio.arXiv preprint arXiv:2111.12993,

    Valerii Likhosherstov, Anurag Arnab, Krzysztof Choro- manski, Mario Lucic, Yi Tay, Adrian Weller, and Mostafa Dehghani. Polyvit: Co-training vision transformers on im- ages, videos and audio.arXiv preprint arXiv:2111.12993,

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 1

  41. [41]

    Remoteclip: A vision language foundation model for re- mote sensing.IEEE Transactions on Geoscience and Re- mote Sensing, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for re- mote sensing.IEEE Transactions on Geoscience and Re- mote Sensing, 2024. 1, 3

  42. [42]

    Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 5, 17

  43. [43]

    Lending orientation to neural networks for cross-view geo-localization

    Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 1, 3, 4, 5

  44. [44]

    Yang Long, Gui-Song Xia, Shengyang Li, Wen Yang, Michael Ying Yang, Xiao Xiang Zhu, Liangpei Zhang, and Deren Li. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid.IEEE Journal of selected topics in applied earth observations and remote sensing, 14:4205–4230, 2021. 3

  45. [45]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  46. [46]

    Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine,

    Siqi Lu, Junlin Guo, James R Zimmer-Dauphinee, Jor- dan M Nieusma, Xiao Wang, Steven A Wernke, Yuankai Huo, et al. Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine,

  47. [47]

    Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jing- dong Chen, Yihua Tan, et al. Skysensegpt: A fine- grained instruction tuning dataset and model for remote sensing vision-language understanding.arXiv preprint arXiv:2406.10100, 2024. 3

  48. [48]

    Change-aware sampling and contrastive learning for satel- lite images

    Utkarsh Mall, Bharath Hariharan, and Kavita Bala. Change-aware sampling and contrastive learning for satel- lite images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261– 5270, 2023. 3 10

  49. [49]

    Remote sensing vision-language foundation models without an- notations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023

    Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Remote sensing vision-language foundation models without an- notations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023. 1, 3

  50. [50]

    Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data

    Oscar Manas, Alexandre Lacoste, Xavier Gir ´o-i Nieto, David Vazquez, and Pau Rodriguez. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9414–9423, 2021. 3

  51. [51]

    Aaron Maxwell, Timothy Warner, Brian Vanderbilt, and Christopher Ramezan. Land cover classification and fea- ture extraction from national agriculture imagery program (naip) orthoimagery: A review.Photogrammetric Engi- neering & Remote Sensing, 83:737–747, 2017. 3

  52. [52]

    Audio-visual instance discrimination with cross-modal agreement

    Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12475– 12486, 2021. 3

  53. [53]

    Geolocation estimation of photos using a hierarchical model and scene classification

    Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ew- erth. Geolocation estimation of photos using a hierarchical model and scene classification. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 563– 579, 2018. 1, 3

  54. [54]

    Learning audio-video modalities from image captions

    Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pages 407–426. Springer, 2022. 3

  55. [55]

    Mmearth: Ex- ploring multi-modal pretext tasks for geospatial representa- tion learning

    Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Ex- ploring multi-modal pretext tasks for geospatial representa- tion learning. InEuropean Conference on Computer Vision, pages 164–182. Springer, 2024. 1, 3

  56. [56]

    Large-scale image retrieval with atten- tive deep local features

    Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with atten- tive deep local features. InProceedings of the IEEE inter- national conference on computer vision, pages 3456–3465,

  57. [57]

    Rethinking transformers pre-training for multi- spectral satellite imagery

    Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27811–27819, 2024. 1, 3

  58. [58]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 6

  59. [59]

    Openstreetmap.https:// www.openstreetmap.org, 2024

    OpenStreetMap contributors. Openstreetmap.https:// www.openstreetmap.org, 2024. 3, 4

  60. [60]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 6

  61. [61]

    Where in the world is this image? transformer-based geo-localization in the wild

    Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Car- los D Castillo, and Rama Chellappa. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision, pages 196–

  62. [62]

    Springer, 2022. 1, 3

  63. [63]

    Revisiting oxford and paris: Large-scale image retrieval benchmarking

    Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5706–5715, 2018. 1, 3

  64. [64]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 6, 7, 8

  65. [65]

    Optimization of rank losses for image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Elias Ramzi, Nicolas Audebert, Cl ´ement Rambour, Andr ´e Araujo, Xavier Bitot, and Nicolas Thome. Optimization of rank losses for image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3, 16

  66. [66]

    Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024

    Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024. 1

  67. [67]

    Birdsat: Cross-view con- trastive masked autoencoders for bird species classification and mapping

    Srikumar Sastry, Subash Khanal, Aayush Dhakal, Di Huang, and Nathan Jacobs. Birdsat: Cross-view con- trastive masked autoencoders for bird species classification and mapping. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 7136– 7145, 2024. 3

  68. [68]

    Taxabind: A unified embed- ding space for ecological applications

    Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ahmad, and Nathan Jacobs. Taxabind: A unified embed- ding space for ecological applications. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1765–1774. IEEE, 2025. 3

  69. [69]

    SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion

    Michael Schmitt, Lloyd Haydn Hughes, Chunping Qiu, and Xiao Xiang Zhu. Sen12ms–a curated dataset of georefer- enced multi-spectral sentinel-1/2 imagery for deep learning and data fusion.arXiv preprint arXiv:1906.07789, 2019. 3

  70. [70]

    Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps

    Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bo- hyung Han. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. InProceedings of the European Conference on Computer Vision (ECCV), pages 536–551, 2018. 3

  71. [71]

    Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025

    Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025. 3

  72. [72]

    Spatial- aware feature aggregation for cross-view image based geo- localization.Advances in Neural Information Processing Systems, 32, 2019

    Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li. Spatial- aware feature aggregation for cross-view image based geo- localization.Advances in Neural Information Processing Systems, 32, 2019. 3

  73. [73]

    Where am i looking at? joint location and orientation es- timation by cross-view matching

    Yujiao Shi, Xin Yu, Dylan Campbell, and Hongdong Li. Where am i looking at? joint location and orientation es- timation by cross-view matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4064–4072, 2020. 1, 3

  74. [74]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, 11 Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

  75. [75]

    Bioclip: A vision foundation model for the tree of life

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Car- lyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19412– 19424, 2024. 3

  76. [76]

    Bigearthnet: A large-scale benchmark archive for remote sensing image understanding

    Gencer Sumbul, Marcela Charfuelan, Beg ¨um Demir, and V olker Markl. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019. 3

  77. [77]

    Gencer Sumbul, Arne De Wall, Tristan Kreuziger, Filipe Marcelino, Hugo Costa, Pedro Benevides, Mario Caetano, Beg¨um Demir, and V olker Markl. Bigearthnet-mm: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets].IEEE Geoscience and Remote Sensing Mag- azine, 9(3):174–180, 2021. 3

  78. [78]

    Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing im- agery.ISPRS Journal of Photogrammetry and Remote Sens- ing, 184:116–130, 2022

    Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, et al. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing im- agery.ISPRS Journal of Photogrammetry and Remote Sens- ing, 184:116–130, 2022. 3

  79. [79]

    Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

    Bart Thomee, David A. Shamma, Gerald Friedland, Ben- jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia re- search.Communications of the ACM, 59(2):64–73, 2016. 1, 3, 5

  80. [80]

    Con- trastive multiview coding

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive multiview coding. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020. 3

Showing first 80 references.