DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

Hang He; Hao Tang; Junchao Cui; Shaoyong Du; Wenqi Shi; Xiangyang Luo; Xuanzi Ma

arxiv: 2604.25533 · v1 · submitted 2026-04-28 · 💻 cs.CV

DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

Junchao Cui , Wenqi Shi , Shaoyong Du , Hang He , Xuanzi Ma , Hao Tang , Xiangyang Luo This is my paper

Pith reviewed 2026-05-07 16:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords image geo-localizationdual-view frameworkbidirectional cross-attentioncontrastive learninggeographic clusteringlarge multimodal modelssemantic segmentationworldwide localization

0 comments

The pith

DualGeo fuses image features with semantic segmentation via cross-attention and aligns them to GPS through contrastive learning, then refines candidates with clustering and large multimodal models to improve worldwide geo-localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DualGeo, a two-stage framework that first builds a retrieval database by fusing standard image features with semantic segmentation outputs through bidirectional cross-attention and aligning the results to GPS coordinates with dual-view contrastive learning. This step aims to create representations less sensitive to lighting, weather, and seasonal changes than visual features alone. The second stage retrieves candidate locations, applies geographic clustering to remove outliers, and passes the filtered set to large multimodal models for precise coordinate prediction. Experiments on three standard benchmarks report gains in accuracy at street-level and city-level scales compared with prior methods. A sympathetic reader would care because accurate worldwide localization from a single photo has direct uses in mapping, navigation, and image search.

Core claim

DualGeo establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention and aligning them with GPS coordinates through dual-view contrastive learning to build a global retrieval database. It then performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering and feeding them into large multimodal models for final coordinate prediction.

What carries the argument

Bidirectional cross-attention fusion of image and semantic segmentation features, followed by dual-view contrastive alignment to GPS, geographic clustering for outlier removal, and large multimodal model coordinate prediction.

If this is right

The method improves street-level localization accuracy by 3.6 to 16.58 percent on IM2GPS, IM2GPS3k, and YFCC4k.
It improves city-level localization accuracy by 1.29 to 8.77 percent on the same benchmarks.
Representations built this way are more robust to environmental variations than those relying on visual features alone.
Geographic clustering followed by large multimodal model prediction provides effective post-processing to reduce outlier errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-plus-refinement pattern could be tested on other retrieval tasks that need spatial context beyond pure appearance, such as visual place recognition in changing urban environments.
Large multimodal models may become a standard refinement step for any retrieval system whose top candidates can be grouped by an external signal like geography.
An ablation that isolates the contribution of semantic segmentation versus the contrastive alignment would clarify which component drives the reported robustness gains.

Load-bearing premise

Fusing image and semantic segmentation features through cross-attention and aligning them to GPS with contrastive learning produces location representations that stay effective despite changes in lighting, season, and weather, and that geographic clustering plus large multimodal models can reliably filter outliers and predict coordinates.

What would settle it

Run DualGeo on a fresh test set of images captured under lighting or weather conditions absent from the training distribution; if accuracy at street and city scales falls to or below the level of prior visual-only methods, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.25533 by Hang He, Hao Tang, Junchao Cui, Shaoyong Du, Wenqi Shi, Xiangyang Luo, Xuanzi Ma.

**Figure 1.** Figure 1: Scenes under diurnal and seasonal variations. RGB images exhibit view at source ↗

**Figure 2.** Figure 2: Overview of the proposed framework DualGeo. view at source ↗

**Figure 3.** Figure 3: Schematic diagram of cross-attention. RGB branches. view at source ↗

**Figure 4.** Figure 4: Performance of different n values on LMM on IM2GPS3k dataset view at source ↗

**Figure 5.** Figure 5: Performance of different clustering radii on the IM2GPS3k dataset. view at source ↗

read the original abstract

Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (<1 km) and city-level (<25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : https://github.com/CJ310177/DualGeo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualGeo fuses image and segmentation features then refines retrievals with clustering plus LMMs, posting benchmark gains that still need variance numbers to be convincing.

read the letter

The paper's core idea is a two-stage pipeline for worldwide image geo-localization. Stage one fuses raw image features with semantic segmentation maps through bidirectional cross-attention and aligns the result to GPS coordinates with dual-view contrastive learning to build a retrieval database. Stage two clusters the top candidates geographically and feeds them to large multimodal models for the final coordinate output. This setup directly targets the usual problems of environmental sensitivity and outlier retrievals at global scale.

Referee Report

1 major / 1 minor

Summary. The manuscript presents DualGeo, a two-stage framework for worldwide image geo-localization. The first stage fuses image features with semantic segmentation features using bidirectional cross-attention and aligns them to GPS coordinates via dual-view contrastive learning to create a retrieval database. The second stage applies geographic clustering to re-rank candidates and uses large multimodal models (LMMs) for final coordinate prediction. Experiments on the IM2GPS, IM2GPS3k, and YFCC4k datasets demonstrate improvements over state-of-the-art methods, with reported gains of 3.6%-16.58% in street-level (<1 km) accuracy and 1.29%-8.77% in city-level (<25 km) accuracy.

Significance. If the reported gains hold under statistical scrutiny, DualGeo would represent a meaningful advance by combining visual-semantic fusion with contrastive GPS alignment and LMM-based refinement to mitigate environmental sensitivity and outlier issues in geo-localization. The public availability of code and datasets is a clear strength that supports reproducibility and extension by the community.

major comments (1)

[Experiments section (results tables on IM2GPS, IM2GPS3k, YFCC4k)] Experiments section (results tables on IM2GPS, IM2GPS3k, YFCC4k): The central claims of 3.6%-16.58% and 1.29%-8.77% absolute improvements are presented as single-run point estimates with no standard deviations, error bars, multiple random seeds, or statistical significance tests against the cited baselines. This leaves open the possibility that the observed deltas arise from training stochasticity or retrieval sampling rather than the bidirectional cross-attention or LMM refinement components.

minor comments (1)

[Abstract] Abstract: The availability statement contains an extraneous space ('available : https'); this should be corrected to 'available at https' for standard formatting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for recognizing the potential significance of DualGeo. We address the concern about statistical robustness of the reported results below and commit to strengthening the experimental section accordingly.

read point-by-point responses

Referee: Experiments section (results tables on IM2GPS, IM2GPS3k, YFCC4k): The central claims of 3.6%-16.58% and 1.29%-8.77% absolute improvements are presented as single-run point estimates with no standard deviations, error bars, multiple random seeds, or statistical significance tests against the cited baselines. This leaves open the possibility that the observed deltas arise from training stochasticity or retrieval sampling rather than the bidirectional cross-attention or LMM refinement components.

Authors: We acknowledge that the current results are reported as single-run point estimates without standard deviations or formal statistical tests. To address this rigorously, in the revised manuscript we will re-run the full training and evaluation pipeline on all three datasets using at least three different random seeds. We will report mean accuracy and standard deviation for each method, add error bars to the tables, and conduct paired statistical significance tests (e.g., t-tests) against the strongest baselines. These additions will allow readers to assess whether the observed improvements are attributable to the bidirectional cross-attention and LMM refinement rather than training variability or sampling effects. We believe this will directly resolve the referee's concern. revision: yes

Circularity Check

0 steps flagged

No circularity in DualGeo's framework or claims

full rationale

The paper introduces DualGeo as a novel two-stage architecture: bidirectional cross-attention fusion of image and segmentation features, dual-view contrastive alignment to GPS, followed by geographic clustering and LMM refinement. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The reported accuracy gains are presented as empirical outcomes on IM2GPS, IM2GPS3k, and YFCC4k rather than algebraic identities or re-expressions of prior fitted results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about feature fusion and contrastive alignment rather than new postulates; no free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Bidirectional cross-attention fusion of image and semantic segmentation features yields geo-representations robust to environmental variation.
Invoked in the first stage to establish the retrieval database.
domain assumption Dual-view contrastive learning successfully aligns the fused features with GPS coordinates.
Central mechanism for building the global retrieval database.

pith-pipeline@v0.9.0 · 5546 in / 1610 out tokens · 61253 ms · 2026-05-07T16:48:29.859398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references

[1]

IM2GPS: Estimating geographic information from a single image,

J. Hays and A. A. Efros, “IM2GPS: Estimating geographic information from a single image,” inCVPR, 2008, pp. 1–8

2008
[2]

Image and object geo-localization,

D. Wilson, X. Zhang, W. Sultani, and S. Wshah, “Image and object geo-localization,”International Journal of Computer Vision, vol. 132, no. 4, pp. 1350–1392, 2024

2024
[3]

CV-Cities: Advancing cross- view geo-localization in global cities,

G. Huang, Y . Zhou, L. Zhao, and W. Gan, “CV-Cities: Advancing cross- view geo-localization in global cities,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 1592– 1606, 2025

2025
[4]

Revisiting IM2GPS in the deep learning era,

N. V o, N. Jacobs, and J. Hays, “Revisiting IM2GPS in the deep learning era,” inICCV, 2017, pp. 2640–2649

2017
[5]

Remote sensing change detection for ecological monitor- ing in united states protected areas,

K. S. Willis, “Remote sensing change detection for ecological monitor- ing in united states protected areas,”Biological Conservation, vol. 182, pp. 233–242, 2015

2015
[6]

Deep learning-based image geolocation for travel recommendation via multi-task learning,

F. Gu, K. Jiang, X. Hu, and J. Yang, “Deep learning-based image geolocation for travel recommendation via multi-task learning,”Journal of Circuits, Systems and Computers, vol. 31, pp. 1–19, 2022

2022
[7]

EU2-Geo: Cross-view image geo-localization via enhancing unlabeled data utility,

L. Yu, C. Yang, M. Zhu, X. Wang, and Y . Pei, “EU2-Geo: Cross-view image geo-localization via enhancing unlabeled data utility,” inICME, 2026, pp. 1–6

2026
[8]

CCIGeo: Cross- view and cross-day-night image geo-localization using daytime image supervision,

N. Wu, C. Yang, B. Qi, M. Zhu, J. Li, and X. Luo, “CCIGeo: Cross- view and cross-day-night image geo-localization using daytime image supervision,”IEEE Trans. Multimedia, vol. 27, pp. 6475–6488, 2025

2025
[9]

CurriculumLoc: Enhancing cross-domain geolocalization through multistage refinement,

B. Hu, L. Chen, R. Chen, S. Bu, P. Han, and H. Li, “CurriculumLoc: Enhancing cross-domain geolocalization through multistage refinement,” IEEE Trans. Geosci. Remote Sensing, vol. 62, pp. 1–14, 2024

2024
[10]

PlaNet-photo geolocation with convolutional neural networks,

T. Weyand, I. Kostrikov, and J. Philbin, “PlaNet-photo geolocation with convolutional neural networks,” inECCV, 2016, pp. 37–55

2016
[11]

CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps,

P. H. Seo, T. Weyand, J. Sim, and B. Han, “CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps,” inECCV, 2018, pp. 544–560

2018
[12]

Geolocation estima- tion of photos using a hierarchical model and scene classification,

E. M ¨uller-Budack, K. Pustu-Iren, and R. Ewerth, “Geolocation estima- tion of photos using a hierarchical model and scene classification,” in ECCV, 2018, pp. 575–592

2018
[13]

The benchmarking initiative for multimedia evaluation: MediaEval 2016,

M. Larson, M. Soleymani, G. Gravier, B. Ionescu, and G. J. F. Jones, “The benchmarking initiative for multimedia evaluation: MediaEval 2016,”IEEE Multimedia, vol. 24, no. 1, pp. 93–96, 2017

2016
[14]

GeoCLIP: CLIP-inspired alignment between locations and images for effective worldwide ge- olocalization,

V . Cepeda, G. K. Nayak, and M. Shah, “GeoCLIP: CLIP-inspired alignment between locations and images for effective worldwide ge- olocalization,” inNeurIPS, 2023, pp. 8690–8701

2023
[15]

PIGEON: Predicting image geolocations,

L. Haas, M. Skreta, S. Alberti, and C. Finn, “PIGEON: Predicting image geolocations,” inCVPR, 2024, pp. 12893–12902

2024
[16]

G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models,

P. Jia, Y . Liu, X. Li, Y . Wang, Y . Du, X. Han, X. Wei, S. Wang, D. Yin, and X. Zhao, “G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models,” inNeurIPS, 2025, pp. 53198–53221

2025
[17]

Where in the world is this image? transformer-based geo- localization in the wild,

S. Pramanick, E. M. Nowara, J. Gleason, C. D. Castillo, and R. Chel- lappa, “Where in the world is this image? transformer-based geo- localization in the wild,” inECCV, 2022, pp. 196–215

2022
[18]

Img2Loc: Revisiting image geolocalization using multi-modality foun- dation models and image-based retrieval-augmented generation,

Z. Zhou, J. Zhang, Z. Guan, M. Hu, N. Lao, L. Mu, S. Li, and G. Mai, “Img2Loc: Revisiting image geolocalization using multi-modality foun- dation models and image-based retrieval-augmented generation,” in ACM SIGIR, 2024, pp. 2749–2754

2024
[19]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

2016
[20]

MCAP: Modality characteristics- aware pruning for multimodal models,

B. Zhang, A. Ren, D. Liu, and et al., “MCAP: Modality characteristics- aware pruning for multimodal models,” inICME, 2026, pp. 1–6

2026
[21]

SegFormer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” inNeurIPS, 2021, pp. 12077–12090

2021

[1] [1]

IM2GPS: Estimating geographic information from a single image,

J. Hays and A. A. Efros, “IM2GPS: Estimating geographic information from a single image,” inCVPR, 2008, pp. 1–8

2008

[2] [2]

Image and object geo-localization,

D. Wilson, X. Zhang, W. Sultani, and S. Wshah, “Image and object geo-localization,”International Journal of Computer Vision, vol. 132, no. 4, pp. 1350–1392, 2024

2024

[3] [3]

CV-Cities: Advancing cross- view geo-localization in global cities,

G. Huang, Y . Zhou, L. Zhao, and W. Gan, “CV-Cities: Advancing cross- view geo-localization in global cities,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 1592– 1606, 2025

2025

[4] [4]

Revisiting IM2GPS in the deep learning era,

N. V o, N. Jacobs, and J. Hays, “Revisiting IM2GPS in the deep learning era,” inICCV, 2017, pp. 2640–2649

2017

[5] [5]

Remote sensing change detection for ecological monitor- ing in united states protected areas,

K. S. Willis, “Remote sensing change detection for ecological monitor- ing in united states protected areas,”Biological Conservation, vol. 182, pp. 233–242, 2015

2015

[6] [6]

Deep learning-based image geolocation for travel recommendation via multi-task learning,

F. Gu, K. Jiang, X. Hu, and J. Yang, “Deep learning-based image geolocation for travel recommendation via multi-task learning,”Journal of Circuits, Systems and Computers, vol. 31, pp. 1–19, 2022

2022

[7] [7]

EU2-Geo: Cross-view image geo-localization via enhancing unlabeled data utility,

L. Yu, C. Yang, M. Zhu, X. Wang, and Y . Pei, “EU2-Geo: Cross-view image geo-localization via enhancing unlabeled data utility,” inICME, 2026, pp. 1–6

2026

[8] [8]

CCIGeo: Cross- view and cross-day-night image geo-localization using daytime image supervision,

N. Wu, C. Yang, B. Qi, M. Zhu, J. Li, and X. Luo, “CCIGeo: Cross- view and cross-day-night image geo-localization using daytime image supervision,”IEEE Trans. Multimedia, vol. 27, pp. 6475–6488, 2025

2025

[9] [9]

CurriculumLoc: Enhancing cross-domain geolocalization through multistage refinement,

B. Hu, L. Chen, R. Chen, S. Bu, P. Han, and H. Li, “CurriculumLoc: Enhancing cross-domain geolocalization through multistage refinement,” IEEE Trans. Geosci. Remote Sensing, vol. 62, pp. 1–14, 2024

2024

[10] [10]

PlaNet-photo geolocation with convolutional neural networks,

T. Weyand, I. Kostrikov, and J. Philbin, “PlaNet-photo geolocation with convolutional neural networks,” inECCV, 2016, pp. 37–55

2016

[11] [11]

CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps,

P. H. Seo, T. Weyand, J. Sim, and B. Han, “CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps,” inECCV, 2018, pp. 544–560

2018

[12] [12]

Geolocation estima- tion of photos using a hierarchical model and scene classification,

E. M ¨uller-Budack, K. Pustu-Iren, and R. Ewerth, “Geolocation estima- tion of photos using a hierarchical model and scene classification,” in ECCV, 2018, pp. 575–592

2018

[13] [13]

The benchmarking initiative for multimedia evaluation: MediaEval 2016,

M. Larson, M. Soleymani, G. Gravier, B. Ionescu, and G. J. F. Jones, “The benchmarking initiative for multimedia evaluation: MediaEval 2016,”IEEE Multimedia, vol. 24, no. 1, pp. 93–96, 2017

2016

[14] [14]

GeoCLIP: CLIP-inspired alignment between locations and images for effective worldwide ge- olocalization,

V . Cepeda, G. K. Nayak, and M. Shah, “GeoCLIP: CLIP-inspired alignment between locations and images for effective worldwide ge- olocalization,” inNeurIPS, 2023, pp. 8690–8701

2023

[15] [15]

PIGEON: Predicting image geolocations,

L. Haas, M. Skreta, S. Alberti, and C. Finn, “PIGEON: Predicting image geolocations,” inCVPR, 2024, pp. 12893–12902

2024

[16] [16]

G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models,

P. Jia, Y . Liu, X. Li, Y . Wang, Y . Du, X. Han, X. Wei, S. Wang, D. Yin, and X. Zhao, “G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models,” inNeurIPS, 2025, pp. 53198–53221

2025

[17] [17]

Where in the world is this image? transformer-based geo- localization in the wild,

S. Pramanick, E. M. Nowara, J. Gleason, C. D. Castillo, and R. Chel- lappa, “Where in the world is this image? transformer-based geo- localization in the wild,” inECCV, 2022, pp. 196–215

2022

[18] [18]

Img2Loc: Revisiting image geolocalization using multi-modality foun- dation models and image-based retrieval-augmented generation,

Z. Zhou, J. Zhang, Z. Guan, M. Hu, N. Lao, L. Mu, S. Li, and G. Mai, “Img2Loc: Revisiting image geolocalization using multi-modality foun- dation models and image-based retrieval-augmented generation,” in ACM SIGIR, 2024, pp. 2749–2754

2024

[19] [19]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

2016

[20] [20]

MCAP: Modality characteristics- aware pruning for multimodal models,

B. Zhang, A. Ren, D. Liu, and et al., “MCAP: Modality characteristics- aware pruning for multimodal models,” inICME, 2026, pp. 1–6

2026

[21] [21]

SegFormer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” inNeurIPS, 2021, pp. 12077–12090

2021