pith. sign in

arxiv: 2605.05351 · v1 · submitted 2026-05-06 · 💻 cs.CV

egenioussBench: A New Dataset for Geospatial Visual Localisation

Pith reviewed 2026-05-08 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual localisationbenchmark dataset3D meshCityGMLgeospatial datasmartphone imagespose estimationnon-co-visible queries
0
0 comments X

The pith

A new benchmark pairs airborne city 3D meshes with smartphone images to create realistic tests for large-scale visual localisation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces egenioussBench, a visual localisation benchmark built from a city-scale airborne 3D mesh and a CityGML LoD2 model as reference data. Smartphone query images carry centimetre-accurate, map-independent ground truth obtained via post-processed kinematic GNSS and ground-control-point adjustment. From an initial pool of 2,709 images the authors compute a co-visibility matrix from rendered depth and extract a maximum independent set to isolate 42 non-co-visible test images whose poses remain withheld. They also release a validation split of 412 sequential images with known poses. The resulting public leaderboard applies binning metrics at multiple pose-error thresholds together with global statistics to enable direct, scalable comparisons between mesh-based and model-based localisation approaches.

Core claim

By using deployable geospatial reference assets instead of structure-from-motion reconstructions and by enforcing a non-co-visible test split through maximum-independent-set selection on a rendered-depth co-visibility graph, egenioussBench supplies evaluation data that expose cross-view and cross-domain localisation difficulties while remaining extensible to city-scale problems.

What carries the argument

The maximum independent set extracted from the co-visibility matrix estimated from depth renders of the reference models; it isolates a compact set of mutually non-overlapping query images that simulate realistic deployment conditions.

If this is right

  • Localisation methods can be compared fairly on a public leaderboard using consistent binning metrics at several pose-error thresholds plus median, RMSE and outlier-ratio statistics.
  • The 412-image validation split with known poses supports supervised training of pose regressors and self-validation experiments.
  • Algorithms can be developed and tested against standard city mapping products rather than requiring per-scene structure-from-motion reconstructions.
  • The design isolates the specific difficulties of airborne-to-ground view changes and mesh-to-image domain shifts for targeted method improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of the benchmark could encourage the community to evaluate localisation systems against official geospatial assets instead of reconstructed models.
  • The same co-visibility-matrix and maximum-independent-set procedure could be reused to construct test sets for other cities or sensor combinations.
  • Large performance differences between mesh-based and LoD2-based entries on the leaderboard would indicate which reference representation is more suitable for particular operational settings.

Load-bearing premise

The maximum independent set taken from the estimated co-visibility matrix produces a test split whose difficulty and distribution genuinely reflect real-world deployment conditions without hidden selection bias.

What would settle it

Releasing the withheld ground truth for the 42-image test split and then measuring whether methods that rank highest on the leaderboard maintain their reported accuracy when the same images are replaced by a random sample of equal size drawn from the full 2,709-image collection would directly test whether the non-co-visibility criterion creates the claimed increase in difficulty.

Figures

Figures reproduced from arXiv: 2605.05351 by Alexander Kern, Francesco Vultaggio, Markus Gerke, Phillipp Fanta-Jende, Yasmin Loeper.

Figure 1
Figure 1. Figure 1: Comparison of egenioussBench to other state of the art view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the area of interest in Braunschweig. Map view at source ↗
Figure 3
Figure 3. Figure 3: Geospatial mesh data; Data: Geofly, Processing: view at source ↗
Figure 4
Figure 4. Figure 4: LoD2 model; Data: City of Braunschweig view at source ↗
read the original abstract

We present egenioussBench, a visual localisation benchmark built on geospatial reference data: a city-scale airborne 3D mesh and a CityGML LoD2 model. This pairing reflects deployable mapping assets and supports true scalability beyond traditional SfM-based approaches. The query data comprise smartphone images with centimetre-accurate, map-independent ground truth obtained via PPK and GCP/CP-aided adjustment. From 2,709 images, we derive a non-co-visible subset by estimating the full co-visibility matrix from rendered depth and selecting a maximum independent set; the released data include a test split of 42 non-co-visible images with withheld ground truth and a validation split of 412 sequential images with poses, e.g. for training of pose regressors and self-validation. The benchmark features a public leaderboard evaluated with binning metrics at multiple pose-error thresholds alongside global statistics (median, RMSE, outlier ratio), ensuring fair, like-for-like comparison across mesh- and LoD2-based methods. Together, these design choices expose realistic cross-view and cross-domain challenges while providing a rigorous, scalable path for advancing large-scale visual localisation. We make the evaluation code and data availeable at https://github.com/fratopa/egenioussBench and https://www.egeniouss.eu/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents egenioussBench, a benchmark for geospatial visual localisation built on a city-scale airborne 3D mesh and CityGML LoD2 model. Smartphone images with centimetre-accurate ground truth (via PPK and GCP/CP-aided adjustment) are used to derive a non-co-visible test split of 42 images by estimating a full co-visibility matrix from rendered depth and extracting a maximum independent set; a validation split of 412 sequential images is also provided. The benchmark includes a public leaderboard using binning metrics at multiple pose-error thresholds plus global statistics (median, RMSE, outlier ratio) to support fair comparisons of mesh- and LoD2-based methods.

Significance. If the test-split construction is shown to avoid selection bias, this work provides a valuable contribution by releasing a scalable benchmark that leverages deployable geospatial assets rather than SfM reconstructions, exposing realistic cross-view and cross-domain challenges. The withheld test ground truth, evaluation code, and public data availability at the cited GitHub and website are clear strengths that enhance reproducibility and utility for the community.

major comments (1)
  1. [Abstract (and corresponding methods section on co-visibility and split construction)] The derivation of the 42-image test split via maximum independent set from the estimated co-visibility matrix (described in the abstract) may systematically bias the selection toward low-density regions, atypical viewpoints, or unique geometric features, since the solver prioritizes global non-overlap rather than matching clustered or sequential real-world query statistics. No quantitative validation—such as spatial histograms, depth distributions, or comparison against random non-co-visible sampling—is provided to confirm the split reflects realistic deployment conditions, which is load-bearing for the central claim that the benchmark exposes realistic challenges.
minor comments (2)
  1. [Abstract] Typo: 'availeable' should be corrected to 'available'.
  2. [Abstract] The abstract refers to 'binning metrics at multiple pose-error thresholds' without specifying the exact thresholds or bin definitions; adding these details would improve clarity for readers evaluating the leaderboard protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the single major comment below, providing clarification on our design choices while acknowledging the need for additional supporting analysis. We commit to incorporating the suggested validation in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract (and corresponding methods section on co-visibility and split construction)] The derivation of the 42-image test split via maximum independent set from the estimated co-visibility matrix (described in the abstract) may systematically bias the selection toward low-density regions, atypical viewpoints, or unique geometric features, since the solver prioritizes global non-overlap rather than matching clustered or sequential real-world query statistics. No quantitative validation—such as spatial histograms, depth distributions, or comparison against random non-co-visible sampling—is provided to confirm the split reflects realistic deployment conditions, which is load-bearing for the central claim that the benchmark exposes realistic challenges.

    Authors: We thank the referee for this important observation. The maximum independent set (MIS) was deliberately selected to guarantee a mutually non-co-visible test set of 42 images, which is central to the benchmark's goal of evaluating localisation performance under realistic cross-view and cross-domain conditions without trivial overlaps. This construction avoids the common pitfall of clustered queries that could artificially boost metrics through repeated similar viewpoints. While the global optimisation of MIS could in principle favour certain distributions, the underlying image set of 2,709 densely sampled smartphone images across the city-scale area provides a broad base from which the MIS is drawn, mitigating extreme bias. Nevertheless, we agree that the manuscript does not include quantitative checks (spatial histograms, depth distributions, or random-sampling baselines) to empirically confirm the split's representativeness. In the revised version we will add these analyses, including (i) spatial density maps of the selected test points versus the full set, (ii) depth and viewpoint histograms, and (iii) a direct comparison against randomly sampled non-co-visible subsets of the same size. These additions will strengthen the claim that the benchmark reflects realistic deployment conditions. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset construction or claims

full rationale

The paper is a data release describing procedural construction of a benchmark split (maximum independent set on an estimated co-visibility matrix derived from rendered depth and GT poses). This is an explicit, one-way definition of the test set properties rather than any derivation, prediction, or fitted parameter that reduces back to its own inputs. No equations, self-citations, or ansatzes are invoked to justify a result; the central claims about exposing cross-view challenges rest on the released data, code, and external verifiability of the construction steps. The skeptic concern about possible selection bias in the MIS procedure is a question of external validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a curated dataset and evaluation protocol; it introduces no new free parameters, mathematical axioms, or postulated entities beyond standard computer-vision assumptions about pose estimation and depth rendering.

axioms (1)
  • domain assumption Depth maps rendered from the 3D mesh and CityGML model accurately reflect real scene geometry for co-visibility computation.
    Invoked when constructing the co-visibility matrix from rendered depth.

pith-pipeline@v0.9.0 · 5543 in / 1272 out tokens · 69724 ms · 2026-05-08T16:49:09.416462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    https://www.adv-online.de/AdV-Produkte/Standards-und- Produktblaetter/Standards-der-Geotopographie/ (last visited October 08, 2025)

    Produkt- und Qualit ¨atsstandard f ¨ur 3D-Geb ¨audemodelle. https://www.adv-online.de/AdV-Produkte/Standards-und- Produktblaetter/Standards-der-Geotopographie/ (last visited October 08, 2025). Barath, D., Noskova, J., Ivashechkin, M., Matas, J.,

  2. [2]

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, W A, USA, 1301–1309

    MAGSAC++, a Fast, Reliable and Accurate Robust Estimator. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, W A, USA, 1301–1309. Berton, G., Junglas, L., Zaccone, R., Pollok, T., Caputo, B., Masone, C.,

  3. [3]

    Brejcha, J., Luk ´aˇc, M., Hold-Geoffroy, Y ., Wang, O., ˇCad´ık, M.,

    Acceler- ated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses.2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), IEEE, Van- couver, BC, Canada, 5044–5053. Brejcha, J., Luk ´aˇc, M., Hold-Geoffroy, Y ., Wang, O., ˇCad´ık, M.,

  4. [4]

    Landscapear: Large scale outdoor augmented real- ity by matching photographs with terrain models using learned descriptors. A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (eds), Computer Vision – ECCV 2020, Lecture Notes in Computer Science, 12374, Springer International Publishing, 295–312. DeTone, D., Malisiewicz, T., Rabinovich, A.,

  5. [5]

    Di Giammarino, L., Sun, B., Grisetti, G., Pollefeys, M., Blum, H., Barath, D.,

    SuperPoint: Self-Supervised Interest Point Detection and Description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition Workshops (CVPRW), IEEE, Salt Lake City, UT, USA, 337–33712. Di Giammarino, L., Sun, B., Grisetti, G., Pollefeys, M., Blum, H., Barath, D.,

  6. [6]

    Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization Using Geomet- rical Information. A. Leonardis, E. Ricci, S. Roth, O. Rus- sakovsky, T. Sattler, G. Varol (eds),Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 188–205. Dong, S., Wang, S., Zhuang, Y ., Kannala, J., Pollefeys, M., Chen, B.,

  7. [7]

    Hespe, D., Schulz, C., Strash, D.,

    Visual Localization via Few-Shot Scene Re- gion Classification.2022 International Conference on 3D Vis- ion (3DV), IEEE Computer Society, 393–402. Hespe, D., Schulz, C., Strash, D.,

  8. [8]

    Hespe, C

    Scalable Ker- nelization for Maximum Independent Sets.ACM Journal of Experimental Algorithmics, 24(1), 1.16:1–1.16:22. ht- tps://doi.org/10.1145/3355502. Krishnan, A., Liu, S., Sarlin, P.-E., Gentilhomme, O., Caruso, D., Monge, M., Newcombe, R., Engel, J., Pollefeys, M.,

  9. [9]

    ”https://ni- lgln-opengeodata.hub.arcgis.com/apps/lgln- opengeodata::alkis/about” (last visited October 21, 2025)

    ALKIS. ”https://ni- lgln-opengeodata.hub.arcgis.com/apps/lgln- opengeodata::alkis/about” (last visited October 21, 2025). Landesamt f¨ur Geoinformation und Landesvermessung Nieder- sachsen,

  10. [10]

    ”https://ni-lgln- opengeodata.hub.arcgis.com/” (last visited October 21, 2025)

    Opengeodata niedersachsen. ”https://ni-lgln- opengeodata.hub.arcgis.com/” (last visited October 21, 2025). Loeper, Y ., Gerke, M., Alamouri, A., Kern, A., Ba- jauri, M. S., Fanta-Jende, P.,

  11. [11]

    https://isprs-archives.copernicus.org/articles/XLVIII- 2-W8-2024/311/2024/

    Visual localization in urban environments employing 3D city models.The In- ternational Archives of the Photogrammetry, Remote Sens- ing and Spatial Information Sciences, XLVIII-2/W8-2024, 311–318. https://isprs-archives.copernicus.org/articles/XLVIII- 2-W8-2024/311/2024/. Maggio, D., Abate, M., Shi, J., Mario, C., Carlone, L.,

  12. [12]

    2023 IEEE International Conference on Robotics and Automa- tion (ICRA), 4018–4025

    Loc-nerf: Monte carlo localization using neural radiance fields. 2023 IEEE International Conference on Robotics and Automa- tion (ICRA), 4018–4025. Mera-Trujillo, M., Smith, B., Fragoso, V .,

  13. [13]

    Mikolka-Fl¨ory, S., Ressl, C., Schimpl, L., Pfeifer, N.,

    Efficient Scene Compression for Visual-based Localization.2020 International Conference on 3D Vision (3DV), 1–10. Mikolka-Fl¨ory, S., Ressl, C., Schimpl, L., Pfeifer, N.,

  14. [14]

    MeshLoc: Mesh- Based Visual Localization. S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, T. Hassner (eds),Computer Vision – ECCV 2022, 13682, Springer Nature Switzerland, Cham, 589–609. Panek, V ., Kukelova, Z., Sattler, T.,

  15. [15]

    10.48550/arXiv.2304.05947

    Visual Loc- alization using Imperfect 3D Models from the Internet. 10.48550/arXiv.2304.05947. Pang, W., Xia, C., Leong, B., Ahmad, F., Paek, J., Govindan, R.,

  16. [16]

    Vultaggio, F., Fanta-Jende, P., Gerke, M.,

    Crosslocate: Cross- modal large-scale visual geo-localization in natural environ- ments using rendered modalities.2022 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), IEEE, 2193–2202. Vultaggio, F., Fanta-Jende, P., Gerke, M.,

  17. [17]

    Vultaggio, F., Fanta-Jende, P., Sch ¨orghuber, M., Kern, A., Gerke, M.,

    Perspective- n-Point in Practice: Performance, Robustness, and Accuracy for Mesh-Based Localisation.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sci- ences, XLVIII-1-W4-2025, 131–138. Vultaggio, F., Fanta-Jende, P., Sch ¨orghuber, M., Kern, A., Gerke, M.,

  18. [18]

    https://isprs-archives.copernicus.org/articles/XLVIII-2-W8- 2024/447/2024/

    Investigating Visual Localization Using Geospatial Meshes.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLVIII-2/W8-2024, 447–454. https://isprs-archives.copernicus.org/articles/XLVIII-2-W8- 2024/447/2024/. Wald, J., Sattler, T., Golodetz, S., Cavallari, T., Tombari, F.,

  19. [19]

    Wang, F., Jiang, X., Galliani, S., V ogel, C., Pollefeys, M.,

    Beyond Controlled Environments: 3D Cam- era Re-Localization in Changing Indoor Scenes.ht- tps://arxiv.org/abs/2008.02004. Wang, F., Jiang, X., Galliani, S., V ogel, C., Pollefeys, M.,

  20. [20]

    Wu, R., Cheng, X., Zhu, J., Liu, X., Zhang, M., Yan, S.,

    GLACE: Global Local Accelerated Coordinate Encoding.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), IEEE, Seattle, W A, USA, 5819–5828. Wu, R., Cheng, X., Zhu, J., Liu, X., Zhang, M., Yan, S.,

  21. [21]

    Uavd4l: A large-scale dataset for uav 6-dof localization

    UA VD4L: A Large-Scale Dataset for UA V 6-DoF Localization. 10.48550/arXiv.2401.05971. Yan, Q., Zheng, J., Reding, S., Li, S., Doytchinov, I.,

  22. [22]

    Yan, S., Cheng, X., Liu, Y ., Zhu, J., Wu, R., Liu, Y ., Zhang, M.,

    CrossLoc: Scalable Aerial Localization Assisted by Mul- timodal Synthetic Data.10.48550/arXiv.2112.09081. Yan, S., Cheng, X., Liu, Y ., Zhu, J., Wu, R., Liu, Y ., Zhang, M.,

  23. [23]

    Render-and-Compare: Cross-view 6-DoF Localiza- tion from Noisy Prior.2023 IEEE International Conference on Multimedia and Expo (ICME), IEEE Computer Society, 2171–

  24. [24]

    Zhu, J., Yan, S., Wang, L., Zhang, S., Liu, Y ., Zhang, M.,

    LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Align- ment.https://doi.org/10.48550/arXiv.2507.00659. Zhu, J., Yan, S., Wang, L., Zhang, S., Liu, Y ., Zhang, M.,

  25. [25]

    LoD-Loc: Aerial Visual Localization using LoD 3D Map with Neural Wireframe Alignment.10.48550/arXiv.2410.12269