pith. sign in

arxiv: 1906.10855 · v1 · pith:ONQVMKHGnew · submitted 2019-06-26 · 💻 cs.CV

On the Role of Geometry in Geo-Localization

Pith reviewed 2026-05-25 16:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords geo-localizationcamera pose estimationconvolutional neural networkslean imagesgeometric learning3D scenepose recovery
0
0 comments X

The pith

A convolutional neural network can recover camera pose from lean images that contain only geometric cues such as edges and relative depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether CNNs perform geo-localization by learning scene geometry or by other means. It does this by training and testing on lean images projected from a minimal 3D city model that strips away texture and fine details. The network succeeds at estimating pose, and the authors conclude the success reflects geometric learning of the area rather than memorization. A sympathetic reader would care because the result isolates how much of the network's power comes from understanding 3D structure alone.

Core claim

The network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions are providing insight into the role of geometry in the CNN learning process and demonstrating the power of CNNs for recovering camera pose using lean images.

What carries the argument

Lean images: projections from a simple 3D city model that contain solely geometric information (edges, faces, or relative depth). They isolate geometry so the network must rely on it for pose estimation.

If this is right

  • CNNs can estimate camera pose using only geometric information without texture.
  • The network learns the geometry of the geographical area rather than memorizing specific images.
  • This approach supplies a way to measure the contribution of geometry inside CNN-based localization.
  • Pose recovery remains possible when input is restricted to edges, faces, and relative depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lean-image protocol could be used to test geometric learning in other vision tasks such as object recognition.
  • Results suggest that depth sensors or edge detectors alone might support localization in structured environments if the network is trained accordingly.
  • Extending the method to real-world depth maps instead of synthetic lean images would test whether the geometric learning transfers outside the simple 3D model.

Load-bearing premise

The lean-image construction and experimental protocol successfully isolate pure geometric cues so that observed performance reflects geometric learning rather than unintended patterns or memorization.

What would settle it

Train on lean images of one set of viewpoints and test on lean images from the same model but with geometry altered (for example by swapping building heights while keeping edge patterns similar) and check whether accuracy drops sharply.

Figures

Figures reproduced from arXiv: 1906.10855 by Ariel Shamir, Moti Kadosh, Yael Moses.

Figure 1
Figure 1. Figure 1: Top: lean images contain mostly geometric fea￾tures: edges (left), faces (center), and depth information (right). We train a CNN to solve the localization problem using such images alone. Bottom: a top view of a city area (buildings are marked as white) where color indicates the localization success rate of the network from red (high) to blue (low). For instance, note how open spaces are more distinct than… view at source ↗
Figure 2
Figure 2. Figure 2: Bird’s-eye view of one of the areas we used. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of sampling positions on a area of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration in 2D of the evaluation measures for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transfer learning: learning from scratch vs. start [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Humans can build a mental map of a geographical area to find their way and recognize places. The basic task we consider is geo-localization - finding the pose (position & orientation) of a camera in a large 3D scene from a single image. We aim to experimentally explore the role of geometry in geo-localization in a convolutional neural network (CNN) solution. We do so by ignoring the often available texture of the scene. We therefore deliberately avoid using texture or rich geometric details and use images projected from a simple 3D model of a city, which we term lean images. Lean images contain solely information that relates to the geometry of the area viewed (edges, faces, or relative depth). We find that the network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions of this paper are: (i) providing insight into the role of geometry in the CNN learning process; and (ii) demonstrating the power of CNNs for recovering camera pose using lean images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript explores the role of geometry in geo-localization using convolutional neural networks by training on 'lean images' generated from a simple 3D city model, which contain only geometric features such as edges, faces, or relative depth without texture. The central claim is that the CNN can estimate camera pose from these images via geometric learning of the geographical area rather than memorization.

Significance. Should the experimental results be substantiated with quantitative evidence and controls, the findings would contribute to understanding the extent to which CNNs can rely on pure geometric cues for pose estimation tasks, potentially informing the design of more robust localization systems.

major comments (2)
  1. [Abstract] Abstract: the claim of successful pose estimation and non-memorization is asserted without any quantitative metrics, error bars, dataset sizes, or explicit controls for memorization, leaving the central empirical claim only partially supported by the provided information.
  2. [Method / Experiments] The experimental protocol for constructing lean images and ruling out memorization (e.g., via held-out test views or ablation on geometric components) is not described in sufficient detail to confirm that observed performance isolates geometric learning rather than unintended patterns.
minor comments (2)
  1. Provide the specific CNN architecture, loss function, and training hyperparameters used for the pose estimation task.
  2. Clarify how the three variants of lean images (edges, faces, relative depth) were generated and whether results are reported separately for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of successful pose estimation and non-memorization is asserted without any quantitative metrics, error bars, dataset sizes, or explicit controls for memorization, leaving the central empirical claim only partially supported by the provided information.

    Authors: We agree that the abstract, as a concise summary, does not include the quantitative details present in the experiments section. In the revision we will update the abstract to report key metrics on pose estimation accuracy, dataset sizes, and the use of held-out views that support the claim of geometric learning over memorization. revision: yes

  2. Referee: [Method / Experiments] The experimental protocol for constructing lean images and ruling out memorization (e.g., via held-out test views or ablation on geometric components) is not described in sufficient detail to confirm that observed performance isolates geometric learning rather than unintended patterns.

    Authors: We will expand the methods and experiments sections to include a more precise description of lean-image generation from the 3D model (specifying retained geometric elements such as edges and depth) and the exact train/test protocol using held-out views. This will clarify how the setup isolates geometric cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is an empirical experimental study reporting CNN performance on lean images (projections from a simple 3D city model containing only edges/faces/relative depth) for camera pose estimation. No mathematical derivation chain, equations, fitted parameters, or self-referential definitions exist in the claims. The central claim rests on experimental results indicating geometric learning rather than memorization, with no load-bearing steps that reduce to inputs by construction, self-citation, or ansatz smuggling. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions from deep learning for computer vision and the premise that lean images isolate geometry. No free parameters, invented entities, or non-standard axioms are mentioned.

axioms (1)
  • domain assumption Convolutional neural networks can be trained to regress camera pose from image-derived geometric features
    Invoked implicitly by the choice to train a CNN on lean images for pose estimation.

pith-pipeline@v0.9.0 · 5722 in / 1205 out tokens · 70331 ms · 2026-05-25T16:12:05.280063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    Baatz, O

    G. Baatz, O. Saurer, K. K ¨oser, and M. Pollefeys. Large scale visual geo-localization of images in mountainous terrain. In European Conference on Computer Vision (ECCV) , pages 517–530. Springer, 2012. 3

  2. [2]

    Bansal and K

    M. Bansal and K. Daniilidis. Geometric urban geo- localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3978–3985. IEEE, 2014. 1, 3

  3. [3]

    Berlin 3d city model, 2016

    Berlin Partner fr Wirtschaft und Technolo- gie GmbH. Berlin 3d city model, 2016. https://www.businesslocationcenter.de/en/W A/B/seite0.jsp. 2, 4

  4. [4]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2009. 3, 4

  5. [5]

    M. A. Fischler and R. C. Bolles. Random sample consen- sus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. 3

  6. [6]

    R. M. Haralick, H. Joo, C. Lee, X. Zhuang, V . G. Vaidya, and M. B. Kim. Pose estimation from corresponding point data. IEEE Transactions on Systems, Man, and Cybernetics, 19(6):1426–1446, Nov 1989. 3

  7. [7]

    Hays and A

    J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008. 2, 3

  8. [8]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 770–778, 2016. 3, 4

  9. [9]

    Irschara, C

    A. Irschara, C. Zach, J. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recogni- tion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2599–2606, June 2009. 3

  10. [10]

    Kendall and R

    A. Kendall and R. Cipolla. Modelling uncertainty in deep learning for camera relocalization. In IEEE International Conference on Robotics and Automation (ICRA) , pages 4762–4769. IEEE, 2016. 3

  11. [11]

    Kendall, R

    A. Kendall, R. Cipolla, et al. Geometric loss functions for camera pose regression with deep learning. In IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , volume 3, page 8, 2017. 3

  12. [12]

    Kendall, M

    A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolu- tional network for real-time 6-dof camera relocalization. In IEEE International Conference on Computer Vision (ICCV), pages 2938–2946. IEEE, 2015. 2, 3, 4

  13. [13]

    H. Li. Consensus set maximization with guaranteed global optimality for robust geometry estimation. In IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1074– 1080, Sept 2009. 3

  14. [14]

    Y . Li, N. Snavely, and D. P. Huttenlocher. Location recogni- tion using prioritized feature matching. In European Confer- ence on Computer Vision (ECCV), pages 791–804. Springer,

  15. [15]

    D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision , 60(2):91–110, Nov 2004. 1, 2, 3

  16. [16]

    B. C. Matei, N. V . Valk, Z. Zhu, H. Cheng, and H. S. Sawh- ney. Image to lidar matching for geotagging in urban envi- ronments. In IEEE Workshop on Applications of Computer Vision (WACV), pages 413–420, Jan 2013. 3

  17. [17]

    Image-based Localization using Hourglass Networks

    I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Image- based localization using hourglass networks. arXiv preprint arXiv:1703.07971, 2017. 2, 3

  18. [18]

    Nister and H

    D. Nister and H. Stewenius. Scalable recognition with a vo- cabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 2, pages 2161–2168, June 2006. 2

  19. [19]

    Piasco, D

    N. Piasco, D. Sidib ´e, C. Demonceaux, and V . Gouet-Brunet. A survey on visual-based localization: On the benefit of het- erogeneous data. Pattern Recognition, 74:90–109, 2018. 3

  20. [20]

    Ramalingam, S

    S. Ramalingam, S. Bouaziz, and P. Sturm. Pose estimation using both points and lines for geo-localization. In IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 4716–4723. IEEE, 2011. 1, 3

  21. [21]

    D. P. Robertson and R. Cipolla. An image-based system for urban navigation. In British Machine Vision Conference (BMVC), volume 19, page 165, 2004. 2, 3

  22. [22]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 3, 4

  23. [23]

    Sattler, B

    T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based lo- calization using direct 2d-to-3d matching. In IEEE Interna- tional Conference on Computer Vision (ICCV) , pages 667– 674, Nov 2011. 3

  24. [24]

    Sattler, B

    T. Sattler, B. Leibe, and L. Kobbelt. Improving image-based localization by active correspondence search. In European Conference on Computer Vision (ECCV) , pages 752–765. Springer, 2012. 3

  25. [25]

    Schindler, M

    G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7, June 2007. 2

  26. [26]

    S. Se, D. Lowe, and J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. The International Journal of Robotics Research, 21(8):735–758, 2002. 1, 3

  27. [27]

    Video google: a text retrieval approach to object matching in videos

    Sivic and Zisserman. Video google: a text retrieval approach to object matching in videos. In IEEE International Confer- ence on Computer Vision (ICCV) , pages 1470–1477 vol.2, Oct 2003. 2, 3

  28. [28]

    Sv ¨arm, O

    L. Sv ¨arm, O. Enqvist, F. Kahl, and M. Oskarsson. City-scale localization for cameras with known vertical direction.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(7):1455–1461, 2017. 3

  29. [29]

    Svarm, O

    L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accu- rate localization and pose estimation for large 3d models. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 532–539, 2014. 3

  30. [30]

    Szegedy, W

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 3, 4

  31. [31]

    Walch, C

    F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsen- beck, and D. Cremers. Image-based localization using lstms for structured feature correlation. In IEEE International Conference on Computer Vision (ICCV) , volume 1, page 3,

  32. [32]

    O. Wiki. Osm-3d.org — openstreetmap wiki,, 2018. [Online; accessed 1-November-2018]. 9

  33. [33]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generaliza- tion. arXiv preprint arXiv:1611.03530, 2016. 1

  34. [34]

    Zhang and J

    W. Zhang and J. Kosecka. Image based localization in urban environments. In Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06) , pages 33–40, June 2006. 2, 3