pith. sign in

arxiv: 2606.04788 · v1 · pith:Z33EQKERnew · submitted 2026-06-03 · 💻 cs.CV · cs.RO

Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

Pith reviewed 2026-06-28 06:48 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords floorplan localizationzero-shot localizationgeometric primitivesminimal solversbird's-eye viewmonocular 3D reconstructionvisual localizationrobust estimation
0
0 comments X

The pith

Zero-shot floorplan localization matches lines and circles extracted from monocular 3D reconstructions to the map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that camera pose can be estimated inside a floorplan without any training data or environment-specific retraining. It does so by lifting monocular images to 3D, projecting the result into bird's-eye view, detecting lines and circles there, and solving the alignment to the floorplan with minimal solvers inside a robust estimator. A sympathetic reader would care because floorplans are already available for most buildings yet current methods need large labeled datasets collected inside each new building before they can work.

Core claim

Dominant geometric primitives—lines and circles—are extracted from a bird's-eye-view projection of a monocular 3D reconstruction and matched to the corresponding elements in the floorplan using dedicated minimal solvers inside a robust estimation framework, producing accurate localization in environments never seen during development and with one fixed set of hyperparameters.

What carries the argument

Minimal solvers that align lines and circles between the bird's-eye-view primitives and the floorplan geometry inside a robust estimator.

If this is right

  • The method generalizes to novel environments without any retraining.
  • It outperforms state-of-the-art learning-based methods on unseen environments.
  • A single fixed set of hyperparameters works across simulated and real-world datasets.
  • Localization succeeds using only the geometric primitives present in the floorplan and the reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same primitive-matching strategy could be tested on other compact map representations such as building footprints or CAD models if similar lines and circles are present.
  • Environments with strong structural regularity may need far less data than feature-learning approaches, reducing the cost of deploying localization in new buildings.
  • Combining the geometric solver with a learned prior only for ambiguous cases could be a direct next step without changing the core zero-shot claim.

Load-bearing premise

Lines and circles remain the dominant, reliably extractable geometric primitives in the bird's-eye view of monocular 3D reconstructions across most human-made indoor spaces.

What would settle it

A set of real indoor environments whose floorplans contain few straight lines or circles, or whose monocular reconstructions yield inaccurate bird's-eye-view primitives, such that localization success rate falls below 30 percent on average.

Figures

Figures reproduced from arXiv: 2606.04788 by Ayumi Umemura, Daniel Barath, Marc Pollefeys, Toshinori Kuwahara.

Figure 1
Figure 1. Figure 1: Pipeline overview. Given a sequence of images, we reconstruct wall-only 3D points and project them along gravity to obtain a bird’s-eye-view (BEV) map. Line segments and circular primitives are extracted from the BEV and matched to floorplan primitives via dedicated minimal solvers within a hybrid RANSAC loop. Each hypoth￾esis is scored by two complementary cost functions – a consistency score rewarding BE… view at source ↗
Figure 2
Figure 2. Figure 2: Wall mask filtering. (a) The input image is warped into a gravity-aligned frame so that vertical walls produce consistent depth values per column. (b) Column￾wise filtering: if any non-wall pixel in a column has depth exceeding the estimated wall depth, the column is discarded (red pixels); when multiple wall structures are present, the deeper wall surface is retained. (c) Filtering suppresses ghost artifa… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results. Each block shows a test scene. Left: the input floorplan with the camera trajectory (green). Right: the corresponding BEV reconstruction over￾laid on the floorplan after alignment with the ground-truth trajectory (blue) and our estimate (green). Our method accurately localizes across diverse environments – from synthetic rooms (Gibson) to large real-world buildings (LaMAR HGE, CAB) – w… view at source ↗
Figure 4
Figure 4. Figure 4: Parameter sensitivity. Each cell shows SR@1m (%) at sequence length 100 for a given combination of consistency threshold τc (columns) and violation threshold τv (rows). (a) On Gibson(t), performance is stable across all settings. (b) On LaMAR HGE (cropped), sensitivity is higher due to the structural complexity of the floorplan, yet strong accuracy is maintained across a wide operating range. with marginal… view at source ↗
read the original abstract

Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Z-FLoc, a zero-shot floorplan localization method. It extracts dominant geometric primitives (lines and circles) from a bird's-eye-view projection of monocular 3D reconstructions and matches them to the floorplan using dedicated minimal solvers inside a robust estimation framework. The method is claimed to generalize to novel environments without retraining or environment-specific data and to outperform state-of-the-art learning-based methods on both simulated and real-world datasets while using a single fixed set of hyperparameters across all experiments.

Significance. If the extraction and matching pipeline proves reliable, the result would be significant: it offers a practical, training-free alternative to data-driven methods for a core robotics/CV task, leveraging the ubiquity of geometric primitives and the ready availability of floorplans. The fixed-hyperparameter design and planned public code release are additional strengths for reproducibility.

major comments (1)
  1. [Experiments] The central claim requires reliable extraction of lines and circles from BEV projections of monocular reconstructions (which are scale-ambiguous and noisy in texture-poor scenes). The manuscript provides no quantitative metrics (precision, recall, or error distribution) for this extraction stage on the real-world datasets, leaving the load-bearing assumption unverified.
minor comments (1)
  1. The abstract states that source code will be made publicly available, but the manuscript contains no repository link, license information, or reproducibility checklist.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experiments] The central claim requires reliable extraction of lines and circles from BEV projections of monocular reconstructions (which are scale-ambiguous and noisy in texture-poor scenes). The manuscript provides no quantitative metrics (precision, recall, or error distribution) for this extraction stage on the real-world datasets, leaving the load-bearing assumption unverified.

    Authors: We agree that the manuscript would benefit from direct quantitative evaluation of the line and circle extraction stage on real-world data. The current evaluation focuses on end-to-end localization accuracy, which serves as an indirect measure of extraction reliability because failures in primitive detection would prevent successful matching to the floorplan. To address the concern directly, the revised manuscript will include precision, recall, and error distribution metrics for the extraction pipeline on the real-world datasets, computed against available ground-truth annotations or a manually verified subset. revision: yes

Circularity Check

0 steps flagged

No circularity: geometric pipeline is self-contained and independent of fitted predictions or self-citations.

full rationale

The paper presents an algorithmic pipeline that extracts lines and circles from BEV projections of monocular reconstructions and matches them to floorplans using minimal solvers inside a robust estimator. No equations or steps are shown that define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on load-bearing self-citations whose validity is internal to the authors. The zero-shot claim rests on the fixed-hyperparameter geometric matching procedure itself, which is externally verifiable against standard minimal-solver literature and does not reduce to any input data by construction. The provided abstract and description contain no self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that geometric primitives can be extracted and matched reliably in novel environments; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Dominant geometric primitives such as lines and circles are ubiquitous in human-made environments and provide appearance-invariant structural constraints.
    Explicitly stated as the key insight enabling zero-shot generalization.

pith-pipeline@v0.9.1-grok · 5745 in / 1309 out tokens · 32099 ms · 2026-06-28T06:48:53.596687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    In: CVPR (2023)

    Brachmann, E., Cavallari, T., Prisacariu, V.A.: Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In: CVPR (2023)

  2. [2]

    In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Brahmbhatt, S., Gu, J., Kim, K., Hays, J., Kautz, J.: Geometry-aware learning of maps for camera localization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2616–2625 (2018)

  3. [3]

    In: CVPR (2018)

    Camposeco, F., Cohen, A., Pollefeys, M., Sattler, T.: Hybrid camera pose estima- tion. In: CVPR (2018)

  4. [4]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Chen, C., Wang, R., Vogel, C., Pollefeys, M.: F3 loc: Fusion and filtering for floor- plan localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  5. [5]

    In: CVPR (2022)

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

  6. [6]

    arXiv preprint arXiv:2511.09724 (2025),https://arxiv.org/abs/2511.09724, accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, Application Track

    Cheng, Y., Princen, B., Manduchi, R.: Palms+: Modular image-based floor plan localization leveraging depth foundation model. arXiv preprint arXiv:2511.09724 (2025),https://arxiv.org/abs/2511.09724, accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, Application Track

  7. [7]

    In: CVPR Deep Learning for Visual SLAM Work- shop (2018)

    DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: CVPR Deep Learning for Visual SLAM Work- shop (2018)

  8. [8]

    In: 2014 IEEE Conference on Computer Vision and Pattern Recognition

    Donoser, M., Schmalstieg, D.: Discriminative feature-to-point matching in image- based localization. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 516–523 (2014)

  9. [9]

    arXiv (2025)

    Edstedt, J., Nordström, D., Zhang, Y., Bökman, G., Astermark, J., Larsson, V., Heyden,A.,Kahl,F.,Wadenbäck,M.,Felsberg,M.:RoMav2:HarderBetterFaster Denser Feature Matching. arXiv (2025)

  10. [10]

    IEEE Robotics and Automation Letters9(4), 3932–3939 (2024)

    Ewe, Z.L., Chang, F.H., Huang, Y.S., Fu, L.C.: Spatial graph-based localization and navigation on scaleless floorplan. IEEE Robotics and Automation Letters9(4), 3932–3939 (2024)

  11. [11]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA)

    Feng, D., He, Z., Hou, J., Schwertfeger, S., Zhang, L.: Floorplannet: Learning topometric floorplan matching for robot localization. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 6168–6174 (2023)

  12. [12]

    In: ECCV (2024)

    Gard, N., Hilsmann, A., Eisert, P.: SPVLoc: Semantic panoramic viewport match- ing for 6D camera localization in unseen environments. In: ECCV (2024)

  13. [13]

    In: European Conference on Computer Vision

    Garg, K., Puligilla, S.S., Kolathaya, S., Krishna, M., Garg, S.: Revisit anything: Visual place recognition via image segment retrieval. In: European Conference on Computer Vision. pp. 326–343. Springer (2024)

  14. [14]

    Autonomous Robots43(06 2019).https://doi.org/10.1007/s10514-018- 9785-7

    Gholami Shahbandi, S., Magnusson, M.: 2d map alignment with region decomposi- tion. Autonomous Robots43(06 2019).https://doi.org/10.1007/s10514-018- 9785-7

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

    Giang, K.T., Song, S., Jo, S.: Learning to produce semi-dense correspondences for visual localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2024)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Grader, Y., Averbuch-Elor, H.: Supercharging floorplan localization with semantic rays. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27116–27125 (2025) 16 A. Umemura et al

  17. [17]

    Hou, J., Yuan, Y., He, Z., Schwertfeger, S.: Matching maps based on the area graph. Intell. Serv. Robot.15(1), 69–94 (Mar 2022)

  18. [18]

    In: ECCV (2022)

    Howard-Jenkins, H., Prisacariu, V.A.: Lalaloc++: Global floor plan comprehension for layout localisation in unvisited environments. In: ECCV (2022)

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Howard-Jenkins, H., Ruiz-Sarmiento, J.R., Prisacariu, V.A.: Lalaloc: Latent layout localisation in dynamic, unvisited environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10107–10116 (2021)

  20. [20]

    In: Advances in Neural Information Pro- cessing Systems

    Huang, K.W., Li, B., Hariharan, B., Snavely, N.: C3po: Cross-view cross-modality correspondence by pointmap prediction. In: Advances in Neural Information Pro- cessing Systems. vol. 38 (2025)

  21. [21]

    Philosophical Transactions of the Royal Society A: Mathematical, Phys- ical and Engineering Sciences374(2016)

    Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent devel- opments. Philosophical Transactions of the Royal Society A: Mathematical, Phys- ical and Engineering Sciences374(2016)

  22. [22]

    In: Conference on Robot Learning (2018)

    Karkus, P., Hsu, D., Lee, W.S.: Particle filter networks with application to visual localization. In: Conference on Robot Learning (2018)

  23. [23]

    IEEE Robotics and Automation Letters (2023)

    Keetha, N., Mishra, A., Karhade, J., Jatavallabhula, K.M., Scherer, S., Krishna, M., Garg, S.: Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters (2023)

  24. [24]

    In: International Con- ference on 3D Vision (3DV)

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: MapA- nything: Universal feed-forward metric 3D reconstruction. In: International Con- ference on 3D Vision (3DV). IEEE (2026)

  25. [25]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

    Kim, J., Jeong, J., Kim, Y.M.: Fully geometric panoramic localization. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

  26. [26]

    Lepetit, V., Moreno-Noguer, F., Fua, P.: Epnp: An accurate o(n) solution to the pnp problem. Int. J. Comput. Vision81(2), 155–166 (Feb 2009)

  27. [27]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  28. [29]

    In: ICCV (2023)

    Lindenberger, P., Sarlin, P.E., Pollefeys, M.: LightGlue: Local Feature Matching at Light Speed. In: ICCV (2023)

  29. [30]

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com- put. Vision60(2), 91–110 (Nov 2004)

  30. [31]

    The Journal of Open Source Software2(11), 205 (2017)

    McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. The Journal of Open Source Software2(11), 205 (2017)

  31. [32]

    Interna- tional Journal of Computer Vision128(5), 1286–1310 (2020)

    Mendez, O., Hadfield, S., Pugeault, N., Bowden, R.: SeDAR: Reading floorplans like a human – using deep learning to enable human-inspired localisation. Interna- tional Journal of Computer Vision128(5), 1286–1310 (2020)

  32. [33]

    In: CVPR

    Min, Z., Khosravan, N., Bessinger, Z., Narayana, M., Kang, S.B., Dunn, E., Boy- adzhiev, I.: Laser: Latent space rendering for 2d visual localization. In: CVPR. pp. 11112–11121. IEEE (2022)

  33. [34]

    In: CVPR (2019)

    Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: Robust hierarchical localization at large scale. In: CVPR (2019)

  34. [35]

    In: CVPR (2023) Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives 17

    Sarlin, P.E., DeTone, D., Yang, T.Y., Avetisyan, A., Straub, J., Malisiewicz, T., Rota Bulò, S., Newcombe, R., Kontschieder, P., Balntas, V.: OrienterNet: Visual localization in 2D public maps with neural matching. In: CVPR (2023) Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives 17

  35. [36]

    In: ECCV (2022)

    Sarlin, P.E., Dusmanu, M., Schönberger, J.L., Speciale, P., Gruber, L., Larsson, V., Miksik, O., Pollefeys, M.: LaMAR: Benchmarking Localization and Mapping for Augmented Reality. In: ECCV (2022)

  36. [37]

    In: CVPR (2021)

    Sarlin, P.E., Unagar, A., Larsson, M., Germain, H., Toft, C., Larsson, V., Pollefeys, M., Lepetit, V., Hammarstrand, L., Kahl, F., Sattler, T.: Back to the Feature: Learning Robust Camera Localization from Pixels to Pose. In: CVPR (2021)

  37. [38]

    In: 2007 IEEE Conference on Computer Vision and Pattern Recognition

    Schindler, G., Brown, M., Szeliski, R.: City-scale location recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–7 (2007). https://doi.org/10.1109/CVPR.2007.383150

  38. [39]

    In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Pérez-D’Arpino, C., Buch, S., Srivastava, S., Tchapmi, L.P., Tchapmi, M.E., Vainio, K., Wong, J., Fei-Fei, L., Savarese, S.: igibson 1.0: a simulation environment for interactive tasks in large realistic scenes. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IR...

  39. [40]

    arXiv preprint arXiv:2306.12547 (2023)

    Wang, S., Kannala, J., Barath, D.: Dgc-gnn: Leveraging geometry and color cues for visual descriptor-free 2d-3d matching. arXiv preprint arXiv:2306.12547 (2023)

  40. [41]

    In: CVPR (2024)

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR (2024)

  41. [42]

    In: The Fourteenth International Conference on Learning Representations (2026)

    Wüest,M.,Engelmann,F.,Miksik,O.,Pollefeys,M.,Barath,D.:Unloc:Leveraging depth uncertainties for floorplan localization. In: The Fourteenth International Conference on Learning Representations (2026)

  42. [43]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Yin, Y., Lyu, J., Wang, Y., Liu, H., Wang, H., Chen, B.: Towards robust proba- bilistic modeling on so (3) via rotation laplace distribution. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)