pith. sign in

arxiv: 2605.22581 · v1 · pith:5O2NYMOQnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.LG

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Pith reviewed 2026-05-22 06:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords floorplan localization3D scene reconstructioncross-modal correspondencedensity map projectionsimilarity transformfoundation model fine-tuningcomputer vision
0
0 comments X

The pith

Reconstructing 3D scenes from images allows floorplan localization in large buildings using density map proxies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to localize images within floorplans for large public buildings by first reconstructing a gravity-aligned 3D scene from an unconstrained collection of images. The 3D scene is projected into a 2D density map that acts as a proxy for the floorplan, which is then aligned with the provided floorplan through a 2D similarity transform. To handle differences in appearance, a 2D foundation model is adapted with a fine-tuning approach that learns cross-modal correspondences while keeping structural consistency. The method shows strong performance improvements, particularly in challenging sparse settings with very few images.

Core claim

The paper claims that floorplan localization in unconstrained environments can be performed by grounding the task in a 3D reconstruction: the scene is reconstructed and projected to a 2D density map proxy, then aligned to the rasterized floorplan using a similarity transform enabled by cross-modal matching from a fine-tuned foundation model. This allows operation without precise vectorized maps or small-scale assumptions.

What carries the argument

The key mechanism is the projection of a gravity-aligned 3D scene reconstruction into a 2D density map that serves as a floorplan proxy, combined with adaptation of a 2D foundation model for cross-modal alignment.

If this is right

  • Substantial improvements in localization accuracy over previous approaches in large-scale settings.
  • Effective performance even when only a single input image is available.
  • Ability to use rasterized floorplans instead of requiring vectorized ones.
  • The approach scales to real-world public buildings with unconstrained image collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests potential for integration into mobile navigation systems for museums or airports.
  • Future work could test the method on dynamic environments where the floorplan changes over time.
  • The density map proxy might be useful for other tasks like 3D to 2D matching in robotics.

Load-bearing premise

The 3D reconstruction from the image collection must yield a gravity-aligned scene with a 2D density projection accurate and complete enough to proxy the floorplan reliably.

What would settle it

A test case in a large building where the image collection is too sparse to reconstruct a complete density map, leading to poor alignment accuracy with the floorplan.

Figures

Figures reproduced from arXiv: 2605.22581 by Hadar Averbuch-Elor, Junhyeong Cho, Ruojin Cai.

Figure 1
Figure 1. Figure 1: Given a collection of in-the-wild images and a rasterized floorplan, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SceneAligner. Given in-the-wild images and a floorplan, it reconstructs a gravity-aligned 3D scene, extracts a 2D density map via projection, and solves for a 2D similarity transform M via correspondence estimation between the density map and floorplan using a shared encoder E. Reliable correspondences used to compute M are overlaid (in orange) on the aligned density map. In contrast, our method bridges th… view at source ↗
Figure 3
Figure 3. Figure 3: Adapting a 2D foundation model for floorplan alignment. We provide PCA visualizations of features before and after our fine-tuning scheme. As illustrated above, the pretrained DINOv3 [39] struggles to bridge the appearance gap, e.g., corresponding regions (white circles) map to different RGB colors. By contrast, our fine-tuning significantly refines the semantic cross-modal alignment. For reference, we sho… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison in the wild. We compare correspondence predictions across baselines and our method, with floorplan localization results shown on the right. Cameras are illustrated in corresponding colors (e.g., for GT, and for Ours). 4.2.1 Comparison with Correspondence-based Methods Baselines. We compare against correspondence-based methods [17, 46, 41] on the clean subset of C3. C3Po [17] builds o… view at source ↗
Figure 5
Figure 5. Figure 5: Performance across varying view counts on C3 [17]. We evaluate the proposed method using different numbers of input images for 3D reconstruction (e.g., ≤ 150 denotes a maximum of 150 images per reconstruction). Notably, Ours (= 1) already outperforms C3Po [17] by a large margin, while Ours (≤ 30) is on par with Ours (≤ 150). ≤ 150 images ≤ 10 images 3 images 1 image [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-the-wild floorplan alignment across varying view counts. We visualize 3D points used for density map extraction and resulting density maps with reliable correspondences overlaid (in orange). Sparse views often suffice to recover geometry informative enough for floorplan alignment. density map, enabling accurate floorplan alignment. However, the single-view setting sometimes suffers from localization amb… view at source ↗
Figure 7
Figure 7. Figure 7: Alignment of interior and exterior 3D scenes. Using the floorplan as a shared geometric bridge, our approach enables independent alignment of interior and exterior reconstructions into a unified global coordinate system. This is achieved despite minimal visual overlap and large viewpoint differences between indoor and outdoor scenes. require preprocessing of floorplans, e.g., converting floorplans into ray… view at source ↗
read the original abstract

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SceneAligner for floorplan localization in large-scale buildings using rasterized floorplans. Given an unconstrained image collection, the method reconstructs a gravity-aligned 3D scene from the images, projects it to a 2D density map serving as a floorplan proxy, and aligns the proxy to the input floorplan via a 2D similarity transform. A 2D foundation model is fine-tuned to bridge the appearance gap between density maps and architectural drawings, using a scheme that encourages semantically aligned matches while preserving structural consistency. The paper reports extensive experiments demonstrating substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image.

Significance. If the central claims are supported by the full experimental evidence, this work would be significant for extending floorplan localization beyond small-scale controlled settings to practical large-scale public buildings with raster floorplans. The 3D-grounded proxy approach combined with foundation model adaptation offers a plausible path to handling unconstrained inputs, and the planned public release of code and data would support reproducibility and community follow-up.

major comments (2)
  1. [§3.1] §3.1 (3D reconstruction and density projection): The central claim of substantial gains even with single images rests on the assumption that the reconstructed 3D scene yields a sufficiently complete and accurate 2D density projection to serve as a reliable floorplan proxy. Standard SfM/monocular pipelines are known to produce gaps or misalignments in textureless/large interiors; the manuscript should add quantitative proxy-fidelity metrics (e.g., coverage ratio or structural similarity to ground-truth floorplans) specifically for the sparse and single-image regimes to verify this load-bearing step.
  2. [§4] §4 (Experiments, sparse-setting results): The reported improvements in extremely sparse cases are load-bearing for the main contribution. The evaluation should include per-scene error distributions, failure-case analysis, or reconstruction-quality ablations rather than aggregate metrics alone, so readers can assess whether gains persist when the density proxy is incomplete.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'large-scale buildings' would benefit from a brief quantitative characterization (e.g., typical floor area or number of rooms) to help readers gauge the operating regime.
  2. [§3] Notation: The distinction between the input raster floorplan and the projected density map could be made clearer with consistent symbols or a small diagram in the method overview.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate additional quantitative analysis and granular evaluations in sparse regimes.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (3D reconstruction and density projection): The central claim of substantial gains even with single images rests on the assumption that the reconstructed 3D scene yields a sufficiently complete and accurate 2D density projection to serve as a reliable floorplan proxy. Standard SfM/monocular pipelines are known to produce gaps or misalignments in textureless/large interiors; the manuscript should add quantitative proxy-fidelity metrics (e.g., coverage ratio or structural similarity to ground-truth floorplans) specifically for the sparse and single-image regimes to verify this load-bearing step.

    Authors: We agree that explicit quantification of proxy fidelity strengthens the central claims. In the revised manuscript we have added a new analysis subsection reporting coverage ratio (fraction of floorplan area covered by projected 3D points) and SSIM between the density map and ground-truth floorplan, computed separately for single-image, 5-image, and full-set regimes across all test scenes. These metrics confirm that structural similarity remains adequate for alignment even when coverage is low, directly supporting the reported localization gains. revision: yes

  2. Referee: [§4] §4 (Experiments, sparse-setting results): The reported improvements in extremely sparse cases are load-bearing for the main contribution. The evaluation should include per-scene error distributions, failure-case analysis, or reconstruction-quality ablations rather than aggregate metrics alone, so readers can assess whether gains persist when the density proxy is incomplete.

    Authors: We acknowledge that aggregate numbers alone leave open questions about robustness. The revision now includes per-scene localization error box plots (supplementary material), a failure-case analysis subsection in the main text that examines scenes with incomplete reconstructions due to textureless walls, and an ablation that correlates point-cloud density with final alignment error. These additions show that our method continues to outperform baselines even under partial proxy coverage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external reconstruction and models

full rationale

The paper's core pipeline reconstructs a gravity-aligned 3D scene from unconstrained images (including single-image cases), projects it to a 2D density map as floorplan proxy, and aligns via 2D similarity transform after fine-tuning a foundation model for cross-modal matching. No quoted equations, definitions, or steps in the abstract or described method reduce a claimed prediction or result to a fitted parameter or self-referential input by construction. The approach invokes standard external 3D reconstruction and foundation models rather than deriving the target alignment from quantities defined using the floorplan itself. Experiments claim improvements on benchmarks, but the derivation chain remains independent of the final localization output.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that standard 3D reconstruction pipelines produce usable gravity-aligned geometry and that a fine-tuned foundation model can reliably bridge density maps to architectural drawings; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Unconstrained image collections yield sufficiently accurate gravity-aligned 3D reconstructions for large-scale indoor scenes.
    Invoked when the method projects the reconstructed scene into a 2D density map to serve as floorplan proxy.
  • domain assumption A 2D foundation model can be fine-tuned to produce semantically aligned matches between density maps and rasterized floorplans while preserving structural consistency.
    Central to bridging the appearance gap in the alignment step.

pith-pipeline@v0.9.0 · 5752 in / 1387 out tokens · 25555 ms · 2026-05-22T06:52:19.806981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...

  2. [2]

    SURF: Speeded Up Robust Features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision (ECCV), 2006

  3. [3]

    Robust LiDAR- based localization in architectural floor plans

    Federico Boniardi, Tim Caselitz, Rainer Kummerle, and Wolfram Burgard. Robust LiDAR- based localization in architectural floor plans. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3318–3324. IEEE, 2017

  4. [4]

    A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

    Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

  5. [5]

    Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network

    Federico Boniardi, Abhinav Valada, Rohit Mohan, Tim Caselitz, and Wolfram Burgard. Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5291–5297. IEEE, 2019

  6. [6]

    F3Loc: Fusion and Filtering for Floorplan Localization

    Changan Chen, Rui Wang, Christoph V ogel, and Marc Pollefeys. F3Loc: Fusion and Filtering for Floorplan Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18029–18038, 2024

  7. [7]

    Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path

    Jiacheng Chen, Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  8. [8]

    You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans

    Hang Chu, Dong Ki Kim, and Tsuhan Chen. You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2210–2218, 2015

  9. [9]

    Indoor-Outdoor 3D Reconstruction Alignment

    Andrea Cohen, Johannes L Schönberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-Outdoor 3D Reconstruction Alignment. InProceedings of the European Conference on Computer Vision (ECCV), pages 285–300. Springer, 2016

  10. [10]

    Scene Grounding In the Wild

    Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, and Hadar Averbuch-Elor. Scene Grounding In the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  11. [11]

    SuperPoint: Self-Supervised Interest Point Detection and Description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 224–236, 2018

  12. [12]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

  13. [13]

    Supercharging Floorplan Localization with Semantic Rays

    Yuval Grader and Hadar Averbuch-Elor. Supercharging Floorplan Localization with Semantic Rays. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27116–27125, 2025

  14. [14]

    LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments

    Henry Howard-Jenkins and Victor Adrian Prisacariu. LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments. InProceedings of the European Conference on Computer Vision (ECCV), pages 693–709, 2022

  15. [15]

    LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021

    Henry Howard-Jenkins, Jose-Raul Ruiz-Sarmiento, and Victor Adrian Prisacariu. LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021. 10

  16. [16]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022

  17. [17]

    C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

    Kuan Wei Huang, Brandon Li, Bharath Hariharan, and Noah Snavely. C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  18. [18]

    W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi

    Seigo Ito, Felix Endres, Markus Kuderer, Gian Diego Tipaldi, Cyrill Stachniss, and Wolfram Burgard. W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 417–422. IEEE, 2014

  19. [19]

    Fully Geometric Panoramic Localization

    Junho Kim, Jiwon Jeong, and Young Min Kim. Fully Geometric Panoramic Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  20. [20]

    Long-tail Internet photo reconstruction

    Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, and Ruojin Cai. Long-tail Internet photo reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  21. [21]

    Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent

    Zhikai Li, Marcelo H Ang, and Daniela Rus. Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8571–8578. IEEE, 2020

  22. [22]

    LightGlue: Local Feature Matching at Light Speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  23. [23]

    FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans

    Chen Liu, Jiaye Wu, and Yasutaka Furukawa. FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans. InProceedings of the European Conference on Computer Vision (ECCV), 2018

  24. [24]

    WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

    Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

  25. [25]

    PolyRoom: Room-aware Transformer for Floorplan Reconstruction

    Yuzhou Liu, Lingjie Zhu, Xiaodong Ma, Hanqiao Ye, Xiang Gao, Xianwei Zheng, and Shuhan Shen. PolyRoom: Room-aware Transformer for Floorplan Reconstruction. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  26. [26]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations (ICLR), 2019

  27. [27]

    Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

    David G Lowe. Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

  28. [28]

    The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces

    Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–16. Springer, 2014

  29. [29]

    SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation

    Oscar Mendez, Simon Hadfield, Nicolas Pugeault, and Richard Bowden. SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation. International Journal of Computer Vision (IJCV), 128:1286–1310, 2020

  30. [30]

    ProtoSnap: Prototype Alignment for Cuneiform Signs

    Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, and Hadar Averbuch-Elor. ProtoSnap: Prototype Alignment for Cuneiform Signs. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 88720–88739, 2025

  31. [31]

    LASER: LAtent SpacE Rendering for 2D Visual Localization

    Zhixiang Min, Naji Khosravan, Zachary Bessinger, Manjunath Narayana, Sing Bing Kang, Enrique Dunn, and Ivaylo Boyadzhiev. LASER: LAtent SpacE Rendering for 2D Visual Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11122–11131, 2022. 11

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

  33. [33]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

  34. [34]

    ORB: An efficient alternative to SIFT or SURF

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

  35. [35]

    3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry

    Bryan C Russell, Ricardo Martin-Brualla, Daniel J Butler, Steven M Seitz, and Luke Zettlemoyer. 3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry. ACM Transactions on Graphics (TOG), 32(6):1–10, 2013

  36. [36]

    SuperGlue: Learning Feature Matching With Graph Neural Networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching With Graph Neural Networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  37. [37]

    Scene Segmentation Using the Wisdom of Crowds

    Ian Simon and Steven M Seitz. Scene Segmentation Using the Wisdom of Crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 541–553. Springer, 2008

  38. [38]

    Scene Summarization for Online Image Collections

    Ian Simon, Noah Snavely, and Steven M Seitz. Scene Summarization for Online Image Collections. In2007 IEEE 11th International conference on computer vision, pages 1–8. IEEE, 2007

  39. [39]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  40. [40]

    RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

  41. [41]

    LoFTR: Detector-Free Local Feature Matching with Transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-Free Local Feature Matching with Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  42. [42]

    Emer- gent Correspondence from Image Diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent Correspondence from Image Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  43. [43]

    GeoCalib: Learning Single-image Calibration with Geometric Optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2024

  44. [44]

    VGGT: Visual Geometry Grounded Transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  45. [45]

    Lost Shopping! Monocular Localization in Large Indoor Spaces

    Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Lost Shopping! Monocular Localization in Large Indoor Spaces. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2695–2703, 2015

  46. [46]

    DUSt3R: Geometric 3D Vision Made Easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 12

  47. [47]

    GLFP: Global Localization from a Floor Plan

    Xipeng Wang, Ryan J Marcotte, and Edwin Olson. GLFP: Global Localization from a Floor Plan. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1627–1632. IEEE, 2019

  48. [48]

    π3: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning. InInternational Conference on Learning Representations (ICLR), 2026

  49. [49]

    Discovering Details and Scene Structure with Hierarchical Iconoid Shift

    Tobias Weyand and Bastian Leibe. Discovering Details and Scene Structure with Hierarchical Iconoid Shift. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3479–3486, 2013

  50. [50]

    Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

    Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 428–437, 2021

  51. [51]

    UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

    Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, and Daniel Barath. UnLoc: Leveraging Depth Uncertainties for Floorplan Localization. InInternational Conference on Learning Representations (ICLR), 2026

  52. [52]

    Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries

    Yuanwen Yue, Theodora Kontogianni, Konrad Schindler, and Francis Engelmann. Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  53. [53]

    facade”, “nave

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 13 Appendix We refer readers to the accompanying viewer.html for 360◦ view comparisons of floorplan-aligned 3D reconstructions (Sec. A)...