SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Hadar Averbuch-Elor; Junhyeong Cho; Ruojin Cai

arxiv: 2605.22581 · v1 · pith:5O2NYMOQnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.LG

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Junhyeong Cho , Ruojin Cai , Hadar Averbuch-Elor This is my paper

Pith reviewed 2026-05-22 06:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords floorplan localization3D scene reconstructioncross-modal correspondencedensity map projectionsimilarity transformfoundation model fine-tuningcomputer vision

0 comments

The pith

Reconstructing 3D scenes from images allows floorplan localization in large buildings using density map proxies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to localize images within floorplans for large public buildings by first reconstructing a gravity-aligned 3D scene from an unconstrained collection of images. The 3D scene is projected into a 2D density map that acts as a proxy for the floorplan, which is then aligned with the provided floorplan through a 2D similarity transform. To handle differences in appearance, a 2D foundation model is adapted with a fine-tuning approach that learns cross-modal correspondences while keeping structural consistency. The method shows strong performance improvements, particularly in challenging sparse settings with very few images.

Core claim

The paper claims that floorplan localization in unconstrained environments can be performed by grounding the task in a 3D reconstruction: the scene is reconstructed and projected to a 2D density map proxy, then aligned to the rasterized floorplan using a similarity transform enabled by cross-modal matching from a fine-tuned foundation model. This allows operation without precise vectorized maps or small-scale assumptions.

What carries the argument

The key mechanism is the projection of a gravity-aligned 3D scene reconstruction into a 2D density map that serves as a floorplan proxy, combined with adaptation of a 2D foundation model for cross-modal alignment.

If this is right

Substantial improvements in localization accuracy over previous approaches in large-scale settings.
Effective performance even when only a single input image is available.
Ability to use rasterized floorplans instead of requiring vectorized ones.
The approach scales to real-world public buildings with unconstrained image collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests potential for integration into mobile navigation systems for museums or airports.
Future work could test the method on dynamic environments where the floorplan changes over time.
The density map proxy might be useful for other tasks like 3D to 2D matching in robotics.

Load-bearing premise

The 3D reconstruction from the image collection must yield a gravity-aligned scene with a 2D density projection accurate and complete enough to proxy the floorplan reliably.

What would settle it

A test case in a large building where the image collection is too sparse to reconstruct a complete density map, leading to poor alignment accuracy with the floorplan.

Figures

Figures reproduced from arXiv: 2605.22581 by Hadar Averbuch-Elor, Junhyeong Cho, Ruojin Cai.

**Figure 2.** Figure 2: SceneAligner. Given in-the-wild images and a floorplan, it reconstructs a gravity-aligned 3D scene, extracts a 2D density map via projection, and solves for a 2D similarity transform M via correspondence estimation between the density map and floorplan using a shared encoder E. Reliable correspondences used to compute M are overlaid (in orange) on the aligned density map. In contrast, our method bridges th… view at source ↗

**Figure 3.** Figure 3: Adapting a 2D foundation model for floorplan alignment. We provide PCA visualizations of features before and after our fine-tuning scheme. As illustrated above, the pretrained DINOv3 [39] struggles to bridge the appearance gap, e.g., corresponding regions (white circles) map to different RGB colors. By contrast, our fine-tuning significantly refines the semantic cross-modal alignment. For reference, we sho… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison in the wild. We compare correspondence predictions across baselines and our method, with floorplan localization results shown on the right. Cameras are illustrated in corresponding colors (e.g., for GT, and for Ours). 4.2.1 Comparison with Correspondence-based Methods Baselines. We compare against correspondence-based methods [17, 46, 41] on the clean subset of C3. C3Po [17] builds o… view at source ↗

**Figure 5.** Figure 5: Performance across varying view counts on C3 [17]. We evaluate the proposed method using different numbers of input images for 3D reconstruction (e.g., ≤ 150 denotes a maximum of 150 images per reconstruction). Notably, Ours (= 1) already outperforms C3Po [17] by a large margin, while Ours (≤ 30) is on par with Ours (≤ 150). ≤ 150 images ≤ 10 images 3 images 1 image [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: In-the-wild floorplan alignment across varying view counts. We visualize 3D points used for density map extraction and resulting density maps with reliable correspondences overlaid (in orange). Sparse views often suffice to recover geometry informative enough for floorplan alignment. density map, enabling accurate floorplan alignment. However, the single-view setting sometimes suffers from localization amb… view at source ↗

**Figure 7.** Figure 7: Alignment of interior and exterior 3D scenes. Using the floorplan as a shared geometric bridge, our approach enables independent alignment of interior and exterior reconstructions into a unified global coordinate system. This is achieved despite minimal visual overlap and large viewpoint differences between indoor and outdoor scenes. require preprocessing of floorplans, e.g., converting floorplans into ray… view at source ↗

read the original abstract

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper grounds floorplan localization in 3D scene reconstructions to create density-map proxies for raster plans in large uncontrolled spaces, but the single-image and sparse-setting claims rest on reconstruction quality that standard pipelines often cannot deliver.

read the letter

The main move is to reconstruct a gravity-aligned 3D scene from unconstrained images, project it to a 2D density map, and treat alignment to the input raster floorplan as a 2D similarity transform after fine-tuning a foundation model for cross-modal matching. This framing lets them drop the usual requirements for vectorized plans and small controlled environments.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SceneAligner for floorplan localization in large-scale buildings using rasterized floorplans. Given an unconstrained image collection, the method reconstructs a gravity-aligned 3D scene from the images, projects it to a 2D density map serving as a floorplan proxy, and aligns the proxy to the input floorplan via a 2D similarity transform. A 2D foundation model is fine-tuned to bridge the appearance gap between density maps and architectural drawings, using a scheme that encourages semantically aligned matches while preserving structural consistency. The paper reports extensive experiments demonstrating substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image.

Significance. If the central claims are supported by the full experimental evidence, this work would be significant for extending floorplan localization beyond small-scale controlled settings to practical large-scale public buildings with raster floorplans. The 3D-grounded proxy approach combined with foundation model adaptation offers a plausible path to handling unconstrained inputs, and the planned public release of code and data would support reproducibility and community follow-up.

major comments (2)

[§3.1] §3.1 (3D reconstruction and density projection): The central claim of substantial gains even with single images rests on the assumption that the reconstructed 3D scene yields a sufficiently complete and accurate 2D density projection to serve as a reliable floorplan proxy. Standard SfM/monocular pipelines are known to produce gaps or misalignments in textureless/large interiors; the manuscript should add quantitative proxy-fidelity metrics (e.g., coverage ratio or structural similarity to ground-truth floorplans) specifically for the sparse and single-image regimes to verify this load-bearing step.
[§4] §4 (Experiments, sparse-setting results): The reported improvements in extremely sparse cases are load-bearing for the main contribution. The evaluation should include per-scene error distributions, failure-case analysis, or reconstruction-quality ablations rather than aggregate metrics alone, so readers can assess whether gains persist when the density proxy is incomplete.

minor comments (2)

[Abstract] Abstract: The phrase 'large-scale buildings' would benefit from a brief quantitative characterization (e.g., typical floor area or number of rooms) to help readers gauge the operating regime.
[§3] Notation: The distinction between the input raster floorplan and the projected density map could be made clearer with consistent symbols or a small diagram in the method overview.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate additional quantitative analysis and granular evaluations in sparse regimes.

read point-by-point responses

Referee: [§3.1] §3.1 (3D reconstruction and density projection): The central claim of substantial gains even with single images rests on the assumption that the reconstructed 3D scene yields a sufficiently complete and accurate 2D density projection to serve as a reliable floorplan proxy. Standard SfM/monocular pipelines are known to produce gaps or misalignments in textureless/large interiors; the manuscript should add quantitative proxy-fidelity metrics (e.g., coverage ratio or structural similarity to ground-truth floorplans) specifically for the sparse and single-image regimes to verify this load-bearing step.

Authors: We agree that explicit quantification of proxy fidelity strengthens the central claims. In the revised manuscript we have added a new analysis subsection reporting coverage ratio (fraction of floorplan area covered by projected 3D points) and SSIM between the density map and ground-truth floorplan, computed separately for single-image, 5-image, and full-set regimes across all test scenes. These metrics confirm that structural similarity remains adequate for alignment even when coverage is low, directly supporting the reported localization gains. revision: yes
Referee: [§4] §4 (Experiments, sparse-setting results): The reported improvements in extremely sparse cases are load-bearing for the main contribution. The evaluation should include per-scene error distributions, failure-case analysis, or reconstruction-quality ablations rather than aggregate metrics alone, so readers can assess whether gains persist when the density proxy is incomplete.

Authors: We acknowledge that aggregate numbers alone leave open questions about robustness. The revision now includes per-scene localization error box plots (supplementary material), a failure-case analysis subsection in the main text that examines scenes with incomplete reconstructions due to textureless walls, and an ablation that correlates point-cloud density with final alignment error. These additions show that our method continues to outperform baselines even under partial proxy coverage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external reconstruction and models

full rationale

The paper's core pipeline reconstructs a gravity-aligned 3D scene from unconstrained images (including single-image cases), projects it to a 2D density map as floorplan proxy, and aligns via 2D similarity transform after fine-tuning a foundation model for cross-modal matching. No quoted equations, definitions, or steps in the abstract or described method reduce a claimed prediction or result to a fitted parameter or self-referential input by construction. The approach invokes standard external 3D reconstruction and foundation models rather than deriving the target alignment from quantities defined using the floorplan itself. Experiments claim improvements on benchmarks, but the derivation chain remains independent of the final localization output.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that standard 3D reconstruction pipelines produce usable gravity-aligned geometry and that a fine-tuned foundation model can reliably bridge density maps to architectural drawings; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Unconstrained image collections yield sufficiently accurate gravity-aligned 3D reconstructions for large-scale indoor scenes.
Invoked when the method projects the reconstructed scene into a 2D density map to serve as floorplan proxy.
domain assumption A 2D foundation model can be fine-tuned to produce semantically aligned matches between density maps and rasterized floorplans while preserving structural consistency.
Central to bridging the appearance gap in the alignment step.

pith-pipeline@v0.9.0 · 5752 in / 1387 out tokens · 25555 ms · 2026-05-22T06:52:19.806981+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy... adapt a 2D foundation model... fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L=λ featLfeat +λ regrLregr +λ topoLtopo +λ geoLgeo... symmetric InfoNCE loss... topology preservation loss Ltopo and a geometry consistency loss Lgeo

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

SURF: Speeded Up Robust Features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision (ECCV), 2006

work page 2006
[3]

Robust LiDAR- based localization in architectural floor plans

Federico Boniardi, Tim Caselitz, Rainer Kummerle, and Wolfram Burgard. Robust LiDAR- based localization in architectural floor plans. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3318–3324. IEEE, 2017

work page 2017
[4]

A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

work page 2019
[5]

Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network

Federico Boniardi, Abhinav Valada, Rohit Mohan, Tim Caselitz, and Wolfram Burgard. Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5291–5297. IEEE, 2019

work page 2019
[6]

F3Loc: Fusion and Filtering for Floorplan Localization

Changan Chen, Rui Wang, Christoph V ogel, and Marc Pollefeys. F3Loc: Fusion and Filtering for Floorplan Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18029–18038, 2024

work page 2024
[7]

Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path

Jiacheng Chen, Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

work page 2019
[8]

You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans

Hang Chu, Dong Ki Kim, and Tsuhan Chen. You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2210–2218, 2015

work page 2015
[9]

Indoor-Outdoor 3D Reconstruction Alignment

Andrea Cohen, Johannes L Schönberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-Outdoor 3D Reconstruction Alignment. InProceedings of the European Conference on Computer Vision (ECCV), pages 285–300. Springer, 2016

work page 2016
[10]

Scene Grounding In the Wild

Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, and Hadar Averbuch-Elor. Scene Grounding In the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[11]

SuperPoint: Self-Supervised Interest Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 224–236, 2018

work page 2018
[12]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

work page 1981
[13]

Supercharging Floorplan Localization with Semantic Rays

Yuval Grader and Hadar Averbuch-Elor. Supercharging Floorplan Localization with Semantic Rays. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27116–27125, 2025

work page 2025
[14]

LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments

Henry Howard-Jenkins and Victor Adrian Prisacariu. LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments. InProceedings of the European Conference on Computer Vision (ECCV), pages 693–709, 2022

work page 2022
[15]

LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021

Henry Howard-Jenkins, Jose-Raul Ruiz-Sarmiento, and Victor Adrian Prisacariu. LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021. 10

work page arXiv 2021
[16]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[17]

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang, Brandon Li, Bharath Hariharan, and Noah Snavely. C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[18]

W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi

Seigo Ito, Felix Endres, Markus Kuderer, Gian Diego Tipaldi, Cyrill Stachniss, and Wolfram Burgard. W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 417–422. IEEE, 2014

work page 2014
[19]

Fully Geometric Panoramic Localization

Junho Kim, Jiwon Jeong, and Young Min Kim. Fully Geometric Panoramic Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[20]

Long-tail Internet photo reconstruction

Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, and Ruojin Cai. Long-tail Internet photo reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[21]

Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent

Zhikai Li, Marcelo H Ang, and Daniela Rus. Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8571–8578. IEEE, 2020

work page 2020
[22]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[23]

FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans

Chen Liu, Jiaye Wu, and Yasutaka Furukawa. FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans. InProceedings of the European Conference on Computer Vision (ECCV), 2018

work page 2018
[24]

WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

work page 2025
[25]

PolyRoom: Room-aware Transformer for Floorplan Reconstruction

Yuzhou Liu, Lingjie Zhu, Xiaodong Ma, Hanqiao Ye, Xiang Gao, Xianwei Zheng, and Shuhan Shen. PolyRoom: Room-aware Transformer for Floorplan Reconstruction. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[27]

Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

David G Lowe. Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

work page 2004
[28]

The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces

Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–16. Springer, 2014

work page 2014
[29]

SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation

Oscar Mendez, Simon Hadfield, Nicolas Pugeault, and Richard Bowden. SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation. International Journal of Computer Vision (IJCV), 128:1286–1310, 2020

work page 2020
[30]

ProtoSnap: Prototype Alignment for Cuneiform Signs

Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, and Hadar Averbuch-Elor. ProtoSnap: Prototype Alignment for Cuneiform Signs. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 88720–88739, 2025

work page 2025
[31]

LASER: LAtent SpacE Rendering for 2D Visual Localization

Zhixiang Min, Naji Khosravan, Zachary Bessinger, Manjunath Narayana, Sing Bing Kang, Enrique Dunn, and Ivaylo Boyadzhiev. LASER: LAtent SpacE Rendering for 2D Visual Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11122–11131, 2022. 11

work page 2022
[32]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021
[34]

ORB: An efficient alternative to SIFT or SURF

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

work page 2011
[35]

3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry

Bryan C Russell, Ricardo Martin-Brualla, Daniel J Butler, Steven M Seitz, and Luke Zettlemoyer. 3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry. ACM Transactions on Graphics (TOG), 32(6):1–10, 2013

work page 2013
[36]

SuperGlue: Learning Feature Matching With Graph Neural Networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching With Graph Neural Networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[37]

Scene Segmentation Using the Wisdom of Crowds

Ian Simon and Steven M Seitz. Scene Segmentation Using the Wisdom of Crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 541–553. Springer, 2008

work page 2008
[38]

Scene Summarization for Online Image Collections

Ian Simon, Noah Snavely, and Steven M Seitz. Scene Summarization for Online Image Collections. In2007 IEEE 11th International conference on computer vision, pages 1–8. IEEE, 2007

work page 2007
[39]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

work page 2024
[41]

LoFTR: Detector-Free Local Feature Matching with Transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-Free Local Feature Matching with Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[42]

Emer- gent Correspondence from Image Diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent Correspondence from Image Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[43]

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2024

work page 2024
[44]

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[45]

Lost Shopping! Monocular Localization in Large Indoor Spaces

Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Lost Shopping! Monocular Localization in Large Indoor Spaces. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2695–2703, 2015

work page 2015
[46]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 12

work page 2024
[47]

GLFP: Global Localization from a Floor Plan

Xipeng Wang, Ryan J Marcotte, and Edwin Olson. GLFP: Global Localization from a Floor Plan. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1627–1632. IEEE, 2019

work page 2019
[48]

π3: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[49]

Discovering Details and Scene Structure with Hierarchical Iconoid Shift

Tobias Weyand and Bastian Leibe. Discovering Details and Scene Structure with Hierarchical Iconoid Shift. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3479–3486, 2013

work page 2013
[50]

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 428–437, 2021

work page 2021
[51]

UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, and Daniel Barath. UnLoc: Leveraging Depth Uncertainties for Floorplan Localization. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[52]

Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries

Yuanwen Yue, Theodora Kontogianni, Konrad Schindler, and Francis Engelmann. Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[53]

facade”, “nave

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 13 Appendix We refer readers to the accompanying viewer.html for 360◦ view comparisons of floorplan-aligned 3D reconstructions (Sec. A)...

work page arXiv 2020

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

SURF: Speeded Up Robust Features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision (ECCV), 2006

work page 2006

[3] [3]

Robust LiDAR- based localization in architectural floor plans

Federico Boniardi, Tim Caselitz, Rainer Kummerle, and Wolfram Burgard. Robust LiDAR- based localization in architectural floor plans. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3318–3324. IEEE, 2017

work page 2017

[4] [4]

A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. A pose graph-based localization system for long-term navigation in CAD floor plans.Robotics and Autonomous Systems, pages 84–97, 2019

work page 2019

[5] [5]

Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network

Federico Boniardi, Abhinav Valada, Rohit Mohan, Tim Caselitz, and Wolfram Burgard. Robot Localization in Floor Plans Using a Room Layout Edge Extraction Network. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5291–5297. IEEE, 2019

work page 2019

[6] [6]

F3Loc: Fusion and Filtering for Floorplan Localization

Changan Chen, Rui Wang, Christoph V ogel, and Marc Pollefeys. F3Loc: Fusion and Filtering for Floorplan Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18029–18038, 2024

work page 2024

[7] [7]

Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path

Jiacheng Chen, Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floor-SP: Inverse CAD for Floor- plans by Sequential Room-wise Shortest Path. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

work page 2019

[8] [8]

You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans

Hang Chu, Dong Ki Kim, and Tsuhan Chen. You Are Here: Mimicking the Human Thinking Process in Reading Floor-Plans. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2210–2218, 2015

work page 2015

[9] [9]

Indoor-Outdoor 3D Reconstruction Alignment

Andrea Cohen, Johannes L Schönberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-Outdoor 3D Reconstruction Alignment. InProceedings of the European Conference on Computer Vision (ECCV), pages 285–300. Springer, 2016

work page 2016

[10] [10]

Scene Grounding In the Wild

Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, and Hadar Averbuch-Elor. Scene Grounding In the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026

[11] [11]

SuperPoint: Self-Supervised Interest Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 224–236, 2018

work page 2018

[12] [12]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

work page 1981

[13] [13]

Supercharging Floorplan Localization with Semantic Rays

Yuval Grader and Hadar Averbuch-Elor. Supercharging Floorplan Localization with Semantic Rays. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27116–27125, 2025

work page 2025

[14] [14]

LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments

Henry Howard-Jenkins and Victor Adrian Prisacariu. LaLaLoc++: Global Floor Plan Compre- hension for Layout Localisation in Unvisited Environments. InProceedings of the European Conference on Computer Vision (ECCV), pages 693–709, 2022

work page 2022

[15] [15]

LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021

Henry Howard-Jenkins, Jose-Raul Ruiz-Sarmiento, and Victor Adrian Prisacariu. LaLaLoc: La- tent Layout Localisation in Dynamic, Unvisited Environments.arXiv preprint arXiv:2104.09169, 2021. 10

work page arXiv 2021

[16] [16]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022

work page 2022

[17] [17]

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang, Brandon Li, Bharath Hariharan, and Noah Snavely. C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[18] [18]

W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi

Seigo Ito, Felix Endres, Markus Kuderer, Gian Diego Tipaldi, Cyrill Stachniss, and Wolfram Burgard. W-RGB-D: Floor-plan-based indoor global localization using a depth camera and WiFi. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 417–422. IEEE, 2014

work page 2014

[19] [19]

Fully Geometric Panoramic Localization

Junho Kim, Jiwon Jeong, and Young Min Kim. Fully Geometric Panoramic Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[20] [20]

Long-tail Internet photo reconstruction

Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, and Ruojin Cai. Long-tail Internet photo reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026

[21] [21]

Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent

Zhikai Li, Marcelo H Ang, and Daniela Rus. Online Localization with Imprecise Floor Space Maps using Stochastic Gradient Descent. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8571–8578. IEEE, 2020

work page 2020

[22] [22]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[23] [23]

FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans

Chen Liu, Jiaye Wu, and Yasutaka Furukawa. FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans. InProceedings of the European Conference on Computer Vision (ECCV), 2018

work page 2018

[24] [24]

WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting, 2025

work page 2025

[25] [25]

PolyRoom: Room-aware Transformer for Floorplan Reconstruction

Yuzhou Liu, Lingjie Zhu, Xiaodong Ma, Hanqiao Ye, Xiang Gao, Xianwei Zheng, and Shuhan Shen. PolyRoom: Room-aware Transformer for Floorplan Reconstruction. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024

[26] [26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[27] [27]

Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

David G Lowe. Distinctive Image Features from Scale-Invariant Keypoints.International Journal of Computer Vision (IJCV), 2004

work page 2004

[28] [28]

The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces

Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces. InProceedings of the European Conference on Computer Vision (ECCV), pages 1–16. Springer, 2014

work page 2014

[29] [29]

SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation

Oscar Mendez, Simon Hadfield, Nicolas Pugeault, and Richard Bowden. SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation. International Journal of Computer Vision (IJCV), 128:1286–1310, 2020

work page 2020

[30] [30]

ProtoSnap: Prototype Alignment for Cuneiform Signs

Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, and Hadar Averbuch-Elor. ProtoSnap: Prototype Alignment for Cuneiform Signs. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 88720–88739, 2025

work page 2025

[31] [31]

LASER: LAtent SpacE Rendering for 2D Visual Localization

Zhixiang Min, Naji Khosravan, Zachary Bessinger, Manjunath Narayana, Sing Bing Kang, Enrique Dunn, and Ivaylo Boyadzhiev. LASER: LAtent SpacE Rendering for 2D Visual Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11122–11131, 2022. 11

work page 2022

[32] [32]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021

[34] [34]

ORB: An efficient alternative to SIFT or SURF

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

work page 2011

[35] [35]

3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry

Bryan C Russell, Ricardo Martin-Brualla, Daniel J Butler, Steven M Seitz, and Luke Zettlemoyer. 3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry. ACM Transactions on Graphics (TOG), 32(6):1–10, 2013

work page 2013

[36] [36]

SuperGlue: Learning Feature Matching With Graph Neural Networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching With Graph Neural Networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[37] [37]

Scene Segmentation Using the Wisdom of Crowds

Ian Simon and Steven M Seitz. Scene Segmentation Using the Wisdom of Crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 541–553. Springer, 2008

work page 2008

[38] [38]

Scene Summarization for Online Image Collections

Ian Simon, Noah Snavely, and Steven M Seitz. Scene Summarization for Online Image Collections. In2007 IEEE 11th International conference on computer vision, pages 1–8. IEEE, 2007

work page 2007

[39] [39]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

work page 2024

[41] [41]

LoFTR: Detector-Free Local Feature Matching with Transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-Free Local Feature Matching with Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[42] [42]

Emer- gent Correspondence from Image Diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent Correspondence from Image Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[43] [43]

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), 2024

work page 2024

[44] [44]

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[45] [45]

Lost Shopping! Monocular Localization in Large Indoor Spaces

Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Lost Shopping! Monocular Localization in Large Indoor Spaces. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2695–2703, 2015

work page 2015

[46] [46]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 12

work page 2024

[47] [47]

GLFP: Global Localization from a Floor Plan

Xipeng Wang, Ryan J Marcotte, and Edwin Olson. GLFP: Global Localization from a Floor Plan. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1627–1632. IEEE, 2019

work page 2019

[48] [48]

π3: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[49] [49]

Discovering Details and Scene Structure with Hierarchical Iconoid Shift

Tobias Weyand and Bastian Leibe. Discovering Details and Scene Structure with Hierarchical Iconoid Shift. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3479–3486, 2013

work page 2013

[50] [50]

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 428–437, 2021

work page 2021

[51] [51]

UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, and Daniel Barath. UnLoc: Leveraging Depth Uncertainties for Floorplan Localization. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[52] [52]

Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries

Yuanwen Yue, Theodora Kontogianni, Konrad Schindler, and Francis Engelmann. Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[53] [53]

facade”, “nave

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 13 Appendix We refer readers to the accompanying viewer.html for 360◦ view comparisons of floorplan-aligned 3D reconstructions (Sec. A)...

work page arXiv 2020