COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual Localization
Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3
The pith
COMPASS builds matching multi-channel radial descriptors from floor plans and fisheye images to enable structural matching for robot localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COMPASS generates a compact multi-channel radial descriptor from floor plans via 360-azimuth ray casting that encodes normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. The identical descriptor is filled from dual fisheye imagery by a window detector that clusters vertical line segments and verifies brightness, then projects the results to azimuthal bearings. At a single known pose drawn from the Hilti-Trimble SLAM Challenge 2026 dataset, the extracted wall-window pattern matches the floor-plan descriptor, confirming that cross-modal structural matching is feasible.
What carries the argument
The multi-channel radial descriptor that encodes the surrounding geometric layout in 360 azimuth bins, populated from floor-plan ray casting on one side and from projected window detections on the fisheye-image side.
If this is right
- Floor plans become usable semantic priors rather than mere geometric outlines for visual localization.
- Detected windows supply a distinctive hit-type channel that complements pure range information.
- The shared descriptor structure permits direct comparison between map and image without intermediate feature matching.
- A single successful match at a known pose supports extension to searching over candidate poses for full estimation.
- Semantic elements such as openings add robustness in repetitive indoor geometries.
Where Pith is reading between the lines
- Full localization could be achieved by sliding the descriptor over the floor plan and finding the minimum-difference position.
- The method may reduce reliance on LiDAR or dense visual features in structured indoor environments.
- Adding more structural classes beyond windows could increase descriptor uniqueness without increasing size.
- Integration with existing SLAM pipelines could use the match for initialization or loop-closure detection.
Load-bearing premise
Window detection via line-segment clustering and brightness verification on fisheye images will produce a hit-type channel that reliably aligns with the floor-plan descriptor under varying lighting, occlusions, and viewpoints.
What would settle it
Generating both descriptors at the known pose on additional frames or a second dataset and observing that the hit-type channel from the images no longer closely matches the floor-plan wall-window pattern.
Figures
read the original abstract
Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays are cast in 360 azimuth bins and the results are encoded into five channels: normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. From the image side, the same descriptor structure is populated by detecting structural elements in the fisheye imagery. As a first step toward full cross-modal matching, we present a window detection algorithm for fisheye images that uses a line segment detector to identify window frames via vertical edge clustering and brightness verification. Detected windows are projected to azimuthal bearings through the fisheye camera model, producing the hit-type channel of the visual descriptor. As a proof of concept, we generate both descriptors at a single known pose from the Hilti-Trimble SLAM Challenge 2026 dataset and demonstrate that the wall-window pattern extracted from the first frame of each camera closely matches the floor plan descriptor, validating the feasibility of cross-modal structural matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COMPASS, a multi-channel radial descriptor inspired by scan context for floor-plan-based visual localization with dual fisheye cameras. Rays are cast from the floor plan across 360 azimuth bins to encode five channels (normalized range, structural hit type including walls/windows/openings, range gradient, inverse range, and local variance). From imagery, a window detection pipeline (line-segment detection, vertical clustering, brightness check) projects detections to azimuthal bearings to populate the hit-type channel. As proof of concept, both descriptors are generated at one known pose from the Hilti-Trimble SLAM Challenge 2026 dataset, with a qualitative visual comparison showing that the extracted wall-window pattern matches the floor-plan descriptor.
Significance. If the cross-modal descriptor can be shown to support reliable matching, the approach would usefully combine widely available semantic floor-plan priors with image-based localization, extending ideas like scan context to indoor structural elements. The current single qualitative match at a known pose provides only weak evidence for feasibility and does not yet demonstrate robustness or a complete pose-estimation pipeline.
major comments (2)
- [Abstract] Abstract / Proof-of-concept demonstration: The claim that the single visual match 'validates the feasibility of cross-modal structural matching' is not supported by any quantitative metric (e.g., azimuthal correlation, edit distance, or descriptor similarity score), ablation on detection thresholds, or tests at additional poses/lighting conditions. A lone qualitative comparison at one known pose supplies no evidence that the window-detection-plus-projection step will produce a reliably alignable hit-type channel under realistic variation.
- [Abstract] Image descriptor construction: Only the hit-type channel is populated from fisheye imagery; the remaining four channels (normalized range, range gradient, inverse range, local range variance) are defined for the floor-plan side but receive no corresponding image-based implementation or projection method. This leaves the advertised multi-channel descriptor incomplete for the intended cross-modal comparison.
minor comments (1)
- [Abstract] The abstract and text should explicitly note that the current implementation is limited to the hit-type channel and a single-pose qualitative check, to avoid overstating the scope of the presented results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our proof-of-concept work. We address each major comment below and will incorporate revisions to better reflect the preliminary nature of the results while preserving the core contribution of the multi-channel descriptor design.
read point-by-point responses
-
Referee: [Abstract] Abstract / Proof-of-concept demonstration: The claim that the single visual match 'validates the feasibility of cross-modal structural matching' is not supported by any quantitative metric (e.g., azimuthal correlation, edit distance, or descriptor similarity score), ablation on detection thresholds, or tests at additional poses/lighting conditions. A lone qualitative comparison at one known pose supplies no evidence that the window-detection-plus-projection step will produce a reliably alignable hit-type channel under realistic variation.
Authors: We agree that the evidence is limited to a single qualitative visual comparison at one known pose from the Hilti-Trimble dataset, which does not include quantitative metrics or tests under varied conditions. The manuscript positions this as an initial proof-of-concept to illustrate the descriptor structure and the window detection pipeline rather than a full validation. In revision, we will modify the abstract to replace 'validating the feasibility' with phrasing such as 'providing initial evidence for' or 'demonstrating the potential of' cross-modal structural matching. We will also add a simple quantitative comparison (e.g., overlap count or correlation score between the hit-type channels) and, if feasible, results from one or two additional poses to strengthen the demonstration without overclaiming robustness. revision: yes
-
Referee: [Abstract] Image descriptor construction: Only the hit-type channel is populated from fisheye imagery; the remaining four channels (normalized range, range gradient, inverse range, local range variance) are defined for the floor-plan side but receive no corresponding image-based implementation or projection method. This leaves the advertised multi-channel descriptor incomplete for the intended cross-modal comparison.
Authors: The observation is accurate: the current image pipeline populates only the structural hit-type channel via window detection and azimuthal projection, while the four range-derived channels are generated exclusively from ray-casting on the floor plan. This reflects the paper's focus on semantic structural elements as the first cross-modal link, since the other channels require explicit range data not directly available from monocular fisheye images. We will revise the abstract and method sections to explicitly state that the full five-channel descriptor is realized on the floor-plan side, with imagery currently contributing the hit-type channel as an initial step. Future extensions (e.g., via depth estimation) will be noted as planned work rather than implying a complete multi-channel image descriptor at present. revision: yes
Circularity Check
No circularity: descriptor built directly from ray-casting and detection steps
full rationale
The paper constructs the multi-channel radial descriptor explicitly: floor-plan side uses 360-azimuth ray casting to populate normalized range, hit type, gradient, inverse range and variance channels; image side uses line-segment detection, vertical clustering and brightness check followed by fisheye projection to populate the hit-type channel. The sole empirical claim is a single qualitative visual comparison of wall-window patterns at one known pose. No equations, fitted parameters, self-citations or uniqueness theorems appear in the derivation chain; each step is an independent geometric or algorithmic operation whose output is not definitionally identical to its input. The result is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- azimuth bin count
- window detection thresholds
axioms (1)
- domain assumption Fisheye camera model permits accurate azimuthal projection of detected window frames
Reference graph
Works this paper leans on
-
[1]
C. Cadenaet al., “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,”IEEE Trans. Robot., vol. 32, no. 6, pp. 1309–1332, 2016
work page 2016
-
[2]
LaLaLoc: La- tent layout localization in dynamic, unvisited environments,
J. Howard-Jenkins, J. Ruiz-Hidalgo, and V . Prisacariu, “LaLaLoc: La- tent layout localization in dynamic, unvisited environments,” inProc. IEEE/CVF ICCV, 2021, pp. 6334–6344
work page 2021
-
[3]
Robot navi- gation in hand-drawn sketched maps,
F. Boniardi, T. Caselitz, R. K ¨ummerle, and W. Burgard, “Robot navi- gation in hand-drawn sketched maps,” inProc. European Conf. Mobile Robots (ECMR), 2019
work page 2019
-
[4]
Accurate indoor localization for RGB-D smartphones and tablets given 2D floor plans,
W. Winterhalter, F. Fleckenstein, B. Steder, L. Spinello, and W. Burgard, “Accurate indoor localization for RGB-D smartphones and tablets given 2D floor plans,” inProc. IEEE/RSJ IROS, 2015, pp. 3138–3143
work page 2015
-
[5]
CPM-Net: Cross-modal place match- ing for camera localization using floor plans,
Y . Kim, H. Choi, and Y . Hwang, “CPM-Net: Cross-modal place match- ing for camera localization using floor plans,” inProc. IEEE ICRA, 2023
work page 2023
-
[6]
Graph-based global robot localization informing situational graphs with architectural graphs,
M. Shaheer, J. A. Millan-Romera, H. Bavle, J. L. Sanchez-Lopez, J. Civera, and H. V oos, “Graph-based global robot localization informing situational graphs with architectural graphs,” inProc. IEEE/RSJ IROS, 2023, pp. 9155–9162
work page 2023
-
[7]
Tightly coupled SLAM with imprecise architectural plans,
M. Shaheer, J. A. Millan-Romera, H. Bavle, M. Giberna, J. L. Sanchez- Lopez, J. Civera, and H. V oos, “Tightly coupled SLAM with imprecise architectural plans,”IEEE Robot. Autom. Lett., vol. 10, no. 8, pp. 8019– 8026, 2025
work page 2025
-
[8]
Scan context: Egocentric spatial descriptor for place recognition within 3D point cloud map,
G. Kim and A. Kim, “Scan context: Egocentric spatial descriptor for place recognition within 3D point cloud map,” inProc. IEEE/RSJ IROS, 2018, pp. 4802–4809
work page 2018
-
[9]
G. Kim, S. Choi, and A. Kim, “Scan context++: Structural place recog- nition robust to rotation and lateral variations in urban environments,” IEEE Trans. Robot., vol. 38, no. 3, pp. 1856–1874, 2022
work page 2022
-
[10]
Intensity scan context: Coding intensity and geometry relations for loop closure detection,
H. Wang, C. Wang, and L. Xie, “Intensity scan context: Coding intensity and geometry relations for loop closure detection,” inProc. IEEE ICRA, 2020, pp. 2095–2101
work page 2020
-
[11]
Hilti-Trimble SLAM Chal- lenge 2026: 360 visual-inertial benchmark with floor plan priors,
Hilti, Trimble, and University of Oxford, “Hilti-Trimble SLAM Chal- lenge 2026: 360 visual-inertial benchmark with floor plan priors,” 2026. [Online]. Available: https://hilti-challenge.com
work page 2026
-
[12]
A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,
J. Kannala and S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1335–1340, 2006
work page 2006
-
[13]
Minimal solvers for single-view lens-distorted camera auto-calibration,
Y . Lochman, O. Dobosevych, R. Hryniv, and J. Pritts, “Minimal solvers for single-view lens-distorted camera auto-calibration,” inProc. IEEE/CVF WACV, 2021, pp. 2886–2895
work page 2021
-
[14]
M. Antunes, J. P. Barreto, and U. Nunes, “Unsupervised vanishing point detection and camera calibration from a single Manhattan image with radial distortion,” inProc. IEEE/CVF CVPR, 2017, pp. 3846–3854
work page 2017
-
[15]
ELSED: Enhanced line segment drawing,
I. Su ´arez, J. M. Buenaposada, and L. Baumela, “ELSED: Enhanced line segment drawing,”Pattern Recognition, vol. 127, p. 108619, 2022
work page 2022
-
[16]
4802-4809, 10.1109/IROS.2018.8593953
Kim, Giseop and Kim, Ayoung, Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) p. 4802-4809, 10.1109/IROS.2018.8593953
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.