pith. sign in

arxiv: 2604.25388 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.RO

COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual Localization

Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords floor-plan localizationvisual localizationfisheye camerasmulti-channel descriptorstructural matchingcross-modal matchingwindow detectionrobot navigation
0
0 comments X

The pith

COMPASS builds matching multi-channel radial descriptors from floor plans and fisheye images to enable structural matching for robot localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMPASS to use architectural floor plans as priors containing both geometry and semantics for localizing a robot equipped with dual fisheye cameras. It designs a radial descriptor with five channels by casting rays from the floor plan across 360 azimuth bins to record normalized range, hit type, gradient, inverse range, and variance. From images, the same structure is populated by detecting windows through line-segment clustering and brightness checks, then projecting detections azimuthally. A proof-of-concept demonstration at one known pose shows the wall-window hit-type pattern from the images closely matches the floor-plan version, validating cross-modal feasibility. This matters because most localization ignores the semantic details in widely available floor plans.

Core claim

COMPASS generates a compact multi-channel radial descriptor from floor plans via 360-azimuth ray casting that encodes normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. The identical descriptor is filled from dual fisheye imagery by a window detector that clusters vertical line segments and verifies brightness, then projects the results to azimuthal bearings. At a single known pose drawn from the Hilti-Trimble SLAM Challenge 2026 dataset, the extracted wall-window pattern matches the floor-plan descriptor, confirming that cross-modal structural matching is feasible.

What carries the argument

The multi-channel radial descriptor that encodes the surrounding geometric layout in 360 azimuth bins, populated from floor-plan ray casting on one side and from projected window detections on the fisheye-image side.

If this is right

  • Floor plans become usable semantic priors rather than mere geometric outlines for visual localization.
  • Detected windows supply a distinctive hit-type channel that complements pure range information.
  • The shared descriptor structure permits direct comparison between map and image without intermediate feature matching.
  • A single successful match at a known pose supports extension to searching over candidate poses for full estimation.
  • Semantic elements such as openings add robustness in repetitive indoor geometries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Full localization could be achieved by sliding the descriptor over the floor plan and finding the minimum-difference position.
  • The method may reduce reliance on LiDAR or dense visual features in structured indoor environments.
  • Adding more structural classes beyond windows could increase descriptor uniqueness without increasing size.
  • Integration with existing SLAM pipelines could use the match for initialization or loop-closure detection.

Load-bearing premise

Window detection via line-segment clustering and brightness verification on fisheye images will produce a hit-type channel that reliably aligns with the floor-plan descriptor under varying lighting, occlusions, and viewpoints.

What would settle it

Generating both descriptors at the known pose on additional frames or a second dataset and observing that the hit-type channel from the images no longer closely matches the floor-plan wall-window pattern.

Figures

Figures reproduced from arXiv: 2604.25388 by Asier Bikandi-Noya, Holger Voos, Jose Luis Sanchez-Lopez, Miguel Fernandez-Cortizas, Muhammad Shaheer.

Figure 1
Figure 1. Figure 1: Floor plan descriptor. Top left: ray cast overlaid on the floor plan (gray rays → walls, red → windows). Top center: polar range profile. Top right: the 5 × 360 descriptor matrix. Bottom: linearized range (blue), gradient (orange), with background shading by hit type. A. Floor Plan Descriptor Generation The floor plan is represented as two binary raster masks (see view at source ↗
Figure 2
Figure 2. Figure 2: Window detection on dual fisheye images from a view at source ↗
Figure 3
Figure 3. Figure 3: Cross-modal hit-type matching. The camera hit-type (156 window bins) is compared against the floor plan (67 window view at source ↗
Figure 4
Figure 4. Figure 4: Cross-correlation peaks at 0 ◦ (score 0.9486). discrepancies, cross-correlation produces a clear peak at the correct heading. Roll and pitch estimation. The Vanishing point (VP) based attitude estimation algorithm is applied to the same pair of fisheye frames view at source ↗
Figure 5
Figure 5. Figure 5: Vanishing point estimation for roll and pitch recovery. view at source ↗
read the original abstract

Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays are cast in 360 azimuth bins and the results are encoded into five channels: normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. From the image side, the same descriptor structure is populated by detecting structural elements in the fisheye imagery. As a first step toward full cross-modal matching, we present a window detection algorithm for fisheye images that uses a line segment detector to identify window frames via vertical edge clustering and brightness verification. Detected windows are projected to azimuthal bearings through the fisheye camera model, producing the hit-type channel of the visual descriptor. As a proof of concept, we generate both descriptors at a single known pose from the Hilti-Trimble SLAM Challenge 2026 dataset and demonstrate that the wall-window pattern extracted from the first frame of each camera closely matches the floor plan descriptor, validating the feasibility of cross-modal structural matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces COMPASS, a multi-channel radial descriptor inspired by scan context for floor-plan-based visual localization with dual fisheye cameras. Rays are cast from the floor plan across 360 azimuth bins to encode five channels (normalized range, structural hit type including walls/windows/openings, range gradient, inverse range, and local variance). From imagery, a window detection pipeline (line-segment detection, vertical clustering, brightness check) projects detections to azimuthal bearings to populate the hit-type channel. As proof of concept, both descriptors are generated at one known pose from the Hilti-Trimble SLAM Challenge 2026 dataset, with a qualitative visual comparison showing that the extracted wall-window pattern matches the floor-plan descriptor.

Significance. If the cross-modal descriptor can be shown to support reliable matching, the approach would usefully combine widely available semantic floor-plan priors with image-based localization, extending ideas like scan context to indoor structural elements. The current single qualitative match at a known pose provides only weak evidence for feasibility and does not yet demonstrate robustness or a complete pose-estimation pipeline.

major comments (2)
  1. [Abstract] Abstract / Proof-of-concept demonstration: The claim that the single visual match 'validates the feasibility of cross-modal structural matching' is not supported by any quantitative metric (e.g., azimuthal correlation, edit distance, or descriptor similarity score), ablation on detection thresholds, or tests at additional poses/lighting conditions. A lone qualitative comparison at one known pose supplies no evidence that the window-detection-plus-projection step will produce a reliably alignable hit-type channel under realistic variation.
  2. [Abstract] Image descriptor construction: Only the hit-type channel is populated from fisheye imagery; the remaining four channels (normalized range, range gradient, inverse range, local range variance) are defined for the floor-plan side but receive no corresponding image-based implementation or projection method. This leaves the advertised multi-channel descriptor incomplete for the intended cross-modal comparison.
minor comments (1)
  1. [Abstract] The abstract and text should explicitly note that the current implementation is limited to the hit-type channel and a single-pose qualitative check, to avoid overstating the scope of the presented results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our proof-of-concept work. We address each major comment below and will incorporate revisions to better reflect the preliminary nature of the results while preserving the core contribution of the multi-channel descriptor design.

read point-by-point responses
  1. Referee: [Abstract] Abstract / Proof-of-concept demonstration: The claim that the single visual match 'validates the feasibility of cross-modal structural matching' is not supported by any quantitative metric (e.g., azimuthal correlation, edit distance, or descriptor similarity score), ablation on detection thresholds, or tests at additional poses/lighting conditions. A lone qualitative comparison at one known pose supplies no evidence that the window-detection-plus-projection step will produce a reliably alignable hit-type channel under realistic variation.

    Authors: We agree that the evidence is limited to a single qualitative visual comparison at one known pose from the Hilti-Trimble dataset, which does not include quantitative metrics or tests under varied conditions. The manuscript positions this as an initial proof-of-concept to illustrate the descriptor structure and the window detection pipeline rather than a full validation. In revision, we will modify the abstract to replace 'validating the feasibility' with phrasing such as 'providing initial evidence for' or 'demonstrating the potential of' cross-modal structural matching. We will also add a simple quantitative comparison (e.g., overlap count or correlation score between the hit-type channels) and, if feasible, results from one or two additional poses to strengthen the demonstration without overclaiming robustness. revision: yes

  2. Referee: [Abstract] Image descriptor construction: Only the hit-type channel is populated from fisheye imagery; the remaining four channels (normalized range, range gradient, inverse range, local range variance) are defined for the floor-plan side but receive no corresponding image-based implementation or projection method. This leaves the advertised multi-channel descriptor incomplete for the intended cross-modal comparison.

    Authors: The observation is accurate: the current image pipeline populates only the structural hit-type channel via window detection and azimuthal projection, while the four range-derived channels are generated exclusively from ray-casting on the floor plan. This reflects the paper's focus on semantic structural elements as the first cross-modal link, since the other channels require explicit range data not directly available from monocular fisheye images. We will revise the abstract and method sections to explicitly state that the full five-channel descriptor is realized on the floor-plan side, with imagery currently contributing the hit-type channel as an initial step. Future extensions (e.g., via depth estimation) will be noted as planned work rather than implying a complete multi-channel image descriptor at present. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptor built directly from ray-casting and detection steps

full rationale

The paper constructs the multi-channel radial descriptor explicitly: floor-plan side uses 360-azimuth ray casting to populate normalized range, hit type, gradient, inverse range and variance channels; image side uses line-segment detection, vertical clustering and brightness check followed by fisheye projection to populate the hit-type channel. The sole empirical claim is a single qualitative visual comparison of wall-window patterns at one known pose. No equations, fitted parameters, self-citations or uniqueness theorems appear in the derivation chain; each step is an independent geometric or algorithmic operation whose output is not definitionally identical to its input. The result is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard camera projection and line detection assumptions plus a new descriptor construction; no major free parameters or invented physical entities are introduced in the abstract.

free parameters (2)
  • azimuth bin count
    360 bins chosen for radial discretization
  • window detection thresholds
    Clustering and brightness verification parameters left unspecified
axioms (1)
  • domain assumption Fisheye camera model permits accurate azimuthal projection of detected window frames
    Invoked to populate the visual hit-type channel from image detections

pith-pipeline@v0.9.0 · 5585 in / 1364 out tokens · 47684 ms · 2026-05-07T16:46:57.491413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

    C. Cadenaet al., “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,”IEEE Trans. Robot., vol. 32, no. 6, pp. 1309–1332, 2016

  2. [2]

    LaLaLoc: La- tent layout localization in dynamic, unvisited environments,

    J. Howard-Jenkins, J. Ruiz-Hidalgo, and V . Prisacariu, “LaLaLoc: La- tent layout localization in dynamic, unvisited environments,” inProc. IEEE/CVF ICCV, 2021, pp. 6334–6344

  3. [3]

    Robot navi- gation in hand-drawn sketched maps,

    F. Boniardi, T. Caselitz, R. K ¨ummerle, and W. Burgard, “Robot navi- gation in hand-drawn sketched maps,” inProc. European Conf. Mobile Robots (ECMR), 2019

  4. [4]

    Accurate indoor localization for RGB-D smartphones and tablets given 2D floor plans,

    W. Winterhalter, F. Fleckenstein, B. Steder, L. Spinello, and W. Burgard, “Accurate indoor localization for RGB-D smartphones and tablets given 2D floor plans,” inProc. IEEE/RSJ IROS, 2015, pp. 3138–3143

  5. [5]

    CPM-Net: Cross-modal place match- ing for camera localization using floor plans,

    Y . Kim, H. Choi, and Y . Hwang, “CPM-Net: Cross-modal place match- ing for camera localization using floor plans,” inProc. IEEE ICRA, 2023

  6. [6]

    Graph-based global robot localization informing situational graphs with architectural graphs,

    M. Shaheer, J. A. Millan-Romera, H. Bavle, J. L. Sanchez-Lopez, J. Civera, and H. V oos, “Graph-based global robot localization informing situational graphs with architectural graphs,” inProc. IEEE/RSJ IROS, 2023, pp. 9155–9162

  7. [7]

    Tightly coupled SLAM with imprecise architectural plans,

    M. Shaheer, J. A. Millan-Romera, H. Bavle, M. Giberna, J. L. Sanchez- Lopez, J. Civera, and H. V oos, “Tightly coupled SLAM with imprecise architectural plans,”IEEE Robot. Autom. Lett., vol. 10, no. 8, pp. 8019– 8026, 2025

  8. [8]

    Scan context: Egocentric spatial descriptor for place recognition within 3D point cloud map,

    G. Kim and A. Kim, “Scan context: Egocentric spatial descriptor for place recognition within 3D point cloud map,” inProc. IEEE/RSJ IROS, 2018, pp. 4802–4809

  9. [9]

    Scan context++: Structural place recog- nition robust to rotation and lateral variations in urban environments,

    G. Kim, S. Choi, and A. Kim, “Scan context++: Structural place recog- nition robust to rotation and lateral variations in urban environments,” IEEE Trans. Robot., vol. 38, no. 3, pp. 1856–1874, 2022

  10. [10]

    Intensity scan context: Coding intensity and geometry relations for loop closure detection,

    H. Wang, C. Wang, and L. Xie, “Intensity scan context: Coding intensity and geometry relations for loop closure detection,” inProc. IEEE ICRA, 2020, pp. 2095–2101

  11. [11]

    Hilti-Trimble SLAM Chal- lenge 2026: 360 visual-inertial benchmark with floor plan priors,

    Hilti, Trimble, and University of Oxford, “Hilti-Trimble SLAM Chal- lenge 2026: 360 visual-inertial benchmark with floor plan priors,” 2026. [Online]. Available: https://hilti-challenge.com

  12. [12]

    A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,

    J. Kannala and S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 8, pp. 1335–1340, 2006

  13. [13]

    Minimal solvers for single-view lens-distorted camera auto-calibration,

    Y . Lochman, O. Dobosevych, R. Hryniv, and J. Pritts, “Minimal solvers for single-view lens-distorted camera auto-calibration,” inProc. IEEE/CVF WACV, 2021, pp. 2886–2895

  14. [14]

    Unsupervised vanishing point detection and camera calibration from a single Manhattan image with radial distortion,

    M. Antunes, J. P. Barreto, and U. Nunes, “Unsupervised vanishing point detection and camera calibration from a single Manhattan image with radial distortion,” inProc. IEEE/CVF CVPR, 2017, pp. 3846–3854

  15. [15]

    ELSED: Enhanced line segment drawing,

    I. Su ´arez, J. M. Buenaposada, and L. Baumela, “ELSED: Enhanced line segment drawing,”Pattern Recognition, vol. 127, p. 108619, 2022

  16. [16]

    4802-4809, 10.1109/IROS.2018.8593953

    Kim, Giseop and Kim, Ayoung, Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) p. 4802-4809, 10.1109/IROS.2018.8593953