pith. sign in

arxiv: 2605.02227 · v1 · submitted 2026-05-04 · 💻 cs.RO

Change-Robust Online Spatial-Semantic Topological Mapping

Pith reviewed 2026-05-08 18:56 UTC · model grok-4.3

classification 💻 cs.RO
keywords topological mappingchange-robust navigationspatial-semantic reasoningRGB-D keyframeshypothesis testingrobot localizationperceptual aliasingSLAM alternatives
0
0 comments X

The pith

Robots can navigate reliably amid lighting changes and rearranged furniture by using an online topological graph of RGB-D keyframes instead of a globally consistent metric map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that spatial-semantic reasoning for robot navigation remains reliable when an online pose-aware topological graph of RGB-D keyframes replaces the usual SLAM-built metric map. Existing pipelines attach semantics to those metric maps, yet they break when appearance shifts or scene dynamics interfere with data association and relocalization. The proposed method instead reasons explicitly over perceptual ambiguity through sequential hypothesis testing in continuous three-dimensional pose space and keeps a bounded mixture belief over possible poses. This matters for autonomous robots because real environments constantly alter through lighting, moved objects, or other changes that would otherwise force unsafe or lost navigation decisions.

Core claim

The central claim is that an online, pose-aware topological graph of RGB-D keyframes, together with sequential hypothesis testing in continuous SE(3), supplies sufficient spatial-semantic information for navigation without requiring a globally consistent metric substrate. The estimator maintains a bounded Gaussian-mixture belief over poses, which supports principled loop-closure handling and recovery from kidnapped-robot events. Experiments with real-robot object-goal navigation under lighting shifts and furniture rearrangement show improved robustness over SLAM-based and standard topological baselines while preserving safety under perceptual aliasing.

What carries the argument

An online pose-aware topological graph of RGB-D keyframes combined with sequential hypothesis testing in continuous three-dimensional pose space, which supplies the spatial-semantic information and bounded pose beliefs needed for navigation decisions.

If this is right

  • Object-goal navigation stays safe and accurate even when lighting conditions change or furniture is moved.
  • The system handles perceptual aliasing without catastrophic failure where multiple locations appear similar.
  • Loop closures and sudden robot displacements are managed through the bounded mixture belief over poses.
  • Navigation decisions remain more reliable than those from methods that depend on global metric consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bounded pose belief could support incremental updates over very long periods without map rebuilding.
  • Sharing such topological graphs among multiple robots might avoid the alignment problems that metric maps create.
  • Pairing the graph with object-level semantic labels could enable planning that reasons directly about reachable places rather than coordinates.

Load-bearing premise

An online pose-aware topological graph of RGB-D keyframes plus sequential hypothesis testing in continuous three-dimensional pose space can supply enough spatial-semantic information for reliable navigation decisions without a globally consistent metric substrate.

What would settle it

A real-robot trial in which the topological-graph method produces unsafe paths or loses localization accuracy during combined lighting shifts and furniture rearrangement, performing no better than the SLAM or topological baselines under perceptual aliasing, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.02227 by Atharva Ghotavadekar, Diwen Liu, Harold Soh, Jiaming Wang, Jiaxuan Da, Jizhuo Chen, Linh K\"astner.

Figure 1
Figure 1. Figure 1: Our Change-Robust Online Spatial–Semantic (CROSS) representation enables robust language-goal navigation (A) under substantial appearance and scene changes, including lighting variation (C), object rearrangement, dynamic pedestrians (B,E), and unexpected sensor failures (D). CROSS constructs a pose-aware topological graph and explicitly reasons over ambiguity via sequential hypothesis testing in continuous… view at source ↗
Figure 3
Figure 3. Figure 3: Change-Robust Online Spatial-Semantic (CROSS) Topological System Overview. Given an RGB-D frame and odometry, the online tracking module (orange) performs sequential hypothesis testing in continuous SE(3). Motion updates are propagated via SE(3) push￾forward, while measurement updates are constructed through VPR-based keyframe retrieval. Competing hypotheses are efficiently managed using Gaussian-mixture c… view at source ↗
Figure 4
Figure 4. Figure 4: Appearance change at the same physical locations for the two benchmarks. The top row shows Rover (Campus) across different months/times, while the bottom row shows OpenLORIS (Corridor) across different times of day. kidnapped-robot event, appears as an additional hypothesis whose trajectory becomes consistent with an older region of the map. After the SHT step, we retain a small set of hypotheses {h (l) t … view at source ↗
Figure 5
Figure 5. Figure 5: Multi-session relocalization results on the Rover [32] Campus scene. Left: Relocalization outcomes across different locations. Each row corresponds to a mapping trajectory (indicated by different colors), while columns show relocalization attempts at the same physical locations captured at different times or months, as illustrated in the top image. Empty space indicates relocalization failed at that specif… view at source ↗
Figure 6
Figure 6. Figure 6: Example illustrations of the three evaluation settings for the real quadruped-robot experiment. Each image shows the envi￾ronment before (top) and after (bottom) the change. Left: Lighting Change (LC). Middle: Object Rearrangement (OR). Right: Com￾bined Change (LC+OR). on a quadruped robot operating in a changing indoor environ￾ment. These experiments are designed to assess the robustness of our spatial–se… view at source ↗
Figure 7
Figure 7. Figure 7: Fast-motion sequence across seven timestamps. Top row shows the online mapping trajectory of our system: yellow view at source ↗
Figure 8
Figure 8. Figure 8: Occlusion sequence across six timestamps. Top row shows the online mapping and belief evolution of our system: view at source ↗
Figure 9
Figure 9. Figure 9: Noisy-odometry experiment under different signal-to-noise ratio (SNR) settings. Each column shows the full trajectory view at source ↗
Figure 10
Figure 10. Figure 10: Runtime breakdown of the mapping pipeline per step, comparing relative pose estimation via PnP-RANSAC versus VGGT [39]. Bars report the total average step time and its main components: relative pose estimation (Rel. pose), visual place recog￾nition retrieval (VPR), and sequential hypothesis testing. The log￾scale y-axis highlights the large disparity in pose-estimation cost. D. Runtime Analysis view at source ↗
Figure 11
Figure 11. Figure 11: Factor-graph representation of our Gaussian mixture filtering model. ϕ mot t and ϕ meas t are the motion and measurement factors. xt is the current pose, ut the odometry input, and zt the RGB￾D observation. Yt is a latent association variable identifying which keyframe explains zt, and G is the set of stored keyframes view at source ↗
Figure 12
Figure 12. Figure 12: Multi-session relocalization results on the Rover [32] Campus scene. view at source ↗
Figure 13
Figure 13. Figure 13: Multi-session relocalization results on the OpenLORIS [33] Corridor scene. view at source ↗
read the original abstract

Autonomous robots require change-robust spatial-semantic reasoning: using spatial and semantic knowledge to decide where to go, how to get there, and where the robot is despite environmental change. Existing approaches typically attach semantics to SLAM-built metric maps, but these pipelines are brittle under appearance shifts and scene dynamics, where data association and relocalization degrade. We propose a Change-Robust Online Spatial-Semantic (CROSS) representation that replaces a globally consistent metric substrate with an online, pose-aware topological graph of RGB-D keyframes. The system explicitly reasons over perceptual ambiguity using sequential hypothesis testing in continuous SE(3). Our estimator maintains a bounded Gaussian-mixture belief over poses, enabling principled handling of loop closures and kidnapped-robot events. Experiments under severe appearance change, including real-robot object-goal navigation with lighting shifts and furniture rearrangement, demonstrate improved robustness over SLAM-based and topological baselines while remaining safe under perceptual aliasing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Change-Robust Online Spatial-Semantic (CROSS) representation for autonomous robots, which replaces a globally consistent metric substrate with an online pose-aware topological graph of RGB-D keyframes. The system uses sequential hypothesis testing in continuous SE(3) and maintains a bounded Gaussian-mixture belief over poses to handle perceptual ambiguities, loop closures, and kidnapped-robot events. Experiments with real-robot object-goal navigation under severe appearance changes (lighting shifts and furniture rearrangement) are reported to show improved robustness over SLAM-based and topological baselines while remaining safe under perceptual aliasing.

Significance. If the central claims hold, the work would be significant for robot mapping and navigation in dynamic environments, as it provides a principled topological alternative to brittle metric SLAM pipelines under appearance and structural change. The explicit handling of ambiguity via SE(3) hypothesis testing and bounded mixture beliefs, combined with real-robot trials, offers a concrete advance over purely metric or purely topological baselines. Credit is due for focusing on safety under aliasing and for grounding the evaluation in object-goal navigation tasks.

major comments (2)
  1. [the proposed CROSS representation and estimator] The load-bearing claim that relative pose estimates between keyframes plus the multi-hypothesis SE(3) belief suffice for reliable navigation decisions without global metric consistency is not accompanied by an explicit analysis of residual pose uncertainty (particularly when furniture rearrangement alters keyframe visibility). This assumption underpins the safety and robustness assertions but lacks a concrete bound or failure-mode characterization in the method description.
  2. [Experiments] The abstract states that experiments demonstrate improved robustness, yet provides no quantitative metrics, error bars, or statistical comparison details. Without these, it is impossible to assess whether the topological approach actually outperforms baselines by a margin that justifies replacing metric substrates.
minor comments (2)
  1. The abstract would be strengthened by including at least one key quantitative result (e.g., success rate or navigation time under change) to support the robustness claim.
  2. Notation for the bounded Gaussian-mixture belief and the sequential hypothesis test should be introduced with explicit equations or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where they strengthen the work.

read point-by-point responses
  1. Referee: [the proposed CROSS representation and estimator] The load-bearing claim that relative pose estimates between keyframes plus the multi-hypothesis SE(3) belief suffice for reliable navigation decisions without global metric consistency is not accompanied by an explicit analysis of residual pose uncertainty (particularly when furniture rearrangement alters keyframe visibility). This assumption underpins the safety and robustness assertions but lacks a concrete bound or failure-mode characterization in the method description.

    Authors: We thank the referee for identifying this point. The CROSS estimator maintains a bounded Gaussian-mixture belief over SE(3) poses via sequential hypothesis testing precisely to represent residual uncertainty and perceptual ambiguity without relying on global metric consistency; navigation decisions are conditioned on the full belief support to preserve safety. We agree, however, that an explicit characterization of how this uncertainty evolves under furniture rearrangement (which can reduce keyframe visibility) would make the safety claims more concrete. We have added a dedicated paragraph in the method section providing a bound on residual pose uncertainty and discussing associated failure modes. revision: yes

  2. Referee: [Experiments] The abstract states that experiments demonstrate improved robustness, yet provides no quantitative metrics, error bars, or statistical comparison details. Without these, it is impossible to assess whether the topological approach actually outperforms baselines by a margin that justifies replacing metric substrates.

    Authors: We appreciate the referee's emphasis on quantitative rigor. The manuscript's experiments section already reports success rates, navigation times, and direct comparisons against SLAM-based and topological baselines under lighting shifts and furniture rearrangement. To address the concern about the abstract and presentation, we have revised the abstract to include key quantitative metrics with error bars and have added explicit statistical significance tests in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper introduces a CROSS representation based on an online pose-aware topological graph of RGB-D keyframes with sequential SE(3) hypothesis testing and a bounded Gaussian-mixture pose belief. No equations, predictions, or first-principles results are shown that reduce by construction to the inputs (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations). The central claims rest on experimental validation under appearance change rather than tautological redefinitions or imported uniqueness theorems. This is a standard non-circular proposal of a new mapping method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard robotics assumptions about keyframe-based representation and probabilistic pose estimation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption A topological graph of RGB-D keyframes can sufficiently capture spatial-semantic knowledge for navigation despite environmental changes.
    This underpins the replacement of metric maps and is invoked in the description of the CROSS representation.

pith-pipeline@v0.9.0 · 5479 in / 1227 out tokens · 48787 ms · 2026-05-08T18:56:54.231077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Classical SE(3) covariance transport; RS's emergent Lorentzian (1,3) signature is conceptually unrelated to estimator covariance propagation reality_from_one_distinction (spacetime emergence) unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Σ⁻ = Ad_{ΔT⁻¹} Σ Ad^T_{ΔT⁻¹} + Q_t (first-order pushforward via the adjoint)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Boq: A place is worth a bag of learnable queries

    Amar Ali-Bey, Brahim Chaib-draa, and Philippe Giguere. Boq: A place is worth a bag of learnable queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17794–17803, 2024

  3. [3]

    D. L. Alspach and H. W. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. IEEE Transactions on Automatic Control, 17(4):439– 448, 1972. doi: 10.1109/TAC.1972.1100034

  4. [4]

    Fast and incremental method for loop-closure detection using bags of visual words.IEEE transactions on robotics, 24(5):1027–1037, 2008

    Adrien Angeli, David Filliat, St ´ephane Doncieux, and Jean-Arcady Meyer. Fast and incremental method for loop-closure detection using bags of visual words.IEEE transactions on robotics, 24(5):1027–1037, 2008

  5. [5]

    Megaloc: One re- trieval to place them all

    Gabriele Berton and Carlo Masone. Megaloc: One re- trieval to place them all. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2861– 2867, 2025

  6. [6]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

  7. [7]

    NaVILA: Legged Robot Vision-Language-Action Model for Navigation

    Matthew Chang, Theophile Gervet, Mukul Khanna, Sri- ram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, and Devendra Singh Chaplot. GOAT: GO to any thing. InProceedings of Robotics: Science and Systems (RSS), 2024. doi: 10.15607/RSS. 2024.XX.073

  8. [8]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33: 4247–4258, 2020

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  10. [10]

    Appearance-only slam at large scale with fab-map 2.0.The International Journal of Robotics Research, 30(9):1100–1123, 2011

    Mark Cummins and Paul Newman. Appearance-only slam at large scale with fab-map 2.0.The International Journal of Robotics Research, 30(9):1100–1123, 2011. doi: 10.1177/0278364910385483

  11. [11]

    borglab/gtsam, May 2022

    Frank Dellaert and GTSAM Contributors. borglab/gtsam, May 2022. URL https://github.com/borglab/gtsam)

  12. [12]

    Figueiredo and A.K

    M.A.T. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381–396, 2002. doi: 10.1109/34.990138

  13. [13]

    ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning

    Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Kr- ishna Murthy Jatavallabhula, Aditya Sen, Aditya Agar- wal, Corban Rivera, William Knudson, Erik Sudderth, Oscar Beijbom, et al. ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning. InProceed- ings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

  14. [14]

    Hughes, Y

    N. Hughes, Y . Chang, and L. Carlone. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization. 2022

  15. [15]

    ConceptFusion: Open-set multimodal 3D mapping

    Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, et al. ConceptFusion: Open-set multimodal 3D mapping. InProceedings of Robotics: Science and Systems (RSS), 2023

  16. [16]

    Appearance- based loop closure detection for online large-scale and long-term operation.IEEE Transactions on Robotics, 29 (3):734–745, 2013

    Mathieu Labbe and Francois Michaud. Appearance- based loop closure detection for online large-scale and long-term operation.IEEE Transactions on Robotics, 29 (3):734–745, 2013

  17. [17]

    Mathieu Labb ´e and Franc ¸ois Michaud. Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation.Journal of field robotics, 36(2):416–446, 2019

  18. [18]

    Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

  19. [19]

    Sgs- slam: Semantic gaussian splatting for neural dense slam

    Mingrui Li, Shuhong Liu, Heng Zhou, Guohao Zhu, Na Cheng, Tianchen Deng, and Hongyu Wang. Sgs- slam: Semantic gaussian splatting for neural dense slam. InEuropean Conference on Computer Vision, pages 163–

  20. [20]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023

  21. [21]

    OK-Robot: What really matters in integrating open- knowledge models for robotics

    Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. OK-Robot: What really matters in integrating open- knowledge models for robotics. InProceedings of Robotics: Science and Systems (RSS), 2024. doi: 10. 15607/RSS.2024.XX.091

  22. [22]

    A comprehensive survey of visual slam algorithms.Robotics, 11(1):24, 2022

    Andr ´ea Macario Barros, Maugan Michel, Yoann Moline, Gwenol´e Corre, and Fr ´ed´erick Carrel. A comprehensive survey of visual slam algorithms.Robotics, 11(1):24, 2022

  23. [23]

    CAT-SLAM: Probabilistic localisation and mapping us- ing a continuous appearance-based trajectory.The In- ternational Journal of Robotics Research (IJRR), 31(4): 429–451, 2012

    Will Maddern, Michael Milford, and Gordon Wyeth. CAT-SLAM: Probabilistic localisation and mapping us- ing a continuous appearance-based trajectory.The In- ternational Journal of Robotics Research (IJRR), 31(4): 429–451, 2012. doi: 10.1177/0278364912438273

  24. [24]

    Scaling local control to large-scale topological navigation

    Xiangyun Meng, Nathan Ratliff, Yu Xiang, and Dieter Fox. Scaling local control to large-scale topological navigation. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 672–678. IEEE, 2020

  25. [25]

    Mapping a suburb with a single camera using a biologically inspired slam system.IEEE Transactions on Robotics, 24(5): 1038–1053, 2008

    Michael J Milford and Gordon F Wyeth. Mapping a suburb with a single camera using a biologically inspired slam system.IEEE Transactions on Robotics, 24(5): 1038–1053, 2008

  26. [26]

    Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

  27. [27]

    Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

  28. [28]

    Xfeat: Accelerated features for lightweight image matching

    Guilherme Potje, Felipe Cadar, Andr ´e Araujo, Renato Martins, and Erickson R Nascimento. Xfeat: Accelerated features for lightweight image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2682–2691, 2024

  29. [29]

    Linear n-point camera pose determination.IEEE Transactions on pattern anal- ysis and machine intelligence, 21(8):774–780, 1999

    Long Quan and Zhongdan Lan. Linear n-point camera pose determination.IEEE Transactions on pattern anal- ysis and machine intelligence, 21(8):774–780, 1999

  30. [30]

    Beyond the Kalman Filter: Particle Filters for Track- ing Applications

    Branko Ristic, Sanjeev Arulampalam, and Neil Gordon. Beyond the Kalman Filter: Particle Filters for Track- ing Applications. Artech House Radar Library. Artech House, Boston, London, 2004. ISBN 9781580536318

  31. [31]

    Semi-parametric topological memory for nav- igation

    Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for nav- igation. InInternational Conference on Learning Repre- sentations, 2018

  32. [32]

    Rover: A multi-season dataset for visual slam.IEEE Transactions on Robotics, 2025

    Fabian Schmidt, Julian Daubermann, Marcel Mitschke, Constantin Blessing, Stephan Meyer, Markus Enzweiler, and Abhinav Valada. Rover: A multi-season dataset for visual slam.IEEE Transactions on Robotics, 2025

  33. [33]

    Xuesong Shi, Dongjiang Li, Pengpeng Zhao, Qinbin Tian, Yuxin Tian, Qiwei Long, Chunhao Zhu, Jingwei Song, Fei Qiao, Le Song, Yangquan Guo, Zhigang Wang, Yimin Zhang, Baoxing Qin, Wei Yang, Fangshi Wang, Rosa H. M. Chan, and Qi She. Are we ready for ser- vice robots? the OpenLORIS-Scene datasets for lifelong SLAM. In2020 International Conference on Robotic...

  34. [34]

    Placenav: Topological navigation through place recognition

    Lauri Suomela, Jussi Kalliola, Harry Edelman, and Joni- Kristian K ¨am¨ar¨ainen. Placenav: Topological navigation through place recognition. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5205–5213. IEEE, 2024

  35. [35]

    S Urban, J Leitloff, and S Hinz. Mlpnp–a real-time maximum likelihood solution to the perspective-n-point problem.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 3:131–138, 2016

  36. [36]

    Probable object location (polo) score estimation for efficient object goal naviga- tion

    Jiaming Wang and Harold Soh. Probable object location (polo) score estimation for efficient object goal naviga- tion. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5221–5227. IEEE, 2024

  37. [37]

    Genie: A generalizable navigation system for in-the-wild envi- ronments.IEEE Robotics and Automation Letters, 2025

    Jiaming Wang, Diwen Liu, Jizhuo Chen, Jiaxuan Da, Nuowen Qian, Minh Man Tram, and Harold Soh. Genie: A generalizable navigation system for in-the-wild envi- ronments.IEEE Robotics and Automation Letters, 2025

  38. [38]

    Topo-bench: An open-source topological mapping eval- uation framework with quantifiable perceptual aliasing

    Jiaming Wang, Diwen Liu, Jizhuo Chen, and Harold Soh. Topo-bench: An open-source topological mapping eval- uation framework with quantifiable perceptual aliasing. arXiv preprint arXiv:2510.04100, 2025

  39. [39]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 5294–5306, 2025

  40. [40]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

    Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchi- cal Open-V ocabulary 3D Scene Graphs for Language- Grounded Robot Navigation. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. doi: 10.15607/RSS.2024.XX.077

  41. [41]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 21676–21685, 2024

  42. [42]

    Sni-slam: Semantic neural implicit slam

    Siting Zhu, Guangming Wang, Hermann Blum, Jiuming Liu, Liang Song, Marc Pollefeys, and Hesheng Wang. Sni-slam: Semantic neural implicit slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21167–21177, 2024. APPENDIXA EXPERIMENTDETAILS A. Topological Localization Baselines This appendix describes the topological...

  43. [43]

    If the maximum similarity exceeds a fixed thresholdτ, the corresponding node is selected as the localization result; otherwise, the localization estimate remains unchanged

    Greedy Matching (GM):The greedy matching baseline localizes by selecting the node with the highest similarity score to the current observation. If the maximum similarity exceeds a fixed thresholdτ, the corresponding node is selected as the localization result; otherwise, the localization estimate remains unchanged. This baseline reflects a common retrieva...

  44. [44]

    A candidate match between nodes(v i, vj)is accepted if the aggregated similarity over a window of size2h+1satisfies f sim(zvi−h, zvj −h),

    Sequence Matching (SM):Instead of matching a single observation, sequence matching aggregates similarity scores over a short temporal window to improve robustness against perceptual aliasing and viewpoint variation. A candidate match between nodes(v i, vj)is accepted if the aggregated similarity over a window of size2h+1satisfies f sim(zvi−h, zvj −h), . ....

  45. [45]

    Probabilistic Belief Update (PBU):The probabilistic belief update baseline maintains a discrete posterior belief bt(v) =P(v t =v|z 1:t)over the topological nodesv∈ Vat timet. Given the belief at the previous timestep, the state is first propagated via a motion modelP(v t |v t−1)that constrains allowable transitions based on the graph topology: P(v t |v t−...

  46. [46]

    The GSF represents the filtering density as a finite mixture p(xt−1 |z 1:t−1, u1:t−2) = Kt−1X k=1 w(k) t−1 N xt−1;µ (k) t−1,Σ (k) t−1 , withw (k) t−1 ≥0and P k w(k) t−1 = 1

    Preliminaries and Notation:Letx t ∈R n be the (locally Euclidean) state with motion and measurement models xt =f t(xt−1, ut−1) +q t, q t ∼ N(0, Q t), zt =h t(xt) +r t, r t ∼ N(0, R t), (10) whereQ t, Rt ≻0. The GSF represents the filtering density as a finite mixture p(xt−1 |z 1:t−1, u1:t−2) = Kt−1X k=1 w(k) t−1 N xt−1;µ (k) t−1,Σ (k) t−1 , withw (k) t−1 ...

  47. [47]

    a) Prediction (per component).:For eachk= 1,

    Exact Linear–Gaussian GSF:Assume linear–Gaussian models: xt =F txt−1 +B tut−1 +q t, z t =H txt +r t. a) Prediction (per component).:For eachk= 1, . . . , Kt−1, µ(k) t|t−1 =F tµ(k) t−1 +B tut−1, Σ(k) t|t−1 =F tΣ(k) t−1F ⊤ t +Q t, w(k) t|t−1 =w (k) t−1. (11) Thusp(x t |z 1:t−1, u1:t−1) =P k w(k) t|t−1N(x t;µ (k) t|t−1,Σ (k) t|t−1). b) Update (per component ...

  48. [48]

    Nonlinear GSF via Local Gaussianization:For nonlin- ear (10), GSF applies a local Gaussian filter to each compo- nent. a) EKF-style (per component).:Linearize around the current component mean: ft(x, u)≈f t(µ(k) t−1, u) +F (k) t (x−µ (k) t−1), ht(x)≈h t(µ(k) t|t−1) +H (k) t (x−µ (k) t|t−1), whereF (k) t , H(k) t are Jacobians. Then apply (11)–(14) with (F...

  49. [49]

    a) Product of Gaussians.: N1(x)N2(x) =N(m 1;m 2, S1+S2)N(x;m, S),(15) whereS= (S −1 1 +S −1 2 )−1 andm=S(S −1 1 m1 +S −1 2 m2)

    Mixture Identities:LetN i(x) =N(x;m i, Si)fori∈ {1,2}. a) Product of Gaussians.: N1(x)N2(x) =N(m 1;m 2, S1+S2)N(x;m, S),(15) whereS= (S −1 1 +S −1 2 )−1 andm=S(S −1 1 m1 +S −1 2 m2). b) Innovation evidence.:For predicted(µ −,Σ −)and measurementz=Hx+r,r∼ N(0, R), the innovation y=z−Hµ − satisfiesy∼ N(0, S)withS=HΣ −H ⊤ +R, yielding the evidence term in (14)

  50. [50]

    b) Reduction / merging.:Iteratively merge nearby com- ponents (e.g., using a KL-based criterion) untilK t ≤K max

    Mixture Growth Control:To prevent unbounded mixture growth, GSF typically uses: a) Pruning.:Remove components withw (k) t < ε. b) Reduction / merging.:Iteratively merge nearby com- ponents (e.g., using a KL-based criterion) untilK t ≤K max. Merging two components with weightsa, bby moment match- ing gives µ= aµa +bµ b a+b ,(16) Σ = a Σa + (µa −µ)(µ a −µ) ...

  51. [51]

    In practice, gating and sparsification reduce theK×C t expansion

    Mixture–Mixture Update (Optional):If the measurement factor is approximated by a mixture Qt(x) = PCt c=1 π(c) t N(x;ν (c) t ,Λ (c) t ), then the update is a mixture–mixture product: p(xt |z 1:t)∝p −(xt)Q t(xt), with p−(xt) = X k w(k) t|t−1N xt;µ (k) t|t−1,Σ (k) t|t−1 , and p(xt |z 1:t) = Kt−1X k=1 CtX c=1 ˜wk,c N(x t;m k,c, Sk,c).(17) Here(m k,c, Sk,c)fol...

  52. [52]

    During prediction, covariances are transported through group composition using the ap- propriate adjoint (first-order), yielding the manifold GSF expressions used in the main text

    Manifold Adaptation (Lie Groups):On a Lie groupX (e.g.,SE(3)), represent each mixture component as a Gaussian in a consistent tangent chartϕ(·), apply the Euclidean GSF updates to ξt =ϕ(x t), and reconstruct means viaexp(·). During prediction, covariances are transported through group composition using the ap- propriate adjoint (first-order), yielding the...

  53. [53]

    2)Update:apply (12)–(14) (or (17) for mixture likelihoods)

    One-Step GSF Summary:Given{w (k) t−1, µ(k) t−1,Σ (k) t−1} Kt−1 k=1 : leftmargin=1.2em,itemsep=2pt 1)Predict:propagate each component (linear (11), or EKF/UKF/CKF per component). 2)Update:apply (12)–(14) (or (17) for mixture likelihoods). 3)Control:prune/reduce (and optionally split) to enforceK t ≤ Kmax. On manifolds, perform all steps in the chosen chart...

  54. [54]

    The (right-invariant) stochastic motion model is Xt =X t−1 ∆Tt exp(νt), ν t ∼ N(0, Q t)⊂se(3), withν t independent ofε

    Setup:LetX t−1 ∈SE(3)be distributed asX t−1 =µexp(ε) withε∼ N(0,Σ)⊂se(3). The (right-invariant) stochastic motion model is Xt =X t−1 ∆Tt exp(νt), ν t ∼ N(0, Q t)⊂se(3), withν t independent ofε

  55. [55]

    Prediction Kernel:For a single mixand, the predicted density is ¯p(xt) = Z p(xt |x t−1)N se(3) log(µ−1xt−1); 0,Σ dxt−1

  56. [56]

    Using the group adjoint and BCH, µexp(ε) ∆Tt =µ∆T t exp Ad∆T −1 t ε+O(∥ε∥ 2)

    First-Order Pushforward:WriteX t−1 =µexp(ε). Using the group adjoint and BCH, µexp(ε) ∆Tt =µ∆T t exp Ad∆T −1 t ε+O(∥ε∥ 2) . Post-multiplying byexp(ν t)and applying BCH again yields µ∆Tt exp Ad∆T −1 t ε exp(νt) =µ∆T t exp Ad∆T −1 t ε +ν t +O(∥ε∥ 2 +∥ν t∥2) . Neglecting higher-order terms, the updated error in the right- invariant chart at the predicted mea...

  57. [57]

    Mixtures and Weights:Since R p(xt |x t−1)dx t = 1, prediction preserves mixture weights: ifp(x) = P k wkpk(x)then ¯p(x) =P k wk ¯pk(x)

  58. [58]

    Small-Increment Approximation:If∆T t = exp(ξ t)with ∥ξt∥ ≪1, then Ad∆T −1 t =I−ad(ξ t) +O(∥ξ t∥2), and the transported covariance expands as Ad∆T −1 t ΣAd⊤ ∆T −1 t = Σ−ad(ξ t)Σ−Σ ad(ξ t)⊤ +O(∥ξ t∥2∥Σ∥). At high update rates (small∥ξ t∥) and when covariances are main- tained in the updated right-invariant chart, a common conservative approximation is Σ− ≈Σ...