pith. sign in

arxiv: 2605.21935 · v1 · pith:622R72ZLnew · submitted 2026-05-21 · 💻 cs.RO

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid navigationdynamic environments3D Gaussian Splattingscene memorydiscrepancy detectionsemantic mappingmanipulation safety
0
0 comments X

The pith

A multi-modal interactive field updates humanoid scene memory only for real changes, raising relocation success from 12 to 94 percent in dynamic offices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the Multi-modal Interactive Field to give humanoid robots reliable scene memory even when the environment changes and their own walking distorts what they see. It does this by combining three specialized fields for appearance, space, and geometry, then using a discrepancy score to decide which parts of the memory need updating. The approach is important because existing mapping tools assume nothing moves and cameras stay steady, which rarely holds for a walking robot trying to pick up objects in a real office. If successful, it means robots can keep working without their internal map becoming outdated or too large to handle.

Core claim

By coupling an uncertainty-aware 3D Gaussian Splatting Appearance Field that reduces gait-induced blur, a Spatial Field for topological memory, and a Geometry Field for safe interaction poses, the system uses a discrepancy detection score to perform local memory updates only on persistent environmental changes rather than false positives from locomotion.

What carries the argument

The Multi-modal Interactive Field (MIF) that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction in a closed-loop pipeline.

If this is right

  • Improves relocation success in non-static environments from 12% to 94% on a Unitree-G1 humanoid.
  • Reduces semantic memory footprint by 91.4% through feature distillation for practical online operation.
  • Supports Interaction Pose Safety using the Geometry Field before manipulation tasks.
  • Maintains reliable memory under locomotion-induced perceptual distortion without assuming stable camera trajectories or static scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Discrepancy-based update logic could apply to other mobile robots that must separate self-motion artifacts from true scene changes.
  • Over longer deployments the evolving memory might support autonomous operation across multiple days in shared spaces.
  • Pairing the geometry field with task planners could improve path choices that respect manipulation safety margins.

Load-bearing premise

The discrepancy detection score reliably separates locomotion-induced false-positive changes from persistent environmental changes without missing critical updates or introducing new errors in the memory model.

What would settle it

A test in a static office where the robot walks extensively and memory updates remain near zero, versus a test where objects are moved and updates occur only in the affected local regions, would show whether the score separates the two cases correctly.

Figures

Figures reproduced from arXiv: 2605.21935 by Hong Liu, Jin Jin, Peifeng Jiang, WenShuai Wang, Xia Li.

Figure 1
Figure 1. Figure 1: Multi-modal Interactive Fields (MIF) for Robust Humanoid Navigation. We propose a hierarchical framework composed of three coupled fields: (1) Appearance Field: Provides dense semantic grounding for robust view synthesis. (2) Spatial Field: Maintains a dynamic topological Scene Graph that updates upon object relocation, enabling seamless re-planning from obsolete paths to correct trajectories. (3) Geometry… view at source ↗
Figure 1
Figure 1. Figure 1: For P1, the Appearance Field builds a confidence-aware semantic 3D Gaussian Splatting representation in which each Gaussian carries a reliability estimate for identifying gait￾corrupted primitives. This reliability gate suppresses gait￾corrupted primitives during rendering and scene-graph con￾struction, reducing the propagation of locomotion artifacts into semantic grounding. For interaction-level safety, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Multi-modal Interactive Fields (MIF) framework. (Red) Incremental Appearance Field: Constructs a dense semantic base by fusing 3DGS SLAM [50] with multi-modal features [33, 38] to generate confidence-aware maps. (Blue) Spatial Field: Abstracts the dense map into a topological Scene Graph, utilizing VLMs [20] to reason over object relationships from synthesized views. (Green) Geometry Field:… view at source ↗
Figure 3
Figure 3. Figure 3: Humanoid navigation control via Pure Pursuit. The schematic illustrates the geometric tracking logic where the Unitree-G1 robot targets a lookahead point at distance L along the planned trajectory. The heading error ∆θ is used to com￾pute the required curvature κ = 2 sin(∆θ)/L. The overlaid velocity curve demonstrates the adaptive scaling mechanism, where L is dynamically adjusted to suppress oscillations … view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic adaptation of the hierarchical scene graph to environmental changes. (a) Detection: Upon arriving at the goal, the G1 robot detects a structural discrepancy (e.g., a moved sofa) that contradicts its prior memory, triggering an increase in the discrepancy score D. (b) Resolution: MIF initiates a local update of the Spatial Field with fresh observations. The updated graph better aligns with the physi… view at source ↗
read the original abstract

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the Multi-modal Interactive Field (MIF) for robust humanoid navigation and manipulation-oriented tasks in dynamic environments. It couples an uncertainty-aware 3D Gaussian Splatting Appearance Field to suppress gait-induced blur, a Spatial Field that maintains topological memory via discrepancy-triggered local updates, and a Geometry Field supporting Interaction Pose Safety (IPS). A discrepancy detection score is proposed to separate locomotion-induced false-positive changes from persistent environmental changes. Real-robot experiments on a Unitree-G1 humanoid in a dynamic office report relocation success rising from 12% to 94% versus static scene-graph memory, alongside a 91.4% reduction in semantic memory footprint through feature distillation.

Significance. If the performance gains are substantiated by rigorous validation, the work could meaningfully advance practical deployment of humanoids in non-static settings by addressing perceptual distortions from locomotion and enabling efficient online memory management. The closed-loop pipeline and multi-modal field integration offer a concrete approach to combining semantic mapping with geometric safety constraints.

major comments (1)
  1. The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a key area for strengthening the validation of our discrepancy detection score. We address the concern point by point below and have revised the manuscript to incorporate additional quantitative analysis.

read point-by-point responses
  1. Referee: The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.

    Authors: We agree that direct metrics on the discrepancy detection score would provide stronger, more targeted validation of its ability to distinguish locomotion-induced artifacts from persistent changes. The original manuscript presents the 12% to 94% relocation success and 91.4% memory reduction as end-to-end outcomes of the full closed-loop pipeline, which implicitly relies on the score to prevent error accumulation in the Spatial Field; the real-robot results in dynamic offices serve as the primary evidence that misclassifications did not materially degrade performance. To address the referee's point explicitly, we have added a new evaluation subsection reporting precision, recall, and false-positive rates on separate locomotion-only sequences (no object motion) versus sequences with controlled object displacements. These results show low false-positive rates under gait-induced conditions while maintaining high detection accuracy for actual changes, directly supporting the load-bearing role of the score. We believe this addition substantiates the claims without altering the core contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical system validated on hardware

full rationale

The paper presents MIF as an integrated perception-adaptation pipeline for humanoid navigation, coupling Appearance, Spatial, and Geometry Fields with a discrepancy detection score for local updates. All reported outcomes (12% to 94% relocation success, 91.4% memory reduction) are measured results from closed-loop experiments on a Unitree-G1 in a real dynamic office, compared against static scene-graph baselines. No equations, fitted parameters, or predictions are shown to reduce to their own inputs by construction. The discrepancy score is introduced as a design element whose separation of locomotion artifacts from persistent changes is assessed via experimental performance rather than self-definition or prior self-citation. The derivation chain consists of engineering choices and empirical validation with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about perceptual distortion in locomotion and the separability of change types, plus one new integrated system entity; no explicit free parameters are named but the discrepancy score implies tuning.

free parameters (1)
  • discrepancy detection score threshold
    Used to decide when to trigger local memory updates by separating locomotion artifacts from real changes; value not stated but required for the pipeline to function as described.
axioms (2)
  • domain assumption Locomotion-induced perceptual distortion can be isolated from persistent environmental changes using a discrepancy score.
    Invoked to justify selective memory updates in the closed-loop pipeline.
  • domain assumption Uncertainty-aware 3D Gaussian Splatting can suppress gait-induced blur sufficiently for reliable semantic mapping.
    Basis for the Appearance Field component.
invented entities (1)
  • Multi-modal Interactive Field (MIF) no independent evidence
    purpose: To couple appearance, spatial, and geometry fields in a perception-adaptation loop for dynamic humanoid navigation.
    New system construct introduced to organize the three fields and update logic.

pith-pipeline@v0.9.0 · 5762 in / 1577 out tokens · 51861 ms · 2026-05-22T05:51:38.463833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

  1. [1]

    3d scene graph: A structure for unified semantics, 3d space, and camera

    Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019

  2. [2]

    Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

    Ruzena Bajcsy. Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

  3. [3]

    Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

    Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, and Holger V oos. Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

  4. [4]

    Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

    Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gau- rav S Sukhatme. Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

  5. [5]

    Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

    Georgia Chalvatzaki, Ali Younes, Daljeet Nandha, An Thai Le, Leonardo FR Ribeiro, and Iryna Gurevych. Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

  6. [6]

    Goat: Go to any thing

    Matthew Chang, Th ´eophile Gervet, Mukul Khanna, Sri- ram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing. InRobotics: Science and Systems (RSS), 2024

  7. [7]

    Learning to explore using active neural slam

    Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020

  8. [8]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems (NeurIPS), 33:4247–4258, 2020

  9. [9]

    How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers

    Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, and Fisher Yu. How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers. 2024

  10. [10]

    SAM 3D: 3Dfy Anything in Images

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  11. [11]

    Bsp- net: Generating compact meshes via binary space parti- tioning

    Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp- net: Generating compact meshes via binary space parti- tioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–54, 2020

  12. [12]

    Scan2mesh: From unstructured range scans to 3d meshes

    Angela Dai and Matthias Nießner. Scan2mesh: From unstructured range scans to 3d meshes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5574–5583, 2019

  13. [13]

    Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views

    Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. In International Conference on Learning Representations (ICLR), volume 2024, pages 8396–8407, 2024

  14. [14]

    Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23171–23181, 2023

  15. [15]

    A papier-m ˆach´e approach to learning 3d surface generation

    Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-m ˆach´e approach to learning 3d surface generation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 216–224, 2018

  16. [16]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024

  17. [17]

    Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting

    Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 5949–5958, 2025

  18. [18]

    Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

  19. [19]

    Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

    Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

  20. [20]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024

  21. [21]

    The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation

    Shuuji Kajita, Fumio Kanehiro, Kenji Kaneko, Kazuhito Yokoi, and Hirohisa Hirukawa. The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), volume 1, pages 239–246, 2001

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023

  23. [23]

    Lerf: Language em- bedded radiance fields

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language em- bedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 19729–19739, 2023

  24. [24]

    Perception-driven navigation: Active visual slam for robotic area coverage

    Ayoung Kim and Ryan M Eustice. Perception-driven navigation: Active visual slam for robotic area coverage. InIEEE International Conference on Robotics and Au- tomation (ICRA), pages 3196–3203, 2013

  25. [25]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

  26. [26]

    Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

    Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

  27. [27]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  28. [28]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

  29. [29]

    Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

  30. [30]

    Polygen: An autoregressive generative model of 3d meshes

    Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning (ICML), pages 7220–7229. PMLR, 2020

  31. [31]

    Grid: Scene-graph-based instruction-driven robotic task planning

    Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, and Long Zeng. Grid: Scene-graph-based instruction-driven robotic task planning. InIEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 13765–13772, 2024

  32. [32]

    Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

    Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, and Ky- oung Mu Lee. Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

  33. [33]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

  34. [34]

    Openscene: 3d scene understanding with open vocabu- laries

    Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabu- laries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2023

  35. [35]

    Langsplat: 3d language gaussian splatting

    Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20051–20060, 2024

  36. [36]

    Language-driven physics-based scene synthesis and editing via feature splatting

    Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Language-driven physics-based scene synthesis and editing via feature splatting. InEuropean conference on computer vision (ECCV), pages 368–383, 2024

  37. [37]

    Ros: an open-source robot operating system

    Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. Ros: an open-source robot operating system. InIEEE International Conference on Robotics and Au- tomation Workshop on Open Source Software, volume 3, page 5, 2009

  38. [38]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational Conference on Ma- chine Learning (ICML), pages 8748–8763, 2021

  39. [39]

    Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning

    Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. InConference on Robot Learning (CoRL), pages 23–72. PMLR, 2023

  40. [40]

    3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans

    Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, and Luca Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. InRobotics: Science and Systems (RSS), 2020

  41. [41]

    Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments

    Lukas Schmid, Marcus Abate, Yun Chang, and Luca Car- lone. Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments. 2024

  42. [42]

    Language embedded 3d gaussians for open- vocabulary scene understanding

    Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024

  43. [43]

    Meshgpt: Generating triangle meshes with decoder-only transformers

    Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, An- gela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19615– 19625, 2024

  44. [44]

    Real-time 3d slam for humanoid robot considering pattern generator information

    Olivier Stasse, Andrew J Davison, Ramzi Sellaouti, and Kazuhito Yokoi. Real-time 3d slam for humanoid robot considering pattern generator information. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 348–355, 2006

  45. [45]

    Learning 3d semantic scene graphs from 3d indoor reconstructions

    Johanna Wald, Helisa Dhamo, Nassir Navab, and Fed- erico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3961–3970, 2020

  46. [46]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

    Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InRobotics: Science and Systems (RSS), 2024

  47. [47]

    Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty

    Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Ming- han Zhu, Kira Barton, and Maani Ghaffari. Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty. IEEE Robotics and Automation Letters, 2025

  48. [48]

    Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences

    Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nas- sir Navab, and Federico Tombari. Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7515–7525, 2021

  49. [49]

    Invariant ekf based 2d active slam with exploration task

    Mengya Xu, Yang Song, Yongbo Chen, Shoudong Huang, and Qi Hao. Invariant ekf based 2d active slam with exploration task. InIEEE International Conference on Robotics and Automation (ICRA), pages 5350–5356, 2021

  50. [50]

    Gs-slam: Dense visual slam with 3d gaussian splatting

    Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

  51. [51]

    Gaussian grouping: Segment and edit anything in 3d scenes

    Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision (ECCV), pages 162–179. Springer, 2024

  52. [52]

    Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

    Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

  53. [53]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation (ICRA), pages 42–48, 2024

  54. [54]

    Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

    Yao Zhao, Zhi Xiong, Shuailin Zhou, Jingqi Wang, Ling Zhang, and Pascual Campoy. Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

  55. [55]

    Sofa Relocation

    Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8120–8132, 2025. Le...

  56. [56]

    The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz)

    Robot Platform and Sensors:We deploy our system on the Unitree G1 humanoid robot. The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz). Camera intrinsics are calibrated offline, and the camera-to-base transform is obtained from the calibrat...

  57. [57]

    •Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-assisted graph construction, and Flow Matching-based geometry generation, run on the workstation

    Computation and Communication:We split computation between onboard control and offboard perception/reasoning: •Onboard Compute: The robot’s built-in NVIDIA Jetson Orin NX handles robot-provided low-level control in- terfaces, state feedback (IMU/Odometry), and hardware bridging. •Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-a...

  58. [58]

    It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM

    Server Specifications:The offboard workstation uses commercial desktop hardware. It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM. This setup was suffi- cient for the reported 22 FPS Appearance Field updates and approximately 6.2 s Geometry Field generation latency in our experiments. B. Software Stack

  59. [59]

    We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering

    3DGS Backend:The Appearance Field is implemented on an incremental Gaussian Splatting backend. We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering. The SLAM frontend is implemented in Python/PyTorch, leveraging diff-gaussian- rasterization-w-depth for differentiable rasterization with depth supervi...

  60. [60]

    on”, “next to

    VLM & Prompt Engineering:For Spatial Field (F spat) construction, we use GPT-4o (OpenAI) to parse stabilized rendered views into structured scene-graph evidence. The parsed outputs are post-processed into scene graph nodes and relations. To encourage structured output for topological parsing, we use the JSON-style prompt template shown in Fig. S.1. This p...

  61. [61]

    Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification

    Flow Matching Model Architecture:The Geometry Field (F geom) uses a pre-trained conditional Flow Matching model [10]. Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification. •Backbone Architecture: We utilize a pre-trained Flow- Matching Transformer backbone that processes lat...

  62. [62]

    w/o Gating

    System Architecture Implementation:Fig. S.2 summa- rizes the onboard/offboard dataflow used by MIF. The system comprises distinct computational modules dis- tributed as follows: Onboard Computation (Unitree G1 with Jetson Orin): The onboard layer provides sensor streams, robot state feed- back, and low-level command execution. •Sensors: Acquires raw data ...

  63. [63]

    Rows 1-3 of Fig

    Robustness of Appearance Field: Jitter Suppression & Dynamic Adaptation:We visualize rendered RGB/depth qual- ity from the Appearance FieldF app under humanoid walking. Rows 1-3 of Fig. S.3 illustrate the effect of confidence gating during high-speed walking (0.5m/s). The ungated variant shows ghosting artifacts in RGB and noisier depth estimates (e.g., D...

  64. [64]

    Bypass the sofa

    Generated Geometry for IPS Checking:To illustrate why sparse point clouds are insufficient for IPS collision checking, we present a comparative analysis in Fig. S.4. The raw 3D Gaussian point cloud (Left panels), while useful for visual navigation, remains sparse and lacks surface connectiv- ity. Gaps between centroids may make collision checks overly opt...