Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments
Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3
The pith
A multi-modal interactive field updates humanoid scene memory only for real changes, raising relocation success from 12 to 94 percent in dynamic offices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By coupling an uncertainty-aware 3D Gaussian Splatting Appearance Field that reduces gait-induced blur, a Spatial Field for topological memory, and a Geometry Field for safe interaction poses, the system uses a discrepancy detection score to perform local memory updates only on persistent environmental changes rather than false positives from locomotion.
What carries the argument
The Multi-modal Interactive Field (MIF) that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction in a closed-loop pipeline.
If this is right
- Improves relocation success in non-static environments from 12% to 94% on a Unitree-G1 humanoid.
- Reduces semantic memory footprint by 91.4% through feature distillation for practical online operation.
- Supports Interaction Pose Safety using the Geometry Field before manipulation tasks.
- Maintains reliable memory under locomotion-induced perceptual distortion without assuming stable camera trajectories or static scenes.
Where Pith is reading between the lines
- Discrepancy-based update logic could apply to other mobile robots that must separate self-motion artifacts from true scene changes.
- Over longer deployments the evolving memory might support autonomous operation across multiple days in shared spaces.
- Pairing the geometry field with task planners could improve path choices that respect manipulation safety margins.
Load-bearing premise
The discrepancy detection score reliably separates locomotion-induced false-positive changes from persistent environmental changes without missing critical updates or introducing new errors in the memory model.
What would settle it
A test in a static office where the robot walks extensively and memory updates remain near zero, versus a test where objects are moved and updates occur only in the affected local regions, would show whether the score separates the two cases correctly.
Figures
read the original abstract
Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Multi-modal Interactive Field (MIF) for robust humanoid navigation and manipulation-oriented tasks in dynamic environments. It couples an uncertainty-aware 3D Gaussian Splatting Appearance Field to suppress gait-induced blur, a Spatial Field that maintains topological memory via discrepancy-triggered local updates, and a Geometry Field supporting Interaction Pose Safety (IPS). A discrepancy detection score is proposed to separate locomotion-induced false-positive changes from persistent environmental changes. Real-robot experiments on a Unitree-G1 humanoid in a dynamic office report relocation success rising from 12% to 94% versus static scene-graph memory, alongside a 91.4% reduction in semantic memory footprint through feature distillation.
Significance. If the performance gains are substantiated by rigorous validation, the work could meaningfully advance practical deployment of humanoids in non-static settings by addressing perceptual distortions from locomotion and enabling efficient online memory management. The closed-loop pipeline and multi-modal field integration offer a concrete approach to combining semantic mapping with geometric safety constraints.
major comments (1)
- The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies a key area for strengthening the validation of our discrepancy detection score. We address the concern point by point below and have revised the manuscript to incorporate additional quantitative analysis.
read point-by-point responses
-
Referee: The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.
Authors: We agree that direct metrics on the discrepancy detection score would provide stronger, more targeted validation of its ability to distinguish locomotion-induced artifacts from persistent changes. The original manuscript presents the 12% to 94% relocation success and 91.4% memory reduction as end-to-end outcomes of the full closed-loop pipeline, which implicitly relies on the score to prevent error accumulation in the Spatial Field; the real-robot results in dynamic offices serve as the primary evidence that misclassifications did not materially degrade performance. To address the referee's point explicitly, we have added a new evaluation subsection reporting precision, recall, and false-positive rates on separate locomotion-only sequences (no object motion) versus sequences with controlled object displacements. These results show low false-positive rates under gait-induced conditions while maintaining high detection accuracy for actual changes, directly supporting the load-bearing role of the score. We believe this addition substantiates the claims without altering the core contributions. revision: yes
Circularity Check
No significant circularity: empirical system validated on hardware
full rationale
The paper presents MIF as an integrated perception-adaptation pipeline for humanoid navigation, coupling Appearance, Spatial, and Geometry Fields with a discrepancy detection score for local updates. All reported outcomes (12% to 94% relocation success, 91.4% memory reduction) are measured results from closed-loop experiments on a Unitree-G1 in a real dynamic office, compared against static scene-graph baselines. No equations, fitted parameters, or predictions are shown to reduce to their own inputs by construction. The discrepancy score is introduced as a design element whose separation of locomotion artifacts from persistent changes is assessed via experimental performance rather than self-definition or prior self-citation. The derivation chain consists of engineering choices and empirical validation with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- discrepancy detection score threshold
axioms (2)
- domain assumption Locomotion-induced perceptual distortion can be isolated from persistent environmental changes using a discrepancy score.
- domain assumption Uncertainty-aware 3D Gaussian Splatting can suppress gait-induced blur sufficiently for reliable semantic mapping.
invented entities (1)
-
Multi-modal Interactive Field (MIF)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
3d scene graph: A structure for unified semantics, 3d space, and camera
Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019
work page 2019
-
[2]
Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988
Ruzena Bajcsy. Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988
work page 1988
-
[3]
Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, and Holger V oos. Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022
work page 2022
-
[4]
Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gau- rav S Sukhatme. Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017
work page 2017
-
[5]
Georgia Chalvatzaki, Ali Younes, Daljeet Nandha, An Thai Le, Leonardo FR Ribeiro, and Iryna Gurevych. Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023
work page 2023
-
[6]
Matthew Chang, Th ´eophile Gervet, Mukul Khanna, Sri- ram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[7]
Learning to explore using active neural slam
Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[8]
Object goal navigation using goal-oriented semantic exploration
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems (NeurIPS), 33:4247–4258, 2020
work page 2020
-
[9]
Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, and Fisher Yu. How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers. 2024
work page 2024
-
[10]
SAM 3D: 3Dfy Anything in Images
Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Bsp- net: Generating compact meshes via binary space parti- tioning
Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp- net: Generating compact meshes via binary space parti- tioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–54, 2020
work page 2020
-
[12]
Scan2mesh: From unstructured range scans to 3d meshes
Angela Dai and Matthias Nießner. Scan2mesh: From unstructured range scans to 3d meshes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5574–5583, 2019
work page 2019
-
[13]
Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views
Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. In International Conference on Learning Representations (ICLR), volume 2024, pages 8396–8407, 2024
work page 2024
-
[14]
Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23171–23181, 2023
work page 2023
-
[15]
A papier-m ˆach´e approach to learning 3d surface generation
Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-m ˆach´e approach to learning 3d surface generation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 216–224, 2018
work page 2018
-
[16]
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning
Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024
work page 2024
-
[17]
Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting
Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 5949–5958, 2025
work page 2025
-
[18]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020
work page 2020
-
[19]
Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022
work page 2022
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation
Shuuji Kajita, Fumio Kanehiro, Kenji Kaneko, Kazuhito Yokoi, and Hirohisa Hirukawa. The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), volume 1, pages 239–246, 2001
work page 2001
-
[22]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023
work page 2023
-
[23]
Lerf: Language em- bedded radiance fields
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language em- bedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 19729–19739, 2023
work page 2023
-
[24]
Perception-driven navigation: Active visual slam for robotic area coverage
Ayoung Kim and Ryan M Eustice. Perception-driven navigation: Active visual slam for robotic area coverage. InIEEE International Conference on Robotics and Au- tomation (ICRA), pages 3196–3203, 2013
work page 2013
-
[25]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023
work page 2023
-
[26]
Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022
work page 2022
-
[27]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[28]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021
work page 2021
-
[29]
Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022
work page 2022
-
[30]
Polygen: An autoregressive generative model of 3d meshes
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning (ICML), pages 7220–7229. PMLR, 2020
work page 2020
-
[31]
Grid: Scene-graph-based instruction-driven robotic task planning
Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, and Long Zeng. Grid: Scene-graph-based instruction-driven robotic task planning. InIEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 13765–13772, 2024
work page 2024
-
[32]
Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024
Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, and Ky- oung Mu Lee. Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024
-
[33]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024
work page 2024
-
[34]
Openscene: 3d scene understanding with open vocabu- laries
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabu- laries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2023
work page 2023
-
[35]
Langsplat: 3d language gaussian splatting
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20051–20060, 2024
work page 2024
-
[36]
Language-driven physics-based scene synthesis and editing via feature splatting
Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Language-driven physics-based scene synthesis and editing via feature splatting. InEuropean conference on computer vision (ECCV), pages 368–383, 2024
work page 2024
-
[37]
Ros: an open-source robot operating system
Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. Ros: an open-source robot operating system. InIEEE International Conference on Robotics and Au- tomation Workshop on Open Source Software, volume 3, page 5, 2009
work page 2009
-
[38]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational Conference on Ma- chine Learning (ICML), pages 8748–8763, 2021
work page 2021
-
[39]
Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning
Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. InConference on Robot Learning (CoRL), pages 23–72. PMLR, 2023
work page 2023
-
[40]
3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans
Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, and Luca Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. InRobotics: Science and Systems (RSS), 2020
work page 2020
-
[41]
Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments
Lukas Schmid, Marcus Abate, Yun Chang, and Luca Car- lone. Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments. 2024
work page 2024
-
[42]
Language embedded 3d gaussians for open- vocabulary scene understanding
Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024
work page 2024
-
[43]
Meshgpt: Generating triangle meshes with decoder-only transformers
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, An- gela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19615– 19625, 2024
work page 2024
-
[44]
Real-time 3d slam for humanoid robot considering pattern generator information
Olivier Stasse, Andrew J Davison, Ramzi Sellaouti, and Kazuhito Yokoi. Real-time 3d slam for humanoid robot considering pattern generator information. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 348–355, 2006
work page 2006
-
[45]
Learning 3d semantic scene graphs from 3d indoor reconstructions
Johanna Wald, Helisa Dhamo, Nassir Navab, and Fed- erico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3961–3970, 2020
work page 2020
-
[46]
Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation
Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[47]
Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Ming- han Zhu, Kira Barton, and Maani Ghaffari. Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty. IEEE Robotics and Automation Letters, 2025
work page 2025
-
[48]
Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences
Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nas- sir Navab, and Federico Tombari. Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7515–7525, 2021
work page 2021
-
[49]
Invariant ekf based 2d active slam with exploration task
Mengya Xu, Yang Song, Yongbo Chen, Shoudong Huang, and Qi Hao. Invariant ekf based 2d active slam with exploration task. InIEEE International Conference on Robotics and Automation (ICRA), pages 5350–5356, 2021
work page 2021
-
[50]
Gs-slam: Dense visual slam with 3d gaussian splatting
Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024
work page 2024
-
[51]
Gaussian grouping: Segment and edit anything in 3d scenes
Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision (ECCV), pages 162–179. Springer, 2024
work page 2024
-
[52]
Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024
work page 2024
-
[53]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation (ICRA), pages 42–48, 2024
work page 2024
-
[54]
Yao Zhao, Zhi Xiong, Shuailin Zhou, Jingqi Wang, Ling Zhang, and Pascual Campoy. Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022
work page 2022
-
[55]
Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8120–8132, 2025. Le...
work page 2025
-
[56]
Robot Platform and Sensors:We deploy our system on the Unitree G1 humanoid robot. The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz). Camera intrinsics are calibrated offline, and the camera-to-base transform is obtained from the calibrat...
-
[57]
Computation and Communication:We split computation between onboard control and offboard perception/reasoning: •Onboard Compute: The robot’s built-in NVIDIA Jetson Orin NX handles robot-provided low-level control in- terfaces, state feedback (IMU/Odometry), and hardware bridging. •Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-a...
-
[58]
Server Specifications:The offboard workstation uses commercial desktop hardware. It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM. This setup was suffi- cient for the reported 22 FPS Appearance Field updates and approximately 6.2 s Geometry Field generation latency in our experiments. B. Software Stack
-
[59]
3DGS Backend:The Appearance Field is implemented on an incremental Gaussian Splatting backend. We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering. The SLAM frontend is implemented in Python/PyTorch, leveraging diff-gaussian- rasterization-w-depth for differentiable rasterization with depth supervi...
-
[60]
VLM & Prompt Engineering:For Spatial Field (F spat) construction, we use GPT-4o (OpenAI) to parse stabilized rendered views into structured scene-graph evidence. The parsed outputs are post-processed into scene graph nodes and relations. To encourage structured output for topological parsing, we use the JSON-style prompt template shown in Fig. S.1. This p...
-
[61]
Flow Matching Model Architecture:The Geometry Field (F geom) uses a pre-trained conditional Flow Matching model [10]. Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification. •Backbone Architecture: We utilize a pre-trained Flow- Matching Transformer backbone that processes lat...
-
[62]
System Architecture Implementation:Fig. S.2 summa- rizes the onboard/offboard dataflow used by MIF. The system comprises distinct computational modules dis- tributed as follows: Onboard Computation (Unitree G1 with Jetson Orin): The onboard layer provides sensor streams, robot state feed- back, and low-level command execution. •Sensors: Acquires raw data ...
-
[63]
Robustness of Appearance Field: Jitter Suppression & Dynamic Adaptation:We visualize rendered RGB/depth qual- ity from the Appearance FieldF app under humanoid walking. Rows 1-3 of Fig. S.3 illustrate the effect of confidence gating during high-speed walking (0.5m/s). The ungated variant shows ghosting artifacts in RGB and noisier depth estimates (e.g., D...
-
[64]
Generated Geometry for IPS Checking:To illustrate why sparse point clouds are insufficient for IPS collision checking, we present a comparative analysis in Fig. S.4. The raw 3D Gaussian point cloud (Left panels), while useful for visual navigation, remains sparse and lacks surface connectiv- ity. Gaps between centroids may make collision checks overly opt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.