Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

Hong Liu; Jin Jin; Peifeng Jiang; WenShuai Wang; Xia Li

arxiv: 2605.21935 · v1 · pith:622R72ZLnew · submitted 2026-05-21 · 💻 cs.RO

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

Peifeng Jiang , Hong Liu , Jin Jin , Wenshuai Wang , Xia Li This is my paper

Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid navigationdynamic environments3D Gaussian Splattingscene memorydiscrepancy detectionsemantic mappingmanipulation safety

0 comments

The pith

A multi-modal interactive field updates humanoid scene memory only for real changes, raising relocation success from 12 to 94 percent in dynamic offices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the Multi-modal Interactive Field to give humanoid robots reliable scene memory even when the environment changes and their own walking distorts what they see. It does this by combining three specialized fields for appearance, space, and geometry, then using a discrepancy score to decide which parts of the memory need updating. The approach is important because existing mapping tools assume nothing moves and cameras stay steady, which rarely holds for a walking robot trying to pick up objects in a real office. If successful, it means robots can keep working without their internal map becoming outdated or too large to handle.

Core claim

By coupling an uncertainty-aware 3D Gaussian Splatting Appearance Field that reduces gait-induced blur, a Spatial Field for topological memory, and a Geometry Field for safe interaction poses, the system uses a discrepancy detection score to perform local memory updates only on persistent environmental changes rather than false positives from locomotion.

What carries the argument

The Multi-modal Interactive Field (MIF) that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction in a closed-loop pipeline.

If this is right

Improves relocation success in non-static environments from 12% to 94% on a Unitree-G1 humanoid.
Reduces semantic memory footprint by 91.4% through feature distillation for practical online operation.
Supports Interaction Pose Safety using the Geometry Field before manipulation tasks.
Maintains reliable memory under locomotion-induced perceptual distortion without assuming stable camera trajectories or static scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Discrepancy-based update logic could apply to other mobile robots that must separate self-motion artifacts from true scene changes.
Over longer deployments the evolving memory might support autonomous operation across multiple days in shared spaces.
Pairing the geometry field with task planners could improve path choices that respect manipulation safety margins.

Load-bearing premise

The discrepancy detection score reliably separates locomotion-induced false-positive changes from persistent environmental changes without missing critical updates or introducing new errors in the memory model.

What would settle it

A test in a static office where the robot walks extensively and memory updates remain near zero, versus a test where objects are moved and updates occur only in the affected local regions, would show whether the score separates the two cases correctly.

Figures

Figures reproduced from arXiv: 2605.21935 by Hong Liu, Jin Jin, Peifeng Jiang, WenShuai Wang, Xia Li.

**Figure 1.** Figure 1: Multi-modal Interactive Fields (MIF) for Robust Humanoid Navigation. We propose a hierarchical framework composed of three coupled fields: (1) Appearance Field: Provides dense semantic grounding for robust view synthesis. (2) Spatial Field: Maintains a dynamic topological Scene Graph that updates upon object relocation, enabling seamless re-planning from obsolete paths to correct trajectories. (3) Geometry… view at source ↗

**Figure 1.** Figure 1: For P1, the Appearance Field builds a confidence-aware semantic 3D Gaussian Splatting representation in which each Gaussian carries a reliability estimate for identifying gaitcorrupted primitives. This reliability gate suppresses gaitcorrupted primitives during rendering and scene-graph construction, reducing the propagation of locomotion artifacts into semantic grounding. For interaction-level safety, … view at source ↗

**Figure 2.** Figure 2: Overview of the Multi-modal Interactive Fields (MIF) framework. (Red) Incremental Appearance Field: Constructs a dense semantic base by fusing 3DGS SLAM [50] with multi-modal features [33, 38] to generate confidence-aware maps. (Blue) Spatial Field: Abstracts the dense map into a topological Scene Graph, utilizing VLMs [20] to reason over object relationships from synthesized views. (Green) Geometry Field:… view at source ↗

**Figure 3.** Figure 3: Humanoid navigation control via Pure Pursuit. The schematic illustrates the geometric tracking logic where the Unitree-G1 robot targets a lookahead point at distance L along the planned trajectory. The heading error ∆θ is used to compute the required curvature κ = 2 sin(∆θ)/L. The overlaid velocity curve demonstrates the adaptive scaling mechanism, where L is dynamically adjusted to suppress oscillations … view at source ↗

**Figure 4.** Figure 4: Dynamic adaptation of the hierarchical scene graph to environmental changes. (a) Detection: Upon arriving at the goal, the G1 robot detects a structural discrepancy (e.g., a moved sofa) that contradicts its prior memory, triggering an increase in the discrepancy score D. (b) Resolution: MIF initiates a local update of the Spatial Field with fresh observations. The updated graph better aligns with the physi… view at source ↗

read the original abstract

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIF gets real-robot relocation success from 12% to 94% on a Unitree G1 in changing offices by using triggered local updates on a 3DGS-based memory, but the discrepancy score that is supposed to ignore locomotion artifacts has no reported validation metrics.

read the letter

The main takeaway is that this system combines an uncertainty-aware 3D Gaussian Splatting appearance field, a spatial field that updates only on detected discrepancies, and a geometry field for interaction pose safety. On hardware it produces a clear jump in success rate over static scene-graph baselines plus a large cut in memory size through distillation. Those outcomes are the concrete part worth paying attention to for anyone running legged robots in offices or homes where things move and the robot itself shakes the camera.

Referee Report

1 major / 0 minor

Summary. The paper introduces the Multi-modal Interactive Field (MIF) for robust humanoid navigation and manipulation-oriented tasks in dynamic environments. It couples an uncertainty-aware 3D Gaussian Splatting Appearance Field to suppress gait-induced blur, a Spatial Field that maintains topological memory via discrepancy-triggered local updates, and a Geometry Field supporting Interaction Pose Safety (IPS). A discrepancy detection score is proposed to separate locomotion-induced false-positive changes from persistent environmental changes. Real-robot experiments on a Unitree-G1 humanoid in a dynamic office report relocation success rising from 12% to 94% versus static scene-graph memory, alongside a 91.4% reduction in semantic memory footprint through feature distillation.

Significance. If the performance gains are substantiated by rigorous validation, the work could meaningfully advance practical deployment of humanoids in non-static settings by addressing perceptual distortions from locomotion and enabling efficient online memory management. The closed-loop pipeline and multi-modal field integration offer a concrete approach to combining semantic mapping with geometric safety constraints.

major comments (1)

The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a key area for strengthening the validation of our discrepancy detection score. We address the concern point by point below and have revised the manuscript to incorporate additional quantitative analysis.

read point-by-point responses

Referee: The 12% to 94% relocation success improvement and 91.4% memory reduction are attributed to the discrepancy detection score triggering updates only for persistent changes while ignoring gait-induced artifacts (see abstract and the closed-loop pipeline description). No precision, recall, or false-positive rate is reported for the score on locomotion-only sequences versus sequences with actual object motion. This validation is load-bearing because misclassifications would allow error accumulation in the Spatial Field, directly undermining both the success-rate claim and the footprint reduction.

Authors: We agree that direct metrics on the discrepancy detection score would provide stronger, more targeted validation of its ability to distinguish locomotion-induced artifacts from persistent changes. The original manuscript presents the 12% to 94% relocation success and 91.4% memory reduction as end-to-end outcomes of the full closed-loop pipeline, which implicitly relies on the score to prevent error accumulation in the Spatial Field; the real-robot results in dynamic offices serve as the primary evidence that misclassifications did not materially degrade performance. To address the referee's point explicitly, we have added a new evaluation subsection reporting precision, recall, and false-positive rates on separate locomotion-only sequences (no object motion) versus sequences with controlled object displacements. These results show low false-positive rates under gait-induced conditions while maintaining high detection accuracy for actual changes, directly supporting the load-bearing role of the score. We believe this addition substantiates the claims without altering the core contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical system validated on hardware

full rationale

The paper presents MIF as an integrated perception-adaptation pipeline for humanoid navigation, coupling Appearance, Spatial, and Geometry Fields with a discrepancy detection score for local updates. All reported outcomes (12% to 94% relocation success, 91.4% memory reduction) are measured results from closed-loop experiments on a Unitree-G1 in a real dynamic office, compared against static scene-graph baselines. No equations, fitted parameters, or predictions are shown to reduce to their own inputs by construction. The discrepancy score is introduced as a design element whose separation of locomotion artifacts from persistent changes is assessed via experimental performance rather than self-definition or prior self-citation. The derivation chain consists of engineering choices and empirical validation with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about perceptual distortion in locomotion and the separability of change types, plus one new integrated system entity; no explicit free parameters are named but the discrepancy score implies tuning.

free parameters (1)

discrepancy detection score threshold
Used to decide when to trigger local memory updates by separating locomotion artifacts from real changes; value not stated but required for the pipeline to function as described.

axioms (2)

domain assumption Locomotion-induced perceptual distortion can be isolated from persistent environmental changes using a discrepancy score.
Invoked to justify selective memory updates in the closed-loop pipeline.
domain assumption Uncertainty-aware 3D Gaussian Splatting can suppress gait-induced blur sufficiently for reliable semantic mapping.
Basis for the Appearance Field component.

invented entities (1)

Multi-modal Interactive Field (MIF) no independent evidence
purpose: To couple appearance, spatial, and geometry fields in a perception-adaptation loop for dynamic humanoid navigation.
New system construct introduced to organize the three fields and update logic.

pith-pipeline@v0.9.0 · 5762 in / 1577 out tokens · 51861 ms · 2026-05-22T05:51:38.463833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

[1]

3d scene graph: A structure for unified semantics, 3d space, and camera

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019

work page 2019
[2]

Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

Ruzena Bajcsy. Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

work page 1988
[3]

Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, and Holger V oos. Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

work page 2022
[4]

Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gau- rav S Sukhatme. Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

work page 2017
[5]

Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

Georgia Chalvatzaki, Ali Younes, Daljeet Nandha, An Thai Le, Leonardo FR Ribeiro, and Iryna Gurevych. Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

work page 2023
[6]

Goat: Go to any thing

Matthew Chang, Th ´eophile Gervet, Mukul Khanna, Sri- ram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing. InRobotics: Science and Systems (RSS), 2024

work page 2024
[7]

Learning to explore using active neural slam

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[8]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems (NeurIPS), 33:4247–4258, 2020

work page 2020
[9]

How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers

Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, and Fisher Yu. How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers. 2024

work page 2024
[10]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Bsp- net: Generating compact meshes via binary space parti- tioning

Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp- net: Generating compact meshes via binary space parti- tioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–54, 2020

work page 2020
[12]

Scan2mesh: From unstructured range scans to 3d meshes

Angela Dai and Matthias Nießner. Scan2mesh: From unstructured range scans to 3d meshes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5574–5583, 2019

work page 2019
[13]

Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. In International Conference on Learning Representations (ICLR), volume 2024, pages 8396–8407, 2024

work page 2024
[14]

Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23171–23181, 2023

work page 2023
[15]

A papier-m ˆach´e approach to learning 3d surface generation

Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-m ˆach´e approach to learning 3d surface generation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 216–224, 2018

work page 2018
[16]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024

work page 2024
[17]

Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting

Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 5949–5958, 2025

work page 2025
[18]

Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

work page 2020
[19]

Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

work page 2022
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation

Shuuji Kajita, Fumio Kanehiro, Kenji Kaneko, Kazuhito Yokoi, and Hirohisa Hirukawa. The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), volume 1, pages 239–246, 2001

work page 2001
[22]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023

work page 2023
[23]

Lerf: Language em- bedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language em- bedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 19729–19739, 2023

work page 2023
[24]

Perception-driven navigation: Active visual slam for robotic area coverage

Ayoung Kim and Ryan M Eustice. Perception-driven navigation: Active visual slam for robotic area coverage. InIEEE International Conference on Robotics and Au- tomation (ICRA), pages 3196–3203, 2013

work page 2013
[25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

work page 2023
[26]

Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

work page 2022
[27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[28]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

work page 2021
[29]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

work page 2022
[30]

Polygen: An autoregressive generative model of 3d meshes

Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning (ICML), pages 7220–7229. PMLR, 2020

work page 2020
[31]

Grid: Scene-graph-based instruction-driven robotic task planning

Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, and Long Zeng. Grid: Scene-graph-based instruction-driven robotic task planning. InIEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 13765–13772, 2024

work page 2024
[32]

Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, and Ky- oung Mu Lee. Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

work page arXiv 2024
[33]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

work page 2024
[34]

Openscene: 3d scene understanding with open vocabu- laries

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabu- laries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2023

work page 2023
[35]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20051–20060, 2024

work page 2024
[36]

Language-driven physics-based scene synthesis and editing via feature splatting

Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Language-driven physics-based scene synthesis and editing via feature splatting. InEuropean conference on computer vision (ECCV), pages 368–383, 2024

work page 2024
[37]

Ros: an open-source robot operating system

Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. Ros: an open-source robot operating system. InIEEE International Conference on Robotics and Au- tomation Workshop on Open Source Software, volume 3, page 5, 2009

work page 2009
[38]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational Conference on Ma- chine Learning (ICML), pages 8748–8763, 2021

work page 2021
[39]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. InConference on Robot Learning (CoRL), pages 23–72. PMLR, 2023

work page 2023
[40]

3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans

Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, and Luca Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. InRobotics: Science and Systems (RSS), 2020

work page 2020
[41]

Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments

Lukas Schmid, Marcus Abate, Yun Chang, and Luca Car- lone. Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments. 2024

work page 2024
[42]

Language embedded 3d gaussians for open- vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024

work page 2024
[43]

Meshgpt: Generating triangle meshes with decoder-only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, An- gela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19615– 19625, 2024

work page 2024
[44]

Real-time 3d slam for humanoid robot considering pattern generator information

Olivier Stasse, Andrew J Davison, Ramzi Sellaouti, and Kazuhito Yokoi. Real-time 3d slam for humanoid robot considering pattern generator information. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 348–355, 2006

work page 2006
[45]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Fed- erico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3961–3970, 2020

work page 2020
[46]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InRobotics: Science and Systems (RSS), 2024

work page 2024
[47]

Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty

Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Ming- han Zhu, Kira Barton, and Maani Ghaffari. Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty. IEEE Robotics and Automation Letters, 2025

work page 2025
[48]

Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences

Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nas- sir Navab, and Federico Tombari. Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7515–7525, 2021

work page 2021
[49]

Invariant ekf based 2d active slam with exploration task

Mengya Xu, Yang Song, Yongbo Chen, Shoudong Huang, and Qi Hao. Invariant ekf based 2d active slam with exploration task. InIEEE International Conference on Robotics and Automation (ICRA), pages 5350–5356, 2021

work page 2021
[50]

Gs-slam: Dense visual slam with 3d gaussian splatting

Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

work page 2024
[51]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision (ECCV), pages 162–179. Springer, 2024

work page 2024
[52]

Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

work page 2024
[53]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation (ICRA), pages 42–48, 2024

work page 2024
[54]

Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

Yao Zhao, Zhi Xiong, Shuailin Zhou, Jingqi Wang, Ling Zhang, and Pascual Campoy. Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

work page 2022
[55]

Sofa Relocation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8120–8132, 2025. Le...

work page 2025
[56]

The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz)

Robot Platform and Sensors:We deploy our system on the Unitree G1 humanoid robot. The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz). Camera intrinsics are calibrated offline, and the camera-to-base transform is obtained from the calibrat...

work page
[57]

•Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-assisted graph construction, and Flow Matching-based geometry generation, run on the workstation

Computation and Communication:We split computation between onboard control and offboard perception/reasoning: •Onboard Compute: The robot’s built-in NVIDIA Jetson Orin NX handles robot-provided low-level control in- terfaces, state feedback (IMU/Odometry), and hardware bridging. •Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-a...

work page
[58]

It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM

Server Specifications:The offboard workstation uses commercial desktop hardware. It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM. This setup was suffi- cient for the reported 22 FPS Appearance Field updates and approximately 6.2 s Geometry Field generation latency in our experiments. B. Software Stack

work page
[59]

We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering

3DGS Backend:The Appearance Field is implemented on an incremental Gaussian Splatting backend. We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering. The SLAM frontend is implemented in Python/PyTorch, leveraging diff-gaussian- rasterization-w-depth for differentiable rasterization with depth supervi...

work page
[60]

on”, “next to

VLM & Prompt Engineering:For Spatial Field (F spat) construction, we use GPT-4o (OpenAI) to parse stabilized rendered views into structured scene-graph evidence. The parsed outputs are post-processed into scene graph nodes and relations. To encourage structured output for topological parsing, we use the JSON-style prompt template shown in Fig. S.1. This p...

work page
[61]

Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification

Flow Matching Model Architecture:The Geometry Field (F geom) uses a pre-trained conditional Flow Matching model [10]. Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification. •Backbone Architecture: We utilize a pre-trained Flow- Matching Transformer backbone that processes lat...

work page
[62]

w/o Gating

System Architecture Implementation:Fig. S.2 summa- rizes the onboard/offboard dataflow used by MIF. The system comprises distinct computational modules dis- tributed as follows: Onboard Computation (Unitree G1 with Jetson Orin): The onboard layer provides sensor streams, robot state feed- back, and low-level command execution. •Sensors: Acquires raw data ...

work page
[63]

Rows 1-3 of Fig

Robustness of Appearance Field: Jitter Suppression & Dynamic Adaptation:We visualize rendered RGB/depth qual- ity from the Appearance FieldF app under humanoid walking. Rows 1-3 of Fig. S.3 illustrate the effect of confidence gating during high-speed walking (0.5m/s). The ungated variant shows ghosting artifacts in RGB and noisier depth estimates (e.g., D...

work page
[64]

Bypass the sofa

Generated Geometry for IPS Checking:To illustrate why sparse point clouds are insufficient for IPS collision checking, we present a comparative analysis in Fig. S.4. The raw 3D Gaussian point cloud (Left panels), while useful for visual navigation, remains sparse and lacks surface connectiv- ity. Gaps between centroids may make collision checks overly opt...

work page

[1] [1]

3d scene graph: A structure for unified semantics, 3d space, and camera

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Za- mir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019

work page 2019

[2] [2]

Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

Ruzena Bajcsy. Active perception.Proceedings of the IEEE, 76(8):966–1005, 1988

work page 1988

[3] [3]

Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, and Holger V oos. Situational graphs for robot navigation in structured indoor envi- ronments.IEEE Robotics and Automation Letters, 7(4): 9107–9114, 2022

work page 2022

[4] [4]

Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, and Gau- rav S Sukhatme. Interactive perception: Leveraging action in perception and perception in action.IEEE Transactions on Robotics, 33(6):1273–1291, 2017

work page 2017

[5] [5]

Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

Georgia Chalvatzaki, Ali Younes, Daljeet Nandha, An Thai Le, Leonardo FR Ribeiro, and Iryna Gurevych. Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning.Frontiers in Robotics and AI, 10:1221739, 2023

work page 2023

[6] [6]

Goat: Go to any thing

Matthew Chang, Th ´eophile Gervet, Mukul Khanna, Sri- ram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing. InRobotics: Science and Systems (RSS), 2024

work page 2024

[7] [7]

Learning to explore using active neural slam

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[8] [8]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems (NeurIPS), 33:4247–4258, 2020

work page 2020

[9] [9]

How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers

Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, and Fisher Yu. How to not train your dragon: Training-free embodied object goal navigation with se- mantic frontiers. 2024

work page 2024

[10] [10]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Bsp- net: Generating compact meshes via binary space parti- tioning

Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp- net: Generating compact meshes via binary space parti- tioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–54, 2020

work page 2020

[12] [12]

Scan2mesh: From unstructured range scans to 3d meshes

Angela Dai and Matthias Nießner. Scan2mesh: From unstructured range scans to 3d meshes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5574–5583, 2019

work page 2019

[13] [13]

Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. In International Conference on Learning Representations (ICLR), volume 2024, pages 8396–8407, 2024

work page 2024

[14] [14]

Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23171–23181, 2023

work page 2023

[15] [15]

A papier-m ˆach´e approach to learning 3d surface generation

Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-m ˆach´e approach to learning 3d surface generation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 216–224, 2018

work page 2018

[16] [16]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024

work page 2024

[17] [17]

Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting

Alex Hanson, Allen Tu, Vasu Singla, Mayuka Jayaward- hana, Matthias Zwicker, and Tom Goldstein. Pup 3d- gs: Principled uncertainty pruning for 3d gaussian splat- ting. InProceedings of the Computer Vision and Pat- tern Recognition Conference (CVPR), pages 5949–5958, 2025

work page 2025

[18] [18]

Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural in- formation processing systems (NeurIPS), 33:6840–6851, 2020

work page 2020

[19] [19]

Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization.Robotics: Science and Systems (RSS), 2022

work page 2022

[20] [20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt- 4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation

Shuuji Kajita, Fumio Kanehiro, Kenji Kaneko, Kazuhito Yokoi, and Hirohisa Hirukawa. The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), volume 1, pages 239–246, 2001

work page 2001

[22] [22]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139– 1, 2023

work page 2023

[23] [23]

Lerf: Language em- bedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language em- bedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 19729–19739, 2023

work page 2023

[24] [24]

Perception-driven navigation: Active visual slam for robotic area coverage

Ayoung Kim and Ryan M Eustice. Perception-driven navigation: Active visual slam for robotic area coverage. InIEEE International Conference on Robotics and Au- tomation (ICRA), pages 3196–3203, 2013

work page 2013

[25] [25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

work page 2023

[26] [26]

Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distillation.Advances in Neural Information Processing Systems (NeurIPS), 35:23311–23330, 2022

work page 2022

[27] [27]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[28] [28]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99– 106, 2021

work page 2021

[29] [29]

Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics, 41(4):1–15, 2022

work page 2022

[30] [30]

Polygen: An autoregressive generative model of 3d meshes

Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning (ICML), pages 7220–7229. PMLR, 2020

work page 2020

[31] [31]

Grid: Scene-graph-based instruction-driven robotic task planning

Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, and Long Zeng. Grid: Scene-graph-based instruction-driven robotic task planning. InIEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 13765–13772, 2024

work page 2024

[32] [32]

Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

Jeongtaek Oh, Jaeyoung Chung, Dongwoo Lee, and Ky- oung Mu Lee. Deblurgs: Gaussian splatting for camera motion blur.arXiv preprint arXiv:2404.11358, 2024

work page arXiv 2024

[33] [33]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

work page 2024

[34] [34]

Openscene: 3d scene understanding with open vocabu- laries

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabu- laries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2023

work page 2023

[35] [35]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20051–20060, 2024

work page 2024

[36] [36]

Language-driven physics-based scene synthesis and editing via feature splatting

Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Language-driven physics-based scene synthesis and editing via feature splatting. InEuropean conference on computer vision (ECCV), pages 368–383, 2024

work page 2024

[37] [37]

Ros: an open-source robot operating system

Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, et al. Ros: an open-source robot operating system. InIEEE International Conference on Robotics and Au- tomation Workshop on Open Source Software, volume 3, page 5, 2009

work page 2009

[38] [38]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational Conference on Ma- chine Learning (ICML), pages 8748–8763, 2021

work page 2021

[39] [39]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. InConference on Robot Learning (CoRL), pages 23–72. PMLR, 2023

work page 2023

[40] [40]

3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans

Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, and Luca Carlone. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. InRobotics: Science and Systems (RSS), 2020

work page 2020

[41] [41]

Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments

Lukas Schmid, Marcus Abate, Yun Chang, and Luca Car- lone. Khronos: A unified approach for spatio-temporal metric-semantic slam in dynamic environments. 2024

work page 2024

[42] [42]

Language embedded 3d gaussians for open- vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024

work page 2024

[43] [43]

Meshgpt: Generating triangle meshes with decoder-only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, An- gela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19615– 19625, 2024

work page 2024

[44] [44]

Real-time 3d slam for humanoid robot considering pattern generator information

Olivier Stasse, Andrew J Davison, Ramzi Sellaouti, and Kazuhito Yokoi. Real-time 3d slam for humanoid robot considering pattern generator information. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 348–355, 2006

work page 2006

[45] [45]

Learning 3d semantic scene graphs from 3d indoor reconstructions

Johanna Wald, Helisa Dhamo, Nassir Navab, and Fed- erico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3961–3970, 2020

work page 2020

[46] [46]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InRobotics: Science and Systems (RSS), 2024

work page 2024

[47] [47]

Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty

Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Ming- han Zhu, Kira Barton, and Maani Ghaffari. Latent- bki: Open-dictionary continuous mapping in visual- language latent spaces with quantifiable uncertainty. IEEE Robotics and Automation Letters, 2025

work page 2025

[48] [48]

Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences

Shun-Cheng Wu, Johanna Wald, Keisuke Tateno, Nas- sir Navab, and Federico Tombari. Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d se- quences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7515–7525, 2021

work page 2021

[49] [49]

Invariant ekf based 2d active slam with exploration task

Mengya Xu, Yang Song, Yongbo Chen, Shoudong Huang, and Qi Hao. Invariant ekf based 2d active slam with exploration task. InIEEE International Conference on Robotics and Automation (ICRA), pages 5350–5356, 2021

work page 2021

[50] [50]

Gs-slam: Dense visual slam with 3d gaussian splatting

Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. Gs-slam: Dense visual slam with 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

work page 2024

[51] [51]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision (ECCV), pages 162–179. Springer, 2024

work page 2024

[52] [52]

Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm- based zero-shot object navigation.Advances in neural in- formation processing systems (NeurIPS), 37:5285–5307, 2024

work page 2024

[53] [53]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Conference on Robotics and Automation (ICRA), pages 42–48, 2024

work page 2024

[54] [54]

Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

Yao Zhao, Zhi Xiong, Shuailin Zhou, Jingqi Wang, Ling Zhang, and Pascual Campoy. Perception-aware planning for active slam in dynamic environments.Remote Sens- ing, 14(11):2584, 2022

work page 2022

[55] [55]

Sofa Relocation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et al. Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8120–8132, 2025. Le...

work page 2025

[56] [56]

The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz)

Robot Platform and Sensors:We deploy our system on the Unitree G1 humanoid robot. The robot is equipped with a head-mounted Intel RealSense D435i RGB-D camera, configured to provide synchronized color and depth streams at a resolution of640×480(30 Hz). Camera intrinsics are calibrated offline, and the camera-to-base transform is obtained from the calibrat...

work page

[57] [57]

•Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-assisted graph construction, and Flow Matching-based geometry generation, run on the workstation

Computation and Communication:We split computation between onboard control and offboard perception/reasoning: •Onboard Compute: The robot’s built-in NVIDIA Jetson Orin NX handles robot-provided low-level control in- terfaces, state feedback (IMU/Odometry), and hardware bridging. •Offboard Compute: The high-load MIF modules, includ- ing 3DGS mapping, VLM-a...

work page

[58] [58]

It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM

Server Specifications:The offboard workstation uses commercial desktop hardware. It is equipped with an Intel Core i9-13900K CPU, 64 GB DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB VRAM. This setup was suffi- cient for the reported 22 FPS Appearance Field updates and approximately 6.2 s Geometry Field generation latency in our experiments. B. Software Stack

work page

[59] [59]

We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering

3DGS Backend:The Appearance Field is implemented on an incremental Gaussian Splatting backend. We use dif- ferentiable Gaussian rasterization libraries and CUDA kernels for primitive updates and rendering. The SLAM frontend is implemented in Python/PyTorch, leveraging diff-gaussian- rasterization-w-depth for differentiable rasterization with depth supervi...

work page

[60] [60]

on”, “next to

VLM & Prompt Engineering:For Spatial Field (F spat) construction, we use GPT-4o (OpenAI) to parse stabilized rendered views into structured scene-graph evidence. The parsed outputs are post-processed into scene graph nodes and relations. To encourage structured output for topological parsing, we use the JSON-style prompt template shown in Fig. S.1. This p...

work page

[61] [61]

Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification

Flow Matching Model Architecture:The Geometry Field (F geom) uses a pre-trained conditional Flow Matching model [10]. Using a pre-trained model avoids training a geometry prior from scratch and provides object-level mesh hypotheses for IPS verification. •Backbone Architecture: We utilize a pre-trained Flow- Matching Transformer backbone that processes lat...

work page

[62] [62]

w/o Gating

System Architecture Implementation:Fig. S.2 summa- rizes the onboard/offboard dataflow used by MIF. The system comprises distinct computational modules dis- tributed as follows: Onboard Computation (Unitree G1 with Jetson Orin): The onboard layer provides sensor streams, robot state feed- back, and low-level command execution. •Sensors: Acquires raw data ...

work page

[63] [63]

Rows 1-3 of Fig

Robustness of Appearance Field: Jitter Suppression & Dynamic Adaptation:We visualize rendered RGB/depth qual- ity from the Appearance FieldF app under humanoid walking. Rows 1-3 of Fig. S.3 illustrate the effect of confidence gating during high-speed walking (0.5m/s). The ungated variant shows ghosting artifacts in RGB and noisier depth estimates (e.g., D...

work page

[64] [64]

Bypass the sofa

Generated Geometry for IPS Checking:To illustrate why sparse point clouds are insufficient for IPS collision checking, we present a comparative analysis in Fig. S.4. The raw 3D Gaussian point cloud (Left panels), while useful for visual navigation, remains sparse and lacks surface connectiv- ity. Gaps between centroids may make collision checks overly opt...

work page