pith. sign in

arxiv: 2410.06239 · v3 · submitted 2024-10-08 · 💻 cs.RO

Open-Architecture End-to-End System for Real-World Autonomous Robot Navigation

Pith reviewed 2026-05-23 19:22 UTC · model grok-4.3

classification 💻 cs.RO
keywords autonomous navigationscene graphsLLM plannerquadruped robotreal-world deploymentsemantic mappingzero-shot navigationROS2
0
0 comments X

The pith

An open system lets a quadruped robot navigate unknown indoor spaces zero-shot by building evolving scene graphs and feeding them to an LLM planner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a complete onboard system that fuses localization, mapping, and open-vocabulary object detection into hierarchical scene graphs that update continuously. An LLM then reads these graphs to produce and revise multi-step navigation plans in response to natural-language commands. Experiments on a Unitree Go2 quadruped across several real indoor environments report over 88 percent task success with no environment-specific training. The work targets the gap between simulated navigation methods and the partial observability, sensor noise, and dynamic changes that appear on physical robots.

Core claim

The central claim is that a lightweight ROS2-based architecture, which continuously builds hierarchical scene graphs from a semantic object map and supplies them to an LLM planner, enables reliable zero-shot autonomous navigation in unknown, dynamic real-world indoor settings, demonstrated by greater than 88 percent task success on a quadruped platform.

What carries the argument

Hierarchical scene graphs constructed from a continuously updated semantic object map, which supply structured spatial and semantic information to the LLM planner for real-time plan generation and adaptation.

If this is right

  • Natural-language task specification becomes usable for navigation without environment-specific engineering.
  • The robot can adjust its plan when objects move or new obstacles appear because the map and graphs update in real time.
  • The same software stack runs on physical hardware without separate simulation training or domain randomization.
  • System behavior logs from deployment reveal which perception or planning steps cause most failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The open ROS2 architecture could be extended by swapping in newer open-vocabulary detectors to raise the success rate further.
  • The reported 12 percent failure cases likely cluster around incomplete scene graphs; targeted improvements to object mapping would be a direct next step.
  • Because the planner operates on explicit graphs rather than raw images, the same system could support multi-robot coordination by sharing graph fragments.

Load-bearing premise

The scene graphs contain enough reliable structure for the LLM to produce plans that remain workable when perception is noisy and the environment changes.

What would settle it

Running the same system in additional indoor environments of comparable size and clutter and recording task success below 50 percent would show that the claimed generalization does not hold.

Figures

Figures reproduced from arXiv: 2410.06239 by Ali Umut Kaypak, Farshad Khorrami, Naman Patel, Prashanth Krishnamurthy, Raktim Gautam Goswami, Rooholla Khorrambakht, Venkata Naren Devarakonda.

Figure 1
Figure 1. Figure 1: Overview of our OrionNav framework. The OrionNav system fuses data from onboard LiDAR and odometry sensors for robust localization and mapping, while integrating open-world semantics to produce a semantic object map of the environment. This map is then clustered into distinct rooms, and room labels are assigned using the Llama3 LLM, generating a hierarchical scene graph. An LLM-based planner utilizes this … view at source ↗
Figure 2
Figure 2. Figure 2: Our method makes use of the semantic constructs of indoor environments to generate a hierarchical scene graph from the semantic object map. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt templates used to generate text embeddings for each category. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the system and user prompts for the LLM planner. The system prompt explains the agent’s role, action primitives, and map format. The user prompt provides the map, command history, feedback, and task details. In the initial call to the LLM, command history and feedback are absent, as there is no prior interaction. Feedback includes task status and error messages from previous command executions.… view at source ↗
Figure 5
Figure 5. Figure 5: Robot Setup: OrionNav’s capabilities are demonstrated on a Unitree Go2 quadrupedal robot equipped with onboard LiDAR sensor, stereo camera, and embedded computers equipped with a graphics processing unit (GPU). The experiments were conducted in four distinct office environments, each featuring multiple rooms and corridors, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experiment Environments: The experiments were performed in four distinct environments. Environment A is a large, multi-room space with interconnected corridors. Environment B is a subset of Environment A, while Environment C is a smaller, single-room setting with diverse objects. Environment D comprises two perpendicular corridors, forming an L-shaped layout. The 2D LiDAR maps created using SLAM are shown … view at source ↗
Figure 7
Figure 7. Figure 7: Success and failure cases of OrionNav across all experiments. Visualization of our task execution framework with breakdown of different obsereved failures that occur due to perception, and navigation failures. A detailed breakdown of failures provided on the right. predictions are handled by a quantized LLaMA 3 model, with system prompts containing 21 tokens and user prompts averaging 175 tokens, depending… view at source ↗
Figure 8
Figure 8. Figure 8: Short range object navigation task: The robot observes the queried objects during the first 360◦ rotation and then navigates toward them [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Long range object navigation task: The system is tasked with locating a bag in a corridor environment. Initially, the bag is not visible, prompting the LLM planner to issue an exploration command. The robot explores until it detects the bag, then successfully navigates to it [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Long range object navigation task: The system is tasked with locating a bag on a floor of a building. It begins at the starting position (a) and initially searches the office-like space (b). After failing to find the bag, the system initiates global exploration (c, d, e). Upon detecting another office space (f), the system enters and searches it (g), successfully locating the bag. command, allowing the ro… view at source ↗
Figure 11
Figure 11. Figure 11: Room navigation task: The system is tasked with locating an office room on a building floor. The LLM planner initiates the task by issuing an exploration command, during which the robot first encounters a break room. Continuing its search, the robot eventually reaches an office room which is correctly detected and classified during scene graph generation. The LLM planner then halts exploration, successful… view at source ↗
Figure 13
Figure 13. Figure 13: Autonomous object navigation in changing environment: The system is provided with a partial map of the environment containing an office room among other locations. A monitor is later placed in the office room and the system is tasked with finding it. In (a), using only the semantic object map without room labels, the system searched near the table in the break room but did not find the monitor. It then ex… view at source ↗
read the original abstract

Enabling robots to autonomously navigate unknown, complex, and dynamic real-world environments presents several challenges, including imperfect perception, partial observability, localization uncertainty, and safety constraints. Current approaches are typically limited to simulations, where such challenges are not present. In this work, we present a lightweight, open-architecture, end-to-end system for real-world robot autonomous navigation. Specifically, we deploy a real-time navigation system on a quadruped robot by integrating multiple onboard components that communicate via ROS2. Given navigation tasks specified in natural language, the system fuses onboard sensory data for localization and mapping with open-vocabulary semantics to build hierarchical scene graphs from a continuously updated semantic object map. An LLM-based planner leverages these graphs to generate and adapt multi-step plans in real time as the scene evolves. Through experiments across multiple indoor environments using a Unitree Go2 quadruped, we demonstrate zero-shot real-world autonomous navigation, achieving over 88% task success, and provide analysis of system behavior during deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a lightweight, open-architecture end-to-end system for real-world autonomous robot navigation on a Unitree Go2 quadruped. It integrates ROS2 components for onboard sensing, localization, and mapping; open-vocabulary semantics to construct a continuously updated semantic object map; hierarchical scene graphs; and an LLM-based planner that generates and adapts multi-step plans in real time. The central empirical claim is zero-shot navigation achieving over 88% task success across multiple indoor environments, with accompanying analysis of system behavior.

Significance. If the reported performance is supported by adequate trial counts, clear success definitions, and controls, the work would offer a practical, reproducible integration of perception, mapping, and LLM planning modules for dynamic real-world navigation. This could serve as a useful baseline for embodied systems research, particularly given the emphasis on an open architecture and deployment on physical hardware.

major comments (2)
  1. Abstract: the central claim of 'over 88% task success' is presented without any accompanying information on the number of trials performed, the precise definition of task success, failure mode categorization, statistical measures (e.g., confidence intervals), or controls for confounding factors such as environment variability. This information is load-bearing for evaluating whether the empirical demonstration supports the zero-shot navigation claim.
  2. The description of the LLM-based planner (implicit in the system architecture): the paper relies on the assumption that hierarchical scene graphs built from the semantic object map supply sufficient structured information for reliable real-time multi-step planning and adaptation, yet no ablation studies, quantitative metrics on plan success rates, or analysis of cases where graph incompleteness leads to planner failure are provided to substantiate this link.
minor comments (1)
  1. The abstract and introduction would benefit from explicit forward references to the experiments section where trial counts and success criteria are detailed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing the open-architecture end-to-end navigation system. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [—] Abstract: the central claim of 'over 88% task success' is presented without any accompanying information on the number of trials performed, the precise definition of task success, failure mode categorization, statistical measures (e.g., confidence intervals), or controls for confounding factors such as environment variability. This information is load-bearing for evaluating whether the empirical demonstration supports the zero-shot navigation claim.

    Authors: We agree that the abstract would be strengthened by including supporting details for the empirical claim. The Experiments section of the manuscript reports the trial counts across multiple indoor environments, defines task success as reaching the goal location without collisions or timeouts, categorizes failure modes, and discusses performance variability. In the revision we will update the abstract to briefly state the number of trials, success definition, and reference to the statistical and failure analysis provided in the body. revision: yes

  2. Referee: [—] The description of the LLM-based planner (implicit in the system architecture): the paper relies on the assumption that hierarchical scene graphs built from the semantic object map supply sufficient structured information for reliable real-time multi-step planning and adaptation, yet no ablation studies, quantitative metrics on plan success rates, or analysis of cases where graph incompleteness leads to planner failure are provided to substantiate this link.

    Authors: The manuscript prioritizes evaluation of the integrated system on physical hardware and includes qualitative analysis of planner adaptation during real-world runs. We acknowledge that dedicated ablation studies isolating the planner, quantitative plan success rates, and explicit analysis of graph incompleteness cases are not present. We will add a limitations paragraph in the revised manuscript discussing these aspects and the role of scene graph structure, while noting that full ablations fall outside the current scope focused on end-to-end deployment. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical system integration with no derivations or fitted predictions

full rationale

The paper presents an integrated ROS2-based navigation system on a quadruped robot, relying on open-vocabulary semantics, hierarchical scene graphs, and an LLM planner to achieve >88% zero-shot task success in real indoor environments. No equations, parameter fits, or mathematical derivations are described that would reduce the reported success metric to quantities defined inside the paper. The central claim is an empirical demonstration of a working architecture rather than a derived prediction; the information-sufficiency assumption between scene graphs and LLM planning is an engineering hypothesis tested by experiment, not a self-referential definition or self-citation chain. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an applied engineering systems description that relies on standard robotics middleware, off-the-shelf perception models, and existing LLM capabilities without introducing new mathematical free parameters, domain axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5738 in / 1228 out tokens · 34030 ms · 2026-05-23T19:22:21.859914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

    cs.CV 2026-05 conditional novelty 7.0

    OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    A survey of visual navigation: From geometry to embodied AI,

    T. Zhang, X. Hu, J. Xiao, and G. Zhang, “A survey of visual navigation: From geometry to embodied AI,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105036, 2022

  2. [2]

    A survey of embodied ai: From simulators to research tasks,

    J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022

  3. [3]

    Emerging frontiers in human-robot interaction,

    F. Safavi, P. Olikkal, D. Pei, S. Kamal, H. Meyerson, V . Penumalee, and R. Vinjamuri, “Emerging frontiers in human-robot interaction,”Journal of Intelligent and Robotic Systems , vol. 110, no. 2, p. 45, 2024

  4. [4]

    Unlocking robotic autonomy: A survey on the applications of foundation models,

    D.-S. Jang, D.-H. Cho, W.-C. Lee, S.-K. Ryu, B. Jeong, M. Hong, M. Jung, M. Kim, M. Lee, S. Lee, et al., “Unlocking robotic autonomy: A survey on the applications of foundation models,” International Journal of Control, Automation and Systems , vol. 22, no. 8, pp. 2341– 2384, 2024

  5. [5]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

    C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016

  6. [6]

    A new era of indoor scene reconstruction: A survey,

    H. Wang and M. Li, “A new era of indoor scene reconstruction: A survey,” IEEE Access, vol. 12, pp. 110 160–110 192, 2024

  7. [7]

    A survey on global lidar localization: Challenges, advances and open problems,

    H. Yin, X. Xu, S. Lu, X. Chen, R. Xiong, S. Shen, C. Stachniss, and Y . Wang, “A survey on global lidar localization: Challenges, advances and open problems,” International Journal of Computer Vision , vol. 132, no. 8, pp. 3139–3171, 2024

  8. [8]

    Kimera: From SLAM to spatial perception with 3d dynamic scene graphs,

    A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From SLAM to spatial perception with 3d dynamic scene graphs,” International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021

  9. [9]

    3d ac- tive metric-semantic SLAM,

    Y . Tao, X. Liu, I. Spasojevic, S. Agarwal, and V . Kumar, “3d ac- tive metric-semantic SLAM,” IEEE Robotics and Automation Letters , vol. 9, no. 3, pp. 2989–2996, 2024

  10. [10]

    A survey of visual SLAM in dynamic environment: The evolution from geometric to semantic approaches,

    Y . Wang, Y . Tian, J. Chen, K. Xu, and X. Ding, “A survey of visual SLAM in dynamic environment: The evolution from geometric to semantic approaches,” IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–21, 2024

  11. [11]

    A survey on open-vocabulary detection and segmentation: Past, present, and future,

    C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,” CoRR, vol. abs/2307.09220, 2023

  12. [12]

    Towards open vocabulary learning: A survey,

    J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y . Yang, X. Li, J. Zhang, Y . Tong, X. Jiang, B. Ghanem, and D. Tao, “Towards open vocabulary learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5092–5113, 2024

  13. [13]

    Open3DSG: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

    S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski, “Open3DSG: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 14 183–14 193

  14. [14]

    Openmask3d: Open-vocabulary 3d instance segmenta- tion,

    A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmenta- tion,” arXiv preprint arXiv:2306.13631 , 2023

  15. [15]

    Clio: Real-time task-driven open-set 3d scene graphs,

    D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,” IEEE Robotics and Automation Letters , vol. 9, no. 10, pp. 8921–8928, 2024

  16. [16]

    Hier- archical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,

    A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hier- archical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,” in Proceedings of the Robotics: Science and Systems, Delft, Netherlands, July 2024

  17. [17]

    Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suen- derhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,” in Proceedings of the Con- ference on Robot Learning , Atlanta, GA, USA, November 2023, pp. 23–72

  18. [18]

    Saynav: Grounding large language models for dynamic planning to navigation in new environments,

    A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez, “Saynav: Grounding large language models for dynamic planning to navigation in new environments,” in Proceedings of the Interna- tional Conference on Automated Planning and Scheduling , Banff, AB, Canada, June 2024, pp. 464–474

  19. [19]

    V oronav: V oronoi-based zero-shot object navigation with large language model,

    P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oronav: V oronoi-based zero-shot object navigation with large language model,” arXiv preprint arXiv:2401.02695 , 2024

  20. [20]

    Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,

    D. Honerkamp, M. B ¨uchner, F. Despinoy, T. Welschehold, and A. Val- ada, “Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,” IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8298–8305, 2024

  21. [21]

    ProgPrompt: Generating situated robot task plans using large language models,

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “ProgPrompt: Generating situated robot task plans using large language models,” in Proceedings of the International Conference on Robotics and Automation, London, United Kingdom, May 2023, pp. 11 523–11 530

  22. [22]

    Code as policies: Language model programs for embodied control,

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Flo- rence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in Proceedings of the International Conference on Robotics and Automation , London, United Kingdom, May 2023, pp. 9493–9500

  23. [23]

    Inner monologue: Embodied reasoning through planning with language models,

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. , “Inner monologue: Embodied reasoning through planning with language models,” in Proceedings of the Conference on Robot Learning , Atlanta, GA, USA, December 2023, pp. 1769–1782

  24. [24]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, et al. , “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, May 2024, pp. 5021–5028

  25. [25]

    Do as I can, not as I say: Grounding language in robotic affordances,

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al. , “Do as I can, not as I say: Grounding language in robotic affordances,” in Proceedings of the Conference on Robot Learning , Auckland, New Zealand, December 2022, pp. 287–318

  26. [26]

    Visual language maps for robot navigation,

    C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in Proceedings of the International Conference on Robotics and Automation , London, United Kingdom, May 2023, pp. 10 608–10 615

  27. [27]

    SceneGraph- Fusion: Incremental 3d scene graph prediction from rgb-d sequences,

    S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “SceneGraph- Fusion: Incremental 3d scene graph prediction from rgb-d sequences,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, June 2021, pp. 7511–7521

  28. [28]

    OpenGraph: Open-vocabulary hierarchical 3d graph representation in large-scale outdoor environments,

    Y . Deng, J. Wang, J. Zhao, X. Tian, G. Chen, Y . Yang, and Y . Yue, “OpenGraph: Open-vocabulary hierarchical 3d graph representation in large-scale outdoor environments,” IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8402–8409, 2024

  29. [29]

    LERF: language embedded radiance fields,

    J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “LERF: language embedded radiance fields,” in Proceedings of the International Conference on Computer Vision , Paris, France, October 2023, pp. 19 672–19 682

  30. [30]

    Open-nerf: Towards open vocabulary nerf decomposition,

    H. Zhang, F. Li, and N. Ahuja, “Open-nerf: Towards open vocabulary nerf decomposition,” in Proceedings of the Winter Conference on Applications of Computer Vision , Waikoloa, HI, USA, January 2024, pp. 3444–3453

  31. [31]

    Gov-nesf: Generalizable open- vocabulary neural semantic fields,

    Y . Wang, H. Chen, and G. H. Lee, “Gov-nesf: Generalizable open- vocabulary neural semantic fields,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 20 443–20 453

  32. [32]

    Semantically-aware neural radiance fields for visual scene understand- ing: A comprehensive review,

    T. Nguyen, A. Bourki, M. Macudzinski, A. Brunel, and M. Bennamoun, “Semantically-aware neural radiance fields for visual scene understand- ing: A comprehensive review,” CoRR, vol. abs/2402.11141, 2024

  33. [33]

    Visual genome: Connect- ing language and vision using crowdsourced dense image annotations,

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma,et al., “Visual genome: Connect- ing language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , vol. 123, pp. 32–73, 2017

  34. [34]

    Structured query-based image retrieval using scene graphs,

    B. Schroeder and S. Tripathi, “Structured query-based image retrieval using scene graphs,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops , Seattle, W A, USA, June 2020, pp. 178–179

  35. [35]

    Image retrieval using scene graphs,

    J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the Conference on Computer Vision and Pattern Recognition , Boston, MA, USA, June 2015, pp. 3668–3678

  36. [36]

    Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,

    Y . Zhu, J. Tremblay, S. Birchfield, and Y . Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in Proceedings of the International Conference on Robotics and Automation, Xi’an, China, May 2021, pp. 6541–6548

  37. [37]

    Joint modeling of visual objects and relations for scene graph generation,

    M. Xu, M. Qu, B. Ni, and J. Tang, “Joint modeling of visual objects and relations for scene graph generation,” Proceedings of the Advances in Neural Information Processing Systems , pp. 7689–7702, December 2021

  38. [38]

    Attentive relational networks for mapping images to scene graphs,

    M. Qi, W. Li, Z. Yang, Y . Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019, pp. 3957–3966

  39. [39]

    Graph r-cnn for scene graph generation,

    J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, September 2018, pp. 670–685

  40. [40]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proceedings of the International Conference on Machine Learning, vol. 139, Vienna, Austria, July 2021, pp. 8748–8763

  41. [41]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  42. [42]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, December 2023

  43. [43]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition , Seattle, W A, USA, June 2024, pp. 26 296–26 306

  44. [44]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, July 2023, pp. 19 730–19 742

  45. [45]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Doll´ar, and R. B. Girshick, “Segment anything,” inProceedings of the International Conference on Computer Vision, Paris, France, October 2023, pp. 3992–4003

  46. [46]

    3D scene graph: A structure for unified semantics, 3d space, and camera,

    I. Armeni, Z.-Y . He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3D scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the International Conference on Computer Vision, Seoul, Korea, October 2019, pp. 5664–5673

  47. [47]

    3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,

    A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Proceedings of the Robotics: Science and Systems , Corvalis, Oregon, USA, July 2020

  48. [48]

    Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,

    N. Hughes, Y . Chang, and L. Carlone, “Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,” in Proceedings of the Robotics: Science and Systems , New York City, NY , USA, June 2022

  49. [49]

    An embodied generalist agent in 3d world,

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” in Proceedings of the International Conference on Machine Learning , Vienna, Austria, July 2024

  50. [50]

    A survey of image semantics-based visual simultaneous localization and mapping: Application-oriented solutions to autonomous navigation of mobile robots,

    L. Xia, J. Cui, R. Shen, X. Xu, Y . Gao, and X. Li, “A survey of image semantics-based visual simultaneous localization and mapping: Application-oriented solutions to autonomous navigation of mobile robots,” International Journal of Advanced Robotic Systems , vol. 17, no. 3, 2020

  51. [51]

    Llm-enabled cyber- physical systems: Survey, research opportunities, and challenges,

    W. Xu, M. Liu, O. Sokolsky, I. Lee, and F. Kong, “Llm-enabled cyber- physical systems: Survey, research opportunities, and challenges,” in IEEE International Workshop on Foundation Models for Cyber- Physical Systems & Internet of Things (FMSys) , Hong Kong, China, May 2024, pp. 50–55

  52. [52]

    Robocat: A self-improving foundation agent for robotic manipulation,

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju, et al. , “Robocat: A self- improving foundation agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023

  53. [53]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” in Proceedings of the Advances in Neural Information Processing Systems , Vancouver, BC, Canada, December 2024

  54. [54]

    Object detection in 20 years: A survey,

    Z. Zou, K. Chen, Z. Shi, Y . Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE , vol. 111, no. 3, pp. 257– 276, 2023

  55. [55]

    Image segmentation using deep learning: A survey,

    S. Minaee, Y . Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Ter- zopoulos, “Image segmentation using deep learning: A survey,” IEEE transactions on pattern analysis and machine intelligence , vol. 44, no. 7, pp. 3523–3542, 2021

  56. [56]

    Clipscope: Enhancing zero-shot ood detection with bayesian scoring,

    H. Fu, N. Patel, P. Krishnamurthy, and F. Khorrami, “Clipscope: Enhancing zero-shot ood detection with bayesian scoring,” arXiv e- prints, pp. arXiv–2405, 2024

  57. [57]

    A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,

    M. Xu, Z. Zhang, F. Wei, Y . Lin, Y . Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” in Proceedings of the European Conference on Computer Vision , Tel Aviv, Israel, October 2022, pp. 736–753

  58. [58]

    Open- vocabulary SAM: segment and recognize twenty-thousand classes interactively,

    H. Yuan, X. Li, C. Zhou, Y . Li, K. Chen, and C. C. Loy, “Open- vocabulary SAM: segment and recognize twenty-thousand classes interactively,” inProceedings of the European Conference on Computer Vision, Milan, Italy, September 2024

  59. [59]

    ViTamin: Designing scalable vision models in the vision-language era,

    J. Chen, Q. Yu, X. Shen, A. L. Yuille, and L. Chen, “ViTamin: Designing scalable vision models in the vision-language era,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 12 954–12 966

  60. [60]

    Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,

    B. Xie, J. Cao, J. Xie, F. S. Khan, and Y . Pang, “Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 3426–3436

  61. [61]

    CAT- seg: Cost aggregation for open-vocabulary semantic segmentation,

    S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 4113–4123

  62. [62]

    Segvg: Transferring object bounding box to segmentation for visual grounding,

    W. Kang, G. Liu, M. Shah, and Y . Yan, “Segvg: Transferring object bounding box to segmentation for visual grounding,” in Proceedings of the European Conference on Computer Vision, Milan, Italy, September 2024

  63. [63]

    Image-to- image matching via foundation models: A new perspective for open- vocabulary semantic segmentation,

    Y . Wang, R. Sun, N. Luo, Y . Pan, and T. Zhang, “Image-to- image matching via foundation models: A new perspective for open- vocabulary semantic segmentation,” inProceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 3952–3963

  64. [64]

    Llm-seg: Bridging image segmentation and large language model reasoning,

    J. Wang and L. Ke, “Llm-seg: Bridging image segmentation and large language model reasoning,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 1765–1774

  65. [65]

    Llmformer: Large language model for open-vocabulary semantic segmentation,

    H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,” International Journal of Computer Vision, August 2024

  66. [66]

    SQA3D: situated question answering in 3d scenes,

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S. Zhu, and S. Huang, “SQA3D: situated question answering in 3d scenes,” in Proceedings of the International Conference on Learning Representations , Kigali, Rwanda, May 2023

  67. [67]

    Scanqa: 3d question answering for spatial scene understanding,

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” in Proceedings of the Conference on Computer Vision and Pattern Recognition , New Orleans, LA, USA, June 2022, pp. 19 107–19 117

  68. [68]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,

    P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” in Proceedings of the European Conference in Computer Vision, Glasgow, UK, August 2020, pp. 422–440

  69. [69]

    Grounded 3d-llm with referent tokens,

    Y . Chen, S. Yang, H. Huang, T. Wang, R. Lyu, R. Xu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,” arXiv preprint arXiv:2405.10370, 2024

  70. [70]

    3d- llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,” in Proceedings of the Advances in Neural Information Processing Systems , New Orleans, LA, USA, December 2023

  71. [71]

    Ll3da: Visual interactive instruction tuning for omni- 3d understanding reasoning and planning,

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni- 3d understanding reasoning and planning,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, W A, USA, June 2024, pp. 26 428–26 438

  72. [72]

    LLM-Grounder: Open-vocabulary 3d visual grounding with large language model as an agent,

    J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “LLM-Grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” in Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, May 2024, pp. 7694–7701

  73. [73]

    Mid-fusion: Octree-based object-level multi-instance dynamic SLAM,

    B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. J. Davison, and S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance dynamic SLAM,” in Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, May 2019, pp. 5231– 5237

  74. [74]

    Maskfusion: Real-time recog- nition, tracking and reconstruction of multiple moving objects,

    M. R ¨unz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recog- nition, tracking and reconstruction of multiple moving objects,” in Proceedings of the International Symposium on Mixed and Augmented Reality, D. Chu, J. L. Gabbard, J. Grubert, and H. Regenbrecht, Eds., Munich, Germany, October 2018, pp. 10–20

  75. [75]

    Quadricslam: Dual quadrics from object detections as landmarks in object-oriented SLAM,

    L. Nicholson, M. Milford, and N. S ¨underhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented SLAM,” IEEE Robotics and Automation Letters , vol. 4, no. 1, pp. 1–8, 2019

  76. [76]

    Semantic segmentation guided slam using vision and lidar,

    N. Patel, P. Krishnamurthy, and F. Khorrami, “Semantic segmentation guided slam using vision and lidar,” in Proceedings of the International Symposium on Robotics , Munich, German, June 2018, pp. 1–7

  77. [77]

    SLAM++: simultaneous localisation and mapping at the level of objects,

    R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: simultaneous localisation and mapping at the level of objects,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition , Portland, OR, USA, June 2013, pp. 1352–1359

  78. [78]

    vmap: Vectorised object mapping for neural field SLAM,

    X. Kong, S. Liu, M. Taher, and A. J. Davison, “vmap: Vectorised object mapping for neural field SLAM,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition , Vancouver, BC, Canada, June 2023, pp. 952–961

  79. [79]

    Tightly coupled semantic RGB-D inertial odometry for accurate long-term localization and mapping,

    N. Patel, F. Khorrami, P. Krishnamurthy, and A. Tzes, “Tightly coupled semantic RGB-D inertial odometry for accurate long-term localization and mapping,” in Proceedings of the International Conference on Advanced Robotics, Belo Horizonte, Brazil, December 2019, pp. 523– 528

  80. [80]

    RO-MAP: real-time multi- object mapping with neural radiance fields,

    X. Han, H. Liu, Y . Ding, and L. Yang, “RO-MAP: real-time multi- object mapping with neural radiance fields,” IEEE Robotics and Au- tomation Letters, vol. 8, no. 9, pp. 5950–5957, 2023

Showing first 80 references.