pith. sign in

arxiv: 2506.01418 · v2 · pith:FAPMW3VGnew · submitted 2025-06-02 · 💻 cs.RO · cs.CV

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

Pith reviewed 2026-05-22 01:29 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords visual semantic navigationsemantic segmentationsim-to-real transferrobot navigationobject searchHabitat simulatorpolicy learningreal-world robotics
0
0 comments X

The pith

SEMNAV improves visual semantic navigation by using semantic segmentation maps instead of raw RGB images as input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that training navigation agents on semantic segmentation outputs rather than raw pixel images produces policies that reach target objects more reliably in environments the agent has never seen before. This matters because most current visual navigation systems are trained only inside simulators and then fail when placed on real robots due to differences in lighting, textures, and rendering. By feeding the model explicit labels for objects and surfaces, the agent learns to focus on stable spatial relationships instead of brittle visual patterns. The authors support this by releasing a new dataset built for segmentation-aware navigation training. Experiments show higher success rates inside the Habitat simulator and clearer transfer when the same policies are run on physical robots.

Core claim

SEMNAV demonstrates that replacing raw RGB observations with semantic segmentation labels as the primary visual representation allows a navigation policy to achieve higher success rates when locating target objects in unseen environments. The model is trained in simulation using the HM3D dataset inside Habitat 2.0 and is then deployed on real robotic platforms, where the semantic input reduces the performance drop caused by visual domain differences between rendered scenes and actual camera footage.

What carries the argument

SEMNAV model that takes semantic segmentation maps as its main visual input for learning navigation policies toward target objects.

If this is right

  • Higher success rates when locating objects in previously unseen simulated rooms using the HM3D dataset inside Habitat 2.0.
  • Narrowed performance gap between simulation training and real-robot execution because semantic labels are less affected by rendering differences than raw pixels.
  • Improved ability to navigate toward specific objects in practical settings after training only in simulation.
  • A new curated dataset that supports further work on navigation models that rely on semantic rather than pixel input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic input strategy could be tested on other robot tasks such as opening doors or placing objects where consistent object identity matters more than exact appearance.
  • Training with semantic maps might let teams collect less real-world data because policies transfer more readily from simulation.
  • The approach could be extended to environments with moving people or changing furniture to check whether segmentation still provides stable guidance.

Load-bearing premise

Semantic segmentation labels produced by an external model stay accurate enough in real-world scenes whose lighting, textures, and layouts differ from the simulator used for training.

What would settle it

Deploy the trained SEMNAV policy on a real robot in a new room where the segmentation network mislabels doors, furniture, or floors at high rates and measure whether success rates fall to the level of standard RGB-based models.

Figures

Figures reproduced from arXiv: 2506.01418 by Carlos Guti\'errez-\'Alvarez, Francisco Javier Acevedo-Rodr\'iguez, Rafael Flor-Rodr\'iguez, Roberto J. L\'opez-Sastre, Sergio Lafuente-Arroyo.

Figure 1
Figure 1. Figure 1: Traditional VSN models are trained in simulation environments [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between the HM3D dataset and the S [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proposed architecture for the SEMNAV model. episode is to enable an agent to navigate in a scene Si from a set of available scenes S = {S1, . . . , Sn}, towards an object of a specific category ci belonging to the category set C = {c1, . . . , cm}, starting from an initial position p0 in the navigation environment. For our SEMNAV model, we define the navigation task as follows. Given a target object class … view at source ↗
Figure 4
Figure 4. Figure 4: Top-down view of the house where OBJECTNAV experiments were conducted for five object categories. adapted for the OBJECTNAV problem. This adaptation repli￾cates the agent’s characteristics during training. Among the modifications, we added a mast to the TurtleBot 2, raising the camera to 1.25 m to match the simulation setup. The camera used is an Orbbec Astra with depth perception. Since our SEMNAV model r… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results in simulated environments. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the SR reported in the real world and the simulation [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of the robot successfully navigating in the real world toward a sofa, a television, and a chair. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SEMNAV, a visual semantic navigation model that replaces raw RGB inputs with semantic segmentation maps to improve policy robustness and sim-to-real transfer. It introduces a curated SEMNAV dataset for training such models and reports superior success rates over prior VSN methods in Habitat 2.0 using HM3D, together with real-robot trials that attribute gains to the semantic representation.

Significance. If the performance gains prove robust and the segmentation assumption holds on real imagery, the work would offer a practical route to reducing domain shift in robotic navigation without heavy reliance on image-level adaptation techniques. The release of code, datasets, and a segmentation-aware benchmark would be a useful community resource for VSN research.

major comments (2)
  1. [Real-world Experiments] Real-world Experiments section: The claim that semantic segmentation mitigates the sim-to-real gap is load-bearing yet rests on an untested assumption. No mIoU, per-class accuracy, or other quantitative segmentation metrics are supplied for the external model’s output on the actual real-robot camera images; without these, observed success-rate improvements cannot be confidently attributed to the semantic input rather than to segmentation errors or other factors.
  2. [Experimental Results] Experimental Results (tables reporting success rate, SPL, etc.): The abstract states higher success rates than SOTA VSN models, but the manuscript supplies no error bars, ablation studies isolating the segmentation component, or statistical tests across random seeds. This weakens the central empirical claim that the approach outperforms existing methods under standard controls.
minor comments (2)
  1. [§3.1] The notation for the navigation policy input (segmentation map versus RGB) should be defined explicitly in §3.1 to avoid ambiguity when comparing to prior RGB-only baselines.
  2. [Figures] Figure captions for the real-robot setup could clarify the exact camera intrinsics and lighting conditions used, aiding reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the work.

read point-by-point responses
  1. Referee: [Real-world Experiments] Real-world Experiments section: The claim that semantic segmentation mitigates the sim-to-real gap is load-bearing yet rests on an untested assumption. No mIoU, per-class accuracy, or other quantitative segmentation metrics are supplied for the external model’s output on the actual real-robot camera images; without these, observed success-rate improvements cannot be confidently attributed to the semantic input rather than to segmentation errors or other factors.

    Authors: We agree that quantitative segmentation metrics on real-robot images would strengthen attribution of the performance gains. However, ground-truth semantic annotations are not available for the real-world camera images used in our experiments, precluding computation of mIoU or per-class accuracy. In the revised manuscript we will add qualitative visualizations of segmentation outputs on representative real-robot images together with a discussion of observed segmentation quality and potential error sources. We believe the combination of these visualizations and the reported real-world success-rate improvements still supports the value of semantic inputs, while acknowledging the limitation noted by the referee. revision: partial

  2. Referee: [Experimental Results] Experimental Results (tables reporting success rate, SPL, etc.): The abstract states higher success rates than SOTA VSN models, but the manuscript supplies no error bars, ablation studies isolating the segmentation component, or statistical tests across random seeds. This weakens the central empirical claim that the approach outperforms existing methods under standard controls.

    Authors: We acknowledge that error bars, ablations, and statistical tests would provide stronger empirical support. In the revision we will re-run the simulation experiments across multiple random seeds, add error bars to the success-rate and SPL tables, include an ablation comparing semantic-segmentation inputs against RGB inputs, and report statistical significance tests (e.g., paired t-tests) on the performance differences. revision: yes

standing simulated objections not resolved
  • Quantitative mIoU and per-class accuracy for the external segmentation model on real-robot images, due to the absence of ground-truth annotations for those images.

Circularity Check

0 steps flagged

No circularity: empirical performance claims on held-out tests

full rationale

The paper describes an empirical ML system (SEMNAV) that replaces RGB input with semantic segmentation labels, trains a navigation policy on a curated dataset, and reports success rates on held-out Habitat 2.0 / HM3D episodes plus real-robot trials. These are measured outcomes from standard train/test splits and physical experiments, not quantities obtained by fitting a parameter to a subset and then relabeling the same quantity as a prediction, nor by self-defining a metric in terms of itself. No equations or uniqueness theorems are invoked that reduce the central claim to a self-citation chain or an ansatz smuggled from prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the external availability of a semantic segmentation model whose outputs are treated as reliable ground truth for policy training; no new physical axioms or free parameters are introduced beyond standard deep-RL hyperparameters.

axioms (1)
  • domain assumption Semantic segmentation labels produced by an off-the-shelf model are sufficiently accurate and domain-invariant for policy learning in both simulation and real environments.
    The method description treats segmentation maps as the primary visual input without quantifying label noise or domain shift in the real-robot experiments.
invented entities (2)
  • SEMNAV model no independent evidence
    purpose: Navigation policy that consumes semantic segmentation instead of RGB
    The model is a learned neural network; no independent physical evidence is supplied beyond the reported success rates.
  • SEMNAV dataset no independent evidence
    purpose: Curated collection of simulator scenes paired with semantic labels for training
    Dataset is newly introduced by the authors; its value depends on the quality of the underlying simulator and labeler.

pith-pipeline@v0.9.0 · 5824 in / 1428 out tokens · 45935 ms · 2026-05-22T01:29:19.935898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Designing Privacy-Preserving Visual Perception for Robot Navigation Based on User Privacy Preferences

    cs.RO 2026-04 unverdicted novelty 5.0

    User studies reveal preferences for visual abstractions and distance-dependent low-resolution capture, leading to a configurable privacy policy for robot navigation.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Simultaneous localization and mapping: part i,

    H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping: part i,” IEEE Robotics and Automation Magazine , vol. 13, no. 2, pp. 99–110, 2006

  2. [2]

    Obvi-slam: Long-term object- visual slam,

    A. Adkins, T. Chen, and J. Biswas, “Obvi-slam: Long-term object- visual slam,” IEEE Robotics and Automation Letters , vol. 9, no. 3, pp. 2909–2916, 2024

  3. [3]

    Kimera: an open- source library for real-time metric-semantic localization and mapping,

    A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” ICRA, 2020

  4. [4]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

    C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016

  5. [5]

    Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,” in CVPR, 2022

  6. [6]

    Offline visual representation learning for embodied navigation,

    K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” in ICLR, 2023

  7. [7]

    Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,

    D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,” in NeurIPS, 2020

  8. [8]

    Semantic Visual Navigation by Watching Youtube Videos,

    M. Chang, A. Gupta, and S. Gupta, “Semantic Visual Navigation by Watching Youtube Videos,” in NeurIPS, 2020

  9. [9]

    DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,” in ICLR, 2020

  10. [10]

    Multi-agent embodied visual semantic navigation with scene prior knowledge,

    X. Liu, D. Guo, H. Liu, and F. Sun, “Multi-agent embodied visual semantic navigation with scene prior knowledge,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3154–3161, 2022

  11. [11]

    HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,

    N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,” IROS, 2024

  12. [12]

    Auxiliary tasks and exploration enable ObjectGoal navigation,

    J. Ye, D. Batra, A. Das, and E. Wijmans, “Auxiliary tasks and exploration enable ObjectGoal navigation,” in ICCV, 2021

  13. [13]

    Visual semantic navigation using scene priors,

    W. Yang, X. Wang, A. Farhadi, A. K. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” ICLR, 2018

  14. [14]

    An object-driven navigation strategy based on active perception and semantic association,

    Y . Guo, J. Sun, R. Zhang, Z. Jiang, Z. Mi, C. Yao, X. Ban, and M. S. Obaidat, “An object-driven navigation strategy based on active perception and semantic association,” IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 7110–7117, 2024

  15. [15]

    Semantic policy network for zero-shot object goal visual navigation,

    Q. Zhao, L. Zhang, B. He, and Z. Liu, “Semantic policy network for zero-shot object goal visual navigation,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7655–7662, 2023

  16. [16]

    Habitat challenge 2023,

    K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/ 2023/, 2023

  17. [17]

    Habitat 2.0: Training home assistants to rearrange their habitat,

    A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, A. Gokaslan, V . V ondruˇs, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, Fig. 7. Qualitative results of the robot successfully navigating in the real world toward a sofa, a television, and a chair. V . Koltun, J. Malik, M. Savva, ...

  18. [18]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y . Zhu, A. Kembhavi, A. K. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” ArXiv, vol. abs/1712.05474, 2017

  19. [19]

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,

    M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A. Farhadi, A. Kembhavi, and R. Mottaghi, “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,” in NeurIPS, 2022, outstanding Paper Award

  20. [20]

    Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,” in NeurIPS, 2021

  21. [21]

    Habitat-matterport 3d semantics dataset,

    K. Yadav, R. Ramrakhya, S. K. Ramakrishnan, T. Gervet, J. Turner, A. Gokaslan, N. Maestre, A. X. Chang, D. Batra, M. Savva et al. , “Habitat-matterport 3d semantics dataset,” arXiv preprint arXiv:2210.05633, 2022

  22. [22]

    Navigating to Objects in the Real World,

    T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot, “Navigating to Objects in the Real World,” Science Robotics , 2022

  23. [23]

    Exploitation- guided exploration for semantic embodied navigation,

    J. Wasserman, G. Chowdhary, A. Gupta, and U. Jain, “Exploitation- guided exploration for semantic embodied navigation,” ICRA, 2024

  24. [24]

    Visual semantic navi- gation with real robots,

    C. Guti ´errez- ´Alvarez, P. R ´ıos-Navarro, R. Flor-Rodr ´ıguez, F. J. Acevedo-Rodr´ıguez, and R. J. L ´opez-Sastre, “Visual semantic navi- gation with real robots,” Applied Intelligence , vol. 55, 2025

  25. [25]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

    C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. M. Montiel, and J. D. Tard ´os, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics , 2021

  26. [26]

    Slam++: Simultaneous localisation and mapping at the level of objects,

    R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1352–1359

  27. [27]

    Deepfactors: Real-time probabilistic dense monocular slam,

    J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 721–728, 2020

  28. [28]

    Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,

    Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,” in ICLR, 2017

  29. [29]

    Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,

    H. Wang, Y . Wang, F. Zhong, M. Wu, J. Zhang, Y . Wang, and H. Dong, “Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,” IEEE Robotics and Automation Letters, 2023

  30. [30]

    Multi-goal audio-visual navigation us- ing sound direction map,

    H. Kondoh and A. Kanezaki, “Multi-goal audio-visual navigation us- ing sound direction map,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2023, pp. 5219–5226

  31. [31]

    HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,

    J. Kang, B. Chen, P. Zhong, H. Yang, Y . Sheng, and J. Wang, “HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,” ICRA, 2024

  32. [32]

    Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,

    F. Gao, J. Tang, J. Wang, S. Li, and J. Yu, “Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,” IEEE Robotics and Automation Letters , vol. 9, no. 12, pp. 10 874– 10 881, 2024

  33. [33]

    Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,

    L. Yue, D. Zhou, L. Xie, F. Zhang, Y . Yan, and E. Yin, “Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,” IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 4918–4925, 2024

  34. [34]

    Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,

    J. Wang, T. Wang, W. Cai, L. Xu, and C. Sun, “Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,” IEEE Robotics and Automation Letters , vol. 10, no. 1, pp. 612–619, 2025

  35. [35]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems , 2020

  36. [36]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) , 2022

  37. [37]

    ViNT: A foundation model for visual navigation,

    D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” in 7th Conference on Robot Learning (CoRL) , 2023, pp. 1–23

  38. [38]

    Flownav: Combining flow matching and depth priors for efficient navigation,

    S. Gode, A. Nayak, D. N. P. Oliveira, M. Krawez, C. Schmid, and W. Burgard, “Flownav: Combining flow matching and depth priors for efficient navigation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.09524

  39. [39]

    Visual navigation using a webcam based on semantic segmentation for indoor robots,

    M. Adachi, S. Shatari, and R. Miyamoto, “Visual navigation using a webcam based on semantic segmentation for indoor robots,” in 2019 15th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) , 2019, pp. 15–21

  40. [40]

    Practical implementation of visual navigation based on semantic segmentation for human-centric environments,

    M. Adachi, K. Honda, J. Xue, H. Sudo, Y . Ueda, Y . Yuda, M. Wada, and R. Miyamoto, “Practical implementation of visual navigation based on semantic segmentation for human-centric environments,” Journal of Robotics and Mechatronics , vol. 35, no. 6, pp. 1419–1434, 2023

  41. [41]

    Visual representations for semantic target driven naviga- tion,

    A. Mousavian, A. Toshev, M. Fi ˇser, J. Ko ˇseck´a, A. Wahid, and J. Davidson, “Visual representations for semantic target driven naviga- tion,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8846–8852

  42. [42]

    Indoor segmenta- tion and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta- tion and support inference from rgbd images,” inEuropean Conference on Computer Vision , 2012

  43. [43]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

  44. [44]

    Emerging properties in self-supervised vision trans- formers,

    M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021

  45. [45]

    Learning phrase representations using RNN encoder–decoder for statistical machine translation,

    K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Oct. 2014, pp. 1724–1734

  46. [46]

    ROS wrapper for Kobuki base Turtlebot 2,

    K. Ltd., “ROS wrapper for Kobuki base Turtlebot 2,” 2023. [Online]. Available: https://github.com/yujinrobot/kobuki.git

  47. [47]

    Efficient rgb-d semantic segmentation for indoor scene analy- sis,

    D. Seichter, M. K ¨ohler, B. Lewandowski, T. Wengefeld, and H.-M. Groß, “Efficient rgb-d semantic segmentation for indoor scene analy- sis,” 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13 525–13 531, 2020

  48. [48]

    PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,

    R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,” in CVPR, 2023

  49. [49]

    DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,

    E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,” in ICLR, 2019

  50. [50]

    Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,

    K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,” arXiv preprint arXiv:2303.07798, 2023

  51. [51]

    MOPA: Modular object navigation with pointgoal agents,

    S. Raychaudhuri, T. Campari, U. Jain, M. Savva, and A. X. Chang, “MOPA: Modular object navigation with pointgoal agents,” in WACV, 2024

  52. [52]

    Homerobot: Open-vocabulary mobile manipulation,

    S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, A. W. Clegg, J. Turner, Z. Kira, M. Savva, A. Chang, D. S. Chaplot, D. Batra, R. Mottaghi, Y . Bisk, and C. Paxton, “Homerobot: Open-vocabulary mobile manipulation,” 2024