SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

Carlos Guti\'errez-\'Alvarez; Francisco Javier Acevedo-Rodr\'iguez; Rafael Flor-Rodr\'iguez; Roberto J. L\'opez-Sastre; Sergio Lafuente-Arroyo

arxiv: 2506.01418 · v2 · pith:FAPMW3VGnew · submitted 2025-06-02 · 💻 cs.RO · cs.CV

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

Rafael Flor-Rodr\'iguez , Carlos Guti\'errez-\'Alvarez , Francisco Javier Acevedo-Rodr\'iguez , Sergio Lafuente-Arroyo , Roberto J. L\'opez-Sastre This is my paper

Pith reviewed 2026-05-22 01:29 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords visual semantic navigationsemantic segmentationsim-to-real transferrobot navigationobject searchHabitat simulatorpolicy learningreal-world robotics

0 comments

The pith

SEMNAV improves visual semantic navigation by using semantic segmentation maps instead of raw RGB images as input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that training navigation agents on semantic segmentation outputs rather than raw pixel images produces policies that reach target objects more reliably in environments the agent has never seen before. This matters because most current visual navigation systems are trained only inside simulators and then fail when placed on real robots due to differences in lighting, textures, and rendering. By feeding the model explicit labels for objects and surfaces, the agent learns to focus on stable spatial relationships instead of brittle visual patterns. The authors support this by releasing a new dataset built for segmentation-aware navigation training. Experiments show higher success rates inside the Habitat simulator and clearer transfer when the same policies are run on physical robots.

Core claim

SEMNAV demonstrates that replacing raw RGB observations with semantic segmentation labels as the primary visual representation allows a navigation policy to achieve higher success rates when locating target objects in unseen environments. The model is trained in simulation using the HM3D dataset inside Habitat 2.0 and is then deployed on real robotic platforms, where the semantic input reduces the performance drop caused by visual domain differences between rendered scenes and actual camera footage.

What carries the argument

SEMNAV model that takes semantic segmentation maps as its main visual input for learning navigation policies toward target objects.

If this is right

Higher success rates when locating objects in previously unseen simulated rooms using the HM3D dataset inside Habitat 2.0.
Narrowed performance gap between simulation training and real-robot execution because semantic labels are less affected by rendering differences than raw pixels.
Improved ability to navigate toward specific objects in practical settings after training only in simulation.
A new curated dataset that supports further work on navigation models that rely on semantic rather than pixel input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic input strategy could be tested on other robot tasks such as opening doors or placing objects where consistent object identity matters more than exact appearance.
Training with semantic maps might let teams collect less real-world data because policies transfer more readily from simulation.
The approach could be extended to environments with moving people or changing furniture to check whether segmentation still provides stable guidance.

Load-bearing premise

Semantic segmentation labels produced by an external model stay accurate enough in real-world scenes whose lighting, textures, and layouts differ from the simulator used for training.

What would settle it

Deploy the trained SEMNAV policy on a real robot in a new room where the segmentation network mislabels doors, furniture, or floors at high rates and measure whether success rates fall to the level of standard RGB-based models.

Figures

Figures reproduced from arXiv: 2506.01418 by Carlos Guti\'errez-\'Alvarez, Francisco Javier Acevedo-Rodr\'iguez, Rafael Flor-Rodr\'iguez, Roberto J. L\'opez-Sastre, Sergio Lafuente-Arroyo.

**Figure 2.** Figure 2: Comparison between the HM3D dataset and the S [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Proposed architecture for the SEMNAV model. episode is to enable an agent to navigate in a scene Si from a set of available scenes S = {S1, . . . , Sn}, towards an object of a specific category ci belonging to the category set C = {c1, . . . , cm}, starting from an initial position p0 in the navigation environment. For our SEMNAV model, we define the navigation task as follows. Given a target object class … view at source ↗

**Figure 4.** Figure 4: Top-down view of the house where OBJECTNAV experiments were conducted for five object categories. adapted for the OBJECTNAV problem. This adaptation replicates the agent’s characteristics during training. Among the modifications, we added a mast to the TurtleBot 2, raising the camera to 1.25 m to match the simulation setup. The camera used is an Orbbec Astra with depth perception. Since our SEMNAV model r… view at source ↗

**Figure 5.** Figure 5: Qualitative results in simulated environments. From top to bottom, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the SR reported in the real world and the simulation [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of the robot successfully navigating in the real world toward a sofa, a television, and a chair. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEMNAV swaps RGB for semantic segmentation in object-goal navigation and reports better sim success plus some real-robot transfer, but skips any check on how accurate those segmentations actually are on the real camera images.

read the letter

The main thing to know is that SEMNAV gets better results in visual semantic navigation by using semantic segmentation masks rather than raw RGB images as input to the policy. This change seems to help with sim-to-real transfer on both Habitat simulations and a real robot platform. The authors put together a new dataset called SEMNAV for this kind of training and evaluate against prior VSN models. They make the code and data available on GitHub, which is straightforward and helpful for others. The real strength is the inclusion of physical robot experiments. Many papers stay in simulation only, so seeing some transfer to hardware adds value. The core idea of stripping away low-level visual details like textures to focus on semantics is sensible for reducing domain gaps. On the downside, the abstract talks about higher success rates but does not include any tables, specific percentages, or ablation studies. We also lack any numbers on the accuracy of the semantic segmentation itself when run on the real-world camera feeds. That is the load-bearing part: if the segmentation model struggles with unfamiliar real environments, the navigation policy gets unreliable features and the reported improvements could be overstated. The stress-test note on this point looks on target. This work would interest researchers in robotic navigation who are trying to close the sim-to-real gap. A reader already familiar with Habitat and object-goal tasks could pick up the dataset or try the input swap in their own setups. I would send this to peer review. The combination of simulation benchmarks, real-robot trials, and public resources is enough to justify referee time, though the authors should add quantitative checks on the segmentation performance in the real setting.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SEMNAV, a visual semantic navigation model that replaces raw RGB inputs with semantic segmentation maps to improve policy robustness and sim-to-real transfer. It introduces a curated SEMNAV dataset for training such models and reports superior success rates over prior VSN methods in Habitat 2.0 using HM3D, together with real-robot trials that attribute gains to the semantic representation.

Significance. If the performance gains prove robust and the segmentation assumption holds on real imagery, the work would offer a practical route to reducing domain shift in robotic navigation without heavy reliance on image-level adaptation techniques. The release of code, datasets, and a segmentation-aware benchmark would be a useful community resource for VSN research.

major comments (2)

[Real-world Experiments] Real-world Experiments section: The claim that semantic segmentation mitigates the sim-to-real gap is load-bearing yet rests on an untested assumption. No mIoU, per-class accuracy, or other quantitative segmentation metrics are supplied for the external model’s output on the actual real-robot camera images; without these, observed success-rate improvements cannot be confidently attributed to the semantic input rather than to segmentation errors or other factors.
[Experimental Results] Experimental Results (tables reporting success rate, SPL, etc.): The abstract states higher success rates than SOTA VSN models, but the manuscript supplies no error bars, ablation studies isolating the segmentation component, or statistical tests across random seeds. This weakens the central empirical claim that the approach outperforms existing methods under standard controls.

minor comments (2)

[§3.1] The notation for the navigation policy input (segmentation map versus RGB) should be defined explicitly in §3.1 to avoid ambiguity when comparing to prior RGB-only baselines.
[Figures] Figure captions for the real-robot setup could clarify the exact camera intrinsics and lighting conditions used, aiding reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Real-world Experiments] Real-world Experiments section: The claim that semantic segmentation mitigates the sim-to-real gap is load-bearing yet rests on an untested assumption. No mIoU, per-class accuracy, or other quantitative segmentation metrics are supplied for the external model’s output on the actual real-robot camera images; without these, observed success-rate improvements cannot be confidently attributed to the semantic input rather than to segmentation errors or other factors.

Authors: We agree that quantitative segmentation metrics on real-robot images would strengthen attribution of the performance gains. However, ground-truth semantic annotations are not available for the real-world camera images used in our experiments, precluding computation of mIoU or per-class accuracy. In the revised manuscript we will add qualitative visualizations of segmentation outputs on representative real-robot images together with a discussion of observed segmentation quality and potential error sources. We believe the combination of these visualizations and the reported real-world success-rate improvements still supports the value of semantic inputs, while acknowledging the limitation noted by the referee. revision: partial
Referee: [Experimental Results] Experimental Results (tables reporting success rate, SPL, etc.): The abstract states higher success rates than SOTA VSN models, but the manuscript supplies no error bars, ablation studies isolating the segmentation component, or statistical tests across random seeds. This weakens the central empirical claim that the approach outperforms existing methods under standard controls.

Authors: We acknowledge that error bars, ablations, and statistical tests would provide stronger empirical support. In the revision we will re-run the simulation experiments across multiple random seeds, add error bars to the success-rate and SPL tables, include an ablation comparing semantic-segmentation inputs against RGB inputs, and report statistical significance tests (e.g., paired t-tests) on the performance differences. revision: yes

standing simulated objections not resolved

Quantitative mIoU and per-class accuracy for the external segmentation model on real-robot images, due to the absence of ground-truth annotations for those images.

Circularity Check

0 steps flagged

No circularity: empirical performance claims on held-out tests

full rationale

The paper describes an empirical ML system (SEMNAV) that replaces RGB input with semantic segmentation labels, trains a navigation policy on a curated dataset, and reports success rates on held-out Habitat 2.0 / HM3D episodes plus real-robot trials. These are measured outcomes from standard train/test splits and physical experiments, not quantities obtained by fitting a parameter to a subset and then relabeling the same quantity as a prediction, nor by self-defining a metric in terms of itself. No equations or uniqueness theorems are invoked that reduce the central claim to a self-citation chain or an ansatz smuggled from prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the external availability of a semantic segmentation model whose outputs are treated as reliable ground truth for policy training; no new physical axioms or free parameters are introduced beyond standard deep-RL hyperparameters.

axioms (1)

domain assumption Semantic segmentation labels produced by an off-the-shelf model are sufficiently accurate and domain-invariant for policy learning in both simulation and real environments.
The method description treats segmentation maps as the primary visual input without quantifying label noise or domain shift in the real-robot experiments.

invented entities (2)

SEMNAV model no independent evidence
purpose: Navigation policy that consumes semantic segmentation instead of RGB
The model is a learned neural network; no independent physical evidence is supplied beyond the reported success rates.
SEMNAV dataset no independent evidence
purpose: Curated collection of simulator scenes paired with semantic labels for training
Dataset is newly introduced by the authors; its value depends on the quality of the underlying simulator and labeler.

pith-pipeline@v0.9.0 · 5824 in / 1428 out tokens · 45935 ms · 2026-05-22T01:29:19.935898+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our model learns robust navigation policies that improve generalization across unseen environments... by explicitly incorporating this type of high-level semantic information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Designing Privacy-Preserving Visual Perception for Robot Navigation Based on User Privacy Preferences
cs.RO 2026-04 unverdicted novelty 5.0

User studies reveal preferences for visual abstractions and distance-dependent low-resolution capture, leading to a configurable privacy policy for robot navigation.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Simultaneous localization and mapping: part i,

H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping: part i,” IEEE Robotics and Automation Magazine , vol. 13, no. 2, pp. 99–110, 2006

work page 2006
[2]

Obvi-slam: Long-term object- visual slam,

A. Adkins, T. Chen, and J. Biswas, “Obvi-slam: Long-term object- visual slam,” IEEE Robotics and Automation Letters , vol. 9, no. 3, pp. 2909–2916, 2024

work page 2024
[3]

Kimera: an open- source library for real-time metric-semantic localization and mapping,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” ICRA, 2020

work page 2020
[4]

Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016

work page 2016
[5]

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,

R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,” in CVPR, 2022

work page 2022
[6]

Offline visual representation learning for embodied navigation,

K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” in ICLR, 2023

work page 2023
[7]

Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,

D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,” in NeurIPS, 2020

work page 2020
[8]

Semantic Visual Navigation by Watching Youtube Videos,

M. Chang, A. Gupta, and S. Gupta, “Semantic Visual Navigation by Watching Youtube Videos,” in NeurIPS, 2020

work page 2020
[9]

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,” in ICLR, 2020

work page 2020
[10]

Multi-agent embodied visual semantic navigation with scene prior knowledge,

X. Liu, D. Guo, H. Liu, and F. Sun, “Multi-agent embodied visual semantic navigation with scene prior knowledge,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3154–3161, 2022

work page 2022
[11]

HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,” IROS, 2024

work page 2024
[12]

Auxiliary tasks and exploration enable ObjectGoal navigation,

J. Ye, D. Batra, A. Das, and E. Wijmans, “Auxiliary tasks and exploration enable ObjectGoal navigation,” in ICCV, 2021

work page 2021
[13]

Visual semantic navigation using scene priors,

W. Yang, X. Wang, A. Farhadi, A. K. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” ICLR, 2018

work page 2018
[14]

An object-driven navigation strategy based on active perception and semantic association,

Y . Guo, J. Sun, R. Zhang, Z. Jiang, Z. Mi, C. Yao, X. Ban, and M. S. Obaidat, “An object-driven navigation strategy based on active perception and semantic association,” IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 7110–7117, 2024

work page 2024
[15]

Semantic policy network for zero-shot object goal visual navigation,

Q. Zhao, L. Zhang, B. He, and Z. Liu, “Semantic policy network for zero-shot object goal visual navigation,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7655–7662, 2023

work page 2023
[16]

Habitat challenge 2023,

K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/ 2023/, 2023

work page 2023
[17]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, A. Gokaslan, V . V ondruˇs, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, Fig. 7. Qualitative results of the robot successfully navigating in the real world toward a sofa, a television, and a chair. V . Koltun, J. Malik, M. Savva, ...

work page 2021
[18]

AI2-THOR: An Interactive 3D Environment for Visual AI

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y . Zhu, A. Kembhavi, A. K. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” ArXiv, vol. abs/1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,

M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A. Farhadi, A. Kembhavi, and R. Mottaghi, “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,” in NeurIPS, 2022, outstanding Paper Award

work page 2022
[20]

Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,” in NeurIPS, 2021

work page 2021
[21]

Habitat-matterport 3d semantics dataset,

K. Yadav, R. Ramrakhya, S. K. Ramakrishnan, T. Gervet, J. Turner, A. Gokaslan, N. Maestre, A. X. Chang, D. Batra, M. Savva et al. , “Habitat-matterport 3d semantics dataset,” arXiv preprint arXiv:2210.05633, 2022

work page arXiv 2022
[22]

Navigating to Objects in the Real World,

T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot, “Navigating to Objects in the Real World,” Science Robotics , 2022

work page 2022
[23]

Exploitation- guided exploration for semantic embodied navigation,

J. Wasserman, G. Chowdhary, A. Gupta, and U. Jain, “Exploitation- guided exploration for semantic embodied navigation,” ICRA, 2024

work page 2024
[24]

Visual semantic navi- gation with real robots,

C. Guti ´errez- ´Alvarez, P. R ´ıos-Navarro, R. Flor-Rodr ´ıguez, F. J. Acevedo-Rodr´ıguez, and R. J. L ´opez-Sastre, “Visual semantic navi- gation with real robots,” Applied Intelligence , vol. 55, 2025

work page 2025
[25]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. M. Montiel, and J. D. Tard ´os, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics , 2021

work page 2021
[26]

Slam++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1352–1359

work page 2013
[27]

Deepfactors: Real-time probabilistic dense monocular slam,

J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 721–728, 2020

work page 2020
[28]

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,

Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,” in ICLR, 2017

work page 2017
[29]

Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,

H. Wang, Y . Wang, F. Zhong, M. Wu, J. Zhang, Y . Wang, and H. Dong, “Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,” IEEE Robotics and Automation Letters, 2023

work page 2023
[30]

Multi-goal audio-visual navigation us- ing sound direction map,

H. Kondoh and A. Kanezaki, “Multi-goal audio-visual navigation us- ing sound direction map,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2023, pp. 5219–5226

work page 2023
[31]

HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,

J. Kang, B. Chen, P. Zhong, H. Yang, Y . Sheng, and J. Wang, “HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,” ICRA, 2024

work page 2024
[32]

Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,

F. Gao, J. Tang, J. Wang, S. Li, and J. Yu, “Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,” IEEE Robotics and Automation Letters , vol. 9, no. 12, pp. 10 874– 10 881, 2024

work page 2024
[33]

Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,

L. Yue, D. Zhou, L. Xie, F. Zhang, Y . Yan, and E. Yin, “Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,” IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 4918–4925, 2024

work page 2024
[34]

Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,

J. Wang, T. Wang, W. Cai, L. Xu, and C. Sun, “Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,” IEEE Robotics and Automation Letters , vol. 10, no. 1, pp. 612–619, 2025

work page 2025
[35]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems , 2020

work page 2020
[36]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) , 2022

work page 2022
[37]

ViNT: A foundation model for visual navigation,

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” in 7th Conference on Robot Learning (CoRL) , 2023, pp. 1–23

work page 2023
[38]

Flownav: Combining flow matching and depth priors for efficient navigation,

S. Gode, A. Nayak, D. N. P. Oliveira, M. Krawez, C. Schmid, and W. Burgard, “Flownav: Combining flow matching and depth priors for efficient navigation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.09524

work page arXiv 2025
[39]

Visual navigation using a webcam based on semantic segmentation for indoor robots,

M. Adachi, S. Shatari, and R. Miyamoto, “Visual navigation using a webcam based on semantic segmentation for indoor robots,” in 2019 15th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) , 2019, pp. 15–21

work page 2019
[40]

Practical implementation of visual navigation based on semantic segmentation for human-centric environments,

M. Adachi, K. Honda, J. Xue, H. Sudo, Y . Ueda, Y . Yuda, M. Wada, and R. Miyamoto, “Practical implementation of visual navigation based on semantic segmentation for human-centric environments,” Journal of Robotics and Mechatronics , vol. 35, no. 6, pp. 1419–1434, 2023

work page 2023
[41]

Visual representations for semantic target driven naviga- tion,

A. Mousavian, A. Toshev, M. Fi ˇser, J. Ko ˇseck´a, A. Wahid, and J. Davidson, “Visual representations for semantic target driven naviga- tion,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8846–8852

work page 2019
[42]

Indoor segmenta- tion and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta- tion and support inference from rgbd images,” inEuropean Conference on Computer Vision , 2012

work page 2012
[43]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

work page 2016
[44]

Emerging properties in self-supervised vision trans- formers,

M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021

work page 2021
[45]

Learning phrase representations using RNN encoder–decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Oct. 2014, pp. 1724–1734

work page 2014
[46]

ROS wrapper for Kobuki base Turtlebot 2,

K. Ltd., “ROS wrapper for Kobuki base Turtlebot 2,” 2023. [Online]. Available: https://github.com/yujinrobot/kobuki.git

work page 2023
[47]

Efficient rgb-d semantic segmentation for indoor scene analy- sis,

D. Seichter, M. K ¨ohler, B. Lewandowski, T. Wengefeld, and H.-M. Groß, “Efficient rgb-d semantic segmentation for indoor scene analy- sis,” 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13 525–13 531, 2020

work page 2021
[48]

PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,” in CVPR, 2023

work page 2023
[49]

DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,” in ICLR, 2019

work page 2019
[50]

Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,

K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,” arXiv preprint arXiv:2303.07798, 2023

work page arXiv 2023
[51]

MOPA: Modular object navigation with pointgoal agents,

S. Raychaudhuri, T. Campari, U. Jain, M. Savva, and A. X. Chang, “MOPA: Modular object navigation with pointgoal agents,” in WACV, 2024

work page 2024
[52]

Homerobot: Open-vocabulary mobile manipulation,

S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, A. W. Clegg, J. Turner, Z. Kira, M. Savva, A. Chang, D. S. Chaplot, D. Batra, R. Mottaghi, Y . Bisk, and C. Paxton, “Homerobot: Open-vocabulary mobile manipulation,” 2024

work page 2024

[1] [1]

Simultaneous localization and mapping: part i,

H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping: part i,” IEEE Robotics and Automation Magazine , vol. 13, no. 2, pp. 99–110, 2006

work page 2006

[2] [2]

Obvi-slam: Long-term object- visual slam,

A. Adkins, T. Chen, and J. Biswas, “Obvi-slam: Long-term object- visual slam,” IEEE Robotics and Automation Letters , vol. 9, no. 3, pp. 2909–2916, 2024

work page 2024

[3] [3]

Kimera: an open- source library for real-time metric-semantic localization and mapping,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” ICRA, 2020

work page 2020

[4] [4]

Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016

work page 2016

[5] [5]

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,

R. Ramrakhya, E. Undersander, D. Batra, and A. Das, “Habitat-Web: Learning Embodied Object-Search Strategies from Human Demon- strations at Scale,” in CVPR, 2022

work page 2022

[6] [6]

Offline visual representation learning for embodied navigation,

K. Yadav, R. Ramrakhya, A. Majumdar, V .-P. Berges, S. Kuhar, D. Ba- tra, A. Baevski, and O. Maksymets, “Offline visual representation learning for embodied navigation,” in ICLR, 2023

work page 2023

[7] [7]

Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,

D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Ob- ject Goal Navigation using Goal-Oriented Semantic Exploration,” in NeurIPS, 2020

work page 2020

[8] [8]

Semantic Visual Navigation by Watching Youtube Videos,

M. Chang, A. Gupta, and S. Gupta, “Semantic Visual Navigation by Watching Youtube Videos,” in NeurIPS, 2020

work page 2020

[9] [9]

DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,” in ICLR, 2020

work page 2020

[10] [10]

Multi-agent embodied visual semantic navigation with scene prior knowledge,

X. Liu, D. Guo, H. Liu, and F. Sun, “Multi-agent embodied visual semantic navigation with scene prior knowledge,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3154–3161, 2022

work page 2022

[11] [11]

HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,

N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha, “HM3D- OVON: A dataset and benchmark for open-vocabulary object goal navigation,” IROS, 2024

work page 2024

[12] [12]

Auxiliary tasks and exploration enable ObjectGoal navigation,

J. Ye, D. Batra, A. Das, and E. Wijmans, “Auxiliary tasks and exploration enable ObjectGoal navigation,” in ICCV, 2021

work page 2021

[13] [13]

Visual semantic navigation using scene priors,

W. Yang, X. Wang, A. Farhadi, A. K. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” ICLR, 2018

work page 2018

[14] [14]

An object-driven navigation strategy based on active perception and semantic association,

Y . Guo, J. Sun, R. Zhang, Z. Jiang, Z. Mi, C. Yao, X. Ban, and M. S. Obaidat, “An object-driven navigation strategy based on active perception and semantic association,” IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 7110–7117, 2024

work page 2024

[15] [15]

Semantic policy network for zero-shot object goal visual navigation,

Q. Zhao, L. Zhang, B. He, and Z. Liu, “Semantic policy network for zero-shot object goal visual navigation,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7655–7662, 2023

work page 2023

[16] [16]

Habitat challenge 2023,

K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/ 2023/, 2023

work page 2023

[17] [17]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, A. Gokaslan, V . V ondruˇs, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, Fig. 7. Qualitative results of the robot successfully navigating in the real world toward a sofa, a television, and a chair. V . Koltun, J. Malik, M. Savva, ...

work page 2021

[18] [18]

AI2-THOR: An Interactive 3D Environment for Visual AI

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y . Zhu, A. Kembhavi, A. K. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” ArXiv, vol. abs/1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,

M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A. Farhadi, A. Kembhavi, and R. Mottaghi, “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation,” in NeurIPS, 2022, outstanding Paper Award

work page 2022

[20] [20]

Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,” in NeurIPS, 2021

work page 2021

[21] [21]

Habitat-matterport 3d semantics dataset,

K. Yadav, R. Ramrakhya, S. K. Ramakrishnan, T. Gervet, J. Turner, A. Gokaslan, N. Maestre, A. X. Chang, D. Batra, M. Savva et al. , “Habitat-matterport 3d semantics dataset,” arXiv preprint arXiv:2210.05633, 2022

work page arXiv 2022

[22] [22]

Navigating to Objects in the Real World,

T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot, “Navigating to Objects in the Real World,” Science Robotics , 2022

work page 2022

[23] [23]

Exploitation- guided exploration for semantic embodied navigation,

J. Wasserman, G. Chowdhary, A. Gupta, and U. Jain, “Exploitation- guided exploration for semantic embodied navigation,” ICRA, 2024

work page 2024

[24] [24]

Visual semantic navi- gation with real robots,

C. Guti ´errez- ´Alvarez, P. R ´ıos-Navarro, R. Flor-Rodr ´ıguez, F. J. Acevedo-Rodr´ıguez, and R. J. L ´opez-Sastre, “Visual semantic navi- gation with real robots,” Applied Intelligence , vol. 55, 2025

work page 2025

[25] [25]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. M. Montiel, and J. D. Tard ´os, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics , 2021

work page 2021

[26] [26]

Slam++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1352–1359

work page 2013

[27] [27]

Deepfactors: Real-time probabilistic dense monocular slam,

J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 721–728, 2020

work page 2020

[28] [28]

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,

Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning,” in ICLR, 2017

work page 2017

[29] [29]

Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,

H. Wang, Y . Wang, F. Zhong, M. Wu, J. Zhang, Y . Wang, and H. Dong, “Learning semantic-agnostic and spatial-aware representation for gen- eralizable visual-audio navigation,” IEEE Robotics and Automation Letters, 2023

work page 2023

[30] [30]

Multi-goal audio-visual navigation us- ing sound direction map,

H. Kondoh and A. Kanezaki, “Multi-goal audio-visual navigation us- ing sound direction map,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2023, pp. 5219–5226

work page 2023

[31] [31]

HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,

J. Kang, B. Chen, P. Zhong, H. Yang, Y . Sheng, and J. Wang, “HSP- Nav: Hierarchical scene prior learning for visual semantic navigation towards real settings,” ICRA, 2024

work page 2024

[32] [32]

Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,

F. Gao, J. Tang, J. Wang, S. Li, and J. Yu, “Enhancing scene under- standing for vision-and-language navigation by knowledge awareness,” IEEE Robotics and Automation Letters , vol. 9, no. 12, pp. 10 874– 10 881, 2024

work page 2024

[33] [33]

Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,

L. Yue, D. Zhou, L. Xie, F. Zhang, Y . Yan, and E. Yin, “Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments,” IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 4918–4925, 2024

work page 2024

[34] [34]

Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,

J. Wang, T. Wang, W. Cai, L. Xu, and C. Sun, “Boosting efficient reinforcement learning for vision-and-language navigation with open- sourced llm,” IEEE Robotics and Automation Letters , vol. 10, no. 1, pp. 612–619, 2025

work page 2025

[35] [35]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems , 2020

work page 2020

[36] [36]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) , 2022

work page 2022

[37] [37]

ViNT: A foundation model for visual navigation,

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” in 7th Conference on Robot Learning (CoRL) , 2023, pp. 1–23

work page 2023

[38] [38]

Flownav: Combining flow matching and depth priors for efficient navigation,

S. Gode, A. Nayak, D. N. P. Oliveira, M. Krawez, C. Schmid, and W. Burgard, “Flownav: Combining flow matching and depth priors for efficient navigation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.09524

work page arXiv 2025

[39] [39]

Visual navigation using a webcam based on semantic segmentation for indoor robots,

M. Adachi, S. Shatari, and R. Miyamoto, “Visual navigation using a webcam based on semantic segmentation for indoor robots,” in 2019 15th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) , 2019, pp. 15–21

work page 2019

[40] [40]

Practical implementation of visual navigation based on semantic segmentation for human-centric environments,

M. Adachi, K. Honda, J. Xue, H. Sudo, Y . Ueda, Y . Yuda, M. Wada, and R. Miyamoto, “Practical implementation of visual navigation based on semantic segmentation for human-centric environments,” Journal of Robotics and Mechatronics , vol. 35, no. 6, pp. 1419–1434, 2023

work page 2023

[41] [41]

Visual representations for semantic target driven naviga- tion,

A. Mousavian, A. Toshev, M. Fi ˇser, J. Ko ˇseck´a, A. Wahid, and J. Davidson, “Visual representations for semantic target driven naviga- tion,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8846–8852

work page 2019

[42] [42]

Indoor segmenta- tion and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta- tion and support inference from rgbd images,” inEuropean Conference on Computer Vision , 2012

work page 2012

[43] [43]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

work page 2016

[44] [44]

Emerging properties in self-supervised vision trans- formers,

M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021

work page 2021

[45] [45]

Learning phrase representations using RNN encoder–decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Oct. 2014, pp. 1724–1734

work page 2014

[46] [46]

ROS wrapper for Kobuki base Turtlebot 2,

K. Ltd., “ROS wrapper for Kobuki base Turtlebot 2,” 2023. [Online]. Available: https://github.com/yujinrobot/kobuki.git

work page 2023

[47] [47]

Efficient rgb-d semantic segmentation for indoor scene analy- sis,

D. Seichter, M. K ¨ohler, B. Lewandowski, T. Wengefeld, and H.-M. Groß, “Efficient rgb-d semantic segmentation for indoor scene analy- sis,” 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13 525–13 531, 2020

work page 2021

[48] [48]

PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,

R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “PIRLNav: Pre- training with Imitation and RL Finetuning for ObjectNav,” in CVPR, 2023

work page 2023

[49] [49]

DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “DD-PPO: Learning nearperfect pointgoal navigators from 2.5 billion frames,” in ICLR, 2019

work page 2019

[50] [50]

Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,

K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,” arXiv preprint arXiv:2303.07798, 2023

work page arXiv 2023

[51] [51]

MOPA: Modular object navigation with pointgoal agents,

S. Raychaudhuri, T. Campari, U. Jain, M. Savva, and A. X. Chang, “MOPA: Modular object navigation with pointgoal agents,” in WACV, 2024

work page 2024

[52] [52]

Homerobot: Open-vocabulary mobile manipulation,

S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, A. W. Clegg, J. Turner, Z. Kira, M. Savva, A. Chang, D. S. Chaplot, D. Batra, R. Mottaghi, Y . Bisk, and C. Paxton, “Homerobot: Open-vocabulary mobile manipulation,” 2024

work page 2024