ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Defeng Gu; Haosen Wang; Kai Li; Liaoyuan Fan; Lutao Jiang; Tingbang Liang; Wenjian Hou; Yibin Wen; Yinqiang Zhang; Yizhou Zhao

arxiv: 2605.09869 · v2 · pith:BQXXPF2Ynew · submitted 2026-05-11 · 💻 cs.RO · cs.CV

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Haosen Wang , Zhenyang Li , Yinqiang Zhang , Zongqi He , Lutao Jiang , Kai Li , Yizhou Zhao , Liaoyuan Fan

show 4 more authors

Wenjian Hou Tingbang Liang Yibin Wen Defeng Gu

This is my paper

Pith reviewed 2026-05-19 18:02 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords zero-shot object navigationaction consistency gapsemantic executivepersistent memoryrobot navigationembodied AIfinite-state controltraining-free navigation

0 comments

The pith

A semantic executive with three coordinated modules closes the action consistency gap by enforcing persistent commitment to target pursuit in zero-shot object navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper points out that even when zero-shot methods spot a plausible target, agents often switch back and forth between exploring and pursuing or quit near the goal because semantic evidence gets reinterpreted fresh at every step. ConsistNav adds a training-free semantic executive layer on top of existing detectors and planners to stage pursuit in guarded phases, keep stable object hypotheses across frames, and block wasteful actions such as spinning in place. A reader would care because this keeps the agent from abandoning a found object and raises success without any retraining or changes to the underlying perception and planning code. Experiments on HM3D and MP3D show the approach reaches state-of-the-art numbers among compared zero-shot methods and lifts success rate by 11.4 percent and SPL by 7.9 percent over a controlled baseline on MP3D. Ablations and real-robot tests confirm the executive modules are what drive the gains.

Core claim

The paper claims that the action consistency gap—repeated reinterpretation of semantic evidence without persistent commitment across the episode—explains why agents oscillate or abandon targets near success, and that this gap can be closed by a semantic executive composed of a Finite-State Executive Controller that stages guarded pursuit phases, a Persistent Candidate Memory that accumulates cross-frame target evidence into stable hypotheses, and Stability-Aware Action Control that suppresses rotational stagnation and unverified stopping, all without modifying the detector or low-level planner.

What carries the argument

Semantic executive, a training-free coordinator that decides when semantic evidence should drive navigation and when it should be suppressed or revisited through its three modules.

If this is right

Agents maintain stable object hypotheses across multiple frames instead of reinterpreting evidence at each step.
Pursuit is staged through guarded semantic phases that prevent premature abandonment of detected targets.
Rotational stagnation and ineffective pursuit actions are suppressed while still allowing verified stopping.
The same detector and planner can be used with higher reliability simply by adding the executive layer.
The method transfers to real-world robot deployments without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same executive structure could be tested on other embodied tasks where agents must commit to a detected goal over time, such as object manipulation sequences.
Because the modules act after detection, they might combine with newer open-vocabulary detectors without retraining the consistency logic.
If the gap is truly central, similar executive controls could be added to language-guided exploration methods to reduce backtracking.
The approach leaves open whether the same gains appear when the underlying planner itself is also improved.

Load-bearing premise

The action consistency gap is the dominant failure mode in current zero-shot object navigation and the three executive modules can close it without creating new exploration failures or requiring detector or planner changes.

What would settle it

Compare oscillation frequency and abandonment rate between the baseline and ConsistNav in identical MP3D episodes and check whether the executive modules produce a clear drop in switches between exploration and pursuit while success rate rises.

Figures

Figures reproduced from arXiv: 2605.09869 by Defeng Gu, Haosen Wang, Kai Li, Liaoyuan Fan, Lutao Jiang, Tingbang Liang, Wenjian Hou, Yibin Wen, Yinqiang Zhang, Yizhou Zhao, Zhenyang Li, Zongqi He.

**Figure 1.** Figure 1: ConsistNav pipeline. ⃝1 Perception converts RGB-D and target cues through VLM scoring into value maps; ⃝2A ⃝2B planning maintains candidates and selects frontier/candidate subgoals; ⃝3 execution outputs LEFT, FORWARD, RIGHT, and STOP actions through the FSE controller. Thus, Ct stores accumulated evidence, qt gates planning, and at remains in the standard ObjectNav action space. The following subsections m… view at source ↗

**Figure 2.** Figure 2: Candidate Memory and FSE Controller. Left: Candidate Memory builds/stores the semantic candidate map. Right: seven-state FSE transitions, with black/green for commitment/success, gray/yellow for invalidation/recovery, and blue for returning to search. Consistency score and priority. To decide which hypotheses can influence control, the executive first converts the memory fields into a consistency score s … view at source ↗

**Figure 3.** Figure 3: Simulation results on HM3Dv2. Qualitative comparison of ConsistNav, VLFM, and ApexNav. Each column shows one episode; green/blue paths denote reference/agent trajectories, and green/black frames denote success/failure. candidates become explicit search failures rather than unstable commitments, while infeasible and late-discovery cases remain dataset-level limits. 4.4 ABLATION STUDY Ablation analysis [PIT… view at source ↗

**Figure 4.** Figure 4: Failure-cause comparison. Outcome statistics for the Non-executive method and ConsistNav on HM3Dv1, HM3Dv2, and MP3D, covering verified success and five residual failure modes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world deployment comparison. Visual comparison of the Non-executive baseline and ConsistNav on four target tasks using the AgileX LIMO platform. The results illustrate that ConsistNav maintains target hypotheses, verifies close-range evidence, and stops reliably under real sensor and timing conditions. and path efficiency, and ablations show that each executive component contributes complementary gain… view at source ↗

read the original abstract

Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConsistNav adds a practical executive layer that boosts zero-shot navigation performance, though the link to closing the specific action consistency gap could use more direct diagnostics.

read the letter

ConsistNav layers a semantic executive on top of standard zero-shot object navigation stacks. The core idea is to stop the agent from constantly reinterpreting what it sees and instead commit to a target once evidence builds up. They do this with a finite-state controller that moves through guarded phases, a memory that keeps candidate objects alive across frames, and action rules that cut rotational waste and bad stops. This setup is new in its coordinated packaging. Most prior work either tweaks the detector or the exploration policy. Here the base pieces stay fixed and the executive just decides when semantics should drive the planner. That keeps things training-free and easy to drop in. The reported numbers on MP3D show an 11.4 point success rate gain and 7.9 point SPL gain over the controlled baseline, with similar patterns on HM3D. Real-world tests add some credibility that the approach survives outside simulation. The main question is whether the gains truly come from closing the action consistency gap. The memory and stability controls might simply let the agent gather more evidence and move more efficiently, regardless of whether it sticks to one hypothesis. Without counts of how often the agent switches from explore to pursue before and after the modules, or how many times it abandons a near-miss target, the attribution stays indirect. The paper does not appear to introduce new failure modes, which is good, but the central claim would be stronger with those episode-level diagnostics. This paper is aimed at robotics researchers who want practical improvements to zero-shot navigation without retraining models. Anyone working on home robots or warehouse pickers could get value from the executive design and the ablation results. It is solid enough to deserve a serious referee, especially since it includes both simulation benchmarks and real deployment. I would recommend sending it to peer review. The experiments are extensive enough to warrant detailed feedback, though the reviewers will likely push for clearer evidence on the mechanism.

Referee Report

2 major / 2 minor

Summary. The paper identifies an 'action consistency gap' in zero-shot object navigation, where agents repeatedly reinterpret semantic evidence without persistent commitment, leading to oscillation between exploration and pursuit or premature abandonment of targets. It introduces ConsistNav, a training-free framework with a semantic executive consisting of three modules: a Finite-State Executive Controller that stages target pursuit through guarded phases, Persistent Candidate Memory that accumulates cross-frame evidence into stable hypotheses, and Stability-Aware Action Control that suppresses rotational stagnation and unverified stopping. The approach leaves the detector and low-level planner unchanged. Experiments on HM3D and MP3D report state-of-the-art results among zero-shot methods, with 11.4% higher success rate (SR) and 7.9% higher SPL over a controlled baseline on MP3D, plus supporting ablations and real-world deployment.

Significance. If the central claim holds, the work would offer a modular, training-free method to improve consistency in open-vocabulary navigation without retraining core perception or planning components. The explicit separation of executive control from the detector/planner, combined with real-world validation, strengthens potential for broader adoption in embodied AI. The identification of a specific failure mode and the provision of ablations are positive elements.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated tables: The reported 11.4% SR and 7.9% SPL gains on MP3D are presented as evidence that the modules close the action consistency gap, but the manuscript provides no episode-level diagnostics such as counts of explore/pursue switches, abandoned hypotheses, or rotational stagnation events before versus after adding the executive. Without these, it remains possible that gains arise from auxiliary effects of memory accumulation and stability filtering rather than enforced cross-step commitment, weakening attribution to the identified gap.
[§3 (Method)] §3 (Method), description of the three modules: The Finite-State Executive Controller and Persistent Candidate Memory are presented as directly addressing reinterpretation without commitment, yet the design inserts a higher-level policy layer. A direct comparison to simpler non-executive heuristics (e.g., fixed hysteresis thresholds on detection confidence) would be needed to establish that the full three-module coordination is necessary for the observed gains rather than replicable by lighter mechanisms.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'guarded semantic phases' is introduced without a concise definition or diagram reference at first mention; a brief inline clarification or pointer to Figure 2 would improve readability.
[§4.3 (Ablations)] §4.3 (Ablations): The ablation table would benefit from explicit reporting of standard deviations or confidence intervals across the N runs, consistent with the main result tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the attribution of our results to the action consistency gap. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: The reported 11.4% SR and 7.9% SPL gains on MP3D are presented as evidence that the modules close the action consistency gap, but the manuscript provides no episode-level diagnostics such as counts of explore/pursue switches, abandoned hypotheses, or rotational stagnation events before versus after adding the executive. Without these, it remains possible that gains arise from auxiliary effects of memory accumulation and stability filtering rather than enforced cross-step commitment, weakening attribution to the identified gap.

Authors: We agree that explicit episode-level diagnostics would strengthen direct attribution to reduced oscillation and premature abandonment. The current ablations isolate module contributions and the overall SR/SPL gains align with fewer consistency failures, but without per-episode switch counts the link remains indirect. In the revised version we will add these diagnostics, reporting average explore/pursue transitions, abandoned hypotheses, and rotational stagnation events for the baseline versus ConsistNav on MP3D. revision: yes
Referee: [§3 (Method)] §3 (Method), description of the three modules: The Finite-State Executive Controller and Persistent Candidate Memory are presented as directly addressing reinterpretation without commitment, yet the design inserts a higher-level policy layer. A direct comparison to simpler non-executive heuristics (e.g., fixed hysteresis thresholds on detection confidence) would be needed to establish that the full three-module coordination is necessary for the observed gains rather than replicable by lighter mechanisms.

Authors: The three modules are coordinated: the finite-state controller stages commitment, memory accumulates evidence across frames, and stability control suppresses ineffective actions. A simple hysteresis threshold on confidence would address only part of the reinterpretation problem and would not stage pursuit phases or suppress rotational stagnation. Our module ablations already show that removing any component degrades performance. Nevertheless, to address the request we will add a controlled comparison against a hysteresis-only variant in the revised experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: additive executive modules on unchanged base components

full rationale

The paper identifies an action consistency gap as an observed failure mode in existing zero-shot ObjectNav pipelines and introduces three new modules (Finite-State Executive Controller, Persistent Candidate Memory, Stability-Aware Action Control) that act as a training-free semantic executive layer. No equations, fitted parameters, or predictions are defined in terms of themselves; the modules are explicitly additive and leave the detector and low-level planner unchanged. Results (11.4% SR / 7.9% SPL gains on MP3D) are reported as empirical measurements against a controlled baseline rather than derived quantities. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the architecture. The derivation chain is therefore self-contained: problem observation plus modular design plus benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the premise that the newly introduced executive modules can enforce cross-step consistency using only existing semantic evidence; no numerical free parameters are stated, but three new control entities are postulated without external falsifiable handles beyond the reported experiments.

axioms (1)

domain assumption Semantic evidence from open-vocabulary detectors can be staged into guarded phases and accumulated across frames without losing necessary exploration coverage.
Invoked by the Finite-State Executive Controller and Persistent Candidate Memory descriptions.

invented entities (3)

Finite-State Executive Controller no independent evidence
purpose: Stages target pursuit through guarded semantic phases
New component introduced to enforce consistency.
Persistent Candidate Memory no independent evidence
purpose: Accumulates cross-frame target evidence into stable object hypotheses
New memory structure for stable hypotheses.
Stability-Aware Action Control no independent evidence
purpose: Suppresses rotational stagnation, ineffective pursuit, and unverified stopping
New action filter to prevent oscillation and premature stops.

pith-pipeline@v0.9.0 · 5805 in / 1494 out tokens · 55667 ms · 2026-05-19T18:02:15.884466+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 2 internal anchors

[1]

Batra, Dhruv and Gokaslan, Aaron and Kembhavi, Aniruddha and Maksymets, Oleksandr and Mottaghi, Roozbeh and Savva, Manolis and Toshev, Alexander and Wijmans, Erik , journal =

work page
[2]

Savva, Manolis and Kadian, Abhishek and Maksymets, Oleksandr and Zhao, Yili and Wijmans, Erik and Jain, Bhavana and Straub, Julian and Liu, Jia and Koltun, Vladlen and Malik, Jitendra and Parikh, Devi and Batra, Dhruv , booktitle =

work page
[3]

and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X

Ramakrishnan, Santhosh Kumar and Gokaslan, Aaron and Wijmans, Erik and Maksymets, Oleksandr and Clegg, Alexander and Turner, John M. and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X. and Savva, Manolis and Zhao, Yili and Batra, Dhruv , booktitle =

work page
[4]

and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda , booktitle =

Chang, Angel X. and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda , booktitle =. Matterport3D: Learning from

work page
[5]

Wijmans, Erik and Kadian, Abhishek and Morcos, Ari and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv , booktitle =

work page
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Object Goal Navigation using Goal-Oriented Semantic Exploration , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[7]

and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen , booktitle =

Ramakrishnan, Santhosh K. and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen , booktitle =

work page
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[9]

Yadav, Karmesh and Ramrakhya, Ram and Majumdar, Arjun and Yokoyama, Naoki and Baevski, Alexei and Kira, Zsolt and Maksymets, Oleksandr and Batra, Dhruv , journal =

work page
[10]

Simple but Effective:

Khandelwal, Apoorv and Weihs, Luca and Mottaghi, Roozbeh and Kembhavi, Aniruddha , booktitle =. Simple but Effective:

work page
[11]

Majumdar, Arjun and Aggarwal, Gunjan and Devnani, Bhavika and Hoffman, Judy and Batra, Dhruv , booktitle =

work page
[12]

Gadre, Samir Yitzhak and Wortsman, Mitchell and Ilharco, Gabriel and Schmidt, Ludwig and Song, Shuran , booktitle =

work page
[13]

Yokoyama, Naoki and Ha, Sehoon and Batra, Dhruv and Wang, Jiuguang and Bucher, Bernadette , booktitle =

work page
[14]

Yu, Bangguo and Kasaei, Hamidreza and Cao, Ming , booktitle =

work page
[15]

Proceedings of the Conference on Robot Learning (CoRL) , year =

Shah, Dhruv and Osi. Proceedings of the Conference on Robot Learning (CoRL) , year =

work page
[16]

2025 , eprint =

Zhang, Mingjie and Du, Yuheng and Wu, Chengkai and Zhou, Jinni and Qi, Zhenchao and Ma, Jun and Zhou, Boyu , journal =. 2025 , eprint =

work page 2025
[17]

Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) , year =

A Frontier-Based Approach for Autonomous Exploration , author =. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) , year =

work page
[18]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =

work page
[19]

Grounding

Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei , booktitle =. Grounding

work page
[20]

Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark , booktitle =

work page
[21]

Faster Segment Anything: Towards Lightweight

Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon , journal =. Faster Segment Anything: Towards Lightweight

work page
[22]

Automated Planning: Theory and Practice , author =

work page
[23]

and Precup, Doina and Singh, Satinder , journal =

Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between

work page
[24]

Artificial Intelligence , volume =

Planning and Acting in Partially Observable Stochastic Domains , author =. Artificial Intelligence , volume =

work page
[25]

Proceedings of the International Conference on Machine Learning (ICML) , year =

Learning Transferable Visual Models from Natural Language Supervision , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =

work page
[26]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Segment Anything , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[27]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[28]

arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[30]

International Conference on Learning Representations (ICLR) , year =

Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation , author =. International Conference on Learning Representations (ICLR) , year =

work page
[31]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

Simple Open-Vocabulary Object Detection with Vision Transformers , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

work page
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Grounded Language-Image Pre-Training , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[33]

Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric , booktitle =

work page
[34]

Rajvanshi, Abhinav and Sikka, Karan and Lin, Xiao and Lee, Bhoram and Chiu, Han-Pang and Velasquez, Alvaro , booktitle =

work page
[35]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

work page
[36]

Kuang, Yuxuan and Lin, Hai and Jiang, Meng , booktitle =

work page
[37]

Long, Yuxing and Cai, Wenzhe and Wang, Hongcheng and Zhan, Guanqi and Dong, Hao , journal =

work page
[38]

Zhang, Lingfeng and Zhang, Qiang and Wang, Hao and Xiao, Erjia and Jiang, Zixuan and Chen, Honglei and Xu, Renjing , booktitle =

work page
[39]

Yin, Hang and Xu, Xiuwei and Wu, Zhenyu and Zhou, Jie and Lu, Jiwen , booktitle =

work page
[40]

Zhang, Jiazhao and Wang, Kunyu and Xu, Rongtao and Zhou, Gengze and Hong, Yicong and Fang, Xiaomeng and Wu, Qi and Zhang, Zhizheng and He, Wang , booktitle =

work page
[41]

Learning to Explore Using Active Neural

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Saurabh and Gupta, Abhinav and Salakhutdinov, Ruslan , booktitle =. Learning to Explore Using Active Neural

work page
[42]

Ramrakhya, Ram and Batra, Dhruv and Wijmans, Erik and Das, Abhishek , booktitle =

work page
[43]

Deitke, Matt and VanderBilt, Eli and Herrasti, Alvaro and Weihs, Luca and Ehsani, Kiana and Salvador, Jordi and Han, Winson and Kolve, Eric and Kembhavi, Aniruddha and Mottaghi, Roozbeh , booktitle =

work page
[44]

Maksymets, Oleksandr and Cartillier, Vincent and Gokaslan, Aaron and Wijmans, Erik and Galuba, Wojciech and Lee, Stefan and Batra, Dhruv , booktitle =

work page
[45]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Hierarchical Object-to-Zone Graph for Object Navigation , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[46]

Hong, Yicong and Wu, Qi and Qi, Yuankai and Rodriguez-Opazo, Cristian and Gould, Stephen , booktitle =

work page
[47]

An, Dong and Wang, Hanqing and Wang, Wenguan and Wang, Zun and Huang, Yan and He, Keji and Wang, Liang , journal =

work page
[48]

On Evaluation of Embodied Navigation Agents

On Evaluation of Embodied Navigation Agents , author =. arXiv preprint arXiv:1807.06757 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[49]

A Survey of Embodied

Duan, Jiafei and Yu, Samson and Tan, Hui Li and Zhu, Hongyuan and Tan, Cheston , journal =. A Survey of Embodied

work page
[50]

Rosinol, Antoni and Abate, Marcus and Chang, Yun and Carlone, Luca , booktitle =

work page
[51]

and Leutenegger, Stefan , booktitle =

McCormac, John and Handa, Ankur and Davison, Andrew J. and Leutenegger, Stefan , booktitle =

work page
[52]

Planning Algorithms , author =

work page
[53]

Behavior Trees in Robotics and

Colledanchise, Michele and. Behavior Trees in Robotics and

work page
[54]

IEEE Robotics & Automation Magazine , volume =

The Dynamic Window Approach to Collision Avoidance , author =. IEEE Robotics & Automation Magazine , volume =

work page
[55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Flamingo: A Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[56]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Tober, Marc and Zeng, Andy...

work page
[57]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal =

work page
[58]

Science Robotics , year =

Navigating to Objects in the Real World , author =. Science Robotics , year =

work page
[59]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Visual Language Maps for Robot Navigation , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

work page
[60]

Shah, Dhruv and Eysenbach, Benjamin and Kahn, Gregory and Levine, Sergey , booktitle =

work page
[61]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Think Before You Act: Decision Transformers with Working Memory , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[62]

Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , journal =

work page
[63]

Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Fu, Chuyuan and Gober, Keerthana and Gopalakrishnan, Karol and others , booktitle =. Do As

work page
[64]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[65]

International Conference on Learning Representations (ICLR) , year =

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

work page
[66]

Mur-Artal, Raul and Montiel, J. M. M. and Tard. IEEE Transactions on Robotics , volume =

work page
[67]

IEEE Transactions on Robotics , volume =

Campos, Carlos and Elvira, Richard and Rodr. IEEE Transactions on Robotics , volume =

work page
[68]

, booktitle =

Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y. , booktitle =

work page
[69]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[1] [1]

Batra, Dhruv and Gokaslan, Aaron and Kembhavi, Aniruddha and Maksymets, Oleksandr and Mottaghi, Roozbeh and Savva, Manolis and Toshev, Alexander and Wijmans, Erik , journal =

work page

[2] [2]

Savva, Manolis and Kadian, Abhishek and Maksymets, Oleksandr and Zhao, Yili and Wijmans, Erik and Jain, Bhavana and Straub, Julian and Liu, Jia and Koltun, Vladlen and Malik, Jitendra and Parikh, Devi and Batra, Dhruv , booktitle =

work page

[3] [3]

and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X

Ramakrishnan, Santhosh Kumar and Gokaslan, Aaron and Wijmans, Erik and Maksymets, Oleksandr and Clegg, Alexander and Turner, John M. and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X. and Savva, Manolis and Zhao, Yili and Batra, Dhruv , booktitle =

work page

[4] [4]

and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda , booktitle =

Chang, Angel X. and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda , booktitle =. Matterport3D: Learning from

work page

[5] [5]

Wijmans, Erik and Kadian, Abhishek and Morcos, Ari and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv , booktitle =

work page

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Object Goal Navigation using Goal-Oriented Semantic Exploration , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[7] [7]

and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen , booktitle =

Ramakrishnan, Santhosh K. and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen , booktitle =

work page

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[9] [9]

Yadav, Karmesh and Ramrakhya, Ram and Majumdar, Arjun and Yokoyama, Naoki and Baevski, Alexei and Kira, Zsolt and Maksymets, Oleksandr and Batra, Dhruv , journal =

work page

[10] [10]

Simple but Effective:

Khandelwal, Apoorv and Weihs, Luca and Mottaghi, Roozbeh and Kembhavi, Aniruddha , booktitle =. Simple but Effective:

work page

[11] [11]

Majumdar, Arjun and Aggarwal, Gunjan and Devnani, Bhavika and Hoffman, Judy and Batra, Dhruv , booktitle =

work page

[12] [12]

Gadre, Samir Yitzhak and Wortsman, Mitchell and Ilharco, Gabriel and Schmidt, Ludwig and Song, Shuran , booktitle =

work page

[13] [13]

Yokoyama, Naoki and Ha, Sehoon and Batra, Dhruv and Wang, Jiuguang and Bucher, Bernadette , booktitle =

work page

[14] [14]

Yu, Bangguo and Kasaei, Hamidreza and Cao, Ming , booktitle =

work page

[15] [15]

Proceedings of the Conference on Robot Learning (CoRL) , year =

Shah, Dhruv and Osi. Proceedings of the Conference on Robot Learning (CoRL) , year =

work page

[16] [16]

2025 , eprint =

Zhang, Mingjie and Du, Yuheng and Wu, Chengkai and Zhou, Jinni and Qi, Zhenchao and Ma, Jun and Zhou, Boyu , journal =. 2025 , eprint =

work page 2025

[17] [17]

Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) , year =

A Frontier-Based Approach for Autonomous Exploration , author =. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) , year =

work page

[18] [18]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =

work page

[19] [19]

Grounding

Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei , booktitle =. Grounding

work page

[20] [20]

Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark , booktitle =

work page

[21] [21]

Faster Segment Anything: Towards Lightweight

Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon , journal =. Faster Segment Anything: Towards Lightweight

work page

[22] [22]

Automated Planning: Theory and Practice , author =

work page

[23] [23]

and Precup, Doina and Singh, Satinder , journal =

Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between

work page

[24] [24]

Artificial Intelligence , volume =

Planning and Acting in Partially Observable Stochastic Domains , author =. Artificial Intelligence , volume =

work page

[25] [25]

Proceedings of the International Conference on Machine Learning (ICML) , year =

Learning Transferable Visual Models from Natural Language Supervision , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =

work page

[26] [26]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Segment Anything , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page

[27] [27]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[28] [28]

arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page

[30] [30]

International Conference on Learning Representations (ICLR) , year =

Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation , author =. International Conference on Learning Representations (ICLR) , year =

work page

[31] [31]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

Simple Open-Vocabulary Object Detection with Vision Transformers , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

work page

[32] [32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Grounded Language-Image Pre-Training , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[33] [33]

Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric , booktitle =

work page

[34] [34]

Rajvanshi, Abhinav and Sikka, Karan and Lin, Xiao and Lee, Bhoram and Chiu, Han-Pang and Velasquez, Alvaro , booktitle =

work page

[35] [35]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

work page

[36] [36]

Kuang, Yuxuan and Lin, Hai and Jiang, Meng , booktitle =

work page

[37] [37]

Long, Yuxing and Cai, Wenzhe and Wang, Hongcheng and Zhan, Guanqi and Dong, Hao , journal =

work page

[38] [38]

Zhang, Lingfeng and Zhang, Qiang and Wang, Hao and Xiao, Erjia and Jiang, Zixuan and Chen, Honglei and Xu, Renjing , booktitle =

work page

[39] [39]

Yin, Hang and Xu, Xiuwei and Wu, Zhenyu and Zhou, Jie and Lu, Jiwen , booktitle =

work page

[40] [40]

Zhang, Jiazhao and Wang, Kunyu and Xu, Rongtao and Zhou, Gengze and Hong, Yicong and Fang, Xiaomeng and Wu, Qi and Zhang, Zhizheng and He, Wang , booktitle =

work page

[41] [41]

Learning to Explore Using Active Neural

Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Saurabh and Gupta, Abhinav and Salakhutdinov, Ruslan , booktitle =. Learning to Explore Using Active Neural

work page

[42] [42]

Ramrakhya, Ram and Batra, Dhruv and Wijmans, Erik and Das, Abhishek , booktitle =

work page

[43] [43]

Deitke, Matt and VanderBilt, Eli and Herrasti, Alvaro and Weihs, Luca and Ehsani, Kiana and Salvador, Jordi and Han, Winson and Kolve, Eric and Kembhavi, Aniruddha and Mottaghi, Roozbeh , booktitle =

work page

[44] [44]

Maksymets, Oleksandr and Cartillier, Vincent and Gokaslan, Aaron and Wijmans, Erik and Galuba, Wojciech and Lee, Stefan and Batra, Dhruv , booktitle =

work page

[45] [45]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Hierarchical Object-to-Zone Graph for Object Navigation , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page

[46] [46]

Hong, Yicong and Wu, Qi and Qi, Yuankai and Rodriguez-Opazo, Cristian and Gould, Stephen , booktitle =

work page

[47] [47]

An, Dong and Wang, Hanqing and Wang, Wenguan and Wang, Zun and Huang, Yan and He, Keji and Wang, Liang , journal =

work page

[48] [48]

On Evaluation of Embodied Navigation Agents

On Evaluation of Embodied Navigation Agents , author =. arXiv preprint arXiv:1807.06757 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

A Survey of Embodied

Duan, Jiafei and Yu, Samson and Tan, Hui Li and Zhu, Hongyuan and Tan, Cheston , journal =. A Survey of Embodied

work page

[50] [50]

Rosinol, Antoni and Abate, Marcus and Chang, Yun and Carlone, Luca , booktitle =

work page

[51] [51]

and Leutenegger, Stefan , booktitle =

McCormac, John and Handa, Ankur and Davison, Andrew J. and Leutenegger, Stefan , booktitle =

work page

[52] [52]

Planning Algorithms , author =

work page

[53] [53]

Behavior Trees in Robotics and

Colledanchise, Michele and. Behavior Trees in Robotics and

work page

[54] [54]

IEEE Robotics & Automation Magazine , volume =

The Dynamic Window Approach to Collision Avoidance , author =. IEEE Robotics & Automation Magazine , volume =

work page

[55] [55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Flamingo: A Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[56] [56]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Tober, Marc and Zeng, Andy...

work page

[57] [57]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal =

work page

[58] [58]

Science Robotics , year =

Navigating to Objects in the Real World , author =. Science Robotics , year =

work page

[59] [59]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Visual Language Maps for Robot Navigation , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

work page

[60] [60]

Shah, Dhruv and Eysenbach, Benjamin and Kahn, Gregory and Levine, Sergey , booktitle =

work page

[61] [61]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Think Before You Act: Decision Transformers with Working Memory , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[62] [62]

Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , journal =

work page

[63] [63]

Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Fu, Chuyuan and Gober, Keerthana and Gopalakrishnan, Karol and others , booktitle =. Do As

work page

[64] [64]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[65] [65]

International Conference on Learning Representations (ICLR) , year =

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

work page

[66] [66]

Mur-Artal, Raul and Montiel, J. M. M. and Tard. IEEE Transactions on Robotics , volume =

work page

[67] [67]

IEEE Transactions on Robotics , volume =

Campos, Carlos and Elvira, Richard and Rodr. IEEE Transactions on Robotics , volume =

work page

[68] [68]

, booktitle =

Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y. , booktitle =

work page

[69] [69]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page