ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control
Pith reviewed 2026-05-19 18:02 UTC · model grok-4.3
The pith
A semantic executive with three coordinated modules closes the action consistency gap by enforcing persistent commitment to target pursuit in zero-shot object navigation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the action consistency gap—repeated reinterpretation of semantic evidence without persistent commitment across the episode—explains why agents oscillate or abandon targets near success, and that this gap can be closed by a semantic executive composed of a Finite-State Executive Controller that stages guarded pursuit phases, a Persistent Candidate Memory that accumulates cross-frame target evidence into stable hypotheses, and Stability-Aware Action Control that suppresses rotational stagnation and unverified stopping, all without modifying the detector or low-level planner.
What carries the argument
Semantic executive, a training-free coordinator that decides when semantic evidence should drive navigation and when it should be suppressed or revisited through its three modules.
If this is right
- Agents maintain stable object hypotheses across multiple frames instead of reinterpreting evidence at each step.
- Pursuit is staged through guarded semantic phases that prevent premature abandonment of detected targets.
- Rotational stagnation and ineffective pursuit actions are suppressed while still allowing verified stopping.
- The same detector and planner can be used with higher reliability simply by adding the executive layer.
- The method transfers to real-world robot deployments without additional training.
Where Pith is reading between the lines
- The same executive structure could be tested on other embodied tasks where agents must commit to a detected goal over time, such as object manipulation sequences.
- Because the modules act after detection, they might combine with newer open-vocabulary detectors without retraining the consistency logic.
- If the gap is truly central, similar executive controls could be added to language-guided exploration methods to reduce backtracking.
- The approach leaves open whether the same gains appear when the underlying planner itself is also improved.
Load-bearing premise
The action consistency gap is the dominant failure mode in current zero-shot object navigation and the three executive modules can close it without creating new exploration failures or requiring detector or planner changes.
What would settle it
Compare oscillation frequency and abandonment rate between the baseline and ConsistNav in identical MP3D episodes and check whether the executive modules produce a clear drop in switches between exploration and pursuit while success rate rises.
Figures
read the original abstract
Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies an 'action consistency gap' in zero-shot object navigation, where agents repeatedly reinterpret semantic evidence without persistent commitment, leading to oscillation between exploration and pursuit or premature abandonment of targets. It introduces ConsistNav, a training-free framework with a semantic executive consisting of three modules: a Finite-State Executive Controller that stages target pursuit through guarded phases, Persistent Candidate Memory that accumulates cross-frame evidence into stable hypotheses, and Stability-Aware Action Control that suppresses rotational stagnation and unverified stopping. The approach leaves the detector and low-level planner unchanged. Experiments on HM3D and MP3D report state-of-the-art results among zero-shot methods, with 11.4% higher success rate (SR) and 7.9% higher SPL over a controlled baseline on MP3D, plus supporting ablations and real-world deployment.
Significance. If the central claim holds, the work would offer a modular, training-free method to improve consistency in open-vocabulary navigation without retraining core perception or planning components. The explicit separation of executive control from the detector/planner, combined with real-world validation, strengthens potential for broader adoption in embodied AI. The identification of a specific failure mode and the provision of ablations are positive elements.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and associated tables: The reported 11.4% SR and 7.9% SPL gains on MP3D are presented as evidence that the modules close the action consistency gap, but the manuscript provides no episode-level diagnostics such as counts of explore/pursue switches, abandoned hypotheses, or rotational stagnation events before versus after adding the executive. Without these, it remains possible that gains arise from auxiliary effects of memory accumulation and stability filtering rather than enforced cross-step commitment, weakening attribution to the identified gap.
- [§3 (Method)] §3 (Method), description of the three modules: The Finite-State Executive Controller and Persistent Candidate Memory are presented as directly addressing reinterpretation without commitment, yet the design inserts a higher-level policy layer. A direct comparison to simpler non-executive heuristics (e.g., fixed hysteresis thresholds on detection confidence) would be needed to establish that the full three-module coordination is necessary for the observed gains rather than replicable by lighter mechanisms.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'guarded semantic phases' is introduced without a concise definition or diagram reference at first mention; a brief inline clarification or pointer to Figure 2 would improve readability.
- [§4.3 (Ablations)] §4.3 (Ablations): The ablation table would benefit from explicit reporting of standard deviations or confidence intervals across the N runs, consistent with the main result tables.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the attribution of our results to the action consistency gap. We respond to each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: The reported 11.4% SR and 7.9% SPL gains on MP3D are presented as evidence that the modules close the action consistency gap, but the manuscript provides no episode-level diagnostics such as counts of explore/pursue switches, abandoned hypotheses, or rotational stagnation events before versus after adding the executive. Without these, it remains possible that gains arise from auxiliary effects of memory accumulation and stability filtering rather than enforced cross-step commitment, weakening attribution to the identified gap.
Authors: We agree that explicit episode-level diagnostics would strengthen direct attribution to reduced oscillation and premature abandonment. The current ablations isolate module contributions and the overall SR/SPL gains align with fewer consistency failures, but without per-episode switch counts the link remains indirect. In the revised version we will add these diagnostics, reporting average explore/pursue transitions, abandoned hypotheses, and rotational stagnation events for the baseline versus ConsistNav on MP3D. revision: yes
-
Referee: [§3 (Method)] §3 (Method), description of the three modules: The Finite-State Executive Controller and Persistent Candidate Memory are presented as directly addressing reinterpretation without commitment, yet the design inserts a higher-level policy layer. A direct comparison to simpler non-executive heuristics (e.g., fixed hysteresis thresholds on detection confidence) would be needed to establish that the full three-module coordination is necessary for the observed gains rather than replicable by lighter mechanisms.
Authors: The three modules are coordinated: the finite-state controller stages commitment, memory accumulates evidence across frames, and stability control suppresses ineffective actions. A simple hysteresis threshold on confidence would address only part of the reinterpretation problem and would not stage pursuit phases or suppress rotational stagnation. Our module ablations already show that removing any component degrades performance. Nevertheless, to address the request we will add a controlled comparison against a hysteresis-only variant in the revised experiments. revision: yes
Circularity Check
No circularity: additive executive modules on unchanged base components
full rationale
The paper identifies an action consistency gap as an observed failure mode in existing zero-shot ObjectNav pipelines and introduces three new modules (Finite-State Executive Controller, Persistent Candidate Memory, Stability-Aware Action Control) that act as a training-free semantic executive layer. No equations, fitted parameters, or predictions are defined in terms of themselves; the modules are explicitly additive and leave the detector and low-level planner unchanged. Results (11.4% SR / 7.9% SPL gains on MP3D) are reported as empirical measurements against a controlled baseline rather than derived quantities. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the architecture. The derivation chain is therefore self-contained: problem observation plus modular design plus benchmark evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic evidence from open-vocabulary detectors can be staged into guarded phases and accumulated across frames without losing necessary exploration coverage.
invented entities (3)
-
Finite-State Executive Controller
no independent evidence
-
Persistent Candidate Memory
no independent evidence
-
Stability-Aware Action Control
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Batra, Dhruv and Gokaslan, Aaron and Kembhavi, Aniruddha and Maksymets, Oleksandr and Mottaghi, Roozbeh and Savva, Manolis and Toshev, Alexander and Wijmans, Erik , journal =
-
[2]
Savva, Manolis and Kadian, Abhishek and Maksymets, Oleksandr and Zhao, Yili and Wijmans, Erik and Jain, Bhavana and Straub, Julian and Liu, Jia and Koltun, Vladlen and Malik, Jitendra and Parikh, Devi and Batra, Dhruv , booktitle =
-
[3]
and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X
Ramakrishnan, Santhosh Kumar and Gokaslan, Aaron and Wijmans, Erik and Maksymets, Oleksandr and Clegg, Alexander and Turner, John M. and Undersander, Eric and Galuba, Wojciech and Westbury, Andrew and Chang, Angel X. and Savva, Manolis and Zhao, Yili and Batra, Dhruv , booktitle =
-
[4]
Chang, Angel X. and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda , booktitle =. Matterport3D: Learning from
-
[5]
Wijmans, Erik and Kadian, Abhishek and Morcos, Ari and Lee, Stefan and Essa, Irfan and Parikh, Devi and Savva, Manolis and Batra, Dhruv , booktitle =
-
[6]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Object Goal Navigation using Goal-Oriented Semantic Exploration , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[7]
Ramakrishnan, Santhosh K. and Chaplot, Devendra Singh and Al-Halah, Ziad and Malik, Jitendra and Grauman, Kristen , booktitle =
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[9]
Yadav, Karmesh and Ramrakhya, Ram and Majumdar, Arjun and Yokoyama, Naoki and Baevski, Alexei and Kira, Zsolt and Maksymets, Oleksandr and Batra, Dhruv , journal =
-
[10]
Khandelwal, Apoorv and Weihs, Luca and Mottaghi, Roozbeh and Kembhavi, Aniruddha , booktitle =. Simple but Effective:
-
[11]
Majumdar, Arjun and Aggarwal, Gunjan and Devnani, Bhavika and Hoffman, Judy and Batra, Dhruv , booktitle =
-
[12]
Gadre, Samir Yitzhak and Wortsman, Mitchell and Ilharco, Gabriel and Schmidt, Ludwig and Song, Shuran , booktitle =
-
[13]
Yokoyama, Naoki and Ha, Sehoon and Batra, Dhruv and Wang, Jiuguang and Bucher, Bernadette , booktitle =
-
[14]
Yu, Bangguo and Kasaei, Hamidreza and Cao, Ming , booktitle =
-
[15]
Proceedings of the Conference on Robot Learning (CoRL) , year =
Shah, Dhruv and Osi. Proceedings of the Conference on Robot Learning (CoRL) , year =
-
[16]
Zhang, Mingjie and Du, Yuheng and Wu, Chengkai and Zhou, Jinni and Qi, Zhenchao and Ma, Jun and Zhou, Boyu , journal =. 2025 , eprint =
work page 2025
-
[17]
A Frontier-Based Approach for Autonomous Exploration , author =. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) , year =
-
[18]
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =
- [19]
-
[20]
Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark , booktitle =
-
[21]
Faster Segment Anything: Towards Lightweight
Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung-Ho and Lee, Seungkyu and Hong, Choong Seon , journal =. Faster Segment Anything: Towards Lightweight
-
[22]
Automated Planning: Theory and Practice , author =
-
[23]
and Precup, Doina and Singh, Satinder , journal =
Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between
-
[24]
Artificial Intelligence , volume =
Planning and Acting in Partially Observable Stochastic Domains , author =. Artificial Intelligence , volume =
-
[25]
Proceedings of the International Conference on Machine Learning (ICML) , year =
Learning Transferable Visual Models from Natural Language Supervision , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =
-
[26]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
Segment Anything , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
-
[27]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[28]
arXiv preprint arXiv:2303.08774 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
-
[30]
International Conference on Learning Representations (ICLR) , year =
Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation , author =. International Conference on Learning Representations (ICLR) , year =
-
[31]
Proceedings of the European Conference on Computer Vision (ECCV) , year =
Simple Open-Vocabulary Object Detection with Vision Transformers , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =
-
[32]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Grounded Language-Image Pre-Training , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[33]
Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric , booktitle =
-
[34]
Rajvanshi, Abhinav and Sikka, Karan and Lin, Xiao and Lee, Bhoram and Chiu, Han-Pang and Velasquez, Alvaro , booktitle =
-
[35]
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
-
[36]
Kuang, Yuxuan and Lin, Hai and Jiang, Meng , booktitle =
-
[37]
Long, Yuxing and Cai, Wenzhe and Wang, Hongcheng and Zhan, Guanqi and Dong, Hao , journal =
-
[38]
Zhang, Lingfeng and Zhang, Qiang and Wang, Hao and Xiao, Erjia and Jiang, Zixuan and Chen, Honglei and Xu, Renjing , booktitle =
-
[39]
Yin, Hang and Xu, Xiuwei and Wu, Zhenyu and Zhou, Jie and Lu, Jiwen , booktitle =
-
[40]
Zhang, Jiazhao and Wang, Kunyu and Xu, Rongtao and Zhou, Gengze and Hong, Yicong and Fang, Xiaomeng and Wu, Qi and Zhang, Zhizheng and He, Wang , booktitle =
-
[41]
Learning to Explore Using Active Neural
Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Saurabh and Gupta, Abhinav and Salakhutdinov, Ruslan , booktitle =. Learning to Explore Using Active Neural
-
[42]
Ramrakhya, Ram and Batra, Dhruv and Wijmans, Erik and Das, Abhishek , booktitle =
-
[43]
Deitke, Matt and VanderBilt, Eli and Herrasti, Alvaro and Weihs, Luca and Ehsani, Kiana and Salvador, Jordi and Han, Winson and Kolve, Eric and Kembhavi, Aniruddha and Mottaghi, Roozbeh , booktitle =
-
[44]
Maksymets, Oleksandr and Cartillier, Vincent and Gokaslan, Aaron and Wijmans, Erik and Galuba, Wojciech and Lee, Stefan and Batra, Dhruv , booktitle =
-
[45]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
Hierarchical Object-to-Zone Graph for Object Navigation , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
-
[46]
Hong, Yicong and Wu, Qi and Qi, Yuankai and Rodriguez-Opazo, Cristian and Gould, Stephen , booktitle =
-
[47]
An, Dong and Wang, Hanqing and Wang, Wenguan and Wang, Zun and Huang, Yan and He, Keji and Wang, Liang , journal =
-
[48]
On Evaluation of Embodied Navigation Agents
On Evaluation of Embodied Navigation Agents , author =. arXiv preprint arXiv:1807.06757 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Duan, Jiafei and Yu, Samson and Tan, Hui Li and Zhu, Hongyuan and Tan, Cheston , journal =. A Survey of Embodied
-
[50]
Rosinol, Antoni and Abate, Marcus and Chang, Yun and Carlone, Luca , booktitle =
-
[51]
and Leutenegger, Stefan , booktitle =
McCormac, John and Handa, Ankur and Davison, Andrew J. and Leutenegger, Stefan , booktitle =
-
[52]
Planning Algorithms , author =
- [53]
-
[54]
IEEE Robotics & Automation Magazine , volume =
The Dynamic Window Approach to Collision Avoidance , author =. IEEE Robotics & Automation Magazine , volume =
-
[55]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Flamingo: A Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[56]
Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Tober, Marc and Zeng, Andy...
-
[57]
Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal =
-
[58]
Navigating to Objects in the Real World , author =. Science Robotics , year =
-
[59]
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
Visual Language Maps for Robot Navigation , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
-
[60]
Shah, Dhruv and Eysenbach, Benjamin and Kahn, Gregory and Levine, Sergey , booktitle =
-
[61]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Think Before You Act: Decision Transformers with Working Memory , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[62]
Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , journal =
-
[63]
Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Fu, Chuyuan and Gober, Keerthana and Gopalakrishnan, Karol and others , booktitle =. Do As
-
[64]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[65]
International Conference on Learning Representations (ICLR) , year =
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =
-
[66]
Mur-Artal, Raul and Montiel, J. M. M. and Tard. IEEE Transactions on Robotics , volume =
-
[67]
IEEE Transactions on Robotics , volume =
Campos, Carlos and Elvira, Richard and Rodr. IEEE Transactions on Robotics , volume =
-
[68]
Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y. , booktitle =
-
[69]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Habitat 2.0: Training Home Assistants to Rearrange their Habitat , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.