ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

· 2026 · cs.RO · arXiv 2603.26788

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.

representative citing papers

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation cs.RO · 2026-05-26 · unverdicted · none · ref 77 · internal anchor
A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.

ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

fields

years

verdicts

representative citing papers

citing papers explorer