{"total":20,"items":[{"citing_arxiv_id":"2605.23187","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction","primary_cat":"cs.CV","submitted_at":"2026-05-22T03:09:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IntentionNav is a new benchmark showing that VLMs infer intended targets from implicit instructions in 48% of cases but achieve only 25% terminal success and 5.5% grounded success in active navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19958","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning","primary_cat":"cs.RO","submitted_at":"2026-05-19T15:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TravExplorer couples zero-shot semantic guidance with traversability-aware 3-D planning to enable cross-floor object navigation in unseen indoor environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19206","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-19T00:15:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CLUE adaptively weights room-type and object-co-location cues from an LLM to construct a unified semantic value map that improves success rate and efficiency in zero-shot object-goal navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14504","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution","primary_cat":"cs.AI","submitted_at":"2026-05-14T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09423","ref_index":9,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732. [8] Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mot- taghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.CoRR, abs/2006.13171, 2020. URL https://arxiv.org/abs/2006.13171. [9] Aleksey Bokhovkin, Quan Meng, Shubham Tulsiani, and Angela Dai. Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 628-639, 2025. [10] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al."},{"citing_arxiv_id":"2605.06223","ref_index":14,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:19:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00397","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-01T04:36:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23327","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Efficient Beam Search Algorithm for Active Perception in Mobile Robotics","primary_cat":"cs.RO","submitted_at":"2026-04-25T14:35:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Node-wise beam search with expected gain and RRAG graph construction outperforms prior active perception methods by at least 20% on representative tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13633","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation","primary_cat":"cs.CV","submitted_at":"2026-04-15T09:01:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12626","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting","primary_cat":"cs.RO","submitted_at":"2026-04-14T11:52:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"for human-aware navigation scenarios, thus helping train more robust agents. embodied agents directly in the physical world is slow, expensive, potentially dangerous, and difficult to reproduce across experiments [23]. The prevailing paradigm in Embodied AI is therefore to train agents at scale in simulation and transfer the learned policies to reality [1,3,30]. This paradigm places two critical demands on the simulator: the visual realism of rendered sensor observations directly governs the effectiveness of Sim-to-Real policy transfer, and the ability to populate scenes with realistic dynamic humans determines whether agents can learn to navigate safely among people. The leading open-source Embodied AI simulators, including Habitat-Sim [21,"},{"citing_arxiv_id":"2604.08509","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visually-grounded Humanoid Agents","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"three state-of-the-art VLN approaches: NaVILA [ 11], NaVid [114], and Uni-NaVid [113]. To ensure a fair com- parison of high-level planning, we intercept their mid-level language commands and convert them into waypoints for our motion generation model, with VLM query frequency kept consistent across all methods. We report the success rate (SR) [ 3], success weighted by path length (SPL) [ 6], and collision rate (CR), averaged over three runs. For the multi-goal benchmark, we additionally adopt the progress rate (PR) and progress weighted by path length (PPL) from MultiON [97]. (ii)World layer:We compare against three representative semantic GS methods: Feature-3DGS [121], Gradient-Weighted 3DGS [30], and OpenGaussian [ 103]. Since no ground-truth semantic annotations exist forSmall-"},{"citing_arxiv_id":"2604.08232","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation","primary_cat":"cs.AI","submitted_at":"2026-04-09T13:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"•How to train the agent capable of hybrid reasoning ability?We curate a training pipeline (Fig. 5) comprising hybrid supervised fine-tuning as a cold start (Sec. 3.3), followed by a two-stage online reinforcement learning with the proposed hybrid reasoning strategy (Sec. 3.4). 3.1. Problem Formulation In this work, we focus on the Object Goal Navigation (Ob- jectNav) task [2], which requires agents to locate the pre- defined target object category in novel environments. Each task can be formulated as a Partially Observable Markov Decision Process (POMDP), denoted as(S, A, O, T, R), where:Sdenotes the state space;Ais the action space in textual form,Orepresents the observation space,Tde- notes the state transition functions t+1 ∼T(s t+1|st, at),"},{"citing_arxiv_id":"2603.26788","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation","primary_cat":"cs.RO","submitted_at":"2026-03-25T09:07:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05377","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenFrontier: General Navigation with Visual-Language Grounded Frontiers","primary_cat":"cs.RO","submitted_at":"2026-03-05T17:02:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OpenFrontier formulates robot navigation as sparse subgoal reaching via visual-language-grounded frontiers, achieving zero-shot performance without fine-tuning or dense semantic maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.05467","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation","primary_cat":"cs.CV","submitted_at":"2026-02-05T09:15:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.23230","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action-guided generation of 3D functionality segmentation data","primary_cat":"cs.CV","submitted_at":"2025-11-28T14:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SynthFun3D generates synthetic 3D functionality segmentation data from action descriptions via object retrieval and scene arrangement, yielding consistent gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU when augmenting real data for VLM training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.20685","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"C-NAV: Towards Self-Evolving Continual Object Navigation in Open World","primary_cat":"cs.RO","submitted_at":"2025-10-23T15:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark while using less memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.09905","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Personalized Embodied Navigation for Portable Object Finding","primary_cat":"cs.RO","submitted_at":"2024-03-14T22:33:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transit-Aware Planning (TAP) enriches navigation policies with object transit data on Dynamic Object Maps, raising success rates by 21.1% in MP3D simulation and 18.3% in real-world tests for finding non-stationary targets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.03568","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent AI: Surveying the Horizons of Multimodal Interaction","primary_cat":"cs.AI","submitted_at":"2024-01-07T19:11:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2109.08238","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI","primary_cat":"cs.CV","submitted_at":"2021-09-16T22:01:24+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Occupancy anticipation for efﬁcient exploration and navigation. In European Conference on Computer Vision, pages 400-418. Springer, 2020. [33] Peter Karkus, Shaojun Cai, and David Hsu. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815-2825, 2021. 7 [34] Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020. 7 [35] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room:"}],"limit":50,"offset":0}