IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Bangya Liu; Hujun Yin; Lin Qian; Shijie Li; Sihao Lin; Xuan Zhang; Yanran Li

arxiv: 2605.23187 · v1 · pith:GPIJLP6Knew · submitted 2026-05-22 · 💻 cs.CV · cs.RO

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Lin Qian , Shijie Li , Sihao Lin , Xuan Zhang , Bangya Liu , Yanran Li , Hujun Yin This is my paper

Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords intent-driven navigationobject navigationembodied AIvisual-language modelsbenchmarkimplicit instructionsIsaac Sim scenesterminal success

0 comments

The pith

Implicit human instructions create a bottleneck for VLMs in embodied navigation, with only 24.9% terminal success and 5.5% grounded success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IntentionNav as a benchmark that gives embodied agents free-text intents such as needing to warm food instead of naming an object category. It supplies 500 intents across 176 scenes and 64 categories, each rewritten in four instruction styles and labeled with one of four intent modes to isolate phrasing effects from semantic cue type. Fixed active-navigation agents powered by three VLMs identify the intended target in 48.3% of episodes and enter its 2 m neighborhood in 68.7%, yet terminate successfully in only 24.9% and reach grounded 1 m success in 5.5%. Performance is highest for event-script intents and lower for physical-state and affordance intents. The paired design lets the benchmark measure target inference, language robustness, neighborhood reachability, and terminal success as separate stages rather than reporting only end-to-end success.

Core claim

IntentionNav demonstrates that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search, with VLMs achieving 48.3% target identification, 68.7% neighborhood entry, 24.9% terminal success, and 5.5% grounded 1 m success, and with performance varying by intent mode such as higher for event-script intents.

What carries the argument

The IntentionNav benchmark of 500 intents over 176 Isaac Sim scenes, each paired with four controlled instruction styles and four intent-mode annotations that separate surface phrasing from semantic cue type under matched geometry.

If this is right

Success rates reach 28.7% for event-script intents but drop to 19.2% for physical-state intents and 18.5% for affordance intents.
Agents reach the 2 m neighborhood of the target in 68.7% of episodes yet terminate successfully in only 24.9%.
The benchmark isolates target inference from language robustness and from terminal localization rather than reporting only aggregate success.
Withholding the target object name forces models to perform inference from implicit instructions alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap between neighborhood entry and successful termination suggests that future work could add explicit visual-verification modules before stopping.
Extending the benchmark to include multi-turn clarification dialogs could test whether agents can recover from initial mis-inference of intent.
The controlled style and mode pairings make it possible to measure whether gains come from better language robustness or from better commonsense mapping of needs to objects.
Low grounded success indicates that current termination heuristics may need to incorporate explicit checks that the observed object satisfies the stated intent.
The results imply that progress on intent-driven navigation may require advances in both language-to-object mapping and in-scene verification rather than navigation alone.

Load-bearing premise

The 500 intents, four controlled instruction styles, and four intent-mode annotations accurately represent the distribution and difficulty of real-world implicit human instructions that embodied agents would encounter.

What would settle it

An agent or model that achieves more than 30% grounded 1 m success across the full set of 500 IntentionNav episodes while keeping the same fixed navigation policy would falsify the bottleneck claim.

Figures

Figures reproduced from arXiv: 2605.23187 by Bangya Liu, Hujun Yin, Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Yanran Li.

**Figure 1.** Figure 1: 3.2 Diagnostic Language Axes IntentionNav annotates each intent along two orthogonal axes: instruction style and intent mode. Instruction style captures how the same intent is expressed across four register variants: formal instructions are explicit and structured; natural instructions reflect everyday conversational phrasing; casual instructions are brief and may omit contextual detail; emotional instruct… view at source ↗

**Figure 1.** Figure 1: Task overview. IntentionNav evaluates intent-driven object navigation: the agent receives [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Candidate acquisition before benchmark filtering. Left: a rendered scene with 25 scene [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The 4×4 diagnostic language design separates how an intent is phrased from what kind of cue grounds the target. The figure summarizes the 500-intent split composition, representative language cues, and target-group distribution: event-script 202, inner-state 167, physical-state 72, affordance 59; appliances 133, small objects 115, large furniture 113, lighting/decor 81, and bathroom objects 58 [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: Reference navigation pipeline used for all evaluated VLMs. The agent repeatedly combines [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Expanded conceptual illustration of IntentionNav. An implicit human request induces [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Radar diagnostic profile of the three evaluated models. Axes are normalized to [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IntentionNav gives a clean benchmark for implicit intent in object nav with useful mode and style breakdowns, but the low VLM numbers rest on intents whose real-world match is not yet shown.

read the letter

The paper's core contribution is IntentionNav: 500 intents across 176 scenes and 64 categories, each rewritten in four controlled styles and labeled with one of four intent modes. This setup lets them measure target inference, neighborhood entry, and termination separately instead of just reporting success. The numbers from three VLMs on a fixed agent are concrete: 48.3% target identification, 68.7% neighborhood entry, 24.9% terminal success, and 5.5% grounded 1 m success, with event-script intents doing better than physical-state or affordance ones. That separation is the part that actually adds diagnostic value over standard category-based object nav benchmarks. The paired design is a straightforward way to test language robustness under matched geometry. The main soft spot is exactly the one flagged in the stress-test note. The abstract describes controlled construction but gives no information on how the intents were elicited, whether from real humans or templates, inter-annotator agreement on the mode labels, or any check against naturalistic instruction data. Without that, the claim that indirect intent is a general bottleneck could be tied to this particular sample rather than the broader problem. The thresholds and episode sampling also need the methods section to verify they were not tuned after seeing results. This is for embodied AI groups working on human-facing robotics and VLM agents. Readers who want a structured way to diagnose where intent handling fails will get something usable from the breakdown. It deserves peer review because the benchmark structure and the empirical splits are new and falsifiable, even if the generalization step needs more grounding on the intent data itself.

Referee Report

2 major / 2 minor

Summary. The paper introduces IntentionNav, a diagnostic benchmark for intent-driven object navigation from implicit human instructions. It consists of 500 free-text intents spanning 176 Isaac Sim scenes and 64 target categories; each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes. Using a fixed active-navigation agent, three VLMs are evaluated on target identification (48.3%), 2 m neighborhood entry (68.7%), terminal success (24.9%), and grounded 1 m success (5.5%), with higher performance on event-script intents than physical-state or affordance intents. The central claim is that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization.

Significance. If the benchmark construction is shown to be representative, the controlled multi-style and multi-mode design would provide a useful diagnostic tool for isolating failures in embodied search under indirect instructions. The explicit separation of phrasing from semantic cue type and the reporting of intermediate metrics (target ID, neighborhood entry, terminal success) are strengths that go beyond aggregate success rates.

major comments (2)

[Abstract / §3 (Benchmark Construction)] Abstract and benchmark-construction section: the generalization that indirect human intent is a general bottleneck for embodied agents rests on the assumption that the 500 intents and four-mode annotations reflect the distribution and difficulty of real-world implicit instructions, yet no details are supplied on intent sourcing, human elicitation protocol, inter-annotator reliability for mode labels, or calibration against any naturalistic instruction corpus.
[Abstract / §4 (Experiments)] Abstract and evaluation section: the headline metrics use 2 m neighborhood and 1 m grounded-success thresholds (yielding 68.7 % and 5.5 %), but the manuscript gives no indication whether these distances were pre-specified in the evaluation protocol or selected after inspecting the data; post-hoc threshold choice would weaken the claim that terminal localization is the dominant failure mode.

minor comments (2)

[Abstract] The abstract refers to “three VLMs” without naming the models or providing implementation details (prompt templates, action space, termination criteria), which limits immediate reproducibility of the reported percentages.
[§4 (Experiments)] The paired design (four styles × four modes under matched geometry) is described as supporting fine-grained analysis, but the results section does not report per-mode breakdowns beyond the three aggregate intent-type success rates; adding these tables would strengthen the diagnostic value.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We provide detailed responses to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / §3 (Benchmark Construction)] Abstract and benchmark-construction section: the generalization that indirect human intent is a general bottleneck for embodied agents rests on the assumption that the 500 intents and four-mode annotations reflect the distribution and difficulty of real-world implicit instructions, yet no details are supplied on intent sourcing, human elicitation protocol, inter-annotator reliability for mode labels, or calibration against any naturalistic instruction corpus.

Authors: We acknowledge that additional details on the construction of the benchmark would be beneficial. In the revised manuscript, we will expand the benchmark construction section to describe the intent sourcing process and the human elicitation protocol used to generate the 500 intents. We will also include inter-annotator reliability statistics for the mode annotations. Calibration against a naturalistic instruction corpus was not conducted in this work; we will explicitly note this as a limitation in the revised version. revision: partial
Referee: [Abstract / §4 (Experiments)] Abstract and evaluation section: the headline metrics use 2 m neighborhood and 1 m grounded-success thresholds (yielding 68.7 % and 5.5 %), but the manuscript gives no indication whether these distances were pre-specified in the evaluation protocol or selected after inspecting the data; post-hoc threshold choice would weaken the claim that terminal localization is the dominant failure mode.

Authors: The 2 m and 1 m thresholds were pre-specified in the evaluation protocol, drawing from common practices in the object navigation literature for defining neighborhood reachability and grounded success. We will revise §4 to explicitly state that these thresholds were determined prior to conducting the experiments, thereby clarifying that the choice was not post-hoc. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with externally measured results

full rationale

The paper introduces IntentionNav as an empirical benchmark consisting of 500 human-annotated intents across 176 scenes and 64 categories, with controlled rewrites and mode labels. VLMs are evaluated on externally defined metrics (target identification rate, neighborhood entry, terminal success, grounded 1 m success) computed from scene geometry, RGB-D observations, and the provided annotations. No equations, fitted parameters, predictions, or derivations appear; the central claim that indirect intent is a bottleneck follows directly from the measured performance numbers rather than reducing to a self-definition, fitted-input prediction, or self-citation chain. The construction and evaluation are self-contained against the stated scenes and annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard embodied-AI evaluation practices (RGB-D + pose input, fixed navigation controller) rather than new axioms or invented entities; no free parameters or postulated objects are introduced.

axioms (1)

domain assumption The provided RGB-D observations and pose are sufficient for an agent to perform active object search in the simulated scenes.
Implicit in the benchmark design that supplies only these inputs.

pith-pipeline@v0.9.0 · 5840 in / 1424 out tokens · 24545 ms · 2026-05-25T05:01:29.018948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes... Models identify the intended target in 48.3% of episodes and enter its 2 m neighborhood in 68.7%, but terminate successfully in only 24.9%...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Intent mode captures the semantic anchor... event-script, inner-state, physical-state, affordance cues...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[3]

Personalized instance-based navigation toward user-specific objects in realistic environments

Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Personalized instance-based navigation toward user-specific objects in realistic environments. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

work page 2024
[4]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

work page arXiv 2006
[5]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision, 2017

work page 2017
[6]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Ruslan Salakhutdi- nov. Object goal navigation using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[7]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[8]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[9]

Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025

work page arXiv 2025
[10]

Uln: Towards underspecified vision-and-language navigation

Weixi Feng, Tsu-Jui Fu, Yujie Lu, and William Yang Wang. Uln: Towards underspecified vision-and-language navigation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022
[11]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[12]

Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking

Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, and Jiajun Lv. Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5237–5246, 2025

work page 2025
[13]

Goat-bench: A benchmark for multi-modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[14]

Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models. InFindings of the Association for Computational Linguistics: NAACL, 2024

work page 2024
[15]

Adversarial reinforced instruction attacker for robust vision-language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

Bingqian Lin, Yi Zhu, Yanxin Long, Xiaodan Liang, Qixiang Ye, and Liang Lin. Adversarial reinforced instruction attacker for robust vision-language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 10

work page 2021
[16]

Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

work page arXiv 2025
[17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, pages 38–55, 2024

work page 2024
[18]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2049–2060. PMLR, 2025

work page 2049
[19]

Zson: Zero-shot object-goal navigation using multimodal goal embeddings

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[20]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

work page 2020
[21]

Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman

Santhosh K. Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[22]

Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M. Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InAdvances in Neural Information Processing Syst...

work page 2021
[23]

Habitat-web: Learning embodied object-search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[24]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE International Conference on Computer Vision, 2019

work page 2019
[25]

Capnav: Benchmarking vision language models on capability-conditioned indoor navigation

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, and Jon Froehlich. Capnav: Benchmarking vision language models on capability-conditioned indoor navigation. arXiv preprint arXiv:2602.18424, 2026

work page arXiv 2026
[26]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training ...

work page 2021
[27]

Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation

Hongcheng Wang, Andy Guan Hong Chen, Xiaoqi Li, Mingdong Wu, and Hao Dong. Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[28]

Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation

Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, and Hao Dong. Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation. InAdvances in Neural Information Processing Systems, 2024. 11

work page 2024
[29]

User-centric object navigation: A benchmark with integrated user habits for personalized embodied object search.arXiv preprint arXiv:2602.06459, 2026

Hongcheng Wang, Jinyu Zhu, and Hao Dong. User-centric object navigation: A benchmark with integrated user habits for personalized embodied object search.arXiv preprint arXiv:2602.06459, 2026

work page arXiv 2026
[30]

Beyond literal descriptions: Understanding and locating open-world objects aligned with human intentions

Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, and Jing Liu. Beyond literal descriptions: Understanding and locating open-world objects aligned with human intentions. InFindings of the Association for Computational Linguistics, 2024

work page 2024
[31]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[32]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[33]

Behavioral analysis of vision-and-language navigation agents

Zijiao Yang, Arjun Majumdar, and Stefan Lee. Behavioral analysis of vision-and-language navigation agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[34]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[35]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Confer- ence on Robotics and Automation, 2024

work page 2024
[36]

Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

work page 2024
[37]

Vision-and- language navigation with analogical textual descriptions in llms

Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, and Parisa Kordjamshidi. Vision-and- language navigation with analogical textual descriptions in llms. InProceedings of EMNLP, 2025

work page 2025
[38]

Towards learning a generalist model for embodied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[39]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, 2024

work page 2024
[40]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[41]

Esc: Exploration with soft commonsense constraints for zero-shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023
[42]

Soon: Scenario oriented object navigation with graph-based exploration

Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[43]

Diagnosing vision-and-language navigation: What really matters

Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Eric Xin Wang, Qi Wu, Miguel Eckstein, and William Yang Wang. Diagnosing vision-and-language navigation: What really matters. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 2022

work page 2022
[44]

{target_category}

Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini, and Tommaso Campari. Personal: Towards a comprehensive benchmark for personalized embodied agents.arXiv preprint arXiv:2509.19843, 2025. 12 A Additional benchmark statistics Table 6 lists the five target groups and representative categories in the 500-intent spl...

work page arXiv 2025
[45]

7 = target can be picked out, but one or two variants are generic

target_grounding (0-10) Given ONLY the attached photo + the target category word + these 4 English variants, could a reader uniquely pick out the target object? 10 = clearly and uniquely points to target; >=2 variants carry an affordance hint distinguishing this target from competitors. 7 = target can be picked out, but one or two variants are generic. 4 ...

work page
[46]

all four look good

style_distinguishability (0-10) Are the four variants clearly different in register/tone/length? formal_en: polished, friendly-formal. natural_en: everyday spoken English, contractions. casual_en: brief (4-9 words), contractions. emotional_en: emotional/atmospheric, still conversational. 10 = all four clearly distinct. 7 = three distinct, one close to ano...

work page 2000

[1] [1]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[3] [3]

Personalized instance-based navigation toward user-specific objects in realistic environments

Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Personalized instance-based navigation toward user-specific objects in realistic environments. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

work page 2024

[4] [4]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

work page arXiv 2006

[5] [5]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision, 2017

work page 2017

[6] [6]

Object goal navigation using goal-oriented semantic exploration

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Ruslan Salakhutdi- nov. Object goal navigation using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[7] [7]

Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[8] [8]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[9] [9]

Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025

work page arXiv 2025

[10] [10]

Uln: Towards underspecified vision-and-language navigation

Weixi Feng, Tsu-Jui Fu, Yujie Lu, and William Yang Wang. Uln: Towards underspecified vision-and-language navigation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022

[11] [11]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[12] [12]

Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking

Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, and Jiajun Lv. Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5237–5246, 2025

work page 2025

[13] [13]

Goat-bench: A benchmark for multi-modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[14] [14]

Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models

Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models. InFindings of the Association for Computational Linguistics: NAACL, 2024

work page 2024

[15] [15]

Adversarial reinforced instruction attacker for robust vision-language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

Bingqian Lin, Yi Zhu, Yanxin Long, Xiaodan Liang, Qixiang Ye, and Liang Lin. Adversarial reinforced instruction attacker for robust vision-language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 10

work page 2021

[16] [16]

Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025

work page arXiv 2025

[17] [17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, pages 38–55, 2024

work page 2024

[18] [18]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2049–2060. PMLR, 2025

work page 2049

[19] [19]

Zson: Zero-shot object-goal navigation using multimodal goal embeddings

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[20] [20]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

work page 2020

[21] [21]

Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman

Santhosh K. Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[22] [22]

Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M. Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InAdvances in Neural Information Processing Syst...

work page 2021

[23] [23]

Habitat-web: Learning embodied object-search strategies from human demonstrations at scale

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[24] [24]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE International Conference on Computer Vision, 2019

work page 2019

[25] [25]

Capnav: Benchmarking vision language models on capability-conditioned indoor navigation

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, and Jon Froehlich. Capnav: Benchmarking vision language models on capability-conditioned indoor navigation. arXiv preprint arXiv:2602.18424, 2026

work page arXiv 2026

[26] [26]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training ...

work page 2021

[27] [27]

Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation

Hongcheng Wang, Andy Guan Hong Chen, Xiaoqi Li, Mingdong Wu, and Hao Dong. Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[28] [28]

Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation

Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, and Hao Dong. Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation. InAdvances in Neural Information Processing Systems, 2024. 11

work page 2024

[29] [29]

User-centric object navigation: A benchmark with integrated user habits for personalized embodied object search.arXiv preprint arXiv:2602.06459, 2026

Hongcheng Wang, Jinyu Zhu, and Hao Dong. User-centric object navigation: A benchmark with integrated user habits for personalized embodied object search.arXiv preprint arXiv:2602.06459, 2026

work page arXiv 2026

[30] [30]

Beyond literal descriptions: Understanding and locating open-world objects aligned with human intentions

Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, and Jing Liu. Beyond literal descriptions: Understanding and locating open-world objects aligned with human intentions. InFindings of the Association for Computational Linguistics, 2024

work page 2024

[31] [31]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[32] [32]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[33] [33]

Behavioral analysis of vision-and-language navigation agents

Zijiao Yang, Arjun Majumdar, and Stefan Lee. Behavioral analysis of vision-and-language navigation agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[34] [34]

Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[35] [35]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Confer- ence on Robotics and Automation, 2024

work page 2024

[36] [36]

Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

work page 2024

[37] [37]

Vision-and- language navigation with analogical textual descriptions in llms

Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, and Parisa Kordjamshidi. Vision-and- language navigation with analogical textual descriptions in llms. InProceedings of EMNLP, 2025

work page 2025

[38] [38]

Towards learning a generalist model for embodied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[39] [39]

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, 2024

work page 2024

[40] [40]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[41] [41]

Esc: Exploration with soft commonsense constraints for zero-shot object navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023

[42] [42]

Soon: Scenario oriented object navigation with graph-based exploration

Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[43] [43]

Diagnosing vision-and-language navigation: What really matters

Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Eric Xin Wang, Qi Wu, Miguel Eckstein, and William Yang Wang. Diagnosing vision-and-language navigation: What really matters. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 2022

work page 2022

[44] [44]

{target_category}

Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini, and Tommaso Campari. Personal: Towards a comprehensive benchmark for personalized embodied agents.arXiv preprint arXiv:2509.19843, 2025. 12 A Additional benchmark statistics Table 6 lists the five target groups and representative categories in the 500-intent spl...

work page arXiv 2025

[45] [45]

7 = target can be picked out, but one or two variants are generic

target_grounding (0-10) Given ONLY the attached photo + the target category word + these 4 English variants, could a reader uniquely pick out the target object? 10 = clearly and uniquely points to target; >=2 variants carry an affordance hint distinguishing this target from competitors. 7 = target can be picked out, but one or two variants are generic. 4 ...

work page

[46] [46]

all four look good

style_distinguishability (0-10) Are the four variants clearly different in register/tone/length? formal_en: polished, friendly-formal. natural_en: everyday spoken English, contractions. casual_en: brief (4-9 words), contractions. emotional_en: emotional/atmospheric, still conversational. 10 = all four clearly distinct. 7 = three distinct, one close to ano...

work page 2000