IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction
Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3
The pith
Implicit human instructions create a bottleneck for VLMs in embodied navigation, with only 24.9% terminal success and 5.5% grounded success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IntentionNav demonstrates that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search, with VLMs achieving 48.3% target identification, 68.7% neighborhood entry, 24.9% terminal success, and 5.5% grounded 1 m success, and with performance varying by intent mode such as higher for event-script intents.
What carries the argument
The IntentionNav benchmark of 500 intents over 176 Isaac Sim scenes, each paired with four controlled instruction styles and four intent-mode annotations that separate surface phrasing from semantic cue type under matched geometry.
If this is right
- Success rates reach 28.7% for event-script intents but drop to 19.2% for physical-state intents and 18.5% for affordance intents.
- Agents reach the 2 m neighborhood of the target in 68.7% of episodes yet terminate successfully in only 24.9%.
- The benchmark isolates target inference from language robustness and from terminal localization rather than reporting only aggregate success.
- Withholding the target object name forces models to perform inference from implicit instructions alone.
Where Pith is reading between the lines
- The gap between neighborhood entry and successful termination suggests that future work could add explicit visual-verification modules before stopping.
- Extending the benchmark to include multi-turn clarification dialogs could test whether agents can recover from initial mis-inference of intent.
- The controlled style and mode pairings make it possible to measure whether gains come from better language robustness or from better commonsense mapping of needs to objects.
- Low grounded success indicates that current termination heuristics may need to incorporate explicit checks that the observed object satisfies the stated intent.
- The results imply that progress on intent-driven navigation may require advances in both language-to-object mapping and in-scene verification rather than navigation alone.
Load-bearing premise
The 500 intents, four controlled instruction styles, and four intent-mode annotations accurately represent the distribution and difficulty of real-world implicit human instructions that embodied agents would encounter.
What would settle it
An agent or model that achieves more than 30% grounded 1 m success across the full set of 500 IntentionNav episodes while keeping the same fixed navigation policy would falsify the bottleneck claim.
Figures
read the original abstract
Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IntentionNav, a diagnostic benchmark for intent-driven object navigation from implicit human instructions. It consists of 500 free-text intents spanning 176 Isaac Sim scenes and 64 target categories; each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes. Using a fixed active-navigation agent, three VLMs are evaluated on target identification (48.3%), 2 m neighborhood entry (68.7%), terminal success (24.9%), and grounded 1 m success (5.5%), with higher performance on event-script intents than physical-state or affordance intents. The central claim is that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization.
Significance. If the benchmark construction is shown to be representative, the controlled multi-style and multi-mode design would provide a useful diagnostic tool for isolating failures in embodied search under indirect instructions. The explicit separation of phrasing from semantic cue type and the reporting of intermediate metrics (target ID, neighborhood entry, terminal success) are strengths that go beyond aggregate success rates.
major comments (2)
- [Abstract / §3 (Benchmark Construction)] Abstract and benchmark-construction section: the generalization that indirect human intent is a general bottleneck for embodied agents rests on the assumption that the 500 intents and four-mode annotations reflect the distribution and difficulty of real-world implicit instructions, yet no details are supplied on intent sourcing, human elicitation protocol, inter-annotator reliability for mode labels, or calibration against any naturalistic instruction corpus.
- [Abstract / §4 (Experiments)] Abstract and evaluation section: the headline metrics use 2 m neighborhood and 1 m grounded-success thresholds (yielding 68.7 % and 5.5 %), but the manuscript gives no indication whether these distances were pre-specified in the evaluation protocol or selected after inspecting the data; post-hoc threshold choice would weaken the claim that terminal localization is the dominant failure mode.
minor comments (2)
- [Abstract] The abstract refers to “three VLMs” without naming the models or providing implementation details (prompt templates, action space, termination criteria), which limits immediate reproducibility of the reported percentages.
- [§4 (Experiments)] The paired design (four styles × four modes under matched geometry) is described as supporting fine-grained analysis, but the results section does not report per-mode breakdowns beyond the three aggregate intent-type success rates; adding these tables would strengthen the diagnostic value.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We provide detailed responses to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / §3 (Benchmark Construction)] Abstract and benchmark-construction section: the generalization that indirect human intent is a general bottleneck for embodied agents rests on the assumption that the 500 intents and four-mode annotations reflect the distribution and difficulty of real-world implicit instructions, yet no details are supplied on intent sourcing, human elicitation protocol, inter-annotator reliability for mode labels, or calibration against any naturalistic instruction corpus.
Authors: We acknowledge that additional details on the construction of the benchmark would be beneficial. In the revised manuscript, we will expand the benchmark construction section to describe the intent sourcing process and the human elicitation protocol used to generate the 500 intents. We will also include inter-annotator reliability statistics for the mode annotations. Calibration against a naturalistic instruction corpus was not conducted in this work; we will explicitly note this as a limitation in the revised version. revision: partial
-
Referee: [Abstract / §4 (Experiments)] Abstract and evaluation section: the headline metrics use 2 m neighborhood and 1 m grounded-success thresholds (yielding 68.7 % and 5.5 %), but the manuscript gives no indication whether these distances were pre-specified in the evaluation protocol or selected after inspecting the data; post-hoc threshold choice would weaken the claim that terminal localization is the dominant failure mode.
Authors: The 2 m and 1 m thresholds were pre-specified in the evaluation protocol, drawing from common practices in the object navigation literature for defining neighborhood reachability and grounded success. We will revise §4 to explicitly state that these thresholds were determined prior to conducting the experiments, thereby clarifying that the choice was not post-hoc. revision: yes
Circularity Check
No circularity; empirical benchmark with externally measured results
full rationale
The paper introduces IntentionNav as an empirical benchmark consisting of 500 human-annotated intents across 176 scenes and 64 categories, with controlled rewrites and mode labels. VLMs are evaluated on externally defined metrics (target identification rate, neighborhood entry, terminal success, grounded 1 m success) computed from scene geometry, RGB-D observations, and the provided annotations. No equations, fitted parameters, predictions, or derivations appear; the central claim that indirect intent is a bottleneck follows directly from the measured performance numbers rather than reducing to a self-definition, fitted-input prediction, or self-citation chain. The construction and evaluation are self-contained against the stated scenes and annotations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The provided RGB-D observations and pose are sufficient for an agent to perform active object search in the simulated scenes.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes... Models identify the intended target in 48.3% of episodes and enter its 2 m neighborhood in 68.7%, but terminate successfully in only 24.9%...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Intent mode captures the semantic anchor... event-script, inner-state, physical-state, affordance cues...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018
work page 2018
-
[3]
Personalized instance-based navigation toward user-specific objects in realistic environments
Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Personalized instance-based navigation toward user-specific objects in realistic environments. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024
work page 2024
-
[4]
Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020
-
[5]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InInternational Conference on 3D Vision, 2017
work page 2017
-
[6]
Object goal navigation using goal-oriented semantic exploration
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Ruslan Salakhutdi- nov. Object goal navigation using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[7]
Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[8]
Think global, act local: Dual-scale graph transformer for vision-and-language navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[9]
Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025
Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multimodal llms as embodied agents.arXiv preprint arXiv:2501.11858, 2025
-
[10]
Uln: Towards underspecified vision-and-language navigation
Weixi Feng, Tsu-Jui Fu, Yujie Lu, and William Yang Wang. Uln: Towards underspecified vision-and-language navigation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
work page 2022
-
[11]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[12]
Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking
Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, and Jiajun Lv. Cogddn: A cognitive demand-driven navigation with decision optimization and dual-process thinking. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5237–5246, 2025
work page 2025
-
[13]
Goat-bench: A benchmark for multi-modal lifelong navigation
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[14]
Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models
Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfmnav: Towards open-set zero-shot object navi- gation via vision-language foundation models. InFindings of the Association for Computational Linguistics: NAACL, 2024
work page 2024
-
[15]
Bingqian Lin, Yi Zhu, Yanxin Long, Xiaodan Liang, Qixiang Ye, and Liang Lin. Adversarial reinforced instruction attacker for robust vision-language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 10
work page 2021
-
[16]
Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, et al. Vlnverse: A benchmark for vision-language navigation with versatile, embodied, realistic simulation and evaluation.arXiv preprint arXiv:2512.19021, 2025
-
[17]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProceedings of the European Conference on Computer Vision, pages 38–55, 2024
work page 2024
-
[18]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2049–2060. PMLR, 2025
work page 2049
-
[19]
Zson: Zero-shot object-goal navigation using multimodal goal embeddings
Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[20]
Reverie: Remote embodied visual referring expression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
work page 2020
-
[21]
Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman
Santhosh K. Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[22]
Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M
Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M. Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InAdvances in Neural Information Processing Syst...
work page 2021
-
[23]
Habitat-web: Learning embodied object-search strategies from human demonstrations at scale
Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[24]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE International Conference on Computer Vision, 2019
work page 2019
-
[25]
Capnav: Benchmarking vision language models on capability-conditioned indoor navigation
Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, and Jon Froehlich. Capnav: Benchmarking vision language models on capability-conditioned indoor navigation. arXiv preprint arXiv:2602.18424, 2026
-
[26]
Habitat 2.0: Training home assistants to rearrange their habitat
Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training ...
work page 2021
-
[27]
Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation
Hongcheng Wang, Andy Guan Hong Chen, Xiaoqi Li, Mingdong Wu, and Hao Dong. Find what you want: Learning demand-conditioned object attribute space for demand-driven navigation. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[28]
Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation
Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, and Hao Dong. Mo-ddn: A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation. InAdvances in Neural Information Processing Systems, 2024. 11
work page 2024
-
[29]
Hongcheng Wang, Jinyu Zhu, and Hao Dong. User-centric object navigation: A benchmark with integrated user habits for personalized embodied object search.arXiv preprint arXiv:2602.06459, 2026
-
[30]
Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, and Jing Liu. Beyond literal descriptions: Understanding and locating open-world objects aligned with human intentions. InFindings of the Association for Computational Linguistics, 2024
work page 2024
-
[31]
Scaling data generation in vision-and-language navigation
Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[32]
Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese
Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018
work page 2018
-
[33]
Behavioral analysis of vision-and-language navigation agents
Zijiao Yang, Arjun Majumdar, and Stefan Lee. Behavioral analysis of vision-and-language navigation agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[34]
Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation
Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, and Jiwen Lu. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[35]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. InIEEE International Confer- ence on Robotics and Automation, 2024
work page 2024
-
[36]
Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation
Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024
work page 2024
-
[37]
Vision-and- language navigation with analogical textual descriptions in llms
Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, and Parisa Kordjamshidi. Vision-and- language navigation with analogical textual descriptions in llms. InProceedings of EMNLP, 2025
work page 2025
-
[38]
Towards learning a generalist model for embodied navigation
Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[39]
Navgpt-2: Unleashing navigational reasoning capability for large vision-language models
Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[40]
Navgpt: Explicit reasoning in vision-and-language navigation with large language models
Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[41]
Esc: Exploration with soft commonsense constraints for zero-shot object navigation
Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In Proceedings of the International Conference on Machine Learning, 2023
work page 2023
-
[42]
Soon: Scenario oriented object navigation with graph-based exploration
Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[43]
Diagnosing vision-and-language navigation: What really matters
Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Eric Xin Wang, Qi Wu, Miguel Eckstein, and William Yang Wang. Diagnosing vision-and-language navigation: What really matters. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 2022
work page 2022
-
[44]
Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini, and Tommaso Campari. Personal: Towards a comprehensive benchmark for personalized embodied agents.arXiv preprint arXiv:2509.19843, 2025. 12 A Additional benchmark statistics Table 6 lists the five target groups and representative categories in the 500-intent spl...
-
[45]
7 = target can be picked out, but one or two variants are generic
target_grounding (0-10) Given ONLY the attached photo + the target category word + these 4 English variants, could a reader uniquely pick out the target object? 10 = clearly and uniquely points to target; >=2 variants carry an affordance hint distinguishing this target from competitors. 7 = target can be picked out, but one or two variants are generic. 4 ...
-
[46]
style_distinguishability (0-10) Are the four variants clearly different in register/tone/length? formal_en: polished, friendly-formal. natural_en: everyday spoken English, contractions. casual_en: brief (4-9 words), contractions. emotional_en: emotional/atmospheric, still conversational. 10 = all four clearly distinct. 7 = three distinct, one close to ano...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.