Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

Kaijie Yun; Yue Chen

arxiv: 2605.19420 · v1 · pith:WMJYDJXKnew · submitted 2026-05-19 · 💻 cs.RO

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

Kaijie Yun , Yue Chen This is my paper

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.RO

keywords semantic navigationdual heatmapcross-embodiment transfervision language modelaffordance predictionrobot navigationsynthetic data

0 comments

The pith

Predicting dual heatmaps for reachable regions and orientations allows robots to ground semantic instructions into safe, executable navigation goals instead of brittle single points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of turning open-ended semantic instructions into physical actions for robots by moving away from regressing single waypoints. Single points often land on non-traversable spots like object centers, causing failures. By using a vision-language model to output a navigation affordance heatmap showing continuous reachable areas and a facing heatmap for orientation, the targets stay in free space. This dense output serves as a potential field for planners and works across different robot bodies when trained on synthetic data. A sympathetic reader would care because it offers a more robust way to make robot navigation reliable in real settings.

Core claim

The framework abandons single-point regression in favor of a Dual-Heatmap representation that predicts a navigation affordance heatmap capturing continuous reachable regions coupled with a facing heatmap for orientation constraints. These outputs function as a differentiable semantic potential field that integrates with local planners. Supported by a synthetic data pipeline, experiments show state-of-the-art performance among 8B baselines and improved Affordance Rate across embodiments including Jetbot, H1, and Aliengo by reliably placing targets in executable free space.

What carries the argument

Dual-Heatmap representation with a navigation affordance heatmap for reachable regions and a facing heatmap for orientation constraints, acting as a semantic potential field.

If this is right

Drastically improves the Affordance Rate by ensuring targets are in executable free space.
Achieves state-of-the-art performance among comparable 8B vision-language baselines.
Transfers effectively to diverse robot embodiments such as Jetbot, H1, and Aliengo.
Integrates seamlessly with downstream local planners as a differentiable potential field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to other tasks where semantic understanding must map to physical actions beyond navigation.
Using synthetic data pipelines might reduce reliance on costly real-world robot data collection for training.
Testing the approach in cluttered or dynamic real-world environments would reveal its robustness limits.

Load-bearing premise

The fully automated foundation-model-assisted synthetic data pipeline produces training distributions that transfer to real robot embodiments and diverse environments without significant domain gap.

What would settle it

A large performance drop or high failure rate when the model is deployed on physical robots in environments different from the simulation would falsify the transferability of the approach.

Figures

Figures reproduced from arXiv: 2605.19420 by Kaijie Yun, Yue Chen.

**Figure 2.** Figure 2: Overview of the proposed heatmap prediction model. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the automated synthetic data generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of spatial grounding. Given a multimodal instruction, the baseline [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Execution trajectories in the simulation environment. While deterministic waypoints easily [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual heatmaps for affordance and facing look like a sensible step past single-point regression, but the sim-only results and missing real-robot checks keep the cross-embodiment claims provisional.

read the letter

The core move is replacing waypoint regression with two dense heatmaps—one marking reachable affordance regions and one encoding orientation constraints. This gives a differentiable potential field that local planners can use directly, which should cut down on cases where a robot heads straight for an object center that it cannot actually reach or stand on. The paper shows this through a feature-fusion study and reports higher Affordance Rate numbers across three simulated platforms: Jetbot, H1, and Aliengo. That cross-embodiment angle is the clearest addition beyond standard waypoint methods cited in the abstract. The automated synthetic data pipeline built on foundation models is also a practical piece of infrastructure that lets them scale training without manual labeling. Those elements are concrete and worth noting. The main limitation is that all reported gains stay inside simulation. There are no real-robot deployments, no measurements of how sensor noise or lighting shifts move the predicted heatmaps, and no explicit sim-to-real gap numbers. If the heatmaps drift outside traversable space on hardware, the claimed reduction in brittleness does not hold. The abstract mentions state-of-the-art results among 8B baselines but gives few baseline details or ablation tables, so the performance edge is hard to judge from the text alone. This work is aimed at researchers building vision-language navigation stacks who already use dense prediction or potential fields. A reader looking for a new output representation and a multi-embodiment sim benchmark will find usable ideas here. The paper is coherent on its own terms and shows honest engagement with the execution-failure problem, so it deserves a serious referee who can ask for real-world validation and fuller experimental reporting.

Referee Report

2 major / 2 minor

Summary. The paper proposes replacing deterministic waypoint regression with a Dual-Heatmap representation (navigation affordance heatmap plus facing heatmap) for in-FOV semantic navigation. The dense outputs act as a differentiable semantic potential field integrable with local planners. A foundation-model-assisted synthetic data pipeline and simulation benchmark are introduced, with experiments across Jetbot, H1, and Aliengo embodiments claiming SOTA results among 8B baselines and improved Affordance Rate by reliably placing targets in executable free space.

Significance. If the central claims hold, the work provides a transferable approach to mitigating point-regression brittleness in cross-embodiment settings through continuous reachable-region predictions. Credit is given for the automated synthetic data pipeline, the feature-fusion study demonstrating heatmap benefits, and the multi-embodiment simulation benchmark that directly tests the affordance improvements.

major comments (2)

[Experiments] Experiments section: The state-of-the-art performance claims and Affordance Rate gains across embodiments are reported without baseline details, ablation numbers, error bars, or statistical analysis, leaving the quantitative support for the central performance improvements only partially verifiable.
[Synthetic Data Pipeline] Synthetic data pipeline and simulation studies: No real-robot deployment or quantitative measurement of sim-to-real domain gap (sensor noise, lighting, calibration) for the predicted heatmaps is described, which is load-bearing for the claim that targets are placed in executable free space under real conditions.

minor comments (2)

[Abstract] The abstract introduces Affordance Rate (AR) and SOTA results but omits any numerical values or specific baseline names.
[Method] Notation for the dual heatmaps and their integration as a potential field would benefit from explicit equations or pseudocode in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We appreciate the positive remarks on the dual-heatmap approach, the synthetic data pipeline, and the multi-embodiment benchmark. We address the major comments below and will incorporate revisions to improve the clarity and verifiability of our results.

read point-by-point responses

Referee: [Experiments] Experiments section: The state-of-the-art performance claims and Affordance Rate gains across embodiments are reported without baseline details, ablation numbers, error bars, or statistical analysis, leaving the quantitative support for the central performance improvements only partially verifiable.

Authors: We agree with the referee that more detailed reporting is necessary to fully support our claims. In the revised version of the manuscript, we will provide comprehensive baseline details, including descriptions of the 8B models used for comparison, full numerical results from ablation studies on the feature-fusion and heatmap components, error bars (standard deviations) for all reported metrics across multiple evaluation runs, and statistical analysis (e.g., paired t-tests) to demonstrate the significance of the Affordance Rate improvements across embodiments. These additions will make the quantitative evidence more robust and verifiable. revision: yes
Referee: [Synthetic Data Pipeline] Synthetic data pipeline and simulation studies: No real-robot deployment or quantitative measurement of sim-to-real domain gap (sensor noise, lighting, calibration) for the predicted heatmaps is described, which is load-bearing for the claim that targets are placed in executable free space under real conditions.

Authors: The manuscript is centered on a simulation-based evaluation to validate the dual-heatmap grounding method using our automated synthetic data pipeline and a new benchmark. All claims about placing targets in executable free space are made and verified within the simulated environments, where precise ground truth for traversability is available. We do not have real-robot deployment results in this work. We will revise the paper to clarify the simulation scope of the claims and include an expanded discussion on limitations, explicitly addressing potential sim-to-real challenges such as sensor noise, lighting variations, and calibration issues, along with plans for future real-world experiments. revision: partial

Circularity Check

0 steps flagged

No circularity detected in empirical framework or derivations

full rationale

The paper introduces a dual-heatmap representation (affordance + facing) for semantic navigation and supports it with a synthetic data generation pipeline plus simulation benchmarks across embodiments. No equations, parameter fits, or derivations are described that reduce claimed outputs or improvements to quantities defined by the inputs themselves. Central performance claims rest on reported Affordance Rate gains and SOTA comparisons in simulation, which are independent measurements rather than tautological. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of heatmap outputs over point regression and on the realism of the synthetic data pipeline; no explicit free parameters are fitted to target results in the abstract.

axioms (1)

domain assumption Foundation models can generate sufficiently realistic synthetic navigation scenes and prompts for training
Invoked to justify the automated data pipeline that supports the dual-heatmap training.

invented entities (1)

Dual-Heatmap representation no independent evidence
purpose: To capture continuous reachable regions and orientation constraints as a differentiable semantic potential field
New output format introduced to replace single-point regression.

pith-pipeline@v0.9.0 · 5783 in / 1304 out tokens · 57786 ms · 2026-05-20T05:49:18.105606+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J(x) = α Hnav(x) + β Hfac(x) − Cost_collision(x) ... dense outputs inherently function as a differentiable semantic potential field
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dual-Heatmap representation ... navigation affordance heatmap ... facing heatmap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chang, A

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

work page 2017
[5]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

work page 2025
[6]

P. Chen, D. Ji, K. Lin, R. Zeng, T. Li, M. Tan, and C. Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation.Advances in Neural Information Processing Systems, 35:38149–38161, 2022

work page 2022
[7]

S. Chen, P. He, J. Hu, Z. Liu, Y . Wang, T. Xu, C. Zhang, C. Zhang, C. An, S. Cai, et al. Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

work page arXiv 2025
[8]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recogni- tion (CVPR), IEEE, 2017

work page 2017
[9]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, et al. Molmo and pixmo: Open weights and open data for state-ofthe-art vision-language models.arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

work page 2023
[12]

J. J. Gibson. The ecological approach to the visual perception of pictures.Leonardo, 11(3): 227–235, 1978

work page 1978
[13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[14]

Huang, O

C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

work page 2023
[15]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020
[17]

Krantz, A

J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

work page 2021
[18]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

work page 2023
[19]

S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

work page 2022
[20]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. Sayplan: Ground- ing large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

work page arXiv 2023
[21]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

work page 2021
[22]

N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory.arXiv preprint arXiv:2210.05663, 2022

work page arXiv 2022
[23]

X. Shao, Y . Tang, P. Xie, K. Zhou, Y . Zhuang, X. Quan, J. Hao, L. Zeng, and X. Li. More than a point: Capturing uncertainty with adaptive affordance heatmaps for spatial grounding in robotic tasks.arXiv preprint arXiv:2510.10912, 2025

work page arXiv 2025
[24]

S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

work page 2015
[25]

M. T. I. SpatialVerse Research Team. Interioragent: Interactive usd interior scenes for isaac sim-based simulation. https://huggingface.co/datasets/spatialverse/ InteriorAgent, 2025

work page 2025
[26]

I. Team. InternVLA-N1: An open dual-system navigation foundation model with learned latent plans, 2025

work page 2025
[27]

H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang, P. Cao, W. Yu, Z. Ye, J. Li, J. Long, Z. Wang, H. Wang, Y . Zhao, Z. Tu, Y . Qiao, D. Lin, and P. Jiangmiao. Grutopia: Dream general robots in a city at scale. InarXiv, 2024

work page 2024
[28]

L. Wang, X. Xia, H. Zhao, H. Wang, T. Wang, Y . Chen, C. Liu, Q. Chen, and J. Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

work page 2025
[29]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025
[30]

Werby, C

A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

work page 2024
[31]

B. Yu, H. Kasaei, and M. Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023. 11

work page 2023
[32]

X. Zhou, D. Wang, and P. Krähenbühl. Objects as points.arXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[33]

X. Zhou, T. Xiao, L. Liu, Y . Wang, M. Chen, X. Meng, X. Wang, W. Feng, W. Sui, and Z. Su. Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph.arXiv preprint arXiv:2509.13733, 2025

work page arXiv 2025
[34]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 Appendix A. Qualitative Results To intuitively demonstrate the advantages of the proposed Dual-Heatmap representat...

work page 2023

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Chang, A

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

work page 2017

[5] [5]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

work page 2025

[6] [6]

P. Chen, D. Ji, K. Lin, R. Zeng, T. Li, M. Tan, and C. Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation.Advances in Neural Information Processing Systems, 35:38149–38161, 2022

work page 2022

[7] [7]

S. Chen, P. He, J. Hu, Z. Liu, Y . Wang, T. Xu, C. Zhang, C. Zhang, C. An, S. Cai, et al. Astra: Toward general-purpose mobile robots via hierarchical multimodal learning.arXiv preprint arXiv:2506.06205, 2025

work page arXiv 2025

[8] [8]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recogni- tion (CVPR), IEEE, 2017

work page 2017

[9] [9]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, et al. Molmo and pixmo: Open weights and open data for state-ofthe-art vision-language models.arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

work page 2023

[12] [12]

J. J. Gibson. The ecological approach to the visual perception of pictures.Leonardo, 11(3): 227–235, 1978

work page 1978

[13] [13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[14] [14]

Huang, O

C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

work page 2023

[15] [15]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020

[17] [17]

Krantz, A

J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

work page 2021

[18] [18]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

work page 2023

[19] [19]

S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022

work page 2022

[20] [20]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. Sayplan: Ground- ing large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

work page arXiv 2023

[21] [21]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

work page 2021

[22] [22]

N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory.arXiv preprint arXiv:2210.05663, 2022

work page arXiv 2022

[23] [23]

X. Shao, Y . Tang, P. Xie, K. Zhou, Y . Zhuang, X. Quan, J. Hao, L. Zeng, and X. Li. More than a point: Capturing uncertainty with adaptive affordance heatmaps for spatial grounding in robotic tasks.arXiv preprint arXiv:2510.10912, 2025

work page arXiv 2025

[24] [24]

S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

work page 2015

[25] [25]

M. T. I. SpatialVerse Research Team. Interioragent: Interactive usd interior scenes for isaac sim-based simulation. https://huggingface.co/datasets/spatialverse/ InteriorAgent, 2025

work page 2025

[26] [26]

I. Team. InternVLA-N1: An open dual-system navigation foundation model with learned latent plans, 2025

work page 2025

[27] [27]

H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang, P. Cao, W. Yu, Z. Ye, J. Li, J. Long, Z. Wang, H. Wang, Y . Zhao, Z. Tu, Y . Qiao, D. Lin, and P. Jiangmiao. Grutopia: Dream general robots in a city at scale. InarXiv, 2024

work page 2024

[28] [28]

L. Wang, X. Xia, H. Zhao, H. Wang, T. Wang, Y . Chen, C. Liu, Q. Chen, and J. Pang. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025

work page 2025

[29] [29]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025

[30] [30]

Werby, C

A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

work page 2024

[31] [31]

B. Yu, H. Kasaei, and M. Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023. 11

work page 2023

[32] [32]

X. Zhou, D. Wang, and P. Krähenbühl. Objects as points.arXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[33] [33]

X. Zhou, T. Xiao, L. Liu, Y . Wang, M. Chen, X. Meng, X. Wang, W. Feng, W. Sui, and Z. Su. Fsr-vln: Fast and slow reasoning for vision-language navigation with hierarchical multi-modal scene graph.arXiv preprint arXiv:2509.13733, 2025

work page arXiv 2025

[34] [34]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 Appendix A. Qualitative Results To intuitively demonstrate the advantages of the proposed Dual-Heatmap representat...

work page 2023