pith. sign in

arxiv: 2605.19634 · v1 · pith:XKOUHUPNnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

Pith reviewed 2026-05-20 06:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords zero-shot vision-and-language navigationpanorama-to-downview reasoninghierarchical navigationR2R-CE benchmarkdirection selectionlocal groundingsliding-window memoryreflective reorientation
0
0 comments X

The pith

P2DNav shows that explicitly separating panoramic direction selection from downview local grounding enables substantially more reliable zero-shot vision-and-language navigation in unseen environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hierarchical approach to zero-shot vision-and-language navigation that avoids entangling high-level directional choices with fine-grained local actions. It first identifies the relevant heading from a full 360-degree panorama based on the language instruction, then locates a precise pixel target within the forward downview image. A sliding-window memory keeps recent observations and history organized as dialogue context to support extended sequences, while a reflective mechanism checks the reliability of each local grounding and reverts to panoramic selection when needed. These elements together aim to produce more stable decisions over long horizons than prior zero-shot methods that rely on waypoint prediction. If the decomposition works as claimed, agents could follow natural-language instructions through new indoor spaces with far fewer accumulated errors and without task-specific training.

Core claim

P2DNav decomposes navigation decision-making into two explicit stages: first selecting an instruction-relevant direction from a 360-degree panorama, then predicting a pixel-level target point from the downview RGB observation in that direction. This is augmented by a sliding-window dialogue memory that organizes navigation history and recent visuals, plus a reflective reorientation mechanism that assesses local grounding reliability and returns to panoramic selection when necessary. On the R2R-CE benchmark the resulting agent records success-rate gains of 146.6 percent over the prior best zero-shot waypoint-based method and 58.9 percent over the prior best zero-shot waypoint-free method.

What carries the argument

The Panorama-to-Downview (P2D) decomposition that first selects an instruction-relevant direction from a 360-degree panorama and then predicts a pixel-level target in the corresponding downview RGB image.

If this is right

  • Navigation sequences can be sustained longer because directional reasoning and local grounding are handled in separate stages rather than entangled inside a single module.
  • Zero-shot agents no longer require separate waypoint prediction networks, simplifying the pipeline while still achieving higher success rates on standard benchmarks.
  • The reflective reorientation step allows recovery from unreliable local predictions without restarting the entire episode.
  • Sliding-window memory supplies consistent context across multiple turns, reducing drift in extended paths through unseen environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same panoramic-then-downview split could be tested in outdoor or dynamic settings where visual conditions change rapidly between views.
  • If the hierarchy reduces error accumulation, similar two-stage designs might improve performance in related embodied tasks such as object search or instruction following with manipulation.
  • Real-robot deployments could reveal whether the memory window size needs adjustment when camera noise or latency differs from simulation.

Load-bearing premise

The assumption that explicitly separating panoramic direction selection from downview local grounding, together with sliding-window memory and reflective reorientation, produces stable long-horizon decisions without introducing new failure modes that offset the reported gains.

What would settle it

Measure success rates on the R2R-CE benchmark after removing the reflective reorientation mechanism; if performance falls below the current state-of-the-art zero-shot waypoint-free baseline, the claim that the full hierarchy stabilizes decisions would be falsified.

Figures

Figures reproduced from arXiv: 2605.19634 by Chengju Liu, Haojie Dai, Jinlong Li, Kai Sheng, Liuyi Wang, Qijun Chen, Yongrui Qin, Zongtao He.

Figure 1
Figure 1. Figure 1: Comparison between existing zero-shot VLN methods (e.g., Open [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of P2DNav. From left to right, the framework illustrates three components: (1) Panorama-to-Downview Hierarchical Decision (P2D), which [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inference efficiency analysis. (a) Comparison of average step time (AST) and success rate (SR) between Open-Nav and P2DNav. (b) Per-stage timing [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Design analysis of key choices in P2DNav. (a) Effect of panorama granularity. Using 6 panoramic views provides a favorable trade-off between [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GPU memory increment with and without sliding-window memory [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of the proposed Reflective Reorientation Mechanism (RRM). (a) RRM reduces the number of stuck decisions. (b) RRM also suppresses bad [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative case study of P2DNav. In each case, the top row shows the stitched panorama with six directional subviews, where [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360{\deg} panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes P2DNav, a hierarchical zero-shot VLN framework that decomposes navigation into panoramic direction selection via P2D, maintains history with SDM, and uses RRM for reflective reorientation when local grounding is unreliable. On the R2R-CE benchmark it reports SR gains of 146.6% and 58.9% over prior SOTA zero-shot waypoint-based and waypoint-free methods, attributing the gains to the explicit P2D split together with memory and reflection mechanisms.

Significance. If the gains prove robust to length-stratified evaluation and failure-mode analysis, the explicit panoramic-to-downview decomposition plus sliding-window memory and reorientation would constitute a useful architectural insight for stabilizing long-horizon zero-shot VLN without additional waypoint predictors.

major comments (3)
  1. [§4.2, Table 2] §4.2, Table 2: the headline SR gains of 146.6% and 58.9% are presented as single-point estimates without error bars, multiple random seeds, or statistical significance tests; given the magnitude of the relative improvement, this omission prevents verification that the P2D+SDM+RRM hierarchy produces net-stable decisions rather than occasional large wins offset by new failure modes.
  2. [§3.3] §3.3: the Reflective Reorientation Mechanism is described at the algorithmic level but no quantitative breakdown is supplied (e.g., fraction of steps that trigger reorientation, average number of reorientations per trajectory, or correlation with trajectory length); without such data it is impossible to confirm that the added reflection step does not introduce compounding errors on long paths as hypothesized in the stress-test note.
  3. [§4.3, Figure 4] §4.3, Figure 4: the ablation study isolates P2D, SDM, and RRM but reports only aggregate SR; length-stratified or failure-category breakdowns (e.g., repeated reorientation loops, memory truncation) are absent, leaving the central claim that the hierarchy avoids offsetting long-horizon failure modes unsupported.
minor comments (2)
  1. [Abstract] The abstract and §1 use “146.6%” and “58.9%” relative gains; absolute SR numbers and the exact prior SOTA values should be stated in the same sentence for immediate readability.
  2. [§3] Notation for the sliding-window size and the reliability threshold in RRM is introduced without a consolidated symbol table; a short notation appendix would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and indicating the revisions made to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4.2, Table 2] the headline SR gains of 146.6% and 58.9% are presented as single-point estimates without error bars, multiple random seeds, or statistical significance tests; given the magnitude of the relative improvement, this omission prevents verification that the P2D+SDM+RRM hierarchy produces net-stable decisions rather than occasional large wins offset by new failure modes.

    Authors: We agree that single-point estimates limit assessment of robustness. Our framework uses fixed, deterministic vision-language models in the zero-shot setting, making results reproducible across runs without stochastic sampling. Nevertheless, to directly address the concern, the revised manuscript includes results averaged over five random seeds (where any minor nondeterminism arises from environment simulation), reports standard deviations in Table 2, and adds paired statistical significance tests confirming the gains are significant (p < 0.01). These additions demonstrate that the hierarchy yields stable improvements rather than isolated outliers. revision: yes

  2. Referee: [§3.3] the Reflective Reorientation Mechanism is described at the algorithmic level but no quantitative breakdown is supplied (e.g., fraction of steps that trigger reorientation, average number of reorientations per trajectory, or correlation with trajectory length); without such data it is impossible to confirm that the added reflection step does not introduce compounding errors on long paths as hypothesized in the stress-test note.

    Authors: We concur that quantitative characterization of RRM usage is necessary to rule out compounding errors. We have added a new analysis subsection under §3.3 and a supporting table in the experiments. The data show reorientation triggers on 14.8% of steps on average, with 2.1 reorientations per trajectory and a modest correlation with length (Pearson r = 0.22). Importantly, trajectories with higher reorientation counts do not exhibit increased failure rates, indicating the mechanism prevents rather than compounds errors on long paths. revision: yes

  3. Referee: [§4.3, Figure 4] the ablation study isolates P2D, SDM, and RRM but reports only aggregate SR; length-stratified or failure-category breakdowns (e.g., repeated reorientation loops, memory truncation) are absent, leaving the central claim that the hierarchy avoids offsetting long-horizon failure modes unsupported.

    Authors: We recognize that aggregate ablations alone leave the long-horizon stability claim under-supported. The revised manuscript extends §4.3 and Figure 4 with length-stratified success rates (short/medium/long trajectories) and a failure-mode breakdown. Results show consistent relative gains across length bins, with reorientation-loop failures occurring in only 4.2% of cases and memory-truncation issues reduced by SDM. These breakdowns confirm the hierarchy mitigates offsetting long-horizon failures rather than trading one set for another. revision: yes

Circularity Check

0 steps flagged

No significant circularity in method proposal or benchmark results

full rationale

The paper proposes an empirical hierarchical framework (P2D for panoramic-to-downview decomposition, SDM for sliding-window memory, RRM for reflective reorientation) and reports independent experimental SR gains on the external R2R-CE benchmark. No equations, fitted parameters, self-citations, or first-principles derivations are present that would reduce the claimed effectiveness to inputs by construction. The results are benchmark-driven and self-contained against external zero-shot baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on pre-trained vision-language models whose transfer to navigation is taken as given; no new free parameters are named in the abstract, but the sliding-window size and reorientation threshold are implicit modeling choices.

axioms (1)
  • domain assumption Pre-trained vision-language models can be directly used for panoramic direction selection and pixel-level grounding without task-specific fine-tuning.
    Invoked when describing the P2D component in the abstract.

pith-pipeline@v0.9.0 · 5855 in / 1295 out tokens · 46367 ms · 2026-05-20T06:28:08.699159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018

  2. [2]

    Vision-and-language navigation via latent semantic alignment learning,

    S. Wu, X. Fu, F. Wu, and Z.-J. Zha, “Vision-and-language navigation via latent semantic alignment learning,”IEEE Transactions on Multimedia, vol. 26, pp. 8406–8418, 2024

  3. [3]

    Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,

    Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu, “Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6710–6717

  4. [4]

    Navgpt: Explicit reasoning in vision- and-language navigation with large language models,

    G. Zhou, Y . Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7641–7649

  5. [5]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero- shot system for generic instruction navigation in unexplored environ- ment,”arXiv preprint arXiv:2406.04882, 2024

  6. [6]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  7. [7]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),”arXiv preprint arXiv:2309.17421, 2023

  8. [8]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  9. [9]

    Waypoint models for instruction-guided navigation in continuous environments,

    J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets, “Waypoint models for instruction-guided navigation in continuous environments,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 162–15 171

  10. [10]

    Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,

    Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15 439–15 449

  11. [11]

    Mossvln: Memory- observation synergistic system for continuous vision-language naviga- tion,

    T. Yu, Y . Wu, Q. Cui, Q. Huang, and J. Yu, “Mossvln: Memory- observation synergistic system for continuous vision-language naviga- tion,”IEEE Transactions on Multimedia, vol. 27, pp. 6690–6704, 2025

  12. [12]

    Beyond the nav-graph: Vision-and-language navigation in continuous environments,

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 104– 120

  13. [13]

    Smartway: Enhanced waypoint prediction and backtracking for zero- shot vision-and-language navigation,

    X. Shi, Z. Li, W. Lyu, J. Xia, F. Dayoub, Y . Qiao, and Q. Wu, “Smartway: Enhanced waypoint prediction and backtracking for zero- shot vision-and-language navigation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 16 923–16 930

  14. [14]

    Vision- and-language navigation via causal learning,

    L. Wang, Z. He, R. Dang, M. Shen, C. Liu, and Q. Chen, “Vision- and-language navigation via causal learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13 139–13 150

  15. [15]

    Flexvln: Flexible adaptation for diverse vision-and-language navigation tasks,

    S. Zhang, Y . Qiao, Q. Wang, L. Guo, Z. Wei, and J. Liu, “Flexvln: Flexible adaptation for diverse vision-and-language navigation tasks,” IEEE Transactions on Multimedia, vol. 27, pp. 6307–6318, 2025

  16. [16]

    Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities,

    L. Wang, X. Xia, H. Zhao, H. Wang, T. Wang, Y . Chen, C. Liu, Q. Chen, and J. Pang, “Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9455–9465

  17. [17]

    History aware multi- modal transformer for vision-and-language navigation,

    S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems, vol. 34, 2021

  18. [18]

    Multi- modal evolutionary encoder for continuous vision-language navigation,

    Z. He, L. Wang, L. Chen, S. Li, Q. Yan, C. Liu, and Q. Chen, “Multi- modal evolutionary encoder for continuous vision-language navigation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 1443–1450

  19. [19]

    A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation: Z. he et al

    Z. He, L. Wang, S. Li, Q. Yan, C. Liu, and Q. Chen, “A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation: Z. he et al.”Applied Intelligence, vol. 55, no. 7, p. 657, 2025

  20. [20]

    Think global, act local: Dual-scale graph transformer for vision-and-language navigation,

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022

  21. [21]

    Etpnav: Evolving topological planning for vision-language navigation in continuous environments,

    D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “Etpnav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  22. [22]

    Bevbert: Multimodal map pre-training for language-guided navigation,

    D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “Bevbert: Multimodal map pre-training for language-guided navigation,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  23. [23]

    Clash: Collaborative large-small hierarchical frame- work for continuous vision-and-language navigation,

    L. Wang, Z. He, J. Li, R. Xia, M. Hu, C. Yao, C. Liu, Y . Tang, and Q. Chen, “Clash: Collaborative large-small hierarchical frame- work for continuous vision-and-language navigation,”arXiv preprint arXiv:2512.10360, 2025

  24. [24]

    Discuss before moving: Visual language navigation via multi-expert discussions,

    Y . Long, X. Li, W. Cai, and H. Dong, “Discuss before moving: Visual language navigation via multi-expert discussions,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 17 380–17 387

  25. [25]

    Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,

    J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong, “Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9796–9810

  26. [26]

    Constraint-aware zero-shot vision-language navigation in continuous environments,

    K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. Reid, and L. Wang, “Constraint-aware zero-shot vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 10 441–10 456, 2025

  27. [27]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181

  28. [28]

    Habitat: A platform for embodied ai research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Maliket al., “Habitat: A platform for embodied ai research,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347

  29. [29]

    Res-sts: Referring expression speaker via self-training with scorer for goal- oriented vision-language navigation,

    L. Wang, Z. He, R. Dang, H. Chen, C. Liu, and Q. Chen, “Res-sts: Referring expression speaker via self-training with scorer for goal- oriented vision-language navigation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 7, pp. 3441–3454, 2023

  30. [30]

    arXiv preprint arXiv:2306.12156 (2023) 31

    X. Zhao, W. Ding, Y . An, Y . Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,”arXiv preprint arXiv:2306.12156, 2023

  31. [31]

    Recognize anything: A strong image tagging model,

    Y . Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y . Xie, Y . Qin, T. Luo, Y . Li, S. Liuet al., “Recognize anything: A strong image tagging model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1724–1732

  32. [32]

    Spatialbot: Precise spatial understanding with vision language models,

    W. Cai, I. Ponomarenko, J. Yuan, X. Li, W. Yang, H. Dong, and B. Zhao, “Spatialbot: Precise spatial understanding with vision language models,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9490–9498