pith. machine review for the scientific record. sign in

arxiv: 2604.16993 · v1 · submitted 2026-04-18 · 💻 cs.AI · cs.CV· cs.RO

Recognition: unknown

Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.RO
keywords Vision-and-Language NavigationRule ComplianceSemantic ReasoningEmbodied AIUrban NavigationVision-Language ModelsConstraint-Aware Planning
0
0 comments X

The pith

Rule-VLN adds 177 regulatory categories to a 29k-node urban graph to test whether navigation agents can obey semantic rules instead of only reaching goals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up Rule-VLN as the first large-scale benchmark that embeds fine-grained regulatory constraints into vision-and-language navigation tasks. Current agents focus on physical reachability and ignore rules such as no-entry zones or behavioral limits, so the benchmark challenges them across four curriculum levels with 8k constrained nodes. The authors introduce SNRM, a zero-shot module that combines coarse-to-fine visual perception from a vision-language model with an epistemic mental map for planning detours. If the approach works, agents can move from goal-driven navigation to socially compliant behavior without retraining. Readers care because real-world deployment of embodied AI requires both arrival and rule adherence.

Core claim

Rule-VLN reveals that state-of-the-art VLN models violate many regulatory constraints, yet SNRM restores performance by integrating semantic reasoning and geometric rectification, cutting constraint violation rate by 19.26 percent and raising task completion by 5.97 percent across the benchmark.

What carries the argument

The Semantic Navigation Rectification Module (SNRM), which runs a coarse-to-fine VLM perception pipeline and maintains an epistemic mental map to generate rule-respecting detours in real time.

If this is right

  • Pre-trained navigation agents can gain rule awareness through a plug-in module rather than full retraining.
  • Navigation success metrics must now track both goal reachability and constraint compliance.
  • The four-level curriculum structure allows systematic measurement of how rule complexity affects agent performance.
  • Dynamic detour planning via an epistemic mental map enables agents to revise paths when new constraints appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rectification pattern could apply to other embodied tasks such as manipulation where physical actions must respect safety or legal rules.
  • Testing SNRM in environments with continuously changing regulations would check whether the mental map updates remain reliable.
  • If the VLM perception step generalizes across cities, the benchmark could serve as a training signal for learning rule patterns directly.

Load-bearing premise

The 177 injected regulatory categories together with the zero-shot coarse-to-fine VLM perception correctly identify and interpret real-world semantic and behavioral constraints without any domain-specific fine-tuning or extra supervision.

What would settle it

Deploy an SNRM-equipped agent in a physical urban setting containing unscripted regulatory signs and observe whether its violation rate remains as low as the simulated 19.26 percent reduction.

Figures

Figures reproduced from arXiv: 2604.16993 by Jiawen Wen, Penglei Sun, Suixuan Qiu, Weisheng Xu, Wenjie Zhang, Xiaofei Yang, Xiaowen Chu.

Figure 1
Figure 1. Figure 1: The Rule-VLN Paradigm. Left: Benchmark construction via MPSI pipeline by injecting semantic constraints into urban topologies. Right: Unlike standard agents (bottom) violating “No Entry” signs, our method (top) helps the agent detect prohibitions, prunes illegal actions, and executes compliant detours (green path). priors [30, 52]. Furthermore, the scarcity of diverse, safety-critical training data in exis… view at source ↗
Figure 2
Figure 2. Figure 2: Rule-VLN Construction Pipeline. (a) CityNav-Rules Dataset: Translates visual signals into permissible action constraints via LLM. (b) Benchmark Generation: Filters strategic nodes via topological metrics and injects constraints via MPSI to construct curriculum environments. LLM-Driven Discrete Action Mapping. To translate abstract rules into rigorous control constraints, we employ GPT-5 to map each visual … view at source ↗
Figure 3
Figure 3. Figure 3: MPSI Pipeline. (a) Boundary extraction via Mroad and prior retrieval. (b) Synthesis via dual-mask￾conditioned DiT. (c) GMM-based filtering and stitching. Spatial Grounding and Rule Decoupling. We decouple regulatory signals into geometric shape S and semantic rule R. Given a target node v with permissible action subspace Avalid(v), we retrieve a semantically aligned instance (S, R) ∼ Dinsert. To ensure glo… view at source ↗
Figure 4
Figure 4. Figure 4: The SNRM Framework. (a) Illustrating the dual-stage perception mechanism for rule grounding. (b-c) showing the local mental map for trajectory correction. Dual-Stage Coarse-to-Fine Perception Framework. To reliably extract subtle regulatory cues from complex ob￾servations, SNRM employs a coarse-to-fine pipeline ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance metrics of SOTA models on the Rule-VLN benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization results of our method and other baselines on navigation samples. Green arrows indicate strictly [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quantitative evaluation of semantic alignment using CLIP scores. (a) Overall score distribution. (b) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative analysis of image inpainting results from MPSI, FLUX.1-Fill, and Google Nano Banana 2 across [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Rule-VLN, the first large-scale urban benchmark for rule-compliant vision-and-language navigation, featuring a 29k-node environment with 177 regulatory categories injected into 8k constrained nodes across four curriculum levels. It also proposes the Semantic Navigation Rectification Module (SNRM), a zero-shot add-on that uses coarse-to-fine VLM perception and an epistemic mental map for dynamic detour planning to equip pre-trained VLN agents with safety awareness. Experiments indicate that Rule-VLN poses challenges to SOTA models, but SNRM restores performance by reducing CVR by 19.26% and increasing TC by 5.97%.

Significance. This work addresses a critical gap in embodied AI by moving beyond geometric reachability to semantic and regulatory compliance in navigation tasks. The introduction of a large-scale benchmark with diverse constraints and a universal, zero-shot rectification module could facilitate safer real-world deployment of VLN agents. The reported quantitative gains, if robust, highlight the potential of integrating VLM-based semantic reasoning with planning.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.
  2. [§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.
minor comments (2)
  1. [Abstract] The acronyms CVR (Constraint Violation Rate) and TC (Task Completion) appear in the abstract and results without explicit expansion on first use, which reduces clarity for readers unfamiliar with the metrics.
  2. [Figures] Ensure that any figures depicting the SNRM pipeline (e.g., mental map construction or perception stages) include explicit labels distinguishing the coarse and fine VLM components to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for greater transparency in our quantitative results and stronger grounding for the benchmark. We address each major comment below and have revised the manuscript to incorporate additional details, sources, and clarifications while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.

    Authors: We agree these details are necessary for proper evaluation and reproducibility. In the revised manuscript, §4 and the appendix now specify: the baselines (VLN-BERT, RecBERT, and two additional SOTA VLN agents), 1,000 evaluation episodes per curriculum level, statistical significance via paired t-tests (p < 0.05 for both CVR and TC improvements), error bars as standard error in all plots, the 70/15/15 environment split, and the full SNRM implementation including the VLM (GPT-4V), coarse-to-fine perception prompts, epistemic mental map update logic, and detour planning algorithm. These additions directly support assessment of the reported gains. revision: yes

  2. Referee: [§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.

    Authors: We acknowledge the importance of this validation for ecological validity. The categories were derived from official municipal regulations, traffic codes, and accessibility guidelines, with curation and injection procedures now detailed in revised §3.1 along with a new appendix table listing all 177 categories and their sources. We have also added qualitative VLM perception examples and failure-case analysis in §4.3. A full-scale human annotation validation was not conducted due to the benchmark's size; we have added this as an explicit limitation in §5 and note that zero-shot VLM reliability was observed empirically across our experiments. This provides greater transparency while highlighting an avenue for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper constructs a new benchmark (Rule-VLN) by injecting 177 regulatory categories into an existing 29k-node environment and evaluates a proposed zero-shot SNRM module (coarse-to-fine VLM perception plus mental-map planning) on pre-trained agents. Reported deltas (CVR reduction 19.26%, TC boost 5.97%) are framed as experimental outcomes, not as quantities derived from or fitted to the same inputs. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text; the benchmark and method are presented as independent contributions whose performance is measured externally. The derivation chain is therefore self-contained against the stated experimental protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be needed to audit any fitted scales, domain assumptions about VLM reliability, or new constructs such as the epistemic mental map.

pith-pipeline@v0.9.0 · 5520 in / 1173 out tokens · 52299 ms · 2026-05-10T06:52:20.124678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 17 canonical work pages · 5 internal anchors

  1. [1]

    Pixtral 12B

    Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., De Monicault, B., Garg, S., Gervet, T., et al.: Pixtral 12b. arXiv preprint arXiv:2410.07073 (2024)

  2. [2]

    nature406(6794), 378–382 (2000)

    Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex networks. nature406(6794), 378–382 (2000)

  3. [3]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025)

    An, D., Wang, H., Wang, W., Wang, Z., Huang, Y ., He, K., Wang, L.: Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025). doi:10.1109/TPAMI.2024.3386695

  4. [4]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)

  5. [5]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  6. [6]

    Advances in neural information processing systems29(2016)

    Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems29(2016)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y .: Touchdown: Natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12538–12547 (2019) 11 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

  8. [8]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

    Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y .: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 9796–9810 (2024)

  9. [9]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Chen, K., An, D., Huang, Y ., Xu, R., Su, Y ., Ling, Y ., Reid, I., Wang, L.: Constraint-aware zero-shot vision- language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  10. [10]

    Information Fusion122, 103198 (2025)

    Chen, S., Wu, Z., Zhang, K., Li, C., Zhang, B., Ma, F., Yu, F.R., Li, Q.: Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion122, 103198 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, X., Huang, L., Liu, Y ., Shen, Y ., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6593–6602 (2024)

  12. [12]

    Artificial Intelligence Review (2026)

    Chowa, S.S., Alvi, R., Rahman, S.S., Rahman, M.A., Raiaan, M.A.K., Islam, M.R., Hussain, M., Azam, S.: From language to action: a review of large language models as autonomous agents and tool users. Artificial Intelligence Review (2026)

  13. [13]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Chu, T., Zhai, Y ., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V ., Levine, S., Ma, Y .: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

  14. [14]

    HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,

    Dong, Y ., Wu, F., He, Q., Cheng, Z.Q., Li, H., Li, M., Cheng, Z., Zhou, Y ., Sun, J., Dai, Q., et al.: Ha-vln 2.0: An open benchmark and leaderboard for human-aware navigation in discrete and continuous environments with dynamic multi-human interactions. arXiv preprint arXiv:2503.14229 (2025)

  15. [15]

    Sociometry pp

    Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry pp. 35–41 (1977)

  16. [16]

    Social networks1(3), 215–239 (1978)

    Freeman, L.C.: Centrality in social networks conceptual clarification. Social networks1(3), 215–239 (1978)

  17. [17]

    Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015,

    Gokhale, T., Palangi, H., Nushi, B., Vineet, V ., Horvitz, E., Kamar, E., Baral, C., Yang, Y .: Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015 (2022)

  18. [18]

    Computers in Industry168, 104268 (2025)

    Hamdani, R., Chihi, I.: Adaptive human-computer interaction for industry 5.0: A novel concept, with comprehen- sive review and empirical validation. Computers in Industry168, 104268 (2025)

  19. [19]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  20. [20]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Hong, H., Wang, S., Huang, Z., Wu, Q., Liu, J.: Navigating beyond instructions: Vision-and-language navigation in obstructed environments. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7639–7648 (2024)

  21. [21]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)

    Huang, K., Duan, C., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation. IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Y ., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y ., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)

  23. [23]

    Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan

    Islam, C.M., Salman, S., Shams, M., Liu, X., Kumar, P.: Malicious path manipulations via exploitation of representation vulnerabilities of vision-language navigation systems. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13845–13852 (2024). doi:10.1109/IROS58592.2024.10802618

  24. [24]

    arXiv preprint arXiv:2410.17267 (2024)

    Jeong, S., Kang, G.C., Kim, J., Zhang, B.T.: Zero-shot vision-and-language navigation with collision mitigation in continuous environment. arXiv preprint arXiv:2410.17267 (2024)

  25. [25]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

    Kang, W., Galim, K., Koo, H.I., Cho, N.I.: Counting guidance for high fidelity text-to-image synthesis. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). pp. 899–908. IEEE (2025)

  26. [26]

    Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

    Ke, F., Hsu, J., Cai, Z., Ma, Z., Zheng, X., Wu, X., Huang, S., Wang, W., Haghighi, P.D., Haffari, G., et al.: Explain before you answer: A survey on compositional visual reasoning. arXiv preprint arXiv:2508.17298 (2025)

  27. [27]

    arXiv preprint arXiv:2506.03834 (2025)

    Kim, J., Sim, J., Kim, W., Sycara, K., Nam, C.: Care: Enhancing safety of visual navigation through collision avoidance via repulsive estimation. arXiv preprint arXiv:2506.03834 (2025)

  28. [28]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  29. [29]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Li, J., Padmakumar, A., Sukhatme, G., Bansal, M.: Vln-video: Utilizing driving videos for outdoor vision-and- language navigation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18517–18526 (2024) 12 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Li, Y ., Tian, M., Lin, Z., Zhu, J., Zhu, D., Liu, H., Zhang, Y ., Xiong, Z., Zhao, X.: Fine-grained evaluation of large vision-language models in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9431–9442 (October 2025)

  31. [31]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github.io/blog/2024-01-30-llava-next/

  32. [32]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Liu, S., Zhang, H., Qiao, Q., Wu, Q., Wang, P.: Vln-chenv: Vision-language navigation in changeable environments. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3798–3807 (2025)

  33. [33]

    Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language naviga- tion.arXiv preprint arXiv:2411.08579, 2024

    Liu, Y ., Yao, F., Yue, Y ., Xu, G., Sun, X., Fu, K.: Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language navigation. arXiv preprint arXiv:2411.08579 (2024)

  34. [34]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Lu, R., Wang, R., Lyu, K., Jiang, X., Huang, G., Wang, M.: Towards understanding text hallucination of diffusion models via local generation bias. In: The Thirteenth International Conference on Learning Representations (2025)

  35. [35]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  36. [36]

    In: European Conference on Computer Vision

    Qiao, Y ., Liu, Q., Liu, J., Liu, J., Wu, Q.: Llm as copilot for coarse-grained vision-and-language navigation. In: European Conference on Computer Vision. pp. 459–476. Springer (2024)

  37. [37]

    Schumann, R., Riezler, S.: Generating landmark navigation instructions from maps as a graph-to-text problem. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers). pp. 489–502 (2021)

  38. [38]

    In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

    Schumann, R., Riezler, S.: Analyzing generalization of vision and language navigation to unseen outdoor areas. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 7519–7532 (2022)

  39. [39]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Schumann, R., Zhu, W., Feng, W., Fu, T.J., Riezler, S., Wang, W.Y .: Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18924–18933 (2024)

  40. [40]

    IEEE Robotics and Automation Letters10(1), 508–515 (2024)

    Song, D., Liang, J., Payandeh, A., Raj, A.H., Xiao, X., Manocha, D.: Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models. IEEE Robotics and Automation Letters10(1), 508–515 (2024)

  41. [41]

    arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

    Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y .: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025)

  42. [42]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Sun, P., Song, Y ., Zhu, X., Liu, X., Wang, Q., Liu, Y ., Xia, C., Li, T., Yang, Y ., Chu, X.: City-vlm: Towards multidomain perception scene understanding via multimodal incomplete learning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3448–3457 (2025)

  43. [43]

    Preprints (2026)

    Sun, P., Tang, S., Wen, J., Liang, Y ., Yang, Y ., Chu, X.: From terrain to space: A survey on multi-domain data lifecycle for urban embodied agents. Preprints (2026)

  44. [44]

    Sensors25(2), 364 (2025)

    Sun, Y ., Qiu, Y ., Aoki, Y .: Dynamicvln: Incorporating dynamics into vision-and-language navigation scenarios. Sensors25(2), 364 (2025)

  45. [45]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Tian, H., Meng, J., Zheng, W.S., Li, Y .M., Yan, J., Zhang, Y .: Loc4plan: Locating before planning for outdoor vision and language navigation. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 4073–4081 (2024)

  46. [46]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  47. [47]

    IEEE transactions on image processing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

  48. [48]

    IEEE Transactions on Neural Networks and Learning Systems (2025)

    Wei, Z., Lin, B., Nie, Y ., Chen, J., Ma, S., Xu, H., Liang, X.: Unseen from seen: Rewriting observation-instruction using foundation models for augmenting vision-language navigation. IEEE Transactions on Neural Networks and Learning Systems (2025)

  49. [49]

    Wen, C., Liang, J., Yuan, S., Huang, H., Bethala, G.C.R., Liu, Y .S., Wang, M., Tzes, A., Fang, Y .: How secure are large language models (llms) for navigation in urban environments? (2025), https://arxiv.org/abs/2402. 09546

  50. [50]

    arXiv preprint arXiv:2409.15310 , year=

    Wu, J., Zhang, Z., Xia, Y ., Li, X., Xia, Z., Chang, A., Yu, T., Kim, S., Rossi, R.A., Zhang, R., et al.: Visual prompting in multimodal large language models: A survey. arXiv preprint arXiv:2409.15310 (2024) 13 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

  51. [51]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020

    Xiang, J., Wang, X., Wang, W.Y .: Learning to stop: A simple yet effective approach to urban vision-language navigation. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 699–707 (2020)

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S.: Flair: Vlm with fine-grained language-informed image representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24884–24894 (June 2025)

  53. [53]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Xu, Y ., Pan, Y ., Liu, Z., Wang, H.: Flame: Learning to navigate with multimodal llm in urban environments. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9005–9013 (2025)

  54. [54]

    arXiv preprint arXiv:2509.10454 (2025)

    Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J., Lu, J.: Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 (2025)

  55. [55]

    IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024)

    Yue, L., Zhou, D., Xie, L., Zhang, F., Yan, Y ., Yin, E.: Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments. IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024). doi:10.1109/LRA.2024.3387171

  56. [56]

    Robotics: Science and Systems (2024)

    Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. Robotics: Science and Systems (2024)

  57. [57]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  58. [58]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Zhang, T., Wang, X., Li, L., Tai, Z., Chi, J., Tian, J., He, H., Wang, S.: Strict: Stress-test of rendering image containing text. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 21148–21161 (2025)

  59. [59]

    In: Proceedings of the 2025 International Conference on Multimedia Retrieval

    Zhang, Z., Chen, M., Zhu, S., Han, T., Yu, Z.: Mmcnav: Mllm-empowered multi-agent collaboration for outdoor visual language navigation. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. pp. 1767–1776 (2025)

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zheng, D., Huang, S., Zhao, L., Zhong, Y ., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13624–13634 (June 2024)

  61. [61]

    In: European Conference on Computer Vision

    Zhou, G., Hong, Y ., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In: European Conference on Computer Vision. pp. 260–278. Springer (2024) 14