arxiv: 2604.16993 · v1 · submitted 2026-04-18 · 💻 cs.AI · cs.CV· cs.RO

Recognition: unknown

Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

Jiawen Wen , Penglei Sun , Wenjie Zhang , Suixuan Qiu , Weisheng Xu , Xiaofei Yang , Xiaowen Chu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.RO

keywords Vision-and-Language NavigationRule ComplianceSemantic ReasoningEmbodied AIUrban NavigationVision-Language ModelsConstraint-Aware Planning

0 comments

The pith

Rule-VLN adds 177 regulatory categories to a 29k-node urban graph to test whether navigation agents can obey semantic rules instead of only reaching goals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up Rule-VLN as the first large-scale benchmark that embeds fine-grained regulatory constraints into vision-and-language navigation tasks. Current agents focus on physical reachability and ignore rules such as no-entry zones or behavioral limits, so the benchmark challenges them across four curriculum levels with 8k constrained nodes. The authors introduce SNRM, a zero-shot module that combines coarse-to-fine visual perception from a vision-language model with an epistemic mental map for planning detours. If the approach works, agents can move from goal-driven navigation to socially compliant behavior without retraining. Readers care because real-world deployment of embodied AI requires both arrival and rule adherence.

Core claim

Rule-VLN reveals that state-of-the-art VLN models violate many regulatory constraints, yet SNRM restores performance by integrating semantic reasoning and geometric rectification, cutting constraint violation rate by 19.26 percent and raising task completion by 5.97 percent across the benchmark.

What carries the argument

The Semantic Navigation Rectification Module (SNRM), which runs a coarse-to-fine VLM perception pipeline and maintains an epistemic mental map to generate rule-respecting detours in real time.

If this is right

Pre-trained navigation agents can gain rule awareness through a plug-in module rather than full retraining.
Navigation success metrics must now track both goal reachability and constraint compliance.
The four-level curriculum structure allows systematic measurement of how rule complexity affects agent performance.
Dynamic detour planning via an epistemic mental map enables agents to revise paths when new constraints appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rectification pattern could apply to other embodied tasks such as manipulation where physical actions must respect safety or legal rules.
Testing SNRM in environments with continuously changing regulations would check whether the mental map updates remain reliable.
If the VLM perception step generalizes across cities, the benchmark could serve as a training signal for learning rule patterns directly.

Load-bearing premise

The 177 injected regulatory categories together with the zero-shot coarse-to-fine VLM perception correctly identify and interpret real-world semantic and behavioral constraints without any domain-specific fine-tuning or extra supervision.

What would settle it

Deploy an SNRM-equipped agent in a physical urban setting containing unscripted regulatory signs and observe whether its violation rate remains as low as the simulated 19.26 percent reduction.

Figures

Figures reproduced from arXiv: 2604.16993 by Jiawen Wen, Penglei Sun, Suixuan Qiu, Weisheng Xu, Wenjie Zhang, Xiaofei Yang, Xiaowen Chu.

**Figure 1.** Figure 1: The Rule-VLN Paradigm. Left: Benchmark construction via MPSI pipeline by injecting semantic constraints into urban topologies. Right: Unlike standard agents (bottom) violating “No Entry” signs, our method (top) helps the agent detect prohibitions, prunes illegal actions, and executes compliant detours (green path). priors [30, 52]. Furthermore, the scarcity of diverse, safety-critical training data in exis… view at source ↗

**Figure 2.** Figure 2: Rule-VLN Construction Pipeline. (a) CityNav-Rules Dataset: Translates visual signals into permissible action constraints via LLM. (b) Benchmark Generation: Filters strategic nodes via topological metrics and injects constraints via MPSI to construct curriculum environments. LLM-Driven Discrete Action Mapping. To translate abstract rules into rigorous control constraints, we employ GPT-5 to map each visual … view at source ↗

**Figure 3.** Figure 3: MPSI Pipeline. (a) Boundary extraction via Mroad and prior retrieval. (b) Synthesis via dual-maskconditioned DiT. (c) GMM-based filtering and stitching. Spatial Grounding and Rule Decoupling. We decouple regulatory signals into geometric shape S and semantic rule R. Given a target node v with permissible action subspace Avalid(v), we retrieve a semantically aligned instance (S, R) ∼ Dinsert. To ensure glo… view at source ↗

**Figure 4.** Figure 4: The SNRM Framework. (a) Illustrating the dual-stage perception mechanism for rule grounding. (b-c) showing the local mental map for trajectory correction. Dual-Stage Coarse-to-Fine Perception Framework. To reliably extract subtle regulatory cues from complex observations, SNRM employs a coarse-to-fine pipeline ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance metrics of SOTA models on the Rule-VLN benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization results of our method and other baselines on navigation samples. Green arrows indicate strictly [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Quantitative evaluation of semantic alignment using CLIP scores. (a) Overall score distribution. (b) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative analysis of image inpainting results from MPSI, FLUX.1-Fill, and Google Nano Banana 2 across [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rule-VLN builds a solid new benchmark for rule-aware VLN and shows a zero-shot module can improve compliance on existing agents, but the gains are modest and the evaluation details matter.

read the letter

The paper's core move is creating Rule-VLN, a large urban navigation benchmark that adds 177 regulatory categories across 8k constrained nodes in a 29k-node environment, split into four curriculum levels. This forces agents to handle both reachability and things like social or legal rules instead of just geometry. They also introduce SNRM, a plug-in module that uses coarse-to-fine VLM perception plus an epistemic mental map to plan detours without any retraining. On top of that, they report that SNRM cuts CVR by about 19% and lifts TC by 6% when added to pre-trained models that otherwise struggle on the new benchmark. That combination is the actual new piece: a scale-appropriate testbed plus a lightweight compliance layer that works zero-shot. It directly targets the gap between current VLN agents and real urban deployment where rules matter. The benchmark construction looks internally consistent and the module architecture is straightforward to understand. The reported deltas are positive but not dramatic, which fits a first attempt at this problem. Soft spots are mostly in the evaluation layer. The abstract and stress-test note give no clear picture of the exact baselines used, whether statistical significance was checked, or how the regulatory injections were validated against real-world ambiguity. If the full paper has only limited ablations on the VLM perception step or the mental map, that would weaken the claim that SNRM is reliably general. The zero-shot assumption on rule interpretation also needs more stress-testing on edge cases where visual cues are partial or rules conflict. This paper is mainly for the embodied AI and VLN crowd who care about moving from simulation reachability to deployable compliance. Readers building or testing navigation agents will find the benchmark useful to run their own models against. It is worth sending to peer review because the benchmark scale and the zero-shot design are concrete contributions that others can build on, even if the current numbers need tighter controls and more comparison points before the field treats them as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Rule-VLN, the first large-scale urban benchmark for rule-compliant vision-and-language navigation, featuring a 29k-node environment with 177 regulatory categories injected into 8k constrained nodes across four curriculum levels. It also proposes the Semantic Navigation Rectification Module (SNRM), a zero-shot add-on that uses coarse-to-fine VLM perception and an epistemic mental map for dynamic detour planning to equip pre-trained VLN agents with safety awareness. Experiments indicate that Rule-VLN poses challenges to SOTA models, but SNRM restores performance by reducing CVR by 19.26% and increasing TC by 5.97%.

Significance. This work addresses a critical gap in embodied AI by moving beyond geometric reachability to semantic and regulatory compliance in navigation tasks. The introduction of a large-scale benchmark with diverse constraints and a universal, zero-shot rectification module could facilitate safer real-world deployment of VLN agents. The reported quantitative gains, if robust, highlight the potential of integrating VLM-based semantic reasoning with planning.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.
[§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.

minor comments (2)

[Abstract] The acronyms CVR (Constraint Violation Rate) and TC (Task Completion) appear in the abstract and results without explicit expansion on first use, which reduces clarity for readers unfamiliar with the metrics.
[Figures] Ensure that any figures depicting the SNRM pipeline (e.g., mental map construction or perception stages) include explicit labels distinguishing the coarse and fine VLM components to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for greater transparency in our quantitative results and stronger grounding for the benchmark. We address each major comment below and have revised the manuscript to incorporate additional details, sources, and clarifications while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.

Authors: We agree these details are necessary for proper evaluation and reproducibility. In the revised manuscript, §4 and the appendix now specify: the baselines (VLN-BERT, RecBERT, and two additional SOTA VLN agents), 1,000 evaluation episodes per curriculum level, statistical significance via paired t-tests (p < 0.05 for both CVR and TC improvements), error bars as standard error in all plots, the 70/15/15 environment split, and the full SNRM implementation including the VLM (GPT-4V), coarse-to-fine perception prompts, epistemic mental map update logic, and detour planning algorithm. These additions directly support assessment of the reported gains. revision: yes
Referee: [§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.

Authors: We acknowledge the importance of this validation for ecological validity. The categories were derived from official municipal regulations, traffic codes, and accessibility guidelines, with curation and injection procedures now detailed in revised §3.1 along with a new appendix table listing all 177 categories and their sources. We have also added qualitative VLM perception examples and failure-case analysis in §4.3. A full-scale human annotation validation was not conducted due to the benchmark's size; we have added this as an explicit limitation in §5 and note that zero-shot VLM reliability was observed empirically across our experiments. This provides greater transparency while highlighting an avenue for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper constructs a new benchmark (Rule-VLN) by injecting 177 regulatory categories into an existing 29k-node environment and evaluates a proposed zero-shot SNRM module (coarse-to-fine VLM perception plus mental-map planning) on pre-trained agents. Reported deltas (CVR reduction 19.26%, TC boost 5.97%) are framed as experimental outcomes, not as quantities derived from or fitted to the same inputs. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text; the benchmark and method are presented as independent contributions whose performance is measured externally. The derivation chain is therefore self-contained against the stated experimental protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be needed to audit any fitted scales, domain assumptions about VLM reliability, or new constructs such as the epistemic mental map.

pith-pipeline@v0.9.0 · 5520 in / 1173 out tokens · 52299 ms · 2026-05-10T06:52:20.124678+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Pixtral 12B

Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., De Monicault, B., Garg, S., Gervet, T., et al.: Pixtral 12b. arXiv preprint arXiv:2410.07073 (2024)

work page internal anchor Pith review arXiv 2024
[2]

nature406(6794), 378–382 (2000)

Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex networks. nature406(6794), 378–382 (2000)

2000
[3]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025)

An, D., Wang, H., Wang, W., Wang, Z., Huang, Y ., He, K., Wang, L.: Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025). doi:10.1109/TPAMI.2024.3386695

work page doi:10.1109/tpami.2024.3386695 2025
[4]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)

2018
[5]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Advances in neural information processing systems29(2016)

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems29(2016)

2016
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y .: Touchdown: Natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12538–12547 (2019) 11 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

2019
[8]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y .: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 9796–9810 (2024)

2024
[9]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Chen, K., An, D., Huang, Y ., Xu, R., Su, Y ., Ling, Y ., Reid, I., Wang, L.: Constraint-aware zero-shot vision- language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[10]

Information Fusion122, 103198 (2025)

Chen, S., Wu, Z., Zhang, K., Li, C., Zhang, B., Ma, F., Yu, F.R., Li, Q.: Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion122, 103198 (2025)

2025
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, X., Huang, L., Liu, Y ., Shen, Y ., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6593–6602 (2024)

2024
[12]

Artificial Intelligence Review (2026)

Chowa, S.S., Alvi, R., Rahman, S.S., Rahman, M.A., Raiaan, M.A.K., Islam, M.R., Hussain, M., Azam, S.: From language to action: a review of large language models as autonomous agents and tool users. Artificial Intelligence Review (2026)

2026
[13]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y ., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V ., Levine, S., Ma, Y .: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

work page internal anchor Pith review arXiv 2025
[14]

HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,

Dong, Y ., Wu, F., He, Q., Cheng, Z.Q., Li, H., Li, M., Cheng, Z., Zhou, Y ., Sun, J., Dai, Q., et al.: Ha-vln 2.0: An open benchmark and leaderboard for human-aware navigation in discrete and continuous environments with dynamic multi-human interactions. arXiv preprint arXiv:2503.14229 (2025)

work page arXiv 2025
[15]

Sociometry pp

Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry pp. 35–41 (1977)

1977
[16]

Social networks1(3), 215–239 (1978)

Freeman, L.C.: Centrality in social networks conceptual clarification. Social networks1(3), 215–239 (1978)

1978
[17]

Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015,

Gokhale, T., Palangi, H., Nushi, B., Vineet, V ., Horvitz, E., Kamar, E., Baral, C., Yang, Y .: Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015 (2022)

work page arXiv 2022
[18]

Computers in Industry168, 104268 (2025)

Hamdani, R., Chihi, I.: Adaptive human-computer interaction for industry 5.0: A novel concept, with comprehen- sive review and empirical validation. Computers in Industry168, 104268 (2025)

2025
[19]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[20]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Hong, H., Wang, S., Huang, Z., Wu, Q., Liu, J.: Navigating beyond instructions: Vision-and-language navigation in obstructed environments. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7639–7648 (2024)

2024
[21]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)

Huang, K., Duan, C., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation. IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)

2025
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Y ., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y ., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)

2024
[23]

Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan

Islam, C.M., Salman, S., Shams, M., Liu, X., Kumar, P.: Malicious path manipulations via exploitation of representation vulnerabilities of vision-language navigation systems. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13845–13852 (2024). doi:10.1109/IROS58592.2024.10802618

work page doi:10.1109/iros58592.2024.10802618 2024
[24]

arXiv preprint arXiv:2410.17267 (2024)

Jeong, S., Kang, G.C., Kim, J., Zhang, B.T.: Zero-shot vision-and-language navigation with collision mitigation in continuous environment. arXiv preprint arXiv:2410.17267 (2024)

work page arXiv 2024
[25]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Kang, W., Galim, K., Koo, H.I., Cho, N.I.: Counting guidance for high fidelity text-to-image synthesis. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). pp. 899–908. IEEE (2025)

2025
[26]

Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

Ke, F., Hsu, J., Cai, Z., Ma, Z., Zheng, X., Wu, X., Huang, S., Wang, W., Haghighi, P.D., Haffari, G., et al.: Explain before you answer: A survey on compositional visual reasoning. arXiv preprint arXiv:2508.17298 (2025)

work page arXiv 2025
[27]

arXiv preprint arXiv:2506.03834 (2025)

Kim, J., Sim, J., Kim, W., Sycara, K., Nam, C.: Care: Enhancing safety of visual navigation through collision avoidance via repulsive estimation. arXiv preprint arXiv:2506.03834 (2025)

work page arXiv 2025
[28]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

work page internal anchor Pith review arXiv 2025
[29]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, J., Padmakumar, A., Sukhatme, G., Bansal, M.: Vln-video: Utilizing driving videos for outdoor vision-and- language navigation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18517–18526 (2024) 12 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

2024
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y ., Tian, M., Lin, Z., Zhu, J., Zhu, D., Liu, H., Zhang, Y ., Xiong, Z., Zhao, X.: Fine-grained evaluation of large vision-language models in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9431–9442 (October 2025)

2025
[31]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github.io/blog/2024-01-30-llava-next/

2024
[32]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Liu, S., Zhang, H., Qiao, Q., Wu, Q., Wang, P.: Vln-chenv: Vision-language navigation in changeable environments. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3798–3807 (2025)

2025
[33]

Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language naviga- tion.arXiv preprint arXiv:2411.08579, 2024

Liu, Y ., Yao, F., Yue, Y ., Xu, G., Sun, X., Fu, K.: Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language navigation. arXiv preprint arXiv:2411.08579 (2024)

work page arXiv 2024
[34]

In: The Thirteenth International Conference on Learning Representations (2025)

Lu, R., Wang, R., Lyu, K., Jiang, X., Huang, G., Wang, M.: Towards understanding text hallucination of diffusion models via local generation bias. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[35]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[36]

In: European Conference on Computer Vision

Qiao, Y ., Liu, Q., Liu, J., Liu, J., Wu, Q.: Llm as copilot for coarse-grained vision-and-language navigation. In: European Conference on Computer Vision. pp. 459–476. Springer (2024)

2024
[37]

Schumann, R., Riezler, S.: Generating landmark navigation instructions from maps as a graph-to-text problem. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers). pp. 489–502 (2021)

2021
[38]

In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

Schumann, R., Riezler, S.: Analyzing generalization of vision and language navigation to unseen outdoor areas. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 7519–7532 (2022)

2022
[39]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Schumann, R., Zhu, W., Feng, W., Fu, T.J., Riezler, S., Wang, W.Y .: Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18924–18933 (2024)

2024
[40]

IEEE Robotics and Automation Letters10(1), 508–515 (2024)

Song, D., Liang, J., Payandeh, A., Raj, A.H., Xiao, X., Manocha, D.: Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models. IEEE Robotics and Automation Letters10(1), 508–515 (2024)

2024
[41]

arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y .: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025)

work page arXiv 2025
[42]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Sun, P., Song, Y ., Zhu, X., Liu, X., Wang, Q., Liu, Y ., Xia, C., Li, T., Yang, Y ., Chu, X.: City-vlm: Towards multidomain perception scene understanding via multimodal incomplete learning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3448–3457 (2025)

2025
[43]

Preprints (2026)

Sun, P., Tang, S., Wen, J., Liang, Y ., Yang, Y ., Chu, X.: From terrain to space: A survey on multi-domain data lifecycle for urban embodied agents. Preprints (2026)

2026
[44]

Sensors25(2), 364 (2025)

Sun, Y ., Qiu, Y ., Aoki, Y .: Dynamicvln: Incorporating dynamics into vision-and-language navigation scenarios. Sensors25(2), 364 (2025)

2025
[45]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Tian, H., Meng, J., Zheng, W.S., Li, Y .M., Yan, J., Zhang, Y .: Loc4plan: Locating before planning for outdoor vision and language navigation. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 4073–4081 (2024)

2024
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review arXiv 2025
[47]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004
[48]

IEEE Transactions on Neural Networks and Learning Systems (2025)

Wei, Z., Lin, B., Nie, Y ., Chen, J., Ma, S., Xu, H., Liang, X.: Unseen from seen: Rewriting observation-instruction using foundation models for augmenting vision-language navigation. IEEE Transactions on Neural Networks and Learning Systems (2025)

2025
[49]

Wen, C., Liang, J., Yuan, S., Huang, H., Bethala, G.C.R., Liu, Y .S., Wang, M., Tzes, A., Fang, Y .: How secure are large language models (llms) for navigation in urban environments? (2025), https://arxiv.org/abs/2402. 09546

2025
[50]

arXiv preprint arXiv:2409.15310 , year=

Wu, J., Zhang, Z., Xia, Y ., Li, X., Xia, Z., Chang, A., Yu, T., Kim, S., Rossi, R.A., Zhang, R., et al.: Visual prompting in multimodal large language models: A survey. arXiv preprint arXiv:2409.15310 (2024) 13 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT

work page arXiv 2024
[51]

In: Findings of the Association for Computational Linguistics: EMNLP 2020

Xiang, J., Wang, X., Wang, W.Y .: Learning to stop: A simple yet effective approach to urban vision-language navigation. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 699–707 (2020)

2020
[52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S.: Flair: Vlm with fine-grained language-informed image representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24884–24894 (June 2025)

2025
[53]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Xu, Y ., Pan, Y ., Liu, Z., Wang, H.: Flame: Learning to navigate with multimodal llm in urban environments. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9005–9013 (2025)

2025
[54]

arXiv preprint arXiv:2509.10454 (2025)

Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J., Lu, J.: Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 (2025)

work page arXiv 2025
[55]

IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024)

Yue, L., Zhou, D., Xie, L., Zhang, F., Yan, Y ., Yin, E.: Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments. IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024). doi:10.1109/LRA.2024.3387171

work page doi:10.1109/lra.2024.3387171 2024
[56]

Robotics: Science and Systems (2024)

Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. Robotics: Science and Systems (2024)

2024
[57]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[58]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zhang, T., Wang, X., Li, L., Tai, Z., Chi, J., Tian, J., He, H., Wang, S.: Strict: Stress-test of rendering image containing text. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 21148–21161 (2025)

2025
[59]

In: Proceedings of the 2025 International Conference on Multimedia Retrieval

Zhang, Z., Chen, M., Zhu, S., Han, T., Yu, Z.: Mmcnav: Mllm-empowered multi-agent collaboration for outdoor visual language navigation. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. pp. 1767–1776 (2025)

2025
[60]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zheng, D., Huang, S., Zhao, L., Zhong, Y ., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13624–13634 (June 2024)

2024
[61]

In: European Conference on Computer Vision

Zhou, G., Hong, Y ., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In: European Conference on Computer Vision. pp. 260–278. Springer (2024) 14

2024