Recognition: unknown
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3
The pith
Rule-VLN adds 177 regulatory categories to a 29k-node urban graph to test whether navigation agents can obey semantic rules instead of only reaching goals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rule-VLN reveals that state-of-the-art VLN models violate many regulatory constraints, yet SNRM restores performance by integrating semantic reasoning and geometric rectification, cutting constraint violation rate by 19.26 percent and raising task completion by 5.97 percent across the benchmark.
What carries the argument
The Semantic Navigation Rectification Module (SNRM), which runs a coarse-to-fine VLM perception pipeline and maintains an epistemic mental map to generate rule-respecting detours in real time.
If this is right
- Pre-trained navigation agents can gain rule awareness through a plug-in module rather than full retraining.
- Navigation success metrics must now track both goal reachability and constraint compliance.
- The four-level curriculum structure allows systematic measurement of how rule complexity affects agent performance.
- Dynamic detour planning via an epistemic mental map enables agents to revise paths when new constraints appear.
Where Pith is reading between the lines
- The same rectification pattern could apply to other embodied tasks such as manipulation where physical actions must respect safety or legal rules.
- Testing SNRM in environments with continuously changing regulations would check whether the mental map updates remain reliable.
- If the VLM perception step generalizes across cities, the benchmark could serve as a training signal for learning rule patterns directly.
Load-bearing premise
The 177 injected regulatory categories together with the zero-shot coarse-to-fine VLM perception correctly identify and interpret real-world semantic and behavioral constraints without any domain-specific fine-tuning or extra supervision.
What would settle it
Deploy an SNRM-equipped agent in a physical urban setting containing unscripted regulatory signs and observe whether its violation rate remains as low as the simulated 19.26 percent reduction.
Figures
read the original abstract
As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Rule-VLN, the first large-scale urban benchmark for rule-compliant vision-and-language navigation, featuring a 29k-node environment with 177 regulatory categories injected into 8k constrained nodes across four curriculum levels. It also proposes the Semantic Navigation Rectification Module (SNRM), a zero-shot add-on that uses coarse-to-fine VLM perception and an epistemic mental map for dynamic detour planning to equip pre-trained VLN agents with safety awareness. Experiments indicate that Rule-VLN poses challenges to SOTA models, but SNRM restores performance by reducing CVR by 19.26% and increasing TC by 5.97%.
Significance. This work addresses a critical gap in embodied AI by moving beyond geometric reachability to semantic and regulatory compliance in navigation tasks. The introduction of a large-scale benchmark with diverse constraints and a universal, zero-shot rectification module could facilitate safer real-world deployment of VLN agents. The reported quantitative gains, if robust, highlight the potential of integrating VLM-based semantic reasoning with planning.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.
- [§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.
minor comments (2)
- [Abstract] The acronyms CVR (Constraint Violation Rate) and TC (Task Completion) appear in the abstract and results without explicit expansion on first use, which reduces clarity for readers unfamiliar with the metrics.
- [Figures] Ensure that any figures depicting the SNRM pipeline (e.g., mental map construction or perception stages) include explicit labels distinguishing the coarse and fine VLM components to improve interpretability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for greater transparency in our quantitative results and stronger grounding for the benchmark. We address each major comment below and have revised the manuscript to incorporate additional details, sources, and clarifications while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The quantitative claims of a 19.26% CVR reduction and 5.97% TC boost are presented without details on the specific baselines, number of evaluation episodes, statistical tests, error bars, dataset splits, or the precise implementation of the coarse-to-fine VLM framework within SNRM. This information is load-bearing for assessing whether the central claim of restored navigation capabilities holds.
Authors: We agree these details are necessary for proper evaluation and reproducibility. In the revised manuscript, §4 and the appendix now specify: the baselines (VLN-BERT, RecBERT, and two additional SOTA VLN agents), 1,000 evaluation episodes per curriculum level, statistical significance via paired t-tests (p < 0.05 for both CVR and TC improvements), error bars as standard error in all plots, the 70/15/15 environment split, and the full SNRM implementation including the VLM (GPT-4V), coarse-to-fine perception prompts, epistemic mental map update logic, and detour planning algorithm. These additions directly support assessment of the reported gains. revision: yes
-
Referee: [§3.1] §3.1 (Benchmark Construction): The 177 regulatory categories and their injection across curriculum levels are foundational to Rule-VLN's ability to test semantic compliance. The manuscript provides no validation (e.g., against human annotations or real urban data) that these categories accurately capture behavioral constraints or that zero-shot VLM perception interprets them reliably, which directly affects the ecological validity of the reported SNRM gains.
Authors: We acknowledge the importance of this validation for ecological validity. The categories were derived from official municipal regulations, traffic codes, and accessibility guidelines, with curation and injection procedures now detailed in revised §3.1 along with a new appendix table listing all 177 categories and their sources. We have also added qualitative VLM perception examples and failure-case analysis in §4.3. A full-scale human annotation validation was not conducted due to the benchmark's size; we have added this as an explicit limitation in §5 and note that zero-shot VLM reliability was observed empirically across our experiments. This provides greater transparency while highlighting an avenue for future work. revision: partial
Circularity Check
No significant circularity detected in derivation or claims
full rationale
The paper constructs a new benchmark (Rule-VLN) by injecting 177 regulatory categories into an existing 29k-node environment and evaluates a proposed zero-shot SNRM module (coarse-to-fine VLM perception plus mental-map planning) on pre-trained agents. Reported deltas (CVR reduction 19.26%, TC boost 5.97%) are framed as experimental outcomes, not as quantities derived from or fitted to the same inputs. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text; the benchmark and method are presented as independent contributions whose performance is measured externally. The derivation chain is therefore self-contained against the stated experimental protocol.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., De Monicault, B., Garg, S., Gervet, T., et al.: Pixtral 12b. arXiv preprint arXiv:2410.07073 (2024)
work page internal anchor Pith review arXiv 2024
-
[2]
nature406(6794), 378–382 (2000)
Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex networks. nature406(6794), 378–382 (2000)
2000
-
[3]
IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025)
An, D., Wang, H., Wang, W., Wang, Z., Huang, Y ., He, K., Wang, L.: Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025). doi:10.1109/TPAMI.2024.3386695
-
[4]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)
2018
-
[5]
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Advances in neural information processing systems29(2016)
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems29(2016)
2016
-
[7]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y .: Touchdown: Natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12538–12547 (2019) 11 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT
2019
-
[8]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)
Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y .: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 9796–9810 (2024)
2024
-
[9]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Chen, K., An, D., Huang, Y ., Xu, R., Su, Y ., Ling, Y ., Reid, I., Wang, L.: Constraint-aware zero-shot vision- language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
2025
-
[10]
Information Fusion122, 103198 (2025)
Chen, S., Wu, Z., Zhang, K., Li, C., Zhang, B., Ma, F., Yu, F.R., Li, Q.: Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion122, 103198 (2025)
2025
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, X., Huang, L., Liu, Y ., Shen, Y ., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6593–6602 (2024)
2024
-
[12]
Artificial Intelligence Review (2026)
Chowa, S.S., Alvi, R., Rahman, S.S., Rahman, M.A., Raiaan, M.A.K., Islam, M.R., Hussain, M., Azam, S.: From language to action: a review of large language models as autonomous agents and tool users. Artificial Intelligence Review (2026)
2026
-
[13]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Chu, T., Zhai, Y ., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V ., Levine, S., Ma, Y .: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)
work page internal anchor Pith review arXiv 2025
-
[14]
Dong, Y ., Wu, F., He, Q., Cheng, Z.Q., Li, H., Li, M., Cheng, Z., Zhou, Y ., Sun, J., Dai, Q., et al.: Ha-vln 2.0: An open benchmark and leaderboard for human-aware navigation in discrete and continuous environments with dynamic multi-human interactions. arXiv preprint arXiv:2503.14229 (2025)
-
[15]
Sociometry pp
Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry pp. 35–41 (1977)
1977
-
[16]
Social networks1(3), 215–239 (1978)
Freeman, L.C.: Centrality in social networks conceptual clarification. Social networks1(3), 215–239 (1978)
1978
-
[17]
Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015,
Gokhale, T., Palangi, H., Nushi, B., Vineet, V ., Horvitz, E., Kamar, E., Baral, C., Yang, Y .: Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015 (2022)
-
[18]
Computers in Industry168, 104268 (2025)
Hamdani, R., Chihi, I.: Adaptive human-computer interaction for industry 5.0: A novel concept, with comprehen- sive review and empirical validation. Computers in Industry168, 104268 (2025)
2025
-
[19]
Advances in neural information processing systems30(2017)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
2017
-
[20]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Hong, H., Wang, S., Huang, Z., Wu, Q., Liu, J.: Navigating beyond instructions: Vision-and-language navigation in obstructed environments. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7639–7648 (2024)
2024
-
[21]
IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)
Huang, K., Duan, C., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation. IEEE Transactions on Pattern Analysis and Machine Intelligence47(5), 3563–3579 (2025)
2025
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Y ., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y ., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)
2024
-
[23]
Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan
Islam, C.M., Salman, S., Shams, M., Liu, X., Kumar, P.: Malicious path manipulations via exploitation of representation vulnerabilities of vision-language navigation systems. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13845–13852 (2024). doi:10.1109/IROS58592.2024.10802618
-
[24]
arXiv preprint arXiv:2410.17267 (2024)
Jeong, S., Kang, G.C., Kim, J., Zhang, B.T.: Zero-shot vision-and-language navigation with collision mitigation in continuous environment. arXiv preprint arXiv:2410.17267 (2024)
-
[25]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)
Kang, W., Galim, K., Koo, H.I., Cho, N.I.: Counting guidance for high fidelity text-to-image synthesis. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). pp. 899–908. IEEE (2025)
2025
-
[26]
Ke, F., Hsu, J., Cai, Z., Ma, Z., Zheng, X., Wu, X., Huang, S., Wang, W., Haghighi, P.D., Haffari, G., et al.: Explain before you answer: A survey on compositional visual reasoning. arXiv preprint arXiv:2508.17298 (2025)
-
[27]
arXiv preprint arXiv:2506.03834 (2025)
Kim, J., Sim, J., Kim, W., Sycara, K., Nam, C.: Care: Enhancing safety of visual navigation through collision avoidance via repulsive estimation. arXiv preprint arXiv:2506.03834 (2025)
-
[28]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)
work page internal anchor Pith review arXiv 2025
-
[29]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Li, J., Padmakumar, A., Sukhatme, G., Bansal, M.: Vln-video: Utilizing driving videos for outdoor vision-and- language navigation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18517–18526 (2024) 12 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT
2024
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Li, Y ., Tian, M., Lin, Z., Zhu, J., Zhu, D., Liu, H., Zhang, Y ., Xiong, Z., Zhao, X.: Fine-grained evaluation of large vision-language models in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9431–9442 (October 2025)
2025
-
[31]
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github.io/blog/2024-01-30-llava-next/
2024
-
[32]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Liu, S., Zhang, H., Qiao, Q., Wu, Q., Wang, P.: Vln-chenv: Vision-language navigation in changeable environments. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3798–3807 (2025)
2025
-
[33]
Liu, Y ., Yao, F., Yue, Y ., Xu, G., Sun, X., Fu, K.: Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language navigation. arXiv preprint arXiv:2411.08579 (2024)
-
[34]
In: The Thirteenth International Conference on Learning Representations (2025)
Lu, R., Wang, R., Lyu, K., Jiang, X., Huang, G., Wang, M.: Towards understanding text hallucination of diffusion models via local generation bias. In: The Thirteenth International Conference on Learning Representations (2025)
2025
-
[35]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
2023
-
[36]
In: European Conference on Computer Vision
Qiao, Y ., Liu, Q., Liu, J., Liu, J., Wu, Q.: Llm as copilot for coarse-grained vision-and-language navigation. In: European Conference on Computer Vision. pp. 459–476. Springer (2024)
2024
-
[37]
Schumann, R., Riezler, S.: Generating landmark navigation instructions from maps as a graph-to-text problem. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers). pp. 489–502 (2021)
2021
-
[38]
In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)
Schumann, R., Riezler, S.: Analyzing generalization of vision and language navigation to unseen outdoor areas. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 7519–7532 (2022)
2022
-
[39]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Schumann, R., Zhu, W., Feng, W., Fu, T.J., Riezler, S., Wang, W.Y .: Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18924–18933 (2024)
2024
-
[40]
IEEE Robotics and Automation Letters10(1), 508–515 (2024)
Song, D., Liang, J., Payandeh, A., Raj, A.H., Xiao, X., Manocha, D.: Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models. IEEE Robotics and Automation Letters10(1), 508–515 (2024)
2024
-
[41]
arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19
Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y .: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025)
-
[42]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Sun, P., Song, Y ., Zhu, X., Liu, X., Wang, Q., Liu, Y ., Xia, C., Li, T., Yang, Y ., Chu, X.: City-vlm: Towards multidomain perception scene understanding via multimodal incomplete learning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3448–3457 (2025)
2025
-
[43]
Preprints (2026)
Sun, P., Tang, S., Wen, J., Liang, Y ., Yang, Y ., Chu, X.: From terrain to space: A survey on multi-domain data lifecycle for urban embodied agents. Preprints (2026)
2026
-
[44]
Sensors25(2), 364 (2025)
Sun, Y ., Qiu, Y ., Aoki, Y .: Dynamicvln: Incorporating dynamics into vision-and-language navigation scenarios. Sensors25(2), 364 (2025)
2025
-
[45]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Tian, H., Meng, J., Zheng, W.S., Li, Y .M., Yan, J., Zhang, Y .: Loc4plan: Locating before planning for outdoor vision and language navigation. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 4073–4081 (2024)
2024
-
[46]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review arXiv 2025
-
[47]
IEEE transactions on image processing13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)
2004
-
[48]
IEEE Transactions on Neural Networks and Learning Systems (2025)
Wei, Z., Lin, B., Nie, Y ., Chen, J., Ma, S., Xu, H., Liang, X.: Unseen from seen: Rewriting observation-instruction using foundation models for augmenting vision-language navigation. IEEE Transactions on Neural Networks and Learning Systems (2025)
2025
-
[49]
Wen, C., Liang, J., Yuan, S., Huang, H., Bethala, G.C.R., Liu, Y .S., Wang, M., Tzes, A., Fang, Y .: How secure are large language models (llms) for navigation in urban environments? (2025), https://arxiv.org/abs/2402. 09546
2025
-
[50]
arXiv preprint arXiv:2409.15310 , year=
Wu, J., Zhang, Z., Xia, Y ., Li, X., Xia, Z., Chang, A., Yu, T., Kim, S., Rossi, R.A., Zhang, R., et al.: Visual prompting in multimodal large language models: A survey. arXiv preprint arXiv:2409.15310 (2024) 13 Rule-VLN: Semantic-Geometric Reasoning for Compliant NavigationA PREPRINT
-
[51]
In: Findings of the Association for Computational Linguistics: EMNLP 2020
Xiang, J., Wang, X., Wang, W.Y .: Learning to stop: A simple yet effective approach to urban vision-language navigation. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 699–707 (2020)
2020
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S.: Flair: Vlm with fine-grained language-informed image representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24884–24894 (June 2025)
2025
-
[53]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Xu, Y ., Pan, Y ., Liu, Z., Wang, H.: Flame: Learning to navigate with multimodal llm in urban environments. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9005–9013 (2025)
2025
-
[54]
arXiv preprint arXiv:2509.10454 (2025)
Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J., Lu, J.: Gc-vln: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 (2025)
-
[55]
IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024)
Yue, L., Zhou, D., Xie, L., Zhang, F., Yan, Y ., Yin, E.: Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments. IEEE Robotics and Automation Letters 9(6), 4918–4925 (2024). doi:10.1109/LRA.2024.3387171
-
[56]
Robotics: Science and Systems (2024)
Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. Robotics: Science and Systems (2024)
2024
-
[57]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
2018
-
[58]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhang, T., Wang, X., Li, L., Tai, Z., Chi, J., Tian, J., He, H., Wang, S.: Strict: Stress-test of rendering image containing text. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 21148–21161 (2025)
2025
-
[59]
In: Proceedings of the 2025 International Conference on Multimedia Retrieval
Zhang, Z., Chen, M., Zhu, S., Han, T., Yu, Z.: Mmcnav: Mllm-empowered multi-agent collaboration for outdoor visual language navigation. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. pp. 1767–1776 (2025)
2025
-
[60]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zheng, D., Huang, S., Zhao, L., Zhong, Y ., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13624–13634 (June 2024)
2024
-
[61]
In: European Conference on Computer Vision
Zhou, G., Hong, Y ., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In: European Conference on Computer Vision. pp. 260–278. Springer (2024) 14
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.