Vision-Language Models for Deployable Social Robot Navigation: Bridging Semantic Reasoning and Low-Level Control
Pith reviewed 2026-06-30 09:50 UTC · model grok-4.3
The pith
Integrating vision-language models into social robot navigation requires hybrid architectures with intermediate mechanisms to translate high-level reasoning into safe low-level actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Social robot navigation demands both metric safety and semantic compliance with human intentions and norms. Vision-language models supply the latter through high-level understanding and natural language interaction, yet they cannot be plugged directly into real-time navigation stacks. A unified perspective therefore separates VLM reasoning, low-level planning and control, and intermediate mechanisms that perform the translation. The survey maps existing methods onto this structure and proposes a roadmap covering evaluators, spatial grounding, intermediate representations, and control modules to produce deployable hybrid systems.
What carries the argument
The three interconnected components (high-level VLM reasoning, low-level planning and control, and intermediate mechanisms that bridge reasoning and action) that organize all surveyed approaches and form the basis of the proposed coupling roadmap.
If this is right
- Hybrid systems that keep classical collision avoidance while adding VLM-derived social reasoning become the practical deployment path.
- Intermediate representations such as spatial grounding and evaluators become necessary design elements rather than optional extras.
- Datasets and platforms that jointly measure semantic compliance and metric safety become the required evaluation standard.
- Open challenges in real-time translation and safety verification must be solved before VLMs move from research to public spaces.
Where Pith is reading between the lines
- Without explicit intermediate layers, attempts to scale VLM navigation to unstructured environments will repeatedly hit latency or safety walls even if model accuracy improves.
- The roadmap implies that control modules may need to remain classical while only the reasoning layer is swapped for VLMs, limiting how much end-to-end learning can be applied.
- Evaluation platforms that separate semantic and metric scores could reveal whether current VLM gains are mostly in perception or actually in downstream action quality.
Load-bearing premise
High-level outputs from vision-language models can be translated into low-level navigation commands fast enough to preserve real-time performance and hard safety guarantees amid moving people.
What would settle it
A deployed robot that uses direct VLM output for path commands without any intermediate translation layer and still meets both collision-free metrics and social compliance scores in crowded dynamic environments.
read the original abstract
Social robot navigation (SRN) requires more than geometric path planning; it demands understanding human intentions, social norms, and contextual cues to generate socially compliant behaviors. Although classical navigation methods provide reliable metric planning and collision avoidance, they often lack the semantic reasoning capabilities necessary for operation in complex human-centered environments. Recent advances in Vision-Language Models (VLMs) have opened new opportunities for SRN by enabling high-level VLM understanding, commonsense reasoning, and natural language interaction. However, a fundamental challenge remains: how to integrate VLMs into real-time, safety-critical navigation systems and reliably translate their high-level reasoning into grounded navigation actions. In this survey, we present a unified perspective of VLM-based SRN and organize existing approaches into three interconnected components: high-level VLM reasoning, low-level planning and control, and intermediate mechanisms that bridge reasoning and action. Based on this perspective, we propose a structured roadmap for coupling VLMs with navigation systems, covering semantic reasoning, evaluators, spatial grounding, intermediate representations, and control modules. The roadmap highlights both the strengths of VLMs and the necessity of hybrid architectures for practical deployment. We further review representative datasets and evaluation platforms developed for SRN. Finally, we discuss key open challenges. This survey aims to provide a foundation for building reliable, socially compliant, and deployable VLM-enabled navigation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a literature survey on the use of Vision-Language Models (VLMs) for social robot navigation (SRN). It argues that while VLMs enable high-level semantic reasoning and natural language interaction, integrating them into real-time, safety-critical navigation systems remains a fundamental challenge. The authors organize existing approaches into three interconnected components—high-level VLM reasoning, low-level planning and control, and intermediate mechanisms—and propose a structured roadmap for coupling VLMs with navigation systems, including semantic reasoning, evaluators, spatial grounding, intermediate representations, and control modules. They also review representative datasets and evaluation platforms and discuss key open challenges.
Significance. This survey synthesizes recent advances in VLM-based SRN and provides a unified framework that could guide future research toward deployable systems. The emphasis on hybrid architectures and the review of datasets and platforms are valuable contributions that could help standardize evaluation in the field. If the proposed organization accurately reflects the literature, it offers a clear structure for addressing the integration of high-level reasoning with low-level control.
major comments (2)
- [Organization of approaches] The central claim rests on the three-component categorization of existing approaches. However, the manuscript does not specify the criteria used to select and classify the reviewed works, which could affect the completeness and balance of the survey (see the section presenting the unified perspective).
- [Roadmap] The roadmap is presented as covering semantic reasoning, evaluators, spatial grounding, intermediate representations, and control modules, but lacks discussion of potential trade-offs in real-time performance, which is critical for the claim of deployability in safety-critical systems.
minor comments (2)
- [Abstract] The abstract is clear but could include one or two concrete examples of VLM applications in SRN to better illustrate the claims.
- [Datasets review] Ensure that the review of datasets includes recent publications up to the submission date to maintain currency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [Organization of approaches] The central claim rests on the three-component categorization of existing approaches. However, the manuscript does not specify the criteria used to select and classify the reviewed works, which could affect the completeness and balance of the survey (see the section presenting the unified perspective).
Authors: We agree that explicit selection criteria would strengthen the survey. In the revised manuscript, we will add a new subsection (likely in Section 2 or the unified perspective) outlining the inclusion criteria: (i) peer-reviewed or arXiv works from 2022 onward that integrate VLMs with navigation; (ii) explicit focus on social compliance, human intent, or contextual reasoning; and (iii) coverage across the three components to ensure balance. We will also note limitations of the search (e.g., English-language sources) to allow readers to evaluate scope. revision: yes
-
Referee: [Roadmap] The roadmap is presented as covering semantic reasoning, evaluators, spatial grounding, intermediate representations, and control modules, but lacks discussion of potential trade-offs in real-time performance, which is critical for the claim of deployability in safety-critical systems.
Authors: We concur that real-time trade-offs are central to deployability. The revised roadmap section will include an explicit discussion of these issues, covering VLM inference latency versus semantic gains, hardware constraints (e.g., onboard vs. cloud), mitigation strategies such as lightweight VLMs or asynchronous pipelines, and how hybrid designs can preserve safety-critical guarantees (e.g., fallback to classical controllers). This will be supported by references to existing latency benchmarks where available. revision: yes
Circularity Check
No significant circularity in survey/roadmap paper
full rationale
This is a literature survey that synthesizes prior work on VLM-based SRN, organizes approaches into high-level reasoning, low-level control, and intermediate mechanisms, and proposes a roadmap plus open challenges. It contains no equations, derivations, fitted parameters, predictions, or quantitative claims. No load-bearing step reduces by construction to inputs, self-citations, or ansatzes; the central integration difficulty is explicitly stated as an unresolved challenge rather than an asserted result. The paper is self-contained as a review.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-level VLM reasoning can be decoupled from low-level metric planning and control in navigation systems
Reference graph
Works this paper leans on
-
[1]
The International Journal of Robotics Research43(10), 1533– 1572 (2024)
Singamaneni, P.T., Bachiller-Burgos, P., Manso, L.J., Garrell, A., Sanfeliu, A., Spalanzani, A., Alami, R.: A survey on socially aware robot navigation: Taxonomy and future challenges. The International Journal of Robotics Research43(10), 1533– 1572 (2024)
2024
-
[2]
ACM Transactions on Human-Robot Interaction12(3), 1–39 (2023)
Mavrogiannis, C., Baldini, F., Wang, A., Zhao, D., Trautman, P., Steinfeld, A., Oh, J.: Core challenges of social robot navigation: A survey. ACM Transactions on Human-Robot Interaction12(3), 1–39 (2023)
2023
-
[3]
Neurospine21(3), 868 (2024)
Han, I.H., Kim, D.H., Nam, K.H., Lee, J.I., Kim, K.-H., Park, J.-H., Ahn, H.S.: Human- robot interaction and social robot: The emerging field of healthcare robotics and current and future perspectives for spinal care. Neurospine21(3), 868 (2024)
2024
-
[4]
In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp
Payandeh, A., Song, D., Nazeri, M., Liang, J., Mukherjee, P., Raj, A.H., Kong, Y., Manocha, D., Xiao, X.: Social- llava: Enhancing social robot navigation through human-language reasoning. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 17192–17198 (2025)
2025
-
[5]
Robotics and Autonomous Systems 61(12), 1726–1743 (2013)
Kruse, T., Pandey, A.K., Alami, R., Kirsch, A.: Human-aware robot navigation: A sur- vey. Robotics and Autonomous Systems 61(12), 1726–1743 (2013)
2013
-
[6]
ACM Transactions on Human- Robot Interaction13(1), 1–36 (2024)
Mirsky, R., Xiao, X., Hart, J., Stone, P.: Conflict avoidance in social navigation—a survey. ACM Transactions on Human- Robot Interaction13(1), 1–36 (2024)
2024
-
[7]
In: 2013 8th ACM/IEEE Inter- national Conference on Human-Robot Interaction (HRI), pp
Dragan, A.D., Lee, K.C., Srinivasa, S.S.: Legibility and predictability of robot motion. In: 2013 8th ACM/IEEE Inter- national Conference on Human-Robot Interaction (HRI), pp. 301–308 (2013)
2013
-
[8]
International Journal of Social Robotics7(2), 137–153 (2015)
Rios-Martinez, J., Spalanzani, A., Laugier, C.: From proxemics theory to socially-aware navigation: A survey. International Journal of Social Robotics7(2), 137–153 (2015)
2015
-
[9]
ACM Trans- actions on Human-Robot Interaction14(2), 1–65 (2025) 16
Francis, A., P´ erez-d’Arpino, C., Li, C., Xia, F., Alahi, A., Alami, R., Bera, A., Biswas, A., Biswas, J., Chandra, R.,et al.: Prin- ciples and guidelines for evaluating social robot navigation algorithms. ACM Trans- actions on Human-Robot Interaction14(2), 1–65 (2025) 16
2025
-
[10]
In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp
Mavrogiannis, C., Hutchinson, A.M., Mac- donald, J., Alves-Oliveira, P., Knepper, R.A.: Effects of distinct robot navigation strategies on human behavior in a crowded environment. In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 421–430 (2019)
2019
-
[11]
Autonomous Robots46(5), 569–597 (2022)
Xiao, X., Liu, B., Warnell, G., Stone, P.: Motion planning and control for mobile robot navigation using machine learning: a survey. Autonomous Robots46(5), 569–597 (2022)
2022
-
[12]
The International Journal of Robotics Research44(5), 701–739 (2025)
Firoozi, R., Tucker, J., Tian, S., Majum- dar, A., Sun, J., Liu, W., Zhu, Y., Song, S., Kapoor, A., Hausman, K.,et al.: Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research44(5), 701–739 (2025)
2025
-
[13]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bern- stein, M.S., Bohg, J., Bosselut, A., Brun- skill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
The International Journal of Robotics Research35(11), 1289–1307 (2016)
Kretzschmar, H., Spies, M., Sprunk, C., Burgard, W.: Socially compliant mobile robot navigation via inverse reinforce- ment learning. The International Journal of Robotics Research35(11), 1289–1307 (2016)
2016
-
[16]
Information Fusion, 103652 (2025)
Han, X., Chen, S., Fu, Z., Feng, Z., Fan, L., An, D., Wang, C., Guo, L., Meng, W., Zhang, X., et al.: Multimodal fusion and vision-language models: A survey for robot vision. Information Fusion, 103652 (2025)
2025
-
[17]
Transactions on Machine Learning Research (2024)
Zhang, Y., Ma, Z., Li, J., Qiao, Y., Wang, Z., Chai, J., Wu, Q., Bansal, M., Kord- jamshidi, P.: Vision-and-language naviga- tion today and tomorrow: A survey in the era of foundation models. Transactions on Machine Learning Research (2024)
2024
-
[18]
arXiv preprint arXiv:2510.22448 (2025)
Chhetri, P., Torrejon, A., Eslava, S., Manso, L.J.: A short methodological review on social robot navigation benchmarking. arXiv preprint arXiv:2510.22448 (2025)
-
[19]
In: the AAAI 2023 Spring Symposium on HRI in Academia and Industry: Bridging the Gap
Francis, A., P´ erez-D’Arpino, C., Li, C., Xia, F., Alahi, A., Bera, A., Biswas, A., Biswas, J., Chiang, H.-T.L., Everett, M., et al.: Benchmarking social robot naviga- tion across academia and industry. In: the AAAI 2023 Spring Symposium on HRI in Academia and Industry: Bridging the Gap. AAAI (2023)
2023
-
[20]
Journal of Telecommunication, Electronic and Computer Engineering (JTEC)8(11), 41–50 (2016)
Chik, S., Yeong, C., Su, E., Lim, T., Subramaniam, Y., Chin, P.: A review of social-aware navigation frameworks for ser- vice robot in dynamic human environments. Journal of Telecommunication, Electronic and Computer Engineering (JTEC)8(11), 41–50 (2016)
2016
-
[21]
Robotics and Autonomous Systems93, 85–104 (2017)
Charalampous, K., Kostavelis, I., Gaster- atos, A.: Recent trends in social aware robot navigation: A survey. Robotics and Autonomous Systems93, 85–104 (2017)
2017
-
[22]
In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp
Cheng, J., Cheng, H., Meng, M.Q.-H., Zhang, H.: Autonomous navigation by mobile robots in human environments: A survey. In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1981–1986 (2018)
2018
-
[23]
International Journal of Human– Computer Interaction36(19), 1804–1817 (2020)
Lambert, A., Norouzi, N., Bruder, G., Welch, G.: A systematic review of ten years of research on human interaction with social robots. International Journal of Human– Computer Interaction36(19), 1804–1817 (2020)
2020
-
[24]
Robotics and Autonomous Systems145, 103837 (2021)
M¨ oller, R., Furnari, A., Battiato, S., H¨ arm¨ a, A., Farinella, G.M.: A survey on human- aware robot navigation. Robotics and Autonomous Systems145, 103837 (2021)
2021
-
[25]
In: 2022 IEEE International 17 Conference on Advanced Robotics and Its Social Impacts (ARSO), pp
Wang, J., Chan, W.P., Carreno-Medrano, P., Cosgun, A., Croft, E.: Metrics for eval- uating social conformity of crowd naviga- tion algorithms. In: 2022 IEEE International 17 Conference on Advanced Robotics and Its Social Impacts (ARSO), pp. 1–6 (2022)
2022
-
[26]
arXiv preprint arXiv:2310.12921 (2023)
Rocamonde, J., Montesinos, V., Nava, E., Perez, E., Lindner, D.: Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921 (2023)
-
[27]
In: Conference on Robot Learning (CoRL), pp
Shah, D., Sridhar, A., Dashora, N., Stachow- icz, K., Black, K., Hirose, N., Levine, S.: Vint: A foundation model for visual navi- gation. In: Conference on Robot Learning (CoRL), pp. 711–733 (2023)
2023
-
[28]
In: 2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pp
Chen, B., Xia, F., Ichter, B., Rao, K., Gopalakrishnan, K., Ryoo, M.S., Stone, A., Kappler, D.: Open-vocabulary queryable scene representations for real world plan- ning. In: 2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pp. 11509–11522 (2023)
2023
-
[29]
Cheng, A.-C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X.: Navila: Legged robot vision- language-action model for navigation. arXiv preprint arXiv:2412.04453 (2024)
-
[30]
arXiv preprint arXiv:2509.19480 (2025)
Hirose, N., Glossop, C., Shah, D., Levine, S.: Omnivla: An omni-modal vision-language- action model for robot navigation. arXiv preprint arXiv:2509.19480 (2025)
-
[31]
In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–971 (2016)
2016
-
[32]
In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Mohamed, A., Qian, K., Elhoseiny, M., Claudel, C.: Social-stgcnn: A social spatio- temporal graph convolutional neural net- work for human trajectory prediction. In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14424–14432 (2020)
2020
-
[33]
In: the Computer Vision and Pattern Recognition Conference (CVPR), pp
Vasu, P.K.A., Faghri, F., Li, C.-L., Koc, C., True, N., Antony, A., Santhanam, G., Gabriel, J., Grasch, P., Tuzel, O.,et al.: Fastvlm: Efficient vision encoding for vision language models. In: the Computer Vision and Pattern Recognition Conference (CVPR), pp. 19769–19780 (2025)
2025
-
[34]
Plos one20(6), 0324341 (2025)
Li, H., Luo, M., Luo, W., Li, H., Cong, S.: Integrated decision-control for social robot autonomous navigation considering nonlinear dynamics model. Plos one20(6), 0324341 (2025)
2025
-
[35]
The International Journal of Robotics Research34(3), 335–356 (2015)
Trautman, P., Ma, J., Murray, R.M., Krause, A.: Robot navigation in dense human crowds: Statistical models and experimental studies of human–robot cooperation. The International Journal of Robotics Research34(3), 335–356 (2015)
2015
-
[36]
On Evaluation of Embodied Navigation Agents
Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
Scientific Reports15(1), 27724 (2025)
Luo, M., Li, H., Luo, W., Li, H., Li, J.: Goal-oriented autonomous decision-making for social robots via collaborative interactive inverse reinforcement learning approach. Scientific Reports15(1), 27724 (2025)
2025
-
[38]
IEEE Robotics and Automation Letters 10(1), 508–515 (2024)
Song, D., Liang, J., Payandeh, A., Raj, A.H., Xiao, X., Manocha, D.: Vlm-social- nav: Socially aware robot navigation through scoring using vision-language mod- els. IEEE Robotics and Automation Letters 10(1), 508–515 (2024)
2024
-
[39]
The Interna- tional Journal of Robotics Research20(5), 378–400 (2001)
LaValle, S.M., Kuffner Jr, J.J.: Random- ized kinodynamic planning. The Interna- tional Journal of Robotics Research20(5), 378–400 (2001)
2001
-
[40]
IEEE Transactions on Robotics and Automation12(4), 566–580 (2002)
Kavraki, L.E., Svestka, P., Latombe, J.- C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional con- figuration spaces. IEEE Transactions on Robotics and Automation12(4), 566–580 (2002)
2002
-
[41]
Engineering Sci- ence and Technology, an International Jour- nal40, 101343 (2023)
Loganathan, A., Ahmad, N.S.: A systematic review on recent advances in autonomous 18 mobile robot navigation. Engineering Sci- ence and Technology, an International Jour- nal40, 101343 (2023)
2023
-
[42]
IEEE Robotics & Automation Magazine4(1), 23–33 (2002)
Fox, D., Burgard, W., Thrun, S.: The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine4(1), 23–33 (2002)
2002
-
[43]
In: ROBOTIK 2012; 7th German Conference on Robotics (GRC), pp
R¨ osmann, C., Feiten, W., W¨ osch, T., Hoff- mann, F., Bertram, T.: Trajectory modi- fication considering dynamic constraints of autonomous robots. In: ROBOTIK 2012; 7th German Conference on Robotics (GRC), pp. 1–6 (2012)
2012
-
[44]
Physical review E 51(5), 4282 (1995)
Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Physical review E 51(5), 4282 (1995)
1995
-
[45]
In: the 37th Conference on Neural Information Processing Systems (NeurIPS), vol
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: the 37th Conference on Neural Information Processing Systems (NeurIPS), vol. 36, pp. 34892–34916 (2023)
2023
-
[46]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Qiao, Y.: Llama-adapter: Efficient fine-tuning of lan- guage models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
arXiv preprint arXiv:2304.08587 (2023)
Zhang, X., Ding, Y., Amiri, S., Yang, H., Kaminski, A., Esselink, C., Zhang, S.: Grounding classical task planners via vision-language models. arXiv preprint arXiv:2304.08587 (2023)
-
[48]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA), pp
Chen, A.S., Lessing, A.M., Tang, A., Chada, G., Smith, L., Levine, S., Finn, C.: Com- monsense reasoning for legged robot adap- tation with vision-language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 12826–12833 (2025)
2025
-
[49]
In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language mod- els with spatial reasoning capabilities. In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465 (2024)
2024
-
[50]
In: the 38th Confer- ence on Neural Information Processing Sys- tems (NeurIPS), vol
Cheng, A.-C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: the 38th Confer- ence on Neural Information Processing Sys- tems (NeurIPS), vol. 37, pp. 135062–135093 (2024)
2024
-
[51]
arXiv preprint arXiv:2410.03603 (2024)
Hirose, N., Glossop, C., Sridhar, A., Shah, D., Mees, O., Levine, S.: Lelan: Learn- ing a language-conditioned navigation pol- icy from in-the-wild videos. arXiv preprint arXiv:2410.03603 (2024)
-
[52]
In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp
Zhu, J., Du, Z., Xu, H., Lan, F., Zheng, Z., Ma, B., Wang, S., Zhang, T.: Navi2gaze: Leveraging foundation models for naviga- tion and target gazing. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 20394– 20401 (2025)
2025
-
[53]
arXiv preprint arXiv:2502.14254 (2025)
Zhang, L., Liu, Y., Zhang, Z., Aghaei, M., Hu, Y., Gu, H., Alomrani, M.A., Bravo, D.G.A., Karimi, R., Hamidizadeh, A., et al.: Mem2ego: Empowering vision- language models with global-to-ego memory for long-horizon embodied navigation. arXiv preprint arXiv:2502.14254 (2025)
-
[54]
IEEE Robotics and Automation Letters11(4), 3947–3954 (2026)
Fang, Z., Xiao, A., Hsu, D., Lee, G.H.: From obstacles to etiquette: Robot social naviga- tion with vlm-informed path selection. IEEE Robotics and Automation Letters11(4), 3947–3954 (2026)
2026
-
[55]
In: International Confer- ence on Learning Representations (ICLR) (2023)
Kwon, M., Michael, S.: Reward design with language models. In: International Confer- ence on Learning Representations (ICLR) (2023)
2023
-
[56]
In: International Conference on Machine Learning (ICML), pp
Ma, Y.J., Kumar, V., Zhang, A., Bastani, O., Jayaraman, D.: Liv: Language-image representations and rewards for robotic control. In: International Conference on Machine Learning (ICML), pp. 23301–23320 (2023)
2023
-
[57]
arXiv preprint arXiv:2210.05663 (2022) 19
Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: Weakly supervised semantic fields for robotic mem- ory. arXiv preprint arXiv:2210.05663 (2022) 19
-
[58]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Georgakis, G., Schmeckpeper, K., Wan- choo, K., Dan, S., Miltsakaki, E., Roth, D., Daniilidis, K.: Cross-modal map learn- ing for vision and language navigation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15460–15470 (2022)
2022
-
[59]
In: Conference on Robot Learning (CoRL), pp
Shah, D., Osi´ nski, B., Levine, S.,et al.: Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. In: Conference on Robot Learning (CoRL), pp. 492–504 (2023)
2023
-
[60]
In: 2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pp
Huang, C., Mees, O., Zeng, A., Burgard, W.: Visual language maps for robot navi- gation. In: 2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pp. 10608–10615 (2023)
2023
-
[61]
In: the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp
Peng, S., Genova, K., Jiang, C., Tagliasac- chi, A., Pollefeys, M., Funkhouser, T.,et al.: Openscene: 3d scene understanding with open vocabularies. In: the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 815–824 (2023)
2023
-
[62]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp
Gu, Q., Kuwajerwala, A., Morin, S., Jataval- labhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R.,et al.: Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 5021–5028 (2024)
2024
-
[63]
arXiv preprint arXiv:2302.07241 (2023)
Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241 (2023)
-
[64]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp
Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language fron- tier maps for zero-shot semantic navigation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 42– 48 (2024)
2024
-
[65]
The International Journal of Robotics Research, 02783649251351658 (2025)
Huang, C., Mees, O., Zeng, A., Burgard, W.: Multimodal spatial language maps for robot navigation and manipulation. The International Journal of Robotics Research, 02783649251351658 (2025)
2025
-
[66]
In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp
Stadler, M., Liu, K., Roy, N.: Online high-level model estimation for efficient hierarchical robot navigation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5568–5575 (2021)
2021
-
[67]
In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a gener- alist model for embodied navigation. In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13624–13634 (2024)
2024
-
[68]
In: European Conference on Computer Vision (ECCV), pp
Yang, J., Dong, Y., Liu, S., Li, B., Wang, Z., Tan, H., Jiang, C., Kang, J., Zhang, Y., Zhou, K.,et al.: Octopus: Embodied vision-language programmer from environ- mental feedback. In: European Conference on Computer Vision (ECCV), pp. 20–38 (2024)
2024
-
[69]
In: the AAAI Conference on Artificial Intelligence (AAAI), vol
Chen, J., Lin, B., Liu, X., Ma, L., Liang, X., Wong, K.-Y.K.: Affordances-oriented planning using foundation models for con- tinuous vision-language navigation. In: the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 23568–23576 (2025)
2025
-
[70]
arXiv preprint arXiv:2503.09820 (2025)
Elnoor, M., Weerakoon, K., Seneviratne, G., Liang, J., Rajagopal, V., Manocha, D.: Vi-lad: Vision-language attention dis- tillation for socially-aware robot navigation in dynamic environments. arXiv preprint arXiv:2503.09820 (2025)
-
[71]
In: 2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pp
Narasimhan, S., Tan, A.H., Choi, D., Nejat, G.: Olivia-nav: An online lifelong vision lan- guage approach for mobile robot social navi- gation. In: 2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pp. 9130–9137 (2025)
2025
-
[72]
In: Conference on Robot 20 Learning (CoRL), pp
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A.,et al.: Rt-2: Vision-language- action models transfer web knowledge to robotic control. In: Conference on Robot 20 Learning (CoRL), pp. 2165–2183 (2023)
2023
-
[73]
In: Conference on Robot Learning (CoRL), pp
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision- language-action model. In: Conference on Robot Learning (CoRL), pp. 2679–2713 (2025)
2025
-
[74]
In: Robotics: Science and Systems 2024, p
Ghosh, D., Walke, H.R., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y.L., Chen, L.Y., Vuong, Q., Xiao, T., Sanketi, P.R., Sadigh, D., Finn, C., Levine, S.: Octo: An open- source generalist robot policy. In: Robotics: Science and Systems 2024, p. 090 (2024)
2024
-
[75]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language nav- igation. arXiv preprint arXiv:2402.15852 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp
Liang, J., Huang, W., Xia, F., Xu, P., Haus- man, K., Ichter, B., Florence, P., Zeng, A.: Code as policies: Language model pro- grams for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500 (2023)
2023
-
[77]
In: the 40th International Conference on Machine Learning (ICML), pp
Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T.,et al.: Palm- e: an embodied multimodal language model. In: the 40th International Conference on Machine Learning (ICML), pp. 8469–8488 (2023)
2023
-
[78]
Walk With Me: Long-Horizon Social Navigation for Human-Centric Outdoor Assistance
Zhang, L., Hao, X., Bu, X., Tang, Y., Li, H., Lu, J., Wei, X.-s., Ma, J., Liu, Y., Zhang, J., et al.: Walk with me: Long-horizon social navigation for human-centric outdoor assistance. arXiv preprint arXiv:2604.26839 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[79]
In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26689–26699 (2024)
2024
-
[80]
arXiv preprint arXiv:2601.14622 (2026)
Xiao, L., Yamasaki, T.: Probing prompt design for socially compliant robot navi- gation with vision language models. arXiv preprint arXiv:2601.14622 (2026)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.