AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

Daeun Song; Dinesh Manocha; Jing Liang; Xuesu Xiao; Yangzhe Kong; Ziyu Yao

arxiv: 2503.07557 · v2 · submitted 2025-03-10 · 💻 cs.RO

AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

Yangzhe Kong , Daeun Song , Jing Liang , Dinesh Manocha , Ziyu Yao , Xuesu Xiao This is my paper

Pith reviewed 2026-05-23 00:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords spatial reasoningvisual language modelssocial robot navigationVQA auto labelinghierarchical VQArobot navigation

0 comments

The pith

AutoSpatial improves VLMs for social robot navigation by training with auto-generated spatial VQA pairs and a hierarchical two-round strategy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutoSpatial to address limited spatial understanding in visual-language models used for social robot navigation. It combines minimal manual supervision with large-scale auto-labeled VQA pairs. A hierarchical two-round VQA strategy during training enables both global and detailed scenario understanding. This results in better performance across spatial perception, movement prediction, Chain of Thought reasoning, final action, and explanation compared to state-of-the-art approaches. A reader would care because these improvements could lead to robots that navigate human environments more effectively and safely.

Core claim

By combining minimal manual supervision with large-scale auto-labeling of VQA pairs and applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, leading to more accurate spatial perception, movement prediction, Chain of Thought reasoning, final action, and explanation in social navigation tasks.

What carries the argument

Hierarchical two-round VQA strategy that first builds global understanding then detailed spatial grounding, powered by auto-generated VQA pairs from minimal manual supervision.

If this is right

Models show higher accuracy in perception and prediction of movements in social scenarios.
Improved Chain of Thought reasoning supports better final action selection.
More accurate explanations accompany the navigation decisions.
Performance gains of up to 20.50% in action and 18.73% in explanation over baseline models trained only on manual data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the auto-labeling works reliably, it could scale spatial reasoning training to much larger datasets with less human effort.
The method might apply to other domains where VLMs need precise spatial understanding, such as object manipulation.
Real-world robot tests would be needed to confirm if the reasoning improvements translate to physical navigation success.

Load-bearing premise

The auto-generated VQA pairs from minimal manual supervision are sufficiently accurate and unbiased to improve spatial reasoning without introducing systematic errors that affect downstream navigation performance.

What would settle it

If a model trained with the auto-generated pairs scores lower than the manual-only baseline on expert or human evaluations of spatial reasoning in navigation scenarios, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2503.07557 by Daeun Song, Dinesh Manocha, Jing Liang, Xuesu Xiao, Yangzhe Kong, Ziyu Yao.

**Figure 2.** Figure 2: An example of the two-round VQA structure, where training data of each round follows the same format of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: While LLaVA-M suffers from faulty spatial reasoning, leading to ambiguous or ineffective navigation decisions, AutoSpatial exhibits improved pedestrian identification and reasoning, when augmented with auto-labeled VQA pairs. understanding. V. DISCUSSIONS A. Human Behavior Recognition Our experimental findings reveal both the strengths of AutoSpatial and areas for further development in VLM-based social na… view at source ↗

read the original abstract

We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs' spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs' limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross-validation score from expert systems in: perception & prediction (up to 10.71%), reasoning (up to 16.26%), action (up to 20.50%), and explanation (up to 18.73%) compared to baseline models trained only on manually annotated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoSpatial's training recipe with minimal manual labels plus hierarchical auto VQA is a practical extension for scaling spatial data in robot navigation VLMs, but the abstract supplies no validation of those auto labels so the reported gains stay hard to trust.

read the letter

The main thing to know is that AutoSpatial fine-tunes VLMs for social robot navigation by starting with a small set of manual annotations, then generating large-scale VQA pairs automatically and training with a two-round hierarchical strategy that first captures global scene layout and then drills into local spatial details. This produces reported lifts of up to 10.71% in perception and prediction, 16.26% in reasoning, 20.50% in action, and 18.73% in explanation over baselines that used only the manual data. The five targeted components line up with what navigation actually needs. The approach is a straightforward, task-specific application of existing VQA and VLM fine-tuning ideas rather than a fundamental shift in how spatial reasoning is modeled. It earns credit for focusing on annotation efficiency, which is a real bottleneck in robotics data collection. The concrete training steps and the use of both LLM cross-validation and human ranking give practitioners something they can try to replicate. The soft spots sit right at the center of the claim. The abstract never describes any accuracy check, inter-annotator agreement, or ablation that removes the auto-generated pairs, so we cannot tell whether the gains reflect better spatial understanding or simply the model absorbing whatever systematic patterns the auto-labeler introduced. Evaluation through GPT-4o, Gemini, and Claude also creates the circularity risk the stress-test note flags; similar model families scoring one another can inflate numbers without independent ground truth. No statistical tests or error breakdowns appear in the provided text. If the full manuscript contains held-out human validation of the VQA pairs and clear ablations, those gaps close; otherwise the central assumption remains untested. This work is aimed at researchers who fine-tune VLMs for navigation and want lower labeling costs. A reader already running similar experiments would get the most out of the data-generation recipe. It is coherent enough on its own terms to merit referee time, even though the evidence presented so far is thin. I would send it to peer review so the methods and any additional validation can be examined directly.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces AutoSpatial, a method for improving VLMs' spatial reasoning in social robot navigation tasks. It combines minimal manual supervision with large-scale auto-generated VQA pairs produced via a hierarchical two-round VQA strategy during training. The approach is claimed to yield better global and detailed scene understanding, leading to gains in spatial perception, movement prediction, CoT reasoning, final action selection, and explanation. Evaluation uses cross-validation scores from expert VLMs (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) plus human relative rankings, reporting averaged improvements over baselines trained only on manual data: up to 10.71% in perception & prediction, 16.26% in reasoning, 20.50% in action, and 18.73% in explanation.

Significance. If the auto-labeled VQA pairs prove accurate and the reported gains are reproducible with independent benchmarks, the method would offer an efficient, low-supervision route to strengthen spatial grounding in VLMs for robotics. This could meaningfully lower annotation costs for navigation datasets while addressing a known weakness in current VLMs. The dual use of LLM cross-validation and human rankings is a reasonable starting point for evaluation in this domain.

major comments (3)

[Abstract] Abstract: The central performance claims (gains of up to 20.50% in action and 18.73% in explanation) are stated without any description of the experimental protocol, baseline model architectures or training details, number of test scenarios, statistical tests, variance across runs, or error analysis. These omissions make the quantitative results impossible to interpret or reproduce from the provided text.
[Abstract] Abstract: The evaluation protocol relies on cross-validation scores from other VLMs (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) for the target VLM; this introduces unquantified circularity risk because the evaluators belong to the same model family as the system being assessed. No independent external benchmarks, human-only ground-truth labels, or ablation removing the auto-labeled data are described to isolate the contribution of the hierarchical strategy.
[Abstract] Abstract: The core claim that the hierarchical two-round VQA strategy produces accurate spatial perception and CoT reasoning rests on the assumption that auto-generated VQA pairs (from minimal manual supervision plus large-scale auto-labeling) are sufficiently accurate and unbiased. No validation metrics—such as inter-annotator agreement on held-out samples, error rates on spatial relations or movement predictions, or an ablation study—are reported, leaving open the possibility that observed gains reflect labeling artifacts rather than improved model capability.

minor comments (1)

[Abstract] The abstract refers to 'expert systems' providing cross-validation scores; this terminology is imprecise because GPT-4o, Gemini, and Claude are themselves VLMs rather than expert systems in the conventional sense.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the abstract's clarity, add missing details, and strengthen the evaluation discussion while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (gains of up to 20.50% in action and 18.73% in explanation) are stated without any description of the experimental protocol, baseline model architectures or training details, number of test scenarios, statistical tests, variance across runs, or error analysis. These omissions make the quantitative results impossible to interpret or reproduce from the provided text.

Authors: We agree the abstract is too concise and omits key experimental context. In the revision we will expand the abstract to briefly state the evaluation protocol (cross-validation with three VLMs plus human rankings), baseline architectures (standard VLM fine-tuning on manual data only), number of test scenarios, and reference the statistical analysis and variance reported in Section 4. Full error analysis will remain in the main text due to length limits, but we will add a pointer from the abstract. revision: yes
Referee: [Abstract] Abstract: The evaluation protocol relies on cross-validation scores from other VLMs (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) for the target VLM; this introduces unquantified circularity risk because the evaluators belong to the same model family as the system being assessed. No independent external benchmarks, human-only ground-truth labels, or ablation removing the auto-labeled data are described to isolate the contribution of the hierarchical strategy.

Authors: We acknowledge the circularity concern with VLM evaluators. The manuscript already includes human relative rankings as an independent signal; we will clarify this distinction and add an explicit ablation removing the auto-labeled VQA pairs to isolate the hierarchical strategy's contribution. While we do not have fully human-only ground-truth labels for all scenarios, the human ranking protocol provides a complementary check. We will also discuss limitations of VLM-based evaluation in the revised text. revision: partial
Referee: [Abstract] Abstract: The core claim that the hierarchical two-round VQA strategy produces accurate spatial perception and CoT reasoning rests on the assumption that auto-generated VQA pairs (from minimal manual supervision plus large-scale auto-labeling) are sufficiently accurate and unbiased. No validation metrics—such as inter-annotator agreement on held-out samples, error rates on spatial relations or movement predictions, or an ablation study—are reported, leaving open the possibility that observed gains reflect labeling artifacts rather than improved model capability.

Authors: We agree that explicit validation of the auto-generated pairs is necessary to support the claims. The revised manuscript will report inter-annotator agreement on a held-out sample set, error rates for spatial relations and movement predictions, and the requested ablation study isolating the auto-labeled data. These additions will directly address the concern that gains may stem from labeling artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper describes a method combining minimal manual supervision with auto-generated VQA pairs for training a VLM on spatial reasoning, evaluated via expert VLMs (GPT-4o etc.) plus human rankings, with gains reported over baselines using only manual data. No equations, self-citations, or self-definitional steps are quoted that reduce any claimed prediction or result to its inputs by construction. The auto-labeling process and hierarchical VQA strategy are presented as external to the evaluation metrics, and human evaluators provide an independent benchmark. This meets the default expectation of a self-contained approach without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method appears to rest on standard VLM fine-tuning and VQA generation pipelines whose assumptions are not enumerated here.

pith-pipeline@v0.9.0 · 5771 in / 1154 out tokens · 54127 ms · 2026-05-23T00:21:09.284212+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

structured spatial grounding … five distinct zones … five-level classification … eight cardinal … directions (N, NE, E, SE, S, SW, W, NW)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical two-round VQA strategy … auto-labeled VQA pairs … minimal manual supervision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Conflict avoidance in social navigation—a survey,

R. Mirsky, X. Xiao, J. Hart, and P. Stone, “Conflict avoidance in social navigation—a survey,” ACM Transactions on Human-Robot Interaction, vol. 13, no. 1, pp. 1–36, 2024

work page 2024
[2]

Principles and guidelines for evaluating social robot navigation algorithms,

A. Francis, C. P ´erez-d’Arpino, C. Li, F. Xia, A. Alahi, R. Alami, A. Bera, A. Biswas, J. Biswas, R. Chandra, et al. , “Principles and guidelines for evaluating social robot navigation algorithms,” arXiv preprint arXiv:2306.16740, 2023

work page arXiv 2023
[3]

Core challenges of social robot navigation: A survey,

C. Mavrogiannis, F. Baldini, A. Wang, D. Zhao, P. Trautman, A. Stein- feld, and J. Oh, “Core challenges of social robot navigation: A survey,” ACM Transactions on Human-Robot Interaction , vol. 12, no. 3, pp. 1– 39, 2023

work page 2023
[4]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

work page 2024
[5]

Tgs: Trajectory generation and selection using vision language models in mapless outdoor environments,

D. Song, J. Liang, X. Xiao, and D. Manocha, “Tgs: Trajectory generation and selection using vision language models in mapless outdoor environments,” arXiv preprint arXiv:2408.02454 , 2024

work page arXiv 2024
[6]

Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,

A. Payandeh, D. Song, M. Nazeri, J. Liang, P. Mukherjee, A. H. Raj, Y . Kong, D. Manocha, and X. Xiao, “Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,” arXiv preprint arXiv:2501.09024, 2024

work page arXiv 2024
[7]

Do vision-language models represent space and how? eval- uating spatial frame of reference under ambiguities,

Z. Zhang, F. Hu, J. Lee, F. Shi, P. Kordjamshidi, J. Chai, and Z. Ma, “Do vision-language models represent space and how? eval- uating spatial frame of reference under ambiguities,” arXiv preprint arXiv:2410.17385, 2024

work page arXiv 2024
[8]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

work page 2024
[9]

Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning,

M. Aghzal, E. Plaku, and Z. Yao, “Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning,” in ICLR 2024 Workshop on Large Language Model (LLM) Agents

work page 2024
[10]

Look further ahead: Testing the limits of gpt-4 in path planning,

M. Aghzal, E. Plaku, and Z. Yao, “Look further ahead: Testing the limits of gpt-4 in path planning,” in 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE) , 2024, pp. 1020–1027

work page 2024
[11]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” in European Conference on Computer Vision. Springer, 2024, pp. 256–274

work page 2024
[12]

Structured spatial reasoning with open vocabulary object detectors,

N. Nejatishahidin, M. R. V ongala, and J. Kosecka, “Structured spatial reasoning with open vocabulary object detectors,” arXiv preprint arXiv:2410.07394, 2024

work page arXiv 2024
[13]

Using human-inspired signals to disam- biguate navigational intentions,

J. Hart, R. Mirsky, X. Xiao, S. Tejeda, B. Mahajan, J. Goo, K. Baldauf, S. Owen, and P. Stone, “Using human-inspired signals to disam- biguate navigational intentions,” in International Conference on Social Robotics. Springer, 2020, pp. 320–331

work page 2020
[14]

A protocol for validating social navigation policies,

S. Pirk, E. Lee, X. Xiao, L. Takayama, A. Francis, and A. Toshev, “A protocol for validating social navigation policies,” arXiv preprint arXiv:2204.05443, 2022

work page arXiv 2022
[15]

Social force model for pedestrian dynam- ics,

D. Helbing and P. Molnar, “Social force model for pedestrian dynam- ics,” Physical review E , vol. 51, no. 5, p. 4282, 1995

work page 1995
[16]

An approach of social navigation based on proxemics for crowded environments of humans and robots,

M. Daza, D. Barrios-Aranibar, J. Diaz-Amado, Y . Cardinale, and J. Vilasboas, “An approach of social navigation based on proxemics for crowded environments of humans and robots,”Micromachines, vol. 12, no. 2, p. 193, 2021

work page 2021
[17]

Socially-aware robot navigation: A learning approach,

M. Luber, L. Spinello, J. Silva, and K. O. Arras, “Socially-aware robot navigation: A learning approach,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems . IEEE, 2012, pp. 902– 907

work page 2012
[18]

Sacson: Scalable autonomous control for social navigation,

N. Hirose, D. Shah, A. Sridhar, and S. Levine, “Sacson: Scalable autonomous control for social navigation,” IEEE Robotics and Au- tomation Letters , 2023

work page 2023
[19]

Appld: Adaptive planner parameter learning from demonstration,

X. Xiao, B. Liu, G. Warnell, J. Fink, and P. Stone, “Appld: Adaptive planner parameter learning from demonstration,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4541–4547, 2020

work page 2020
[20]

Learning model pre- dictive controllers with real-time attention for real-world navigation,

X. Xiao, T. Zhang, K. M. Choromanski, T.-W. E. Lee, A. Francis, J. Varley, S. Tu, S. Singh, P. Xu, F. Xia, S. M. Persson, L. Takayama, R. Frostig, J. Tan, C. Parada, and V . Sindhwani, “Learning model pre- dictive controllers with real-time attention for real-world navigation,” in Conference on robot learning . PMLR, 2022

work page 2022
[21]

Vlm-social-nav: Socially aware robot navigation through scoring us- ing vision-language models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “Vlm-social-nav: Socially aware robot navigation through scoring us- ing vision-language models,” IEEE Robotics and Automation Letters , 2024

work page 2024
[22]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,

H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters , vol. 7, no. 4, pp. 11 807–11 814, 2022

work page 2022
[23]

Rethinking social robot navigation: Leveraging the best of two worlds,

A. H. Raj, Z. Hu, H. Karnan, R. Chandra, A. Payandeh, L. Mao, P. Stone, J. Biswas, and X. Xiao, “Rethinking social robot navigation: Leveraging the best of two worlds,” in 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 16 330–16 337

work page 2024
[24]

To- ward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset,

D. M. Nguyen, M. Nazeri, A. Payandeh, A. Datar, and X. Xiao, “To- ward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 7442–7447

work page 2023
[25]

A study on learning social robot navigation with multimodal perception,

B. Panigrahi, A. H. Raj, M. Nazeri, and X. Xiao, “A study on learning social robot navigation with multimodal perception,” arXiv preprint arXiv:2309.12568, 2023

work page arXiv 2023
[26]

Do as i can, not as i say: Grounding language in robotic affordances,

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on robot learning. PMLR, 2023, pp. 287–318

work page 2023
[27]

Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,

K. Weerakoon, M. Elnoor, G. Seneviratne, V . Rajagopal, S. H. Arul, J. Liang, M. K. M. Jaffar, and D. Manocha, “Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,” arXiv preprint arXiv:2409.16484 , 2024

work page arXiv 2024
[28]

Convoi: Context-aware navigation using vision language models in outdoor and indoor envi- ronments,

A. J. Sathyamoorthy, K. Weerakoon, M. Elnoor, A. Zore, B. Ichter, F. Xia, J. Tan, W. Yu, and D. Manocha, “Convoi: Context-aware navigation using vision language models in outdoor and indoor envi- ronments,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2024, pp. 13 837–13 844

work page 2024
[29]

Pivot: Iterative visual prompting elicits actionable knowledge for vlms,

S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, et al. , “Pivot: Iterative visual prompting elicits actionable knowledge for vlms,” arXiv preprint arXiv:2402.07872, 2024

work page arXiv 2024
[30]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Yao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025
[31]

Olivia-nav: An online lifelong vision language approach for mobile robot social navigation,

S. Narasimhan, A. H. Tan, D. Choi, and G. Nejat, “Olivia-nav: An online lifelong vision language approach for mobile robot social navigation,” arXiv preprint arXiv:2409.13675 , 2024

work page arXiv 2024
[32]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

A. Kamath, J. Hessel, and K.-W. Chang, “What’s” up” with vision- language models? investigating their struggle with spatial reasoning,” arXiv preprint arXiv:2310.19785 , 2023

work page arXiv 2023
[33]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, Y . Li, and N. Joshi, “Is a picture worth a thousand words? delving into spatial reasoning for vision language models,” arXiv preprint arXiv:2406.14852 , 2024

work page arXiv 2024
[34]

Instruction mining: High-quality instruction data selection for large language models

Y . Cao, Y . Kang, C. Wang, and L. Sun, “Instruction mining: Instruction data selection for tuning large language models,” arXiv preprint arXiv:2307.06290, 2023

work page arXiv 2023
[35]

Towards robust robot 3d perception in urban environments: The ut campus object dataset,

A. Zhang, C. Eranki, C. Zhang, J.-H. Park, R. Hong, P. Kalyani, L. Kalyanaraman, A. Gamare, A. Bagad, M. Esteva, and J. Biswas, “Towards robust robot 3d perception in urban environments: The ut campus object dataset,” 2023

work page 2023
[36]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

work page 2024

[1] [1]

Conflict avoidance in social navigation—a survey,

R. Mirsky, X. Xiao, J. Hart, and P. Stone, “Conflict avoidance in social navigation—a survey,” ACM Transactions on Human-Robot Interaction, vol. 13, no. 1, pp. 1–36, 2024

work page 2024

[2] [2]

Principles and guidelines for evaluating social robot navigation algorithms,

A. Francis, C. P ´erez-d’Arpino, C. Li, F. Xia, A. Alahi, R. Alami, A. Bera, A. Biswas, J. Biswas, R. Chandra, et al. , “Principles and guidelines for evaluating social robot navigation algorithms,” arXiv preprint arXiv:2306.16740, 2023

work page arXiv 2023

[3] [3]

Core challenges of social robot navigation: A survey,

C. Mavrogiannis, F. Baldini, A. Wang, D. Zhao, P. Trautman, A. Stein- feld, and J. Oh, “Core challenges of social robot navigation: A survey,” ACM Transactions on Human-Robot Interaction , vol. 12, no. 3, pp. 1– 39, 2023

work page 2023

[4] [4]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

work page 2024

[5] [5]

Tgs: Trajectory generation and selection using vision language models in mapless outdoor environments,

D. Song, J. Liang, X. Xiao, and D. Manocha, “Tgs: Trajectory generation and selection using vision language models in mapless outdoor environments,” arXiv preprint arXiv:2408.02454 , 2024

work page arXiv 2024

[6] [6]

Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,

A. Payandeh, D. Song, M. Nazeri, J. Liang, P. Mukherjee, A. H. Raj, Y . Kong, D. Manocha, and X. Xiao, “Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,” arXiv preprint arXiv:2501.09024, 2024

work page arXiv 2024

[7] [7]

Do vision-language models represent space and how? eval- uating spatial frame of reference under ambiguities,

Z. Zhang, F. Hu, J. Lee, F. Shi, P. Kordjamshidi, J. Chai, and Z. Ma, “Do vision-language models represent space and how? eval- uating spatial frame of reference under ambiguities,” arXiv preprint arXiv:2410.17385, 2024

work page arXiv 2024

[8] [8]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

work page 2024

[9] [9]

Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning,

M. Aghzal, E. Plaku, and Z. Yao, “Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning,” in ICLR 2024 Workshop on Large Language Model (LLM) Agents

work page 2024

[10] [10]

Look further ahead: Testing the limits of gpt-4 in path planning,

M. Aghzal, E. Plaku, and Z. Yao, “Look further ahead: Testing the limits of gpt-4 in path planning,” in 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE) , 2024, pp. 1020–1027

work page 2024

[11] [11]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” in European Conference on Computer Vision. Springer, 2024, pp. 256–274

work page 2024

[12] [12]

Structured spatial reasoning with open vocabulary object detectors,

N. Nejatishahidin, M. R. V ongala, and J. Kosecka, “Structured spatial reasoning with open vocabulary object detectors,” arXiv preprint arXiv:2410.07394, 2024

work page arXiv 2024

[13] [13]

Using human-inspired signals to disam- biguate navigational intentions,

J. Hart, R. Mirsky, X. Xiao, S. Tejeda, B. Mahajan, J. Goo, K. Baldauf, S. Owen, and P. Stone, “Using human-inspired signals to disam- biguate navigational intentions,” in International Conference on Social Robotics. Springer, 2020, pp. 320–331

work page 2020

[14] [14]

A protocol for validating social navigation policies,

S. Pirk, E. Lee, X. Xiao, L. Takayama, A. Francis, and A. Toshev, “A protocol for validating social navigation policies,” arXiv preprint arXiv:2204.05443, 2022

work page arXiv 2022

[15] [15]

Social force model for pedestrian dynam- ics,

D. Helbing and P. Molnar, “Social force model for pedestrian dynam- ics,” Physical review E , vol. 51, no. 5, p. 4282, 1995

work page 1995

[16] [16]

An approach of social navigation based on proxemics for crowded environments of humans and robots,

M. Daza, D. Barrios-Aranibar, J. Diaz-Amado, Y . Cardinale, and J. Vilasboas, “An approach of social navigation based on proxemics for crowded environments of humans and robots,”Micromachines, vol. 12, no. 2, p. 193, 2021

work page 2021

[17] [17]

Socially-aware robot navigation: A learning approach,

M. Luber, L. Spinello, J. Silva, and K. O. Arras, “Socially-aware robot navigation: A learning approach,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems . IEEE, 2012, pp. 902– 907

work page 2012

[18] [18]

Sacson: Scalable autonomous control for social navigation,

N. Hirose, D. Shah, A. Sridhar, and S. Levine, “Sacson: Scalable autonomous control for social navigation,” IEEE Robotics and Au- tomation Letters , 2023

work page 2023

[19] [19]

Appld: Adaptive planner parameter learning from demonstration,

X. Xiao, B. Liu, G. Warnell, J. Fink, and P. Stone, “Appld: Adaptive planner parameter learning from demonstration,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4541–4547, 2020

work page 2020

[20] [20]

Learning model pre- dictive controllers with real-time attention for real-world navigation,

X. Xiao, T. Zhang, K. M. Choromanski, T.-W. E. Lee, A. Francis, J. Varley, S. Tu, S. Singh, P. Xu, F. Xia, S. M. Persson, L. Takayama, R. Frostig, J. Tan, C. Parada, and V . Sindhwani, “Learning model pre- dictive controllers with real-time attention for real-world navigation,” in Conference on robot learning . PMLR, 2022

work page 2022

[21] [21]

Vlm-social-nav: Socially aware robot navigation through scoring us- ing vision-language models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “Vlm-social-nav: Socially aware robot navigation through scoring us- ing vision-language models,” IEEE Robotics and Automation Letters , 2024

work page 2024

[22] [22]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,

H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters , vol. 7, no. 4, pp. 11 807–11 814, 2022

work page 2022

[23] [23]

Rethinking social robot navigation: Leveraging the best of two worlds,

A. H. Raj, Z. Hu, H. Karnan, R. Chandra, A. Payandeh, L. Mao, P. Stone, J. Biswas, and X. Xiao, “Rethinking social robot navigation: Leveraging the best of two worlds,” in 2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 16 330–16 337

work page 2024

[24] [24]

To- ward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset,

D. M. Nguyen, M. Nazeri, A. Payandeh, A. Datar, and X. Xiao, “To- ward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2023, pp. 7442–7447

work page 2023

[25] [25]

A study on learning social robot navigation with multimodal perception,

B. Panigrahi, A. H. Raj, M. Nazeri, and X. Xiao, “A study on learning social robot navigation with multimodal perception,” arXiv preprint arXiv:2309.12568, 2023

work page arXiv 2023

[26] [26]

Do as i can, not as i say: Grounding language in robotic affordances,

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on robot learning. PMLR, 2023, pp. 287–318

work page 2023

[27] [27]

Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,

K. Weerakoon, M. Elnoor, G. Seneviratne, V . Rajagopal, S. H. Arul, J. Liang, M. K. M. Jaffar, and D. Manocha, “Behav: Behavioral rule guided autonomy using vlms for robot navigation in outdoor scenes,” arXiv preprint arXiv:2409.16484 , 2024

work page arXiv 2024

[28] [28]

Convoi: Context-aware navigation using vision language models in outdoor and indoor envi- ronments,

A. J. Sathyamoorthy, K. Weerakoon, M. Elnoor, A. Zore, B. Ichter, F. Xia, J. Tan, W. Yu, and D. Manocha, “Convoi: Context-aware navigation using vision language models in outdoor and indoor envi- ronments,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2024, pp. 13 837–13 844

work page 2024

[29] [29]

Pivot: Iterative visual prompting elicits actionable knowledge for vlms,

S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, et al. , “Pivot: Iterative visual prompting elicits actionable knowledge for vlms,” arXiv preprint arXiv:2402.07872, 2024

work page arXiv 2024

[30] [30]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Yao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025

[31] [31]

Olivia-nav: An online lifelong vision language approach for mobile robot social navigation,

S. Narasimhan, A. H. Tan, D. Choi, and G. Nejat, “Olivia-nav: An online lifelong vision language approach for mobile robot social navigation,” arXiv preprint arXiv:2409.13675 , 2024

work page arXiv 2024

[32] [32]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

A. Kamath, J. Hessel, and K.-W. Chang, “What’s” up” with vision- language models? investigating their struggle with spatial reasoning,” arXiv preprint arXiv:2310.19785 , 2023

work page arXiv 2023

[33] [33]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, Y . Li, and N. Joshi, “Is a picture worth a thousand words? delving into spatial reasoning for vision language models,” arXiv preprint arXiv:2406.14852 , 2024

work page arXiv 2024

[34] [34]

Instruction mining: High-quality instruction data selection for large language models

Y . Cao, Y . Kang, C. Wang, and L. Sun, “Instruction mining: Instruction data selection for tuning large language models,” arXiv preprint arXiv:2307.06290, 2023

work page arXiv 2023

[35] [35]

Towards robust robot 3d perception in urban environments: The ut campus object dataset,

A. Zhang, C. Eranki, C. Zhang, J.-H. Park, R. Hong, P. Kalyani, L. Kalyanaraman, A. Gamare, A. Bagad, M. Esteva, and J. Biswas, “Towards robust robot 3d perception in urban environments: The ut campus object dataset,” 2023

work page 2023

[36] [36]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

work page 2024