Recognition: no theorem link
Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision
Pith reviewed 2026-05-13 05:53 UTC · model grok-4.3
The pith
Vision-language models enable risk-aware urban navigation maps for people with low vision through visual question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative Multimodal Large Language Models substantially outperform classification-based approaches in VQA for hazard identification, with Qwen-VL achieving the best balance of precision and recall. This supports the creation of navigable risk-aware event maps from aggregated model responses using a hierarchical query structure, demonstrating viability as a foundation for assistive navigation systems.
What carries the argument
VQA-based event map framework with three-level hierarchical queries on VLMs aggregated via weighted risk scoring into four safety categories.
Load-bearing premise
Responses from off-the-shelf VLMs can be reliably aggregated into accurate four-category safety labels that generalize to new urban settings without additional training or validation.
What would settle it
A field test showing that the generated risk maps frequently misclassify hazards in cities outside the 20-city dataset, leading to unsafe navigation suggestions.
Figures
read the original abstract
Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an event map framework for risk-aware navigation assistance for individuals with low vision. It utilizes a three-level hierarchical VQA pipeline with off-the-shelf VLMs to analyze urban scenes and identify hazards. A weighted risk scoring system aggregates model responses to classify locations into four safety categories. The authors introduce a new dataset of 800+ images from 20 cities with 18k VQA annotations and benchmark ViLT, LLaVA, InstructBLIP, and Qwen-VL, concluding that generative MLLMs, particularly Qwen-VL, offer superior precision-recall performance for this application.
Significance. If the end-to-end system is shown to produce reliable safety maps, the work could advance flexible assistive technologies by demonstrating the use of general-purpose VLMs without task-specific fine-tuning. The geographically diverse dataset spanning six continents is a notable contribution that could facilitate further research in computer vision for accessibility. The benchmarking results highlight the advantages of generative models over classification-based ones in this context.
major comments (3)
- [Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.
- [Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.
- [Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.
minor comments (2)
- [Abstract] The abstract states that Qwen-VL achieves the 'best overall balance of precision and recall' but supplies no numerical values; these should be included for transparency.
- [Figures] Event map visualizations would benefit from explicit legends and scale bars to clarify the four safety category color codings and geographic coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional evaluation and documentation would strengthen the claims regarding the reliability of the event maps. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation] The reported precision and recall metrics are limited to individual VQA questions (18k answers); no quantitative evaluation is provided for the accuracy of the downstream weighted risk scoring system in producing the four safety categories, such as comparison to expert-labeled ground truth or agreement metrics on the final maps. This is load-bearing for the central claim that the framework produces reliable navigable risk maps.
Authors: We agree that direct evaluation of the aggregated risk maps is important to support the central claim. The current experiments focus on the VQA stage because it is the core technical contribution and the source of all variability. In the revision we will add a new evaluation subsection that obtains expert safety-category labels (four categories) for a random subset of 150 images spanning multiple cities. We will report precision/recall and Cohen's kappa between the system's weighted risk output and these expert labels, plus qualitative examples of the resulting maps. This provides the missing quantitative link without requiring a new dataset. revision: yes
-
Referee: [Method (risk aggregation)] The weighting scheme for converting VLM responses into risk scores, including the specific weights, category thresholds, and handling of conflicting or hallucinated answers, lacks any reported calibration, cross-validation, or ablation against real-world risk data.
Authors: The weights were derived from accessibility literature and consultation with two low-vision navigation specialists; we will add the exact numerical weights, the four category thresholds, and the rule for handling conflicting answers (majority vote with tie-breaking by highest-risk category) to the revised Methods section. We will also include a sensitivity ablation that varies the weights by ±20 % and shows the resulting change in category distribution across the 20-city dataset. A full calibration against real-world incident data is not feasible with currently available public sources, so we will explicitly note this as a limitation and future work. revision: partial
-
Referee: [Dataset] No inter-rater agreement metrics (e.g., Cohen's kappa) are reported for the annotations of the 800 images and 18k questions, which is necessary to establish the reliability of the ground truth used for all benchmarking claims.
Authors: We will compute and report inter-rater agreement in the revised dataset section. The 800 images were labeled for safety category by three independent annotators following a written guideline; we will report Fleiss' kappa on the four-category labels. For the 18k VQA answers, a 10 % random sample was double-annotated and we will report Cohen's kappa on that sample. These statistics will be added to Table 1 and the text. revision: yes
Circularity Check
No circularity: empirical benchmarks on new dataset with off-the-shelf models
full rationale
The paper introduces a hierarchical VQA pipeline and weighted aggregation for risk mapping, then evaluates four standard VLMs on a newly collected 800-image/18k-question dataset using precision and recall. No equations, fitted parameters, or predictions are defined in terms of the target outputs; the aggregation step is presented as a fixed heuristic without reported calibration on the evaluation data. Benchmarks compare public models directly against author annotations, with no self-citation load-bearing on uniqueness or ansatz. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- risk weights
axioms (1)
- domain assumption Off-the-shelf VLMs can produce accurate pedestrian-scene descriptions and hazard identifications across varied urban environments
invented entities (1)
-
event maps
no independent evidence
Reference graph
Works this paper leans on
-
[1]
World blindness and visual impairment: despite many successes, the problem is growing,
P. Ackland, S. Resnikoff, and R. Bourne, “World blindness and visual impairment: despite many successes, the problem is growing,”Commu- nity Eye Health, vol. 30, no. 100, pp. 71–73, 2018
work page 2018
-
[2]
Motion planning for autonomous driving: The state of the art and future perspectives,
S. Teng, X. Hu, P. Deng, B. Li, Y . Li, Y . Ai, D. Yang, L. Li, Z. Xuanyuan, F. Zhu, and L. Chen, “Motion planning for autonomous driving: The state of the art and future perspectives,”IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3692–3711, 2023
work page 2023
-
[3]
Deep leaning-based ultra-fast stair detection,
C. Wang, Z. Pei, S. Qiu, and Z. Tang, “Deep leaning-based ultra-fast stair detection,”Scientific Reports, vol. 12, no. 1, p. 16124, 2022
work page 2022
-
[4]
Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,
H. Hwang, S. Kwon, Y . Kim, and D. Kim, “Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,” 2024 21st International Conference on Ubiquitous Robots (UR), pp. 281–288, 2024
work page 2024
-
[5]
Visual language integration: A survey and open challenges,
S.-M. Park and Y .-G. Kim, “Visual language integration: A survey and open challenges,”Computer Science Review, vol. 48, p. 100548, 2023
work page 2023
-
[6]
Vqa: Visual question answering,
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inICCV, 2015, pp. 2425– 2433
work page 2015
-
[7]
Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,
A. Werby, C. Huang, M. B ¨uchner, A. Valada, and W. Burgard, “Hi- erarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024
work page 2024
-
[8]
Taskography: Evaluating robot task planning over large 3d scene graphs,
C. Agia, K. M. Jatavallabhula, M. Khodeir, O. Miksik, V . Vineet, M. Mukadam, L. Paull, and F. Shkurti, “Taskography: Evaluating robot task planning over large 3d scene graphs,” inProceedings of the 5th Conference on Robot Learning, vol. 164, 2022, pp. 46–58
work page 2022
-
[9]
Vilt: Vision-and-language transformer without convolution or region supervision,
W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inICML, vol. 139, 2021, pp. 5583–5594
work page 2021
-
[10]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, vol. 36, 2023, pp. 34 892–34 916
work page 2023
-
[11]
Instructblip: towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: towards general-purpose vision- language models with instruction tuning,” inNeurIPS, 2023
work page 2023
-
[12]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Mapless urban robot navigation by following pedestrians,
S. Buckeridge, P. Carreno-Medrano, A. Cosgun, E. Croft, and W. P. Chan, “Mapless urban robot navigation by following pedestrians,” in IROS, 2023, pp. 6787–6792
work page 2023
-
[14]
Dynamic channel: A planning framework for crowd navigation,
C. Cao, P. Trautman, and S. Iba, “Dynamic channel: A planning framework for crowd navigation,” inICRA, 2019, pp. 5551–5557
work page 2019
-
[15]
Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,
A. Brunetti, D. Buongiorno, G. F. Trotta, and V . Bevilacqua, “Com- puter vision and deep learning techniques for pedestrian detection and tracking: A survey,”Neurocomputing, vol. 300, pp. 17–33, 2018
work page 2018
-
[16]
Group surfing: A pedestrian- based approach to sidewalk robot navigation,
Y . Du, N. J. Hetherington, C. L. Oon, W. P. Chan, C. P. Quintero, E. Croft, and H. Machiel Van der Loos, “Group surfing: A pedestrian- based approach to sidewalk robot navigation,” inICRA, 2019, pp. 6518– 6524
work page 2019
-
[17]
Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,
H. Fu, V . Renaudin, Y . Kone, and N. Zhu, “Analysis of the recent ai for pedestrian navigation with wearable inertial sensors,”IEEE Journal of Indoor and Seamless Positioning and Navigation, vol. 1, pp. 26–38, 2023
work page 2023
-
[18]
J. Zhang, X. Yu, S. Ha, P. T. Mor´on, S. Salimpour, F. Keramat, H. Zhang, and T. Westerlund, “Seamless outdoor-indoor pedestrian positioning system with gnss/uwb/imu fusion: A comparison of ekf, fgo, and pf,” ArXiv, vol. abs/2512.10480, 2025
-
[19]
From research to app: Personalized inertial navigation for the visually impaired,
T. Moisan, H. Fu, V . Renaudin, and M. I. Sayyaf, “From research to app: Personalized inertial navigation for the visually impaired,” in Proceedings of the 2025 International Conference on Indoor Positioning and Indoor Navigation (IPIN). Tampere, Finland: Tampere University, 2025
work page 2025
-
[20]
D. Kumar, S. Iyer, E. Raja, R. Kumar, and V . P. Kafle, “Improving pedestrian navigation in urban environment using augmented reality and landmark recognition,”IEEE Communications Standards Magazine, vol. 8, no. 1, pp. 20–26, 2024
work page 2024
-
[21]
Landmark-based pedestrian navigation from collec- tions of geotagged photos,
H. Hile, R. Vedantham, G. Cuellar, A. Liu, N. Gelfand, R. Grzeszczuk, and G. Borriello, “Landmark-based pedestrian navigation from collec- tions of geotagged photos,” inProceedings of the 7th International Conference on Mobile and Ubiquitous Multimedia, 12 2008, pp. 145– 152
work page 2008
-
[22]
L. Zhu, J. Shen, J. Zhou, Z. Stacho ˇn, S. Hong, and X. Wang, “Person- alized landmark adaptive visualization method for pedestrian navigation maps: Considering user familiarity,”Transactions in GIS, vol. 26, no. 2, pp. 669–690, 2022
work page 2022
-
[23]
A Personalised Pedestrian Navigation System,
U. Shah and J. Wang, “A Personalised Pedestrian Navigation System,” in12th International Conference on Geographic Information Science, ser. Leibniz International Proceedings in Informatics (LIPIcs), vol. 277, 2023, pp. 67:1–67:6
work page 2023
-
[24]
What about people in pedestrian navigation?
Z. Fang, Q. Li, and S.-L. Shaw, “What about people in pedestrian navigation?”Geo-spatial Information Science, vol. 18, no. 4, pp. 135– 150, 2015
work page 2015
-
[25]
A system for generating customized pleasant pedestrian routes based on openstreetmap data,
T. Novack, Z. Wang, and A. Zipf, “A system for generating customized pleasant pedestrian routes based on openstreetmap data,”Sensors, vol. 18, p. 3794, 2018
work page 2018
-
[26]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139, 2021, pp. 8748–8763
work page 2021
-
[27]
J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inICML, vol. 162, 2022, pp. 12 888–12 900
work page 2022
-
[28]
Open- vocabulary object detection upon frozen vision and language models,
W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “Open- vocabulary object detection upon frozen vision and language models,” inICLR, 2023
work page 2023
-
[29]
Lami-detr: Open-vocabulary detection with language model instruction,
P. Du, Y . Wang, Y . Sun, L. Wang, Y . Liao, G. Zhang, E. Ding, Y . Wang, J. Wang, and S. Liu, “Lami-detr: Open-vocabulary detection with language model instruction,” inECCV, 2024
work page 2024
-
[30]
A survey on open-vocabulary detection and segmentation: Past, present, and future,
C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, p. 8954–8975, 2024
work page 2024
-
[31]
Clip-vg: Self-paced curriculum adapting of clip for visual grounding,
L. Xiao, X. Yang, F. Peng, M. Yan, Y . Wang, and C. Xu, “Clip-vg: Self-paced curriculum adapting of clip for visual grounding,”IEEE Transactions on Multimedia, vol. 26, p. 4334–4347, 2024. 10
work page 2024
-
[32]
Unleashing text-to- image diffusion models for visual perception,
W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to- image diffusion models for visual perception,” inICCV, October 2023, pp. 5729–5739
work page 2023
-
[33]
Generating contextually-relevant navigation instructions for blind and low vision people,
Z. Merchant, A. Anwar, E. H. Wang, S. Chattopadhyay, and J. Thoma- son, “Generating contextually-relevant navigation instructions for blind and low vision people,” inThe 1st InterAI Workshop: Interactive AI for Human-centered Robotics, 2024
work page 2024
-
[34]
Vialm: A survey and benchmark of visually impaired assistance with large models,
Y . Zhao, Y . Zhang, R. Xiang, J. Li, and H. Li, “Vialm: A survey and benchmark of visually impaired assistance with large models,”ArXiv, vol. abs/2402.01735, 2024
- [35]
-
[36]
Walkvlm: Aid visually impaired people walking by vision language model,
Z. Yuan, T. Zhang, Y . Zhu, J. Zhang, Y . Deng, Z. Jia, P. Luo, X. Duan, J. Zhou, and J. Zhang, “Walkvlm: Aid visually impaired people walking by vision language model,” inICCV, October 2025, pp. 9845–9854
work page 2025
-
[37]
Vqa-driven event maps for assistive navigation for people with low vision in urban environments,
J. Morales, B. Gebregziabher, A. Caba ˜neros, and J. Sanchez-Riera, “Vqa-driven event maps for assistive navigation for people with low vision in urban environments,” inICRA, 2025, pp. 12 458–12 464
work page 2025
-
[38]
Vizwiz grand challenge: Answering visual questions from blind people,
D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inCVPR, 2018
work page 2018
-
[39]
D. Gurari, Q. Li, C. Lin, Y . Zhao, A. Guo, A. Stangl, and J. P. Bigham, “Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people,” inCVPR, 2019, pp. 939–948
work page 2019
-
[40]
J. Kim, J. Park, J. Park, S. Lee, J. Chung, J. Kim, J. H. Joung, and Y . Yu, “Guidedog: A real-world egocentric multimodal dataset for blind and low-vision accessibility-aware guidance,” 2025
work page 2025
-
[41]
3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,
A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” inRobotics: Science and Systems (RSS), 2020
work page 2020
-
[42]
Optimal scene graph planning with large language model guidance,
Z. Dai, A. Asgharivaskasi, T. Duong, S. Lin, M.-E. Tzes, G. Pappas, and N. Atanasov, “Optimal scene graph planning with large language model guidance,” inICRA, 2024, pp. 14 062–14 069
work page 2024
-
[43]
Bird’s-eye-view scene graph for vision-language navigation,
R. Liu, X. Wang, W. Wang, and Y . Yang, “Bird’s-eye-view scene graph for vision-language navigation,” inICCV, October 2023, pp. 10 968– 10 980
work page 2023
-
[44]
Long-term object search using incremental scene graph updating,
F. Zhou, H. Liu, H. Zhao, and L. Liang, “Long-term object search using incremental scene graph updating,”Robotica, vol. 41, no. 3, p. 962–975, 2023. Antoni Vallsreceived the B.Sc. degree in theoretical physics from the Universitat de Barcelona, Spain, in 2022, and the M.Sc. degree in data science from the University of Padua, Italy, in 2024. He has pre- v...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.