pith. sign in

arxiv: 2503.12844 · v2 · submitted 2025-03-17 · 💻 cs.CV

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Pith reviewed 2026-05-22 23:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric datasetblind low visionaccessibility guidancemultimodal modelsdepth perceptionhuman-AI annotationpedestrian scenesnavigation assistance
0
0 comments X

The pith

A new dataset of real-world scenes shows current multimodal models struggle with depth perception for blind navigation guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GuideDog, a collection of 22,000 egocentric image-description pairs from pedestrian scenes across 46 countries, built with a human-AI pipeline that verifies outputs against established blind and low-vision guidance standards instead of starting from scratch. This tackles the prior bottleneck of labor-intensive expert annotation that limited progress on assistive navigation tools. The accompanying GuideDogQA benchmark with 818 samples tests object recognition and depth perception specifically. Experiments on the benchmark indicate that existing multimodal large language models still fall short on depth cues and on following the accessibility standards. A reader would care because reliable AI guidance could support safer independent movement for the billions affected by vision loss.

Core claim

The central claim is that GuideDog supplies a scalable, standards-grounded dataset of 22K image-description pairs plus an 818-sample QA benchmark, and that testing current multimodal large language models on this benchmark reveals persistent difficulties with depth perception and with producing descriptions that adhere to blind and low-vision guidance standards.

What carries the argument

The human-AI annotation pipeline that moves the task from full generation to verification against established blind and low-vision guidance standards.

If this is right

  • Models can now be benchmarked on realistic pedestrian scenes for object recognition and depth perception in accessibility contexts.
  • The verification-based pipeline reduces the cost of creating further accessibility-aware datasets while preserving standard compliance.
  • Persistent model failures on depth perception point to concrete targets for improvement in multimodal systems used for navigation.
  • The 2K human-verified subset provides a high-quality seed for training or fine-tuning assistive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification approach could extend to other domains where expert standards must be followed at scale, such as medical imaging or legal document review.
  • If models improve on this benchmark, real-time mobile apps could translate the descriptions into audio cues that help users avoid obstacles more effectively.
  • Dataset creators in related fields might adopt similar hybrid pipelines to incorporate domain standards without full expert labor.

Load-bearing premise

The human-AI pipeline yields descriptions that match the quality and fidelity of full expert annotation when judged against blind and low-vision guidance standards.

What would settle it

Independent expert raters score a random sample of the dataset descriptions for adherence to the guidance standards and find large systematic deviations from expert-level quality.

Figures

Figures reproduced from arXiv: 2503.12844 by Jaewoo Park, Ji Hoon Joung, JiSung Kim, Jiwan Chung, Junhee Park, Junhyeok Kim, Sangeyl Lee, Youngjae Yu.

Figure 1
Figure 1. Figure 1: An overview of the accessibility-aware guidance gen [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Side-by-side comparison of examples from (a) G [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of the GUIDEDOG generation pipeline, ensuring all stages adhere to GUIDEDOG standards (S1, S2, S3). For the collected scene image displayed at the center top, (1) in accordance with S1, the MLLM first extracts a comprehensive scene description; (2) following S2, both off-the-shelf models and the MLLM identify obstacles; (3) next, the extracted information is incorporated into a BLV-specific ins… view at source ↗
Figure 5
Figure 5. Figure 5: Country distribution of samples in the GUIDEDOG dataset. # Source Videos 269 # Total Frames in Source Videos 59.8M Total Source Videos Duration 291 hours # GUIDEDOG Samples 22,084 # GUIDEDOG Gold Label 2,106 # GUIDEDOGQA Samples 818 # Cities in GUIDEDOG 183 # Countries in GUIDEDOG 46 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of object recognition and relative depth com [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The global information extraction prompt filters out inappropriate images and extracts scene information and location details in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The local information extraction prompt, which evaluates detected objects and classifies each as either dangerous or not, in order [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The silver label generation prompt leverages classified dangerous objects and extracted scene information to generate silver [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The GPT-Eval prompt, which evaluates how well it adheres to G [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The zero-shot prompt designed for MLLM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The 3-shot prompt designed for MLLM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The zero-shot prompt designed for SM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The 3-shot prompt designed for SM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: An example of a gold label in GUIDEDOG. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: An example of an object recognition task on G [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: An example of a depth recognition task on G [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: An interface for evaluating and refining silver labels into gold labels. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: An interface for validating object detection and depth estimation. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: An interface for human evaluation of model-generated descriptions. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
read the original abstract

For people affected by blindness and low vision (BLV), safe and independent navigation remains a major challenge, impacting over 2.2 billion individuals worldwide. Although multimodal large language models (MLLMs) offer new opportunities for assistive navigation, progress has been limited by the scarcity of accessibility-aware datasets, because creating them requires labor-intensive expert annotation. To this end, we introduce GuideDog, a novel dataset containing 22K image-description pairs (2K human-verified) capturing real-world pedestrian scenes across 46 countries. Our human-AI pipeline shifts annotation from generation to verification, grounded in established BLV guidance standards from experts and research, improving scalability while maintaining quality. We also present GuideDogQA, an 818-sample benchmark evaluating object recognition and depth perception. Experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces GuideDog, a dataset of 22K egocentric image-description pairs (2K human-verified) from real-world pedestrian scenes across 46 countries, created via a human-AI annotation pipeline grounded in established BLV guidance standards. It also presents the GuideDogQA benchmark consisting of 818 samples to evaluate MLLMs on object recognition and depth perception, with the central claim that experiments show depth perception and adherence to these standards remain challenging for current MLLMs.

Significance. If the human-AI pipeline produces annotations of quality comparable to full expert annotation and the 818-sample benchmark is constructed rigorously, the dataset would provide a valuable, scalable resource for developing accessibility-aware navigation systems for BLV users, addressing the scarcity of real-world egocentric data in this domain.

major comments (1)
  1. [Abstract] Abstract: The statement that 'experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs' is not accompanied by any quantitative results, performance metrics, error analysis, or details on the construction, sampling method, or composition of the 818-sample GuideDogQA benchmark, leaving the central empirical claim only moderately supported.
minor comments (1)
  1. [Abstract] Abstract: Clarify whether the 2K human-verified pairs are a subset of the 22K or represent an additional verification step applied to the full set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and will incorporate changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The statement that 'experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs' is not accompanied by any quantitative results, performance metrics, error analysis, or details on the construction, sampling method, or composition of the 818-sample GuideDogQA benchmark, leaving the central empirical claim only moderately supported.

    Authors: We agree that the abstract would be strengthened by including brief quantitative support for the central claim. The full details on GuideDogQA construction (including sampling method and composition), performance metrics, and error analysis appear in Sections 3.3 and 4 of the manuscript. In revision we will add concise quantitative highlights to the abstract (e.g., key accuracy figures on depth-perception and standard-adherence tasks across evaluated MLLMs) while preserving length constraints. This directly addresses the concern without altering the underlying experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical dataset release introducing GuideDog (22K image-description pairs via human-AI pipeline grounded in external BLV expert standards) and GuideDogQA benchmark (818 samples). No mathematical derivations, equations, parameter fitting, or predictions exist. Claims about MLLM performance on depth perception and standards adherence are direct empirical observations on the benchmark, not reductions to fitted inputs or self-citations. The annotation process is presented as shifting to verification with 2K human-verified pairs, without self-definitional loops or ansatzes smuggled via citation. The work is self-contained against external benchmarks and standards.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that established BLV guidance standards exist and can be reliably applied by the described pipeline; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Established BLV guidance standards from experts and research provide a valid and complete basis for annotation.
    The human-AI pipeline is explicitly grounded in these standards.

pith-pipeline@v0.9.0 · 5714 in / 1133 out tokens · 26831 ms · 2026-05-22T23:48:05.242896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

  3. [3]

    METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics. 6

  4. [4]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 4

  5. [5]

    Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis

    Rupert RA Bourne, Seth R Flaxman, Tasanee Braithwaite, Maria V Cicinelli, Aditi Das, Jost B Jonas, Jill Keeffe, John H Kempen, Janet Leasher, Hans Limburg, et al. Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis. The Lancet Global Health, 5(...

  6. [6]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 4

  7. [7]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 4

  8. [8]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024. 6

  9. [9]

    V-eye: A vision-based navigation system for the visually impaired

    Ping-Jung Duh, Yu-Cheng Sung, Liang-Yu Fan Chiang, Yung-Ju Chang, and Kuan-Wen Chen. V-eye: A vision-based navigation system for the visually impaired. IEEE Transac- tions on Multimedia, 23:1567–1580, 2020. 3

  10. [10]

    How to guide someone who is blind or partially sighted

    Emma Turner, Sense. How to guide someone who is blind or partially sighted. https://www.sense.org.uk/ blog/how-to-guide-someone-who-is-blind- or-partially-sighted, 2023. 3

  11. [11]

    A review of assistive spatial orientation and navigation technologies for the visually impaired

    Hugo Fernandes, Paulo Costa, Vitor Filipe, Hugo Paredes, and Jo ˜ao Barroso. A review of assistive spatial orientation and navigation technologies for the visually impaired. Uni- versal Access in the Information Society, 18:155–168, 2019. 3

  12. [12]

    Introducing gemini 2.0: our new ai model for the agentic era

    Google DeepMind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology / google - deepmind / google - gemini - ai - update - december - 2024/ , 2024. 6, 8

  13. [13]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

  14. [14]

    Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people

    Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P Bigham. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 939–948, 2019. 8

  15. [15]

    Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired

    ByungOk Han, Woo-han Yun, Beom-Su Seo, and Jaehong Kim. Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired. arXiv preprint arXiv:2502.07183, 2025. 2

  16. [16]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 6

  17. [17]

    Long-form answers to visual questions from blind and low vision people

    Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, and Amy Pavel. Long-form answers to visual questions from blind and low vision people. arXiv preprint arXiv:2408.06303, 2024. 8

  18. [18]

    System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work

    Hochul Hwang, Tim Xia, Ibrahima Keita, Ken Suzuki, Joy- deep Biswas, Sunghoon I Lee, and Donghyun Kim. System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work. In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 9778–9784. IEEE, 2023. 2

  19. [19]

    Identifying crucial ob- jects in blind and low-vision individuals’ navigation

    Md Touhidul Islam, Imran Kabir, Elena Ariel Pearce, Md Al- imoor Reza, and Syed Masum Billah. Identifying crucial ob- jects in blind and low-vision individuals’ navigation. InPro- ceedings of the 26th International ACM SIGACCESS Con- ference on Computers and Accessibility, pages 1–8, 2024. 2, 3, 4

  20. [20]

    Enhancing multimodal large language models with vision detection models: An empirical study

    Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. Enhancing multimodal large language models with vision detection models: An empirical study. arXiv preprint arXiv:2401.17981, 2024. 4

  21. [21]

    Techniques for constructing indoor navigation systems for the visually impaired: A review

    Roya Norouzi Kandalan and Kamesh Namuduri. Techniques for constructing indoor navigation systems for the visually impaired: A review. IEEE Transactions on Human-Machine Systems, 50(6):492–506, 2020. 1

  22. [22]

    Meganno+: A human-llm collabora- tive annotation system

    Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rah- man, and Dan Zhang. Meganno+: A human-llm collabora- tive annotation system. arXiv preprint arXiv:2402.18050 ,

  23. [23]

    Understanding expec- tations for a robotic guide dog for visually impaired people

    J Taery Kim, Morgan Byrd, Jack L Crandell, Bruce N Walker, Greg Turk, and Sehoon Ha. Understanding expec- tations for a robotic guide dog for visually impaired people. arXiv preprint arXiv:2501.04594, 2025. 2 9

  24. [24]

    Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments

    Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes. Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments. Expert Systems with Applications, 212:118720, 2023. 3

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  26. [26]

    A survey of multimodel large language models

    Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024. 4

  27. [27]

    A technique for the measurement of attitudes

    Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 7

  28. [28]

    ROUGE: A package for automatic evalua- tion of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evalua- tion of summaries. In Text Summarization Branches Out , pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6

  29. [29]

    Deep learning based wearable assistive system for visually im- paired people

    Yimin Lin, Kai Wang, Wanxin Yi, and Shiguo Lian. Deep learning based wearable assistive system for visually im- paired people. In Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops, pages 0–0,

  30. [30]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1, 8

  31. [31]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 6

  32. [32]

    Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments

    Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan Chen, Kailun Yang, and Rainer Stiefelhagen. Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1857–1867, 2023. 1

  33. [33]

    Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

    Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 8

  34. [34]

    G-eval: NLG evaluation using gpt- 4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt- 4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 7

  35. [35]

    A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple

    Jyoti Madake, Shripad Bhatlawande, Anjali Solanke, and Swati Shilaskar. A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple. IEEE Access, 11:82496–82520, 2023. 3

  36. [36]

    Mobility-related ac- cidents experienced by people with visual impairment

    Roberto Manduchi and Sri Kurniawan. Mobility-related ac- cidents experienced by people with visual impairment. AER Journal: Research and Practice in Visual Impairment and Blindness, 4(2):44–54, 2011. 1, 3

  37. [37]

    Generating contextually- relevant navigation instructions for blind and low vision peo- ple

    Zain Merchant, Abrar Anwar, Emily Wang, Souti Chat- topadhyay, and Jesse Thomason. Generating contextually- relevant navigation instructions for blind and low vision peo- ple. arXiv preprint arXiv:2407.08219, 2024. 2, 3, 8

  38. [38]

    Ai and accessibility

    Meredith Ringel Morris. Ai and accessibility. Communica- tions of the ACM, 63(6):35–37, 2020. 2

  39. [39]

    We don’t need no bounding-boxes: Train- ing object class detectors using only human verification

    Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. We don’t need no bounding-boxes: Train- ing object class detectors using only human verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 854–863, 2016. 2

  40. [40]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311– 318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics. 6

  41. [41]

    Egoblur: Responsible innovation in aria, 2023

    Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023. 4, 8, 12

  42. [42]

    Optimal walking in terms of variability in step length

    Noboru Sekiya, Hiroshi Nagasaki, Hajime Ito, and Taketo Furuna. Optimal walking in terms of variability in step length. Journal of Orthopaedic & Sports Physical Therapy, 26(5):266–272, 1997. 4

  43. [43]

    Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence

    Ji-Won Song and Sung-Ho Yang. Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence. In CHI’10 Extended Abstracts on Human Factors in Computing Systems, pages 3343–3348. 2010. 3

  44. [44]

    Assisting the blind and visually impaired: guidelines for eye health workers and other helpers

    Sue Stevens. Assisting the blind and visually impaired: guidelines for eye health workers and other helpers. Com- munity Eye Health, 16(45):7, 2003. 3

  45. [45]

    Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people

    Garreth W Tigwell. Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people. In Proceedings of the 2021 CHI confer- ence on human factors in computing systems , pages 1–15,

  46. [46]

    Label Studio: Data labeling soft- ware, 2020-2025

    Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling soft- ware, 2020-2025. Open source software available from https://github.com/HumanSignal/label-studio. 12

  47. [47]

    Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025. 6

  48. [48]

    Helpful resources for business con- sulting

    Vision Australia. Helpful resources for business con- sulting. https://www.visionaustralia.org/ business - consulting / helpful - resources ,

  49. [49]

    How to be a sighted guide

    Vision Loss Resources. How to be a sighted guide. https: //visionlossresources.org/resources/how- to-be-a-guide/

  50. [50]

    Etiquette for interacting with people who are blind or have low vision

    Washington State University. Etiquette for interacting with people who are blind or have low vision. https:// 10 studentaffairs.vancouver.wsu.edu/access- center / etiquette - interacting - people - who-are-blind-who-have-low-vision . 3

  51. [51]

    Dos and don’ts for people with vision loss

    Wisconsin Department of Health Services. Dos and don’ts for people with vision loss. https : / / www . dhs . wisconsin.gov/obvi/adjustment/dos-donts. htm, 2023. 3

  52. [52]

    Sighted guide techniques

    Wisconsin Department of Health Services. Sighted guide techniques. https://www.dhs.wisconsin.gov/ obvi / adjustment / sightedguidetech . htm ,

  53. [53]

    Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design

    Jingyi Xie, Rui Yu, He Zhang, Sooyeon Lee, Syed Ma- sum Billah, and John M Carroll. Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design. arXiv preprint arXiv:2407.08882, 2024. 1, 2, 8

  54. [54]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

  55. [55]

    Socratic models: Composing zero-shot multimodal rea- soning with language

    Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal rea- soning with language. In The Eleventh International Confer- ence on Learning Representations. 6

  56. [56]

    A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models

    Zhe-Xin Zhang and Yoichi Ochiai. A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–4, 2024. 8

  57. [57]

    Vialm: A survey and benchmark of visually impaired assis- tance with large models

    Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li. Vialm: A survey and benchmark of visually impaired assis- tance with large models. arXiv preprint arXiv:2402.01735,

  58. [58]

    drive”, “car

    2, 8 11 A. Automatic Pipeline Details A.1. Gathering Videos To construct a high-quality dataset of outdoor walking scenes, we implemented a systematic approach to video selection. We manually identified YouTube channels spe- cializing in walking tours across urban and natural envi- ronments. All selected channels (listed in Table 6) sat- isfy three criter...

  59. [59]

    Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street

    Review the image and decide if it represents a street scene (”Yes”) or not (”No”). Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street. • The viewer is looking at something other than the street (e.g., looking at the display glass of a store on the st...

  60. [60]

    • scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions

    If "Yes": • scene description: Provide an overview of the street, including pedestrians, buildings, vehicles, or any key elements that make it a street. • scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions. The leftmost part of the image is 10 o’clock, the center is 12 o’clock, and the rightmost is 2 o’clock

  61. [61]

    No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be

    If "No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be "None"

  62. [62]

    is_street

    Use only the JSON format below. 5. Do not include any text outside of this JSON format. Output JSON example: { "is_street": "Yes", "scene_description": "A detailed description of the street, including key elements such as pedestrians, shops, and vehicles.", "scene_location": "A positional overview using 10, 11, 12, 1, and 2 o’clock references." } Or: { "i...

  63. [63]

    • Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.)

    Complete Danger Zone (within 5 meters): • Evaluate whether each object is directly in the user’s walking path and could lead to a collision. • Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.). • Otherwise, mark it as not dangerous (”No”) and briefly...

  64. [64]

    object": the object’s identifier or name. •

    Ordinary Zone (beyond 5 meters) • Focus on moving objects in this zone (e.g., approaching motorcycles, cars, bicycles, pedestrians). • Mark an object as dangerous (”Yes”) if it is moving toward the user and could pose a threat. • Otherwise, mark it as not dangerous (”No”) and provide a brief explanation. For each object, provide a JSON entry with the foll...

  65. [66]

    Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Potential Hazards: {object info} Scene Information: {scene info} Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 9. The silver label gen...

  66. [68]

    Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

    Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 11. The zero-shot prompt designed for MLLM to generate accessibility-aware guidance gene...

  67. [70]

    Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right

    Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right. At 10 o’clock, about five steps away, there’s a moving car which is potentially dangerous if you stray off th...

  68. [72]

    Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

    Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: {llava output} Object Info: {object info} Guidance: Figure 13. The zero-shot...

  69. [73]

    Surroundings and Position: Summarize where the person is, the general environment, their current position, and any nearby landmarks in 1-2 sentences

  70. [74]

    • Follow the order of 10, 11, 12, 1, and 2 o’clock

    Hazards: • For each direction (10, 11, 12, 1, and 2 o’clock), combine all hazards in that direction into exactly one sentence, mentioning approximate distance(s) and reason(s) they are dangerous. • Follow the order of 10, 11, 12, 1, and 2 o’clock

  71. [75]

    Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

    Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: Object Info: Guidance: You’re on a bustling city street with buildings on yo...