GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Jaewoo Park; Ji Hoon Joung; JiSung Kim; Jiwan Chung; Junhee Park; Junhyeok Kim; Sangeyl Lee; Youngjae Yu

arxiv: 2503.12844 · v2 · submitted 2025-03-17 · 💻 cs.CV

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim , Jaewoo Park , Junhee Park , Sangeyl Lee , Jiwan Chung , JiSung Kim , Ji Hoon Joung , Youngjae Yu This is my paper

Pith reviewed 2026-05-22 23:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric datasetblind low visionaccessibility guidancemultimodal modelsdepth perceptionhuman-AI annotationpedestrian scenesnavigation assistance

0 comments

The pith

A new dataset of real-world scenes shows current multimodal models struggle with depth perception for blind navigation guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GuideDog, a collection of 22,000 egocentric image-description pairs from pedestrian scenes across 46 countries, built with a human-AI pipeline that verifies outputs against established blind and low-vision guidance standards instead of starting from scratch. This tackles the prior bottleneck of labor-intensive expert annotation that limited progress on assistive navigation tools. The accompanying GuideDogQA benchmark with 818 samples tests object recognition and depth perception specifically. Experiments on the benchmark indicate that existing multimodal large language models still fall short on depth cues and on following the accessibility standards. A reader would care because reliable AI guidance could support safer independent movement for the billions affected by vision loss.

Core claim

The central claim is that GuideDog supplies a scalable, standards-grounded dataset of 22K image-description pairs plus an 818-sample QA benchmark, and that testing current multimodal large language models on this benchmark reveals persistent difficulties with depth perception and with producing descriptions that adhere to blind and low-vision guidance standards.

What carries the argument

The human-AI annotation pipeline that moves the task from full generation to verification against established blind and low-vision guidance standards.

If this is right

Models can now be benchmarked on realistic pedestrian scenes for object recognition and depth perception in accessibility contexts.
The verification-based pipeline reduces the cost of creating further accessibility-aware datasets while preserving standard compliance.
Persistent model failures on depth perception point to concrete targets for improvement in multimodal systems used for navigation.
The 2K human-verified subset provides a high-quality seed for training or fine-tuning assistive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification approach could extend to other domains where expert standards must be followed at scale, such as medical imaging or legal document review.
If models improve on this benchmark, real-time mobile apps could translate the descriptions into audio cues that help users avoid obstacles more effectively.
Dataset creators in related fields might adopt similar hybrid pipelines to incorporate domain standards without full expert labor.

Load-bearing premise

The human-AI pipeline yields descriptions that match the quality and fidelity of full expert annotation when judged against blind and low-vision guidance standards.

What would settle it

Independent expert raters score a random sample of the dataset descriptions for adherence to the guidance standards and find large systematic deviations from expert-level quality.

Figures

Figures reproduced from arXiv: 2503.12844 by Jaewoo Park, Ji Hoon Joung, JiSung Kim, Jiwan Chung, Junhee Park, Junhyeok Kim, Sangeyl Lee, Youngjae Yu.

**Figure 2.** Figure 2: Side-by-side comparison of examples from (a) G [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of the GUIDEDOG generation pipeline, ensuring all stages adhere to GUIDEDOG standards (S1, S2, S3). For the collected scene image displayed at the center top, (1) in accordance with S1, the MLLM first extracts a comprehensive scene description; (2) following S2, both off-the-shelf models and the MLLM identify obstacles; (3) next, the extracted information is incorporated into a BLV-specific ins… view at source ↗

**Figure 5.** Figure 5: Country distribution of samples in the GUIDEDOG dataset. # Source Videos 269 # Total Frames in Source Videos 59.8M Total Source Videos Duration 291 hours # GUIDEDOG Samples 22,084 # GUIDEDOG Gold Label 2,106 # GUIDEDOGQA Samples 818 # Cities in GUIDEDOG 183 # Countries in GUIDEDOG 46 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of object recognition and relative depth com [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The global information extraction prompt filters out inappropriate images and extracts scene information and location details in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: The local information extraction prompt, which evaluates detected objects and classifies each as either dangerous or not, in order [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The silver label generation prompt leverages classified dangerous objects and extracted scene information to generate silver [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: The GPT-Eval prompt, which evaluates how well it adheres to G [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: The zero-shot prompt designed for MLLM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: The 3-shot prompt designed for MLLM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: The zero-shot prompt designed for SM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: The 3-shot prompt designed for SM to generate accessibility-aware guidance generation in G [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: An example of a gold label in GUIDEDOG. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: An example of an object recognition task on G [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: An example of a depth recognition task on G [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: An interface for evaluating and refining silver labels into gold labels. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: An interface for validating object detection and depth estimation. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: An interface for human evaluation of model-generated descriptions. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

read the original abstract

For people affected by blindness and low vision (BLV), safe and independent navigation remains a major challenge, impacting over 2.2 billion individuals worldwide. Although multimodal large language models (MLLMs) offer new opportunities for assistive navigation, progress has been limited by the scarcity of accessibility-aware datasets, because creating them requires labor-intensive expert annotation. To this end, we introduce GuideDog, a novel dataset containing 22K image-description pairs (2K human-verified) capturing real-world pedestrian scenes across 46 countries. Our human-AI pipeline shifts annotation from generation to verification, grounded in established BLV guidance standards from experts and research, improving scalability while maintaining quality. We also present GuideDogQA, an 818-sample benchmark evaluating object recognition and depth perception. Experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GuideDog delivers a large, multi-country egocentric dataset grounded in BLV standards, but the model performance claims rest on thin evidence.

read the letter

The core contribution is a new dataset of 22K real-world egocentric image-description pairs captured across 46 countries, with 2K human-verified, plus an 818-sample QA benchmark focused on object recognition and depth for blind and low-vision navigation. The human-AI pipeline, built on established guidance standards, is the practical step that makes scaling feasible without full expert annotation for every sample. Geographic spread and the shift from generation to verification are the parts that actually move the needle compared with prior general navigation datasets. That combination is worth having if you work on assistive multimodal systems. The experiments section states that current MLLMs struggle with depth perception and standard adherence, yet the abstract supplies no accuracy numbers, confusion matrices, or breakdown of the 818-sample construction. Without those, the central empirical observation stays at the level of a qualitative note rather than a quantified result. The pipeline's fidelity to expert-level output is asserted but not measured against a pure-expert baseline in the provided description. Minor gaps like that are common in dataset papers, but they limit how far the benchmark can be used to drive model improvements right now. This paper is aimed at researchers building or evaluating navigation aids for the BLV population. Anyone already collecting accessibility data or testing MLLMs on real pedestrian scenes will find the raw pairs and the standards grounding directly usable. The dataset release itself is solid enough to justify sending it out for peer review; the analysis can be strengthened in revision without changing the main value.

Referee Report

1 major / 1 minor

Summary. The paper introduces GuideDog, a dataset of 22K egocentric image-description pairs (2K human-verified) from real-world pedestrian scenes across 46 countries, created via a human-AI annotation pipeline grounded in established BLV guidance standards. It also presents the GuideDogQA benchmark consisting of 818 samples to evaluate MLLMs on object recognition and depth perception, with the central claim that experiments show depth perception and adherence to these standards remain challenging for current MLLMs.

Significance. If the human-AI pipeline produces annotations of quality comparable to full expert annotation and the 818-sample benchmark is constructed rigorously, the dataset would provide a valuable, scalable resource for developing accessibility-aware navigation systems for BLV users, addressing the scarcity of real-world egocentric data in this domain.

major comments (1)

[Abstract] Abstract: The statement that 'experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs' is not accompanied by any quantitative results, performance metrics, error analysis, or details on the construction, sampling method, or composition of the 818-sample GuideDogQA benchmark, leaving the central empirical claim only moderately supported.

minor comments (1)

[Abstract] Abstract: Clarify whether the 2K human-verified pairs are a subset of the 22K or represent an additional verification step applied to the full set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and will incorporate changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that 'experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs' is not accompanied by any quantitative results, performance metrics, error analysis, or details on the construction, sampling method, or composition of the 818-sample GuideDogQA benchmark, leaving the central empirical claim only moderately supported.

Authors: We agree that the abstract would be strengthened by including brief quantitative support for the central claim. The full details on GuideDogQA construction (including sampling method and composition), performance metrics, and error analysis appear in Sections 3.3 and 4 of the manuscript. In revision we will add concise quantitative highlights to the abstract (e.g., key accuracy figures on depth-perception and standard-adherence tasks across evaluated MLLMs) while preserving length constraints. This directly addresses the concern without altering the underlying experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical dataset release introducing GuideDog (22K image-description pairs via human-AI pipeline grounded in external BLV expert standards) and GuideDogQA benchmark (818 samples). No mathematical derivations, equations, parameter fitting, or predictions exist. Claims about MLLM performance on depth perception and standards adherence are direct empirical observations on the benchmark, not reductions to fitted inputs or self-citations. The annotation process is presented as shifting to verification with 2K human-verified pairs, without self-definitional loops or ansatzes smuggled via citation. The work is self-contained against external benchmarks and standards.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that established BLV guidance standards exist and can be reliably applied by the described pipeline; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Established BLV guidance standards from experts and research provide a valid and complete basis for annotation.
The human-AI pipeline is explicitly grounded in these standards.

pith-pipeline@v0.9.0 · 5714 in / 1133 out tokens · 26831 ms · 2026-05-22T23:48:05.242896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics. 6

work page 2005
[4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis

Rupert RA Bourne, Seth R Flaxman, Tasanee Braithwaite, Maria V Cicinelli, Aditi Das, Jost B Jonas, Jill Keeffe, John H Kempen, Janet Leasher, Hans Limburg, et al. Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis. The Lancet Global Health, 5(...

work page 2017
[6]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 4

work page 2021
[7]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 4

work page 2024
[8]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

V-eye: A vision-based navigation system for the visually impaired

Ping-Jung Duh, Yu-Cheng Sung, Liang-Yu Fan Chiang, Yung-Ju Chang, and Kuan-Wen Chen. V-eye: A vision-based navigation system for the visually impaired. IEEE Transac- tions on Multimedia, 23:1567–1580, 2020. 3

work page 2020
[10]

How to guide someone who is blind or partially sighted

Emma Turner, Sense. How to guide someone who is blind or partially sighted. https://www.sense.org.uk/ blog/how-to-guide-someone-who-is-blind- or-partially-sighted, 2023. 3

work page 2023
[11]

A review of assistive spatial orientation and navigation technologies for the visually impaired

Hugo Fernandes, Paulo Costa, Vitor Filipe, Hugo Paredes, and Jo ˜ao Barroso. A review of assistive spatial orientation and navigation technologies for the visually impaired. Uni- versal Access in the Information Society, 18:155–168, 2019. 3

work page 2019
[12]

Introducing gemini 2.0: our new ai model for the agentic era

Google DeepMind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology / google - deepmind / google - gemini - ai - update - december - 2024/ , 2024. 6, 8

work page 2024
[13]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page
[14]

Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people

Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P Bigham. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 939–948, 2019. 8

work page 2019
[15]

Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired

ByungOk Han, Woo-han Yun, Beom-Su Seo, and Jaehong Kim. Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired. arXiv preprint arXiv:2502.07183, 2025. 2

work page arXiv 2025
[16]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 6

work page 2022
[17]

Long-form answers to visual questions from blind and low vision people

Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, and Amy Pavel. Long-form answers to visual questions from blind and low vision people. arXiv preprint arXiv:2408.06303, 2024. 8

work page arXiv 2024
[18]

System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work

Hochul Hwang, Tim Xia, Ibrahima Keita, Ken Suzuki, Joy- deep Biswas, Sunghoon I Lee, and Donghyun Kim. System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work. In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 9778–9784. IEEE, 2023. 2

work page 2023
[19]

Identifying crucial ob- jects in blind and low-vision individuals’ navigation

Md Touhidul Islam, Imran Kabir, Elena Ariel Pearce, Md Al- imoor Reza, and Syed Masum Billah. Identifying crucial ob- jects in blind and low-vision individuals’ navigation. InPro- ceedings of the 26th International ACM SIGACCESS Con- ference on Computers and Accessibility, pages 1–8, 2024. 2, 3, 4

work page 2024
[20]

Enhancing multimodal large language models with vision detection models: An empirical study

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. Enhancing multimodal large language models with vision detection models: An empirical study. arXiv preprint arXiv:2401.17981, 2024. 4

work page arXiv 2024
[21]

Techniques for constructing indoor navigation systems for the visually impaired: A review

Roya Norouzi Kandalan and Kamesh Namuduri. Techniques for constructing indoor navigation systems for the visually impaired: A review. IEEE Transactions on Human-Machine Systems, 50(6):492–506, 2020. 1

work page 2020
[22]

Meganno+: A human-llm collabora- tive annotation system

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rah- man, and Dan Zhang. Meganno+: A human-llm collabora- tive annotation system. arXiv preprint arXiv:2402.18050 ,

work page arXiv
[23]

Understanding expec- tations for a robotic guide dog for visually impaired people

J Taery Kim, Morgan Byrd, Jack L Crandell, Bruce N Walker, Greg Turk, and Sehoon Ha. Understanding expec- tations for a robotic guide dog for visually impaired people. arXiv preprint arXiv:2501.04594, 2025. 2 9

work page arXiv 2025
[24]

Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments

Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes. Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments. Expert Systems with Applications, 212:118720, 2023. 3

work page 2023
[25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

A survey of multimodel large language models

Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024. 4

work page 2024
[27]

A technique for the measurement of attitudes

Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 7

work page 1932
[28]

ROUGE: A package for automatic evalua- tion of summaries

Chin-Yew Lin. ROUGE: A package for automatic evalua- tion of summaries. In Text Summarization Branches Out , pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6

work page 2004
[29]

Deep learning based wearable assistive system for visually im- paired people

Yimin Lin, Kai Wang, Wanxin Yi, and Shiguo Lian. Deep learning based wearable assistive system for visually im- paired people. In Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops, pages 0–0,

work page
[30]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1, 8

work page 2023
[31]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 6

work page 2024
[32]

Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments

Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan Chen, Kailun Yang, and Rainer Stiefelhagen. Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1857–1867, 2023. 1

work page 2023
[33]

Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 8

work page arXiv 2024
[34]

G-eval: NLG evaluation using gpt- 4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt- 4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 7

work page 2023
[35]

A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple

Jyoti Madake, Shripad Bhatlawande, Anjali Solanke, and Swati Shilaskar. A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple. IEEE Access, 11:82496–82520, 2023. 3

work page 2023
[36]

Mobility-related ac- cidents experienced by people with visual impairment

Roberto Manduchi and Sri Kurniawan. Mobility-related ac- cidents experienced by people with visual impairment. AER Journal: Research and Practice in Visual Impairment and Blindness, 4(2):44–54, 2011. 1, 3

work page 2011
[37]

Generating contextually- relevant navigation instructions for blind and low vision peo- ple

Zain Merchant, Abrar Anwar, Emily Wang, Souti Chat- topadhyay, and Jesse Thomason. Generating contextually- relevant navigation instructions for blind and low vision peo- ple. arXiv preprint arXiv:2407.08219, 2024. 2, 3, 8

work page arXiv 2024
[38]

Ai and accessibility

Meredith Ringel Morris. Ai and accessibility. Communica- tions of the ACM, 63(6):35–37, 2020. 2

work page 2020
[39]

We don’t need no bounding-boxes: Train- ing object class detectors using only human verification

Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. We don’t need no bounding-boxes: Train- ing object class detectors using only human verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 854–863, 2016. 2

work page 2016
[40]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311– 318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics. 6

work page 2002
[41]

Egoblur: Responsible innovation in aria, 2023

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023. 4, 8, 12

work page 2023
[42]

Optimal walking in terms of variability in step length

Noboru Sekiya, Hiroshi Nagasaki, Hajime Ito, and Taketo Furuna. Optimal walking in terms of variability in step length. Journal of Orthopaedic & Sports Physical Therapy, 26(5):266–272, 1997. 4

work page 1997
[43]

Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence

Ji-Won Song and Sung-Ho Yang. Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence. In CHI’10 Extended Abstracts on Human Factors in Computing Systems, pages 3343–3348. 2010. 3

work page 2010
[44]

Assisting the blind and visually impaired: guidelines for eye health workers and other helpers

Sue Stevens. Assisting the blind and visually impaired: guidelines for eye health workers and other helpers. Com- munity Eye Health, 16(45):7, 2003. 3

work page 2003
[45]

Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people

Garreth W Tigwell. Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people. In Proceedings of the 2021 CHI confer- ence on human factors in computing systems , pages 1–15,

work page 2021
[46]

Label Studio: Data labeling soft- ware, 2020-2025

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling soft- ware, 2020-2025. Open source software available from https://github.com/HumanSignal/label-studio. 12

work page 2020
[47]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025. 6

work page 2025
[48]

Helpful resources for business con- sulting

Vision Australia. Helpful resources for business con- sulting. https://www.visionaustralia.org/ business - consulting / helpful - resources ,

work page
[49]

How to be a sighted guide

Vision Loss Resources. How to be a sighted guide. https: //visionlossresources.org/resources/how- to-be-a-guide/

work page
[50]

Etiquette for interacting with people who are blind or have low vision

Washington State University. Etiquette for interacting with people who are blind or have low vision. https:// 10 studentaffairs.vancouver.wsu.edu/access- center / etiquette - interacting - people - who-are-blind-who-have-low-vision . 3

work page
[51]

Dos and don’ts for people with vision loss

Wisconsin Department of Health Services. Dos and don’ts for people with vision loss. https : / / www . dhs . wisconsin.gov/obvi/adjustment/dos-donts. htm, 2023. 3

work page 2023
[52]

Sighted guide techniques

Wisconsin Department of Health Services. Sighted guide techniques. https://www.dhs.wisconsin.gov/ obvi / adjustment / sightedguidetech . htm ,

work page
[53]

Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design

Jingyi Xie, Rui Yu, He Zhang, Sooyeon Lee, Syed Ma- sum Billah, and John M Carroll. Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design. arXiv preprint arXiv:2407.08882, 2024. 1, 2, 8

work page arXiv 2024
[54]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Socratic models: Composing zero-shot multimodal rea- soning with language

Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal rea- soning with language. In The Eleventh International Confer- ence on Learning Representations. 6

work page
[56]

A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models

Zhe-Xin Zhang and Yoichi Ochiai. A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–4, 2024. 8

work page 2024
[57]

Vialm: A survey and benchmark of visually impaired assis- tance with large models

Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li. Vialm: A survey and benchmark of visually impaired assis- tance with large models. arXiv preprint arXiv:2402.01735,

work page arXiv
[58]

drive”, “car

2, 8 11 A. Automatic Pipeline Details A.1. Gathering Videos To construct a high-quality dataset of outdoor walking scenes, we implemented a systematic approach to video selection. We manually identified YouTube channels spe- cializing in walking tours across urban and natural envi- ronments. All selected channels (listed in Table 6) sat- isfy three criter...

work page
[59]

Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street

Review the image and decide if it represents a street scene (”Yes”) or not (”No”). Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street. • The viewer is looking at something other than the street (e.g., looking at the display glass of a store on the st...

work page
[60]

• scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions

If "Yes": • scene description: Provide an overview of the street, including pedestrians, buildings, vehicles, or any key elements that make it a street. • scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions. The leftmost part of the image is 10 o’clock, the center is 12 o’clock, and the rightmost is 2 o’clock

work page
[61]

No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be

If "No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be "None"

work page
[62]

is_street

Use only the JSON format below. 5. Do not include any text outside of this JSON format. Output JSON example: { "is_street": "Yes", "scene_description": "A detailed description of the street, including key elements such as pedestrians, shops, and vehicles.", "scene_location": "A positional overview using 10, 11, 12, 1, and 2 o’clock references." } Or: { "i...

work page
[63]

• Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.)

Complete Danger Zone (within 5 meters): • Evaluate whether each object is directly in the user’s walking path and could lead to a collision. • Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.). • Otherwise, mark it as not dangerous (”No”) and briefly...

work page
[64]

object": the object’s identifier or name. •

Ordinary Zone (beyond 5 meters) • Focus on moving objects in this zone (e.g., approaching motorcycles, cars, bicycles, pedestrians). • Mark an object as dangerous (”Yes”) if it is moving toward the user and could pose a threat. • Otherwise, mark it as not dangerous (”No”) and provide a brief explanation. For each object, provide a JSON entry with the foll...

work page
[66]

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Potential Hazards: {object info} Scene Information: {scene info} Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 9. The silver label gen...

work page
[68]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 11. The zero-shot prompt designed for MLLM to generate accessibility-aware guidance gene...

work page
[70]

Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right. At 10 o’clock, about five steps away, there’s a moving car which is potentially dangerous if you stray off th...

work page
[72]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: {llava output} Object Info: {object info} Guidance: Figure 13. The zero-shot...

work page
[73]

Surroundings and Position: Summarize where the person is, the general environment, their current position, and any nearby landmarks in 1-2 sentences

work page
[74]

• Follow the order of 10, 11, 12, 1, and 2 o’clock

Hazards: • For each direction (10, 11, 12, 1, and 2 o’clock), combine all hazards in that direction into exactly one sentence, mentioning approximate distance(s) and reason(s) they are dangerous. • Follow the order of 10, 11, 12, 1, and 2 o’clock

work page
[75]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: Object Info: Guidance: You’re on a bustling city street with buildings on yo...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics. 6

work page 2005

[4] [4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis

Rupert RA Bourne, Seth R Flaxman, Tasanee Braithwaite, Maria V Cicinelli, Aditi Das, Jost B Jonas, Jill Keeffe, John H Kempen, Janet Leasher, Hans Limburg, et al. Mag- nitude, temporal trends, and projections of the global preva- lence of blindness and distance and near vision impairment: a systematic review and meta-analysis. The Lancet Global Health, 5(...

work page 2017

[6] [6]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 4

work page 2021

[7] [7]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 4

work page 2024

[8] [8]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

V-eye: A vision-based navigation system for the visually impaired

Ping-Jung Duh, Yu-Cheng Sung, Liang-Yu Fan Chiang, Yung-Ju Chang, and Kuan-Wen Chen. V-eye: A vision-based navigation system for the visually impaired. IEEE Transac- tions on Multimedia, 23:1567–1580, 2020. 3

work page 2020

[10] [10]

How to guide someone who is blind or partially sighted

Emma Turner, Sense. How to guide someone who is blind or partially sighted. https://www.sense.org.uk/ blog/how-to-guide-someone-who-is-blind- or-partially-sighted, 2023. 3

work page 2023

[11] [11]

A review of assistive spatial orientation and navigation technologies for the visually impaired

Hugo Fernandes, Paulo Costa, Vitor Filipe, Hugo Paredes, and Jo ˜ao Barroso. A review of assistive spatial orientation and navigation technologies for the visually impaired. Uni- versal Access in the Information Society, 18:155–168, 2019. 3

work page 2019

[12] [12]

Introducing gemini 2.0: our new ai model for the agentic era

Google DeepMind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology / google - deepmind / google - gemini - ai - update - december - 2024/ , 2024. 6, 8

work page 2024

[13] [13]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page

[14] [14]

Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people

Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P Bigham. Vizwiz-priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 939–948, 2019. 8

work page 2019

[15] [15]

Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired

ByungOk Han, Woo-han Yun, Beom-Su Seo, and Jaehong Kim. Space-aware instruction tuning: Dataset and bench- mark for guide dog robots assisting the visually impaired. arXiv preprint arXiv:2502.07183, 2025. 2

work page arXiv 2025

[16] [16]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 6

work page 2022

[17] [17]

Long-form answers to visual questions from blind and low vision people

Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, and Amy Pavel. Long-form answers to visual questions from blind and low vision people. arXiv preprint arXiv:2408.06303, 2024. 8

work page arXiv 2024

[18] [18]

System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work

Hochul Hwang, Tim Xia, Ibrahima Keita, Ken Suzuki, Joy- deep Biswas, Sunghoon I Lee, and Donghyun Kim. System configuration and navigation of a guide dog robot: Toward animal guide dog-level guiding work. In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 9778–9784. IEEE, 2023. 2

work page 2023

[19] [19]

Identifying crucial ob- jects in blind and low-vision individuals’ navigation

Md Touhidul Islam, Imran Kabir, Elena Ariel Pearce, Md Al- imoor Reza, and Syed Masum Billah. Identifying crucial ob- jects in blind and low-vision individuals’ navigation. InPro- ceedings of the 26th International ACM SIGACCESS Con- ference on Computers and Accessibility, pages 1–8, 2024. 2, 3, 4

work page 2024

[20] [20]

Enhancing multimodal large language models with vision detection models: An empirical study

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. Enhancing multimodal large language models with vision detection models: An empirical study. arXiv preprint arXiv:2401.17981, 2024. 4

work page arXiv 2024

[21] [21]

Techniques for constructing indoor navigation systems for the visually impaired: A review

Roya Norouzi Kandalan and Kamesh Namuduri. Techniques for constructing indoor navigation systems for the visually impaired: A review. IEEE Transactions on Human-Machine Systems, 50(6):492–506, 2020. 1

work page 2020

[22] [22]

Meganno+: A human-llm collabora- tive annotation system

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rah- man, and Dan Zhang. Meganno+: A human-llm collabora- tive annotation system. arXiv preprint arXiv:2402.18050 ,

work page arXiv

[23] [23]

Understanding expec- tations for a robotic guide dog for visually impaired people

J Taery Kim, Morgan Byrd, Jack L Crandell, Bruce N Walker, Greg Turk, and Sehoon Ha. Understanding expec- tations for a robotic guide dog for visually impaired people. arXiv preprint arXiv:2501.04594, 2025. 2 9

work page arXiv 2025

[24] [24]

Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments

Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes. Deepnavi: A deep learning based smartphone navigation as- sistant for people with visual impairments. Expert Systems with Applications, 212:118720, 2023. 3

work page 2023

[25] [25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

A survey of multimodel large language models

Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024. 4

work page 2024

[27] [27]

A technique for the measurement of attitudes

Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 7

work page 1932

[28] [28]

ROUGE: A package for automatic evalua- tion of summaries

Chin-Yew Lin. ROUGE: A package for automatic evalua- tion of summaries. In Text Summarization Branches Out , pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6

work page 2004

[29] [29]

Deep learning based wearable assistive system for visually im- paired people

Yimin Lin, Kai Wang, Wanxin Yi, and Shiguo Lian. Deep learning based wearable assistive system for visually im- paired people. In Proceedings of the IEEE/CVF interna- tional conference on computer vision workshops, pages 0–0,

work page

[30] [30]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1, 8

work page 2023

[31] [31]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 6

work page 2024

[32] [32]

Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments

Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan Chen, Kailun Yang, and Rainer Stiefelhagen. Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual im- pairments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1857–1867, 2023. 1

work page 2023

[33] [33]

Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people

Ruiping Liu, Jiaming Zhang, Angela Sch ¨on, Karin M ¨uller, Junwei Zheng, Kailun Yang, Kathrin Gerling, and Rainer Stiefelhagen. Objectfinder: Open-vocabulary assistive sys- tem for interactive object search by blind people. arXiv preprint arXiv:2412.03118, 2024. 8

work page arXiv 2024

[34] [34]

G-eval: NLG evaluation using gpt- 4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt- 4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 7

work page 2023

[35] [35]

A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple

Jyoti Madake, Shripad Bhatlawande, Anjali Solanke, and Swati Shilaskar. A qualitative and quantitative analysis of research in mobility technologies for visually impaired peo- ple. IEEE Access, 11:82496–82520, 2023. 3

work page 2023

[36] [36]

Mobility-related ac- cidents experienced by people with visual impairment

Roberto Manduchi and Sri Kurniawan. Mobility-related ac- cidents experienced by people with visual impairment. AER Journal: Research and Practice in Visual Impairment and Blindness, 4(2):44–54, 2011. 1, 3

work page 2011

[37] [37]

Generating contextually- relevant navigation instructions for blind and low vision peo- ple

Zain Merchant, Abrar Anwar, Emily Wang, Souti Chat- topadhyay, and Jesse Thomason. Generating contextually- relevant navigation instructions for blind and low vision peo- ple. arXiv preprint arXiv:2407.08219, 2024. 2, 3, 8

work page arXiv 2024

[38] [38]

Ai and accessibility

Meredith Ringel Morris. Ai and accessibility. Communica- tions of the ACM, 63(6):35–37, 2020. 2

work page 2020

[39] [39]

We don’t need no bounding-boxes: Train- ing object class detectors using only human verification

Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. We don’t need no bounding-boxes: Train- ing object class detectors using only human verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 854–863, 2016. 2

work page 2016

[40] [40]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311– 318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics. 6

work page 2002

[41] [41]

Egoblur: Responsible innovation in aria, 2023

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023. 4, 8, 12

work page 2023

[42] [42]

Optimal walking in terms of variability in step length

Noboru Sekiya, Hiroshi Nagasaki, Hajime Ito, and Taketo Furuna. Optimal walking in terms of variability in step length. Journal of Orthopaedic & Sports Physical Therapy, 26(5):266–272, 1997. 4

work page 1997

[43] [43]

Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence

Ji-Won Song and Sung-Ho Yang. Touch your way: hap- tic sight for visually impaired people to walk with indepen- dence. In CHI’10 Extended Abstracts on Human Factors in Computing Systems, pages 3343–3348. 2010. 3

work page 2010

[44] [44]

Assisting the blind and visually impaired: guidelines for eye health workers and other helpers

Sue Stevens. Assisting the blind and visually impaired: guidelines for eye health workers and other helpers. Com- munity Eye Health, 16(45):7, 2003. 3

work page 2003

[45] [45]

Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people

Garreth W Tigwell. Nuanced perspectives toward disabil- ity simulations from digital designers, blind, low vision, and color blind people. In Proceedings of the 2021 CHI confer- ence on human factors in computing systems , pages 1–15,

work page 2021

[46] [46]

Label Studio: Data labeling soft- ware, 2020-2025

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling soft- ware, 2020-2025. Open source software available from https://github.com/HumanSignal/label-studio. 12

work page 2020

[47] [47]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025. 6

work page 2025

[48] [48]

Helpful resources for business con- sulting

Vision Australia. Helpful resources for business con- sulting. https://www.visionaustralia.org/ business - consulting / helpful - resources ,

work page

[49] [49]

How to be a sighted guide

Vision Loss Resources. How to be a sighted guide. https: //visionlossresources.org/resources/how- to-be-a-guide/

work page

[50] [50]

Etiquette for interacting with people who are blind or have low vision

Washington State University. Etiquette for interacting with people who are blind or have low vision. https:// 10 studentaffairs.vancouver.wsu.edu/access- center / etiquette - interacting - people - who-are-blind-who-have-low-vision . 3

work page

[51] [51]

Dos and don’ts for people with vision loss

Wisconsin Department of Health Services. Dos and don’ts for people with vision loss. https : / / www . dhs . wisconsin.gov/obvi/adjustment/dos-donts. htm, 2023. 3

work page 2023

[52] [52]

Sighted guide techniques

Wisconsin Department of Health Services. Sighted guide techniques. https://www.dhs.wisconsin.gov/ obvi / adjustment / sightedguidetech . htm ,

work page

[53] [53]

Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design

Jingyi Xie, Rui Yu, He Zhang, Sooyeon Lee, Syed Ma- sum Billah, and John M Carroll. Emerging practices for large multimodal model (lmm) assistance for people with vi- sual impairments: Implications for design. arXiv preprint arXiv:2407.08882, 2024. 1, 2, 8

work page arXiv 2024

[54] [54]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Socratic models: Composing zero-shot multimodal rea- soning with language

Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal rea- soning with language. In The Eleventh International Confer- ence on Learning Representations. 6

work page

[56] [56]

A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models

Zhe-Xin Zhang and Yoichi Ochiai. A design of interface for visual-impaired people to access visual information from images featuring large language models and visual language models. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–4, 2024. 8

work page 2024

[57] [57]

Vialm: A survey and benchmark of visually impaired assis- tance with large models

Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, and Hillming Li. Vialm: A survey and benchmark of visually impaired assis- tance with large models. arXiv preprint arXiv:2402.01735,

work page arXiv

[58] [58]

drive”, “car

2, 8 11 A. Automatic Pipeline Details A.1. Gathering Videos To construct a high-quality dataset of outdoor walking scenes, we implemented a systematic approach to video selection. We manually identified YouTube channels spe- cializing in walking tours across urban and natural envi- ronments. All selected channels (listed in Table 6) sat- isfy three criter...

work page

[59] [59]

Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street

Review the image and decide if it represents a street scene (”Yes”) or not (”No”). Filter out the following non-street cases from the label {is street}: • Certain objects block most of the screen, making it difficult to recognize the street. • The viewer is looking at something other than the street (e.g., looking at the display glass of a store on the st...

work page

[60] [60]

• scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions

If "Yes": • scene description: Provide an overview of the street, including pedestrians, buildings, vehicles, or any key elements that make it a street. • scene location: Describe the location and surroundings using only 10, 11, 12, 1, and 2 o’clock positions. The leftmost part of the image is 10 o’clock, the center is 12 o’clock, and the rightmost is 2 o’clock

work page

[61] [61]

No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be

If "No": • scene description: Briefly explain why this is not a street scene. • scene location: Must be "None"

work page

[62] [62]

is_street

Use only the JSON format below. 5. Do not include any text outside of this JSON format. Output JSON example: { "is_street": "Yes", "scene_description": "A detailed description of the street, including key elements such as pedestrians, shops, and vehicles.", "scene_location": "A positional overview using 10, 11, 12, 1, and 2 o’clock references." } Or: { "i...

work page

[63] [63]

• Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.)

Complete Danger Zone (within 5 meters): • Evaluate whether each object is directly in the user’s walking path and could lead to a collision. • Mark an object as dangerous (”Yes”) only if it poses a collision risk (e.g., curbs, potholes, poles, stairs, con- struction barriers, parked vehicles, etc.). • Otherwise, mark it as not dangerous (”No”) and briefly...

work page

[64] [64]

object": the object’s identifier or name. •

Ordinary Zone (beyond 5 meters) • Focus on moving objects in this zone (e.g., approaching motorcycles, cars, bicycles, pedestrians). • Mark an object as dangerous (”Yes”) if it is moving toward the user and could pose a threat. • Otherwise, mark it as not dangerous (”No”) and provide a brief explanation. For each object, provide a JSON entry with the foll...

work page

[65] [66]

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Potential Hazards: {object info} Scene Information: {scene info} Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 9. The silver label gen...

work page

[66] [68]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Figure 11. The zero-shot prompt designed for MLLM to generate accessibility-aware guidance gene...

work page

[67] [70]

Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Examples: • You’re on a bustling city street with buildings on your left, and the sidewalk and storefronts on your right. At 10 o’clock, about five steps away, there’s a moving car which is potentially dangerous if you stray off th...

work page

[68] [72]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: {llava output} Object Info: {object info} Guidance: Figure 13. The zero-shot...

work page

[69] [73]

Surroundings and Position: Summarize where the person is, the general environment, their current position, and any nearby landmarks in 1-2 sentences

work page

[70] [74]

• Follow the order of 10, 11, 12, 1, and 2 o’clock

Hazards: • For each direction (10, 11, 12, 1, and 2 o’clock), combine all hazards in that direction into exactly one sentence, mentioning approximate distance(s) and reason(s) they are dangerous. • Follow the order of 10, 11, 12, 1, and 2 o’clock

work page

[71] [75]

Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual

Navigation: After describing all hazards, provide a single, concise sentence on how to safely navigate or avoid them overall. Remember to provide a single, flowing explanation without labeled sections, as if talking directly to the visually impaired individual. Scene Description: Object Info: Guidance: You’re on a bustling city street with buildings on yo...

work page