pith. machine review for the scientific record. sign in

arxiv: 2605.05348 · v1 · submitted 2026-05-06 · 💻 cs.HC · cs.AI

Recognition: unknown

Making AI Drafts Count: A Quality Threshold in Audio Description Workflows

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:03 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords audio descriptionAI assistancehuman-AI collaborationquality thresholdaccessibilitycognitive loadvideo narrationediting workflows
0
0 comments X

The pith

AI drafts for audio description only cut work time and effort when they exceed a minimum quality threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how the quality of an AI starting draft influences the speed and ease of producing audio descriptions that narrate visual content for blind and low-vision audiences. In a within-subjects comparison, participants edited high-quality drafts generated by a pipeline that follows accessibility guidelines and uses video context; those drafts halved completion time and lowered cognitive load compared with writing from scratch. Drafts produced by simple unguided prompts delivered only small gains, showing that AI help is not automatically useful. The required quality level rises with visual complexity of the video, leading the authors to state that effective AI assistance must clear a content-suited threshold rather than merely being supplied.

Core claim

Editing GenAD drafts, which incorporate accessibility guidelines and contextual video information, cut completion time by more than half and significantly reduced cognitive load relative to authoring from scratch. Baseline drafts from simple unguided prompts produced only modest benefits. The results indicate that a minimum quality threshold must be met for AI drafts to be effective, and that this threshold is content-dependent, increasing with visual complexity.

What carries the argument

The quality threshold for AI drafts, measured through direct comparison of GenAD (guided pipeline using guidelines and video context) against baseline (simple prompts) in a RefineAD editing interface that tracks changes in text, timing, and delivery.

Load-bearing premise

The controlled difference in draft quality and the within-subjects study design isolate the effect of quality from differences in individual editor skill, specific video content, or how quality was defined and measured.

What would settle it

A replication study in which baseline drafts are adjusted to match GenAD on guideline compliance and context accuracy yet still produce no more than 20 percent time savings, or in which GenAD drafts on high-complexity videos show no reduction in completion time.

Figures

Figures reproduced from arXiv: 2605.05348 by Charity M. Pitcher-Cooper, Gio Jung, Hyunjoo Shim, Ilmi Yoon, Kien T. Nguyen, Lana Do, Sanjay Mirani, Shasta Ihorn, Vassilis Athitsos, Zhenzhen Qin.

Figure 1
Figure 1. Figure 1: Overview of the GenAD–RefineAD pipeline. GenAD processes a YouTube video through four stages: (A) video view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative differences between Baseline and view at source ↗
Figure 3
Figure 3. Figure 3: The RefineAD editing interface. Describers preview the video with audio ducking controls (A), take notes (B), and view at source ↗
Figure 4
Figure 4. Figure 4: Participant preferences (left) and perceived speed view at source ↗
Figure 7
Figure 7. Figure 7: Human contribution tracks task completion time view at source ↗
Figure 8
Figure 8. Figure 8: Three themes from participants’ qualitative responses: AI drafts as scaffolding for description authoring, (2) a view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for generating scene-level descriptions view at source ↗
Figure 9
Figure 9. Figure 9: System prompt: the guidelines string is injected into the system message via the {guidelines} placeholder. A.2 Scene-Level Generation Prompt This prompt is issued once per scene. It receives the scene duration and narrative context, then requests structured JSON output for both on-screen text events and visual description events. SCENE DURATION: {scene_duration:.2f} seconds CONTEXT: {context} You are analy… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for retrying a failed inline optimization. view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for filtering extended audio description view at source ↗
Figure 14
Figure 14. Figure 14: Video preview with AI-generated AD. Users watch view at source ↗
Figure 15
Figure 15. Figure 15: From-scratch authoring workflow. (a) Video land view at source ↗
Figure 16
Figure 16. Figure 16: Condition labels presented to participants. The view at source ↗
Figure 17
Figure 17. Figure 17: Interactive onboarding tutorial. A spotlight overlay guides users step by step through the interface, requiring specific view at source ↗
read the original abstract

Audio description (AD) narrates visual elements in video for blind and low-vision audiences. Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality AD and lowers the barrier to entry. What remains an open question is how draft quality shapes the editing process. We investigate this through GenAD, an AD generation pipeline that incorporates accessibility guidelines and contextual video information, and RefineAD, an editing interface for human revisions. Human-AI contributions are measured across text, timing, and delivery. In a within-subjects study, we compared authoring from scratch against editing AI drafts of varying quality. GenAD drafts cut completion time by more than half and significantly reduced cognitive load. In contrast, baseline drafts generated from simple, unguided prompts offered only modest benefits, pointing to a minimum quality threshold for effectiveness. Qualitative findings suggest this threshold is content-dependent; as visual complexity increases, so does the quality needed from AI drafts. We propose this as a design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GenAD, an AI pipeline for generating audio description (AD) drafts that incorporates accessibility guidelines and contextual video information, along with the RefineAD editing interface. It reports a within-subjects user study comparing AD authoring from scratch against editing AI drafts of varying quality (GenAD vs. baseline drafts from simple prompts). Key results indicate that GenAD drafts reduce completion time by more than half and significantly lower cognitive load, while baseline drafts yield only modest benefits; this leads to the proposal of a content-dependent minimum quality threshold for effective AI assistance in AD workflows.

Significance. If the empirical results hold after addressing methodological reporting, the work makes a useful contribution to HCI research on human-AI collaboration in accessibility and creative tasks. It provides evidence that AI draft quality, rather than mere presence of assistance, drives efficiency gains, and the proposed design principle offers a practical guideline for tool development in similar domains. The direct comparison of draft qualities via within-subjects design is a positive aspect of the empirical approach.

major comments (2)
  1. [Methods / within-subjects study description] The description of the within-subjects study (abstract and Methods) provides no participant count, details on counterbalancing of condition order, video selection/assignment method, measurement of editor skill, or how draft quality was operationalized and varied between GenAD and baseline conditions. These omissions are load-bearing for the central claim, as they leave open whether time and cognitive-load differences arise from draft quality or from confounds such as learning effects, video complexity, or individual differences.
  2. [Abstract / Results] The abstract states that GenAD drafts 'cut completion time by more than half and significantly reduced cognitive load' and that baseline drafts offered 'only modest benefits,' but reports no statistical tests, effect sizes, confidence intervals, or raw summaries. Without these, the evidence strength for the quality-threshold conclusion cannot be fully assessed.
minor comments (1)
  1. [Abstract] The abstract would benefit from briefly noting the number of participants and videos to give readers immediate context for the study scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our reporting on the within-subjects study and its results. We address each major point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Methods / within-subjects study description] The description of the within-subjects study (abstract and Methods) provides no participant count, details on counterbalancing of condition order, video selection/assignment method, measurement of editor skill, or how draft quality was operationalized and varied between GenAD and baseline conditions. These omissions are load-bearing for the central claim, as they leave open whether time and cognitive-load differences arise from draft quality or from confounds such as learning effects, video complexity, or individual differences.

    Authors: We agree that the current Methods section would benefit from expanded, explicit reporting on these elements to allow full evaluation of the design and to address potential confounds. In the revised manuscript, we will add: the exact participant count; the counterbalancing method for condition order (e.g., Latin square); the video selection criteria and assignment procedure; how editor skill was measured or controlled (e.g., via pre-study screening or self-reported experience); and a precise operationalization of draft quality, including the specific differences in content, guideline adherence, and contextual integration between GenAD and baseline drafts. These additions will directly support the quality-threshold interpretation. revision: yes

  2. Referee: [Abstract / Results] The abstract states that GenAD drafts 'cut completion time by more than half and significantly reduced cognitive load' and that baseline drafts offered 'only modest benefits,' but reports no statistical tests, effect sizes, confidence intervals, or raw summaries. Without these, the evidence strength for the quality-threshold conclusion cannot be fully assessed.

    Authors: The abstract is intentionally concise, with full statistical reporting (tests, effect sizes, confidence intervals, and descriptive summaries) provided in the Results section. To address the concern, we will revise the abstract to include key statistical details supporting the main claims (e.g., time reduction with p-value and effect size) while remaining within length limits. This will make the evidence for the quality threshold more transparent at the abstract level. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical user study with measured outcomes

full rationale

The paper reports results from a within-subjects user study comparing AD authoring from scratch versus editing AI drafts of varying quality (GenAD vs. baseline). All central claims—>50% time reduction, reduced cognitive load, and a content-dependent quality threshold—are grounded in direct experimental measurements of time, load, and qualitative feedback rather than any equations, fitted parameters, derivations, or self-referential definitions. No load-bearing steps reduce to inputs by construction; the design is self-contained against external benchmarks of observed human performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on experimental measurements from a within-subjects user study comparing draft conditions. No free parameters or invented entities are introduced; the work relies on standard HCI evaluation practices and accessibility guidelines.

axioms (1)
  • standard math Standard statistical assumptions for analyzing within-subjects experimental data (e.g., appropriate use of paired tests or ANOVA with corrections).
    Required to interpret reported significant differences in time and cognitive load.

pith-pipeline@v0.9.0 · 5529 in / 1352 out tokens · 60279 ms · 2026-05-08T16:03:22.502381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Described and Captioned Media Program (DCMP)

    2024. Described and Captioned Media Program (DCMP). https://dcmp.org/learn/ descriptionkey

  2. [2]

    Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. 2025. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)(Yokohama, Japan). Association for Computing Machinery. doi:10.1145/ 3706598.3713564

  3. [3]

    Daniel Bergin and Brett Oppegaard. 2025. Automating Media Accessibility: An Approach for Analyzing Audio Description Across Generative Artificial Intelligence Algorithms.Technical Communication Quarterly34, 2 (2025), 169–

  4. [4]

    doi:10.1080/10572252.2024.2372771

  5. [5]

    Scott, Lothar Narins, Yash Kant, Abhishek Das, and Ilmi Yoon

    Aditya Bodi, Pooyan Fazli, Shasta Ihorn, Yue Ting Siu, Andrew T. Scott, Lothar Narins, Yash Kant, Abhishek Das, and Ilmi Yoon. 2021. Automated Video Descrip- tion for Blind and Low Vision Users.Conference on Human Factors in Computing Systems - Proceedings(5 2021). doi:10.1145/3411763.3451810

  6. [6]

    Bradley, Leslie A

    Elizabeth H. Bradley, Leslie A. Curry, and Kelly J. Devers. 2007. Qualitative Data Analysis for Health Services Research: Developing Taxonomy, Themes, and Theory.Health Services Research42, 4 (2007), 1758–1772. doi:10.1111/j.1475- 6773.2006.00684.x

  7. [7]

    Branje and Deborah I

    Carmen J. Branje and Deborah I. Fels. 2012. LiveDescribe: Can Amateur De- scribers Create High-Quality Audio Description?Journal of Visual Impairment & Blindness106, 3 (2012), 154–165. doi:10.1177/0145482X1210600304

  8. [8]

    Ruei Che Chang, Chao Hsien Ting, Chia Sheng Hung, Wan Chen Lee, Liang Jin Chen, Yu Tzu Chao, Bing Yu Chen, and Anhong Guo. 2022. OmniScribe: Author- ing Immersive Audio Descriptions for 360◦ Videos.UIST 2022 - Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(10 2022). doi:10.1145/3526113.3545613

  9. [9]

    Jawad Cheema, Stefan Ihorn, Rosiana Natalie, Kotaro Hara, and Amy Pavel. 2025. DescribePro: Collaborative Audio Description with Human-AI Interaction. InThe 27th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’25)(Denver, CO, USA). Association for Computing Machinery. doi:10. 1145/3663547.3746320

  10. [10]

    Maryam Cheema, Hasti Seifi, and Pooyan Fazli. 2024. Describe Now: User- Driven Audio Description for Blind and Low Vision Individuals. (11 2024). http://arxiv.org/abs/2411.11835

  11. [11]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

  12. [12]

    InFindings of the Asso- ciation for Computational Linguistics: ACL 2024

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2318–2335

  13. [13]

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. 2024. ShareGPT4Video: Improving Video Understand- ing and Generation with Better Captions. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track

  14. [14]

    Yingqiang Chu, Yong Li, Joon-Young Yoon, and Byung-Kwon Park. 2024. LLM- AD: Large Language Model based Audio Description System.arXiv preprint arXiv:2405.00983(2024)

  15. [15]

    Hartsuiker, and Lieve Macken

    Joke Daems, Sonia Vandepitte, Robert J. Hartsuiker, and Lieve Macken. 2017. Identifying the Machine Translation Error Types with the Greatest Impact on Post- editing Effort.Frontiers in Psychology8 (2017), 1282. doi:10.3389/fpsyg.2017.01282

  16. [16]

    Dhillon, Somayeh Molaei, Jiaqi Li, Maximilian Golub, Shaochun Zheng, and Lionel P

    Paramveer S. Dhillon, Somayeh Molaei, Jiaqi Li, Maximilian Golub, Shaochun Zheng, and Lionel P. Robert. 2024. Shaping Human-AI Collaboration: Varied Scaffolding Levels in Co-writing with Language Models. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery. doi:10.1145/3613904.3642134

  17. [17]

    Anna Fernández-Torné. 2015. The Journal of Specialised Translation Text-to- speech vs. human voiced audio descriptions: a reception study in films dubbed into Catalan. (2015)

  18. [18]

    Google Cloud. 2024. Cloud Text-to-Speech API. https://cloud.google.com/text- to-speech Accessed: 2025

  19. [19]

    Spence Green, Jeffrey Heer, and Christopher D. Manning. 2013. The efficacy of human post-editing for language translation.Conference on Human Factors in Computing Systems - Proceedings(2013), 439–448. doi:10.1145/2470654.2470718

  20. [20]

    Hart and Lowell E

    Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. InHuman Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). North-Holland, Amsterdam, 139–183. doi:10.1016/S0166-4115(08)62386-9

  21. [21]

    Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang ’Anthony’ Chen, Young-Ho Kim, and Amy Pavel. 2023. AVscript: Accessible Video Editing with Audio-Visual Scripts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 796, 17 pages. doi:10.1145/3544548.3581494

  22. [22]

    Shasta Ihorn, Aditya Bodi, Fazli Pooyan, Yue-Ting Siu, and Ilmi Yoon. 2023. The Potential of a Visual Dialogue Agent in a Tandem Automated Audio Description System for Videos. InProceedings of the 25th International ACM SIGACCESS Con- ference on Computers and Accessibility (ASSETS ’23). Association for Computing Machinery. doi:10.1145/3597638.3608402

  23. [23]

    Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman

  24. [24]

    In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23)

    Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery. doi:10.1145/3544548.3581196 , , Lana Do, Shasta Ihorn, Charity M. Pitcher-Cooper, Sanjay Mirani, Gio Jung, Hyunjoo Shim, Zhenzhen Qin, Kien T. Nguyen, Vassilis Ath...

  25. [25]

    Irma Browne Jamison. 2003. Turnover and Retention among Volunteers in Human Service Agencies.Review of Public Personnel Administration23, 2 (2003), 114–132. doi:10.1177/0734371X03023002003

  26. [26]

    Kowe Kadoma, Marianne Aubin Le Quere, Xiyu Jenny Fu, Christin Munsch, Danaë Metaxa, and Mor Naaman. 2024. The Role of Inclusion, Control, and Ownership in Workplace AI-Mediated Communication. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery. doi:10.1145/3613904.3642650

  27. [27]

    Samuel Läubli, Mark Fishel, Gary Massey, Maureen Ehrensberger-Dow, and Martin Volk. [n. d.]. Assessing Post-Editing Efficiency in a Realistic Translation Environment. ([n. d.]). http://www.my-across.net/en/

  28. [28]

    Alghamdi, et al

    Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, et al. 2024. A Design Space for Intelligent and Interac- tive Writing Assistants. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Com...

  29. [29]

    Levenshtein

    Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710

  30. [30]

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understand- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13723–13733

  31. [31]

    Cheema, Hasti Seifi, and Pooyan Fazli

    Chaoyu Li, Sid Padmanabhuni, Maryam S. Cheema, Hasti Seifi, and Pooyan Fazli. 2025. VideoA11y: Method and Dataset for Accessible Video Description. Conference on Human Factors in Computing Systems - Proceedings(4 2025). doi:10. 1145/3706598.3714096/SUPPL{_}FILE/PN2974-TALK-VIDEO.MP4

  32. [32]

    Bigham, and Amy Pavel

    Franklin Mingzhe Li, Michael Xieyang Liu, Jeffrey P. Bigham, and Amy Pavel

  33. [33]

    InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems

    ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3772318.3791158

  34. [34]

    Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, and Lijuan Li. 2023. MM-VID: Advancing Video Understanding with GPT-4V(ision). arXiv preprint arXiv:2310.19773(2023)

  35. [35]

    Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang ’Anthony’ Chen, and Amy Pavel. 2022. CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22)(Bend, OR, USA). Association for Computing Machinery, Article 43, 14 pages. doi:10.1145/3526...

  36. [36]

    Rosiana Natalie, Joshua Tseng, Hernisa Kacorri, and Kotaro Hara. 2023. Support- ing Novices Author Audio Descriptions via Automatic Feedback. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23) (Hamburg, Germany). Association for Computing Machinery, Article 77, 18 pages. doi:10.1145/3544548.3581023

  37. [37]

    National Center for Accessible Media. 2017. Accessible Digital Media Guidelines. https://ncam.wgbh.org

  38. [38]

    OpenAI. 2024. GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2025

  39. [39]

    Amy Pavel, Gabriel Reyes, and Jeffrey P. Bigham. 2020. Rescribe: Authoring and automatically editing audio descriptions.UIST 2020 - Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(10 2020), 747–759. doi:10.1145/3379337.3415864/SUPPL{_}FILE/3379337.3415864.MP4

  40. [40]

    Charity Pitcher-Cooper, Manali Seth, Benjamin Kao, James M Coughlan, and Ilmi Yoon. 2023. You Described, We Archived: A Rich Audio Description Dataset. Journal on Technology and Persons with Disabilities11 (2023). https://youdescribe. org/

  41. [41]

    Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y

    Rafael A. Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y. Chen, Aleem Khan, Marcus Bishop, and Nicholas Andrews. 2021. Learning Universal Authorship Representations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online and Punta Cana, Dominican Re...

  42. [42]

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie Description. International Journal of Computer Vision123, 1 (5 2017), 94–120. doi:10.1007/ S11263-016-0987-1

  43. [43]

    Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic Programming Algorithm Opti- mization for Spoken Word Recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing26, 1 (1978), 43–49. doi:10.1109/TASSP.1978.1163055

  44. [44]

    Eva Schaeffer-Lacroix, Nina Reviers, and Elena Di Giovanni. 2023. Beyond Objectivity in Audio Description: New Practices and Perspectives

  45. [45]

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Lud- wig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text...

  46. [46]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying About Prompt Formatting. InProceedings of the International Conference on Learning Representations (ICLR)

  47. [47]

    TomarSuramya. 2006. Converting video formats with FFmpeg.Linux Journal(6 2006). doi:10.5555/1134782.1134792

  48. [48]

    Derry, Mina Huh, and Amy Pavel

    Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, and Amy Pavel. 2024. Making Short-Form Videos Accessible with Hierarchical Video Summaries.Conference on Human Factors in Computing Systems - Proceedings1 (2 2024), 17. doi:10.1145/3613904.3642839

  49. [49]

    Agnieszka Walczak and Louise Fryer. 2017. Creative Description: The Impact of Audio Description Style on Presence in Visually Impaired Audiences.British Journal of Visual Impairment35, 1 (2017), 6–17. doi:10.1177/0264619616661603

  50. [50]

    Agnieszka Walczak and Louise Fryer. 2018. Vocal delivery of audio description by genre: measuring users’ presence.Perspectives: Studies in Translatology26, 1 (1 2018), 69–83. doi:10.1080/0907676X.2017.1298634

  51. [51]

    Yujia Wang and Wei Liang. 2021. Toward automatic audio description gener- ation for accessible videos.Conference on Human Factors in Computing Sys- tems - Proceedings(5 2021). doi:10.1145/3411764.3445347/SUPPL{_}FILE/3411764. 3445347{_}VIDEOPREVIEW.MP4

  52. [52]

    Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman. 2024. AutoAD-Zero: A Training-Free Framework for Zero- Shot Audio Description.arXiv preprint arXiv:2407.15850(2024)

  53. [53]

    2024.MMAD: Multi-modal Movie Audio Description

    Xiaojun Ye, Junhao Chen, Xiang Li, Haidong Xin, Chao Li, Sheng Zhou, and Jiajun Bu. 2024.MMAD: Multi-modal Movie Audio Description. Technical Report. 11415 pages. https://github.com/Daria8976/MMAD

  54. [54]

    YouDescribe. [n. d.]. YouDescribe. https://www.youdescribe.org/

  55. [55]

    yt-dlp contributors. 2025. yt-dlp. https://github.com/yt-dlp/yt-dlp Accessed: 2025

  56. [56]

    Beste F Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon

  57. [57]

    type": "Text on Screen

    Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users. (2020). doi:10.1145/3357236.3395433 Making AI Drafts Count: A Quality Threshold in Audio Description Workflows , , A GenAD Prompt Templates This section contains the full prompt templates used in our system, including the system prompt with audio desc...