Recognition: unknown
Making AI Drafts Count: A Quality Threshold in Audio Description Workflows
Pith reviewed 2026-05-08 16:03 UTC · model grok-4.3
The pith
AI drafts for audio description only cut work time and effort when they exceed a minimum quality threshold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Editing GenAD drafts, which incorporate accessibility guidelines and contextual video information, cut completion time by more than half and significantly reduced cognitive load relative to authoring from scratch. Baseline drafts from simple unguided prompts produced only modest benefits. The results indicate that a minimum quality threshold must be met for AI drafts to be effective, and that this threshold is content-dependent, increasing with visual complexity.
What carries the argument
The quality threshold for AI drafts, measured through direct comparison of GenAD (guided pipeline using guidelines and video context) against baseline (simple prompts) in a RefineAD editing interface that tracks changes in text, timing, and delivery.
Load-bearing premise
The controlled difference in draft quality and the within-subjects study design isolate the effect of quality from differences in individual editor skill, specific video content, or how quality was defined and measured.
What would settle it
A replication study in which baseline drafts are adjusted to match GenAD on guideline compliance and context accuracy yet still produce no more than 20 percent time savings, or in which GenAD drafts on high-complexity videos show no reduction in completion time.
Figures
read the original abstract
Audio description (AD) narrates visual elements in video for blind and low-vision audiences. Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality AD and lowers the barrier to entry. What remains an open question is how draft quality shapes the editing process. We investigate this through GenAD, an AD generation pipeline that incorporates accessibility guidelines and contextual video information, and RefineAD, an editing interface for human revisions. Human-AI contributions are measured across text, timing, and delivery. In a within-subjects study, we compared authoring from scratch against editing AI drafts of varying quality. GenAD drafts cut completion time by more than half and significantly reduced cognitive load. In contrast, baseline drafts generated from simple, unguided prompts offered only modest benefits, pointing to a minimum quality threshold for effectiveness. Qualitative findings suggest this threshold is content-dependent; as visual complexity increases, so does the quality needed from AI drafts. We propose this as a design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GenAD, an AI pipeline for generating audio description (AD) drafts that incorporates accessibility guidelines and contextual video information, along with the RefineAD editing interface. It reports a within-subjects user study comparing AD authoring from scratch against editing AI drafts of varying quality (GenAD vs. baseline drafts from simple prompts). Key results indicate that GenAD drafts reduce completion time by more than half and significantly lower cognitive load, while baseline drafts yield only modest benefits; this leads to the proposal of a content-dependent minimum quality threshold for effective AI assistance in AD workflows.
Significance. If the empirical results hold after addressing methodological reporting, the work makes a useful contribution to HCI research on human-AI collaboration in accessibility and creative tasks. It provides evidence that AI draft quality, rather than mere presence of assistance, drives efficiency gains, and the proposed design principle offers a practical guideline for tool development in similar domains. The direct comparison of draft qualities via within-subjects design is a positive aspect of the empirical approach.
major comments (2)
- [Methods / within-subjects study description] The description of the within-subjects study (abstract and Methods) provides no participant count, details on counterbalancing of condition order, video selection/assignment method, measurement of editor skill, or how draft quality was operationalized and varied between GenAD and baseline conditions. These omissions are load-bearing for the central claim, as they leave open whether time and cognitive-load differences arise from draft quality or from confounds such as learning effects, video complexity, or individual differences.
- [Abstract / Results] The abstract states that GenAD drafts 'cut completion time by more than half and significantly reduced cognitive load' and that baseline drafts offered 'only modest benefits,' but reports no statistical tests, effect sizes, confidence intervals, or raw summaries. Without these, the evidence strength for the quality-threshold conclusion cannot be fully assessed.
minor comments (1)
- [Abstract] The abstract would benefit from briefly noting the number of participants and videos to give readers immediate context for the study scale.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our reporting on the within-subjects study and its results. We address each major point below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Methods / within-subjects study description] The description of the within-subjects study (abstract and Methods) provides no participant count, details on counterbalancing of condition order, video selection/assignment method, measurement of editor skill, or how draft quality was operationalized and varied between GenAD and baseline conditions. These omissions are load-bearing for the central claim, as they leave open whether time and cognitive-load differences arise from draft quality or from confounds such as learning effects, video complexity, or individual differences.
Authors: We agree that the current Methods section would benefit from expanded, explicit reporting on these elements to allow full evaluation of the design and to address potential confounds. In the revised manuscript, we will add: the exact participant count; the counterbalancing method for condition order (e.g., Latin square); the video selection criteria and assignment procedure; how editor skill was measured or controlled (e.g., via pre-study screening or self-reported experience); and a precise operationalization of draft quality, including the specific differences in content, guideline adherence, and contextual integration between GenAD and baseline drafts. These additions will directly support the quality-threshold interpretation. revision: yes
-
Referee: [Abstract / Results] The abstract states that GenAD drafts 'cut completion time by more than half and significantly reduced cognitive load' and that baseline drafts offered 'only modest benefits,' but reports no statistical tests, effect sizes, confidence intervals, or raw summaries. Without these, the evidence strength for the quality-threshold conclusion cannot be fully assessed.
Authors: The abstract is intentionally concise, with full statistical reporting (tests, effect sizes, confidence intervals, and descriptive summaries) provided in the Results section. To address the concern, we will revise the abstract to include key statistical details supporting the main claims (e.g., time reduction with p-value and effect size) while remaining within length limits. This will make the evidence for the quality threshold more transparent at the abstract level. revision: partial
Circularity Check
No circularity: purely empirical user study with measured outcomes
full rationale
The paper reports results from a within-subjects user study comparing AD authoring from scratch versus editing AI drafts of varying quality (GenAD vs. baseline). All central claims—>50% time reduction, reduced cognitive load, and a content-dependent quality threshold—are grounded in direct experimental measurements of time, load, and qualitative feedback rather than any equations, fitted parameters, derivations, or self-referential definitions. No load-bearing steps reduce to inputs by construction; the design is self-contained against external benchmarks of observed human performance.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard statistical assumptions for analyzing within-subjects experimental data (e.g., appropriate use of paired tests or ANOVA with corrections).
Reference graph
Works this paper leans on
-
[1]
Described and Captioned Media Program (DCMP)
2024. Described and Captioned Media Program (DCMP). https://dcmp.org/learn/ descriptionkey
2024
-
[2]
Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. 2025. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25)(Yokohama, Japan). Association for Computing Machinery. doi:10.1145/ 3706598.3713564
-
[3]
Daniel Bergin and Brett Oppegaard. 2025. Automating Media Accessibility: An Approach for Analyzing Audio Description Across Generative Artificial Intelligence Algorithms.Technical Communication Quarterly34, 2 (2025), 169–
2025
-
[4]
doi:10.1080/10572252.2024.2372771
-
[5]
Scott, Lothar Narins, Yash Kant, Abhishek Das, and Ilmi Yoon
Aditya Bodi, Pooyan Fazli, Shasta Ihorn, Yue Ting Siu, Andrew T. Scott, Lothar Narins, Yash Kant, Abhishek Das, and Ilmi Yoon. 2021. Automated Video Descrip- tion for Blind and Low Vision Users.Conference on Human Factors in Computing Systems - Proceedings(5 2021). doi:10.1145/3411763.3451810
-
[6]
Elizabeth H. Bradley, Leslie A. Curry, and Kelly J. Devers. 2007. Qualitative Data Analysis for Health Services Research: Developing Taxonomy, Themes, and Theory.Health Services Research42, 4 (2007), 1758–1772. doi:10.1111/j.1475- 6773.2006.00684.x
-
[7]
Carmen J. Branje and Deborah I. Fels. 2012. LiveDescribe: Can Amateur De- scribers Create High-Quality Audio Description?Journal of Visual Impairment & Blindness106, 3 (2012), 154–165. doi:10.1177/0145482X1210600304
-
[8]
Ruei Che Chang, Chao Hsien Ting, Chia Sheng Hung, Wan Chen Lee, Liang Jin Chen, Yu Tzu Chao, Bing Yu Chen, and Anhong Guo. 2022. OmniScribe: Author- ing Immersive Audio Descriptions for 360◦ Videos.UIST 2022 - Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(10 2022). doi:10.1145/3526113.3545613
-
[9]
Jawad Cheema, Stefan Ihorn, Rosiana Natalie, Kotaro Hara, and Amy Pavel. 2025. DescribePro: Collaborative Audio Description with Human-AI Interaction. InThe 27th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’25)(Denver, CO, USA). Association for Computing Machinery. doi:10. 1145/3663547.3746320
- [10]
-
[11]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
-
[12]
InFindings of the Asso- ciation for Computational Linguistics: ACL 2024
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2318–2335
2024
-
[13]
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. 2024. ShareGPT4Video: Improving Video Understand- ing and Generation with Better Captions. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track
2024
- [14]
-
[15]
Joke Daems, Sonia Vandepitte, Robert J. Hartsuiker, and Lieve Macken. 2017. Identifying the Machine Translation Error Types with the Greatest Impact on Post- editing Effort.Frontiers in Psychology8 (2017), 1282. doi:10.3389/fpsyg.2017.01282
-
[16]
Dhillon, Somayeh Molaei, Jiaqi Li, Maximilian Golub, Shaochun Zheng, and Lionel P
Paramveer S. Dhillon, Somayeh Molaei, Jiaqi Li, Maximilian Golub, Shaochun Zheng, and Lionel P. Robert. 2024. Shaping Human-AI Collaboration: Varied Scaffolding Levels in Co-writing with Language Models. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery. doi:10.1145/3613904.3642134
-
[17]
Anna Fernández-Torné. 2015. The Journal of Specialised Translation Text-to- speech vs. human voiced audio descriptions: a reception study in films dubbed into Catalan. (2015)
2015
-
[18]
Google Cloud. 2024. Cloud Text-to-Speech API. https://cloud.google.com/text- to-speech Accessed: 2025
2024
-
[19]
Spence Green, Jeffrey Heer, and Christopher D. Manning. 2013. The efficacy of human post-editing for language translation.Conference on Human Factors in Computing Systems - Proceedings(2013), 439–448. doi:10.1145/2470654.2470718
-
[20]
Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. InHuman Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). North-Holland, Amsterdam, 139–183. doi:10.1016/S0166-4115(08)62386-9
-
[21]
Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang ’Anthony’ Chen, Young-Ho Kim, and Amy Pavel. 2023. AVscript: Accessible Video Editing with Audio-Visual Scripts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 796, 17 pages. doi:10.1145/3544548.3581494
-
[22]
Shasta Ihorn, Aditya Bodi, Fazli Pooyan, Yue-Ting Siu, and Ilmi Yoon. 2023. The Potential of a Visual Dialogue Agent in a Tandem Automated Audio Description System for Videos. InProceedings of the 25th International ACM SIGACCESS Con- ference on Computers and Accessibility (ASSETS ’23). Association for Computing Machinery. doi:10.1145/3597638.3608402
-
[23]
Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman
-
[24]
In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23)
Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery. doi:10.1145/3544548.3581196 , , Lana Do, Shasta Ihorn, Charity M. Pitcher-Cooper, Sanjay Mirani, Gio Jung, Hyunjoo Shim, Zhenzhen Qin, Kien T. Nguyen, Vassilis Ath...
-
[25]
Irma Browne Jamison. 2003. Turnover and Retention among Volunteers in Human Service Agencies.Review of Public Personnel Administration23, 2 (2003), 114–132. doi:10.1177/0734371X03023002003
-
[26]
Kowe Kadoma, Marianne Aubin Le Quere, Xiyu Jenny Fu, Christin Munsch, Danaë Metaxa, and Mor Naaman. 2024. The Role of Inclusion, Control, and Ownership in Workplace AI-Mediated Communication. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery. doi:10.1145/3613904.3642650
-
[27]
Samuel Läubli, Mark Fishel, Gary Massey, Maureen Ehrensberger-Dow, and Martin Volk. [n. d.]. Assessing Post-Editing Efficiency in a Realistic Translation Environment. ([n. d.]). http://www.my-across.net/en/
-
[28]
Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, et al. 2024. A Design Space for Intelligent and Interac- tive Writing Assistants. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Com...
-
[29]
Levenshtein
Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710
1966
-
[30]
Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understand- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13723–13733
2025
-
[31]
Cheema, Hasti Seifi, and Pooyan Fazli
Chaoyu Li, Sid Padmanabhuni, Maryam S. Cheema, Hasti Seifi, and Pooyan Fazli. 2025. VideoA11y: Method and Dataset for Accessible Video Description. Conference on Human Factors in Computing Systems - Proceedings(4 2025). doi:10. 1145/3706598.3714096/SUPPL{_}FILE/PN2974-TALK-VIDEO.MP4
-
[32]
Bigham, and Amy Pavel
Franklin Mingzhe Li, Michael Xieyang Liu, Jeffrey P. Bigham, and Amy Pavel
-
[33]
InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems
ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3772318.3791158
- [34]
-
[35]
Xingyu "Bruce" Liu, Ruolin Wang, Dingzeyu Li, Xiang ’Anthony’ Chen, and Amy Pavel. 2022. CrossA11y: Identifying Video Accessibility Issues via Cross-Modal Grounding. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22)(Bend, OR, USA). Association for Computing Machinery, Article 43, 14 pages. doi:10.1145/3526...
-
[36]
Rosiana Natalie, Joshua Tseng, Hernisa Kacorri, and Kotaro Hara. 2023. Support- ing Novices Author Audio Descriptions via Automatic Feedback. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23) (Hamburg, Germany). Association for Computing Machinery, Article 77, 18 pages. doi:10.1145/3544548.3581023
-
[37]
National Center for Accessible Media. 2017. Accessible Digital Media Guidelines. https://ncam.wgbh.org
2017
-
[38]
OpenAI. 2024. GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2025
2024
-
[39]
Amy Pavel, Gabriel Reyes, and Jeffrey P. Bigham. 2020. Rescribe: Authoring and automatically editing audio descriptions.UIST 2020 - Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(10 2020), 747–759. doi:10.1145/3379337.3415864/SUPPL{_}FILE/3379337.3415864.MP4
-
[40]
Charity Pitcher-Cooper, Manali Seth, Benjamin Kao, James M Coughlan, and Ilmi Yoon. 2023. You Described, We Archived: A Rich Audio Description Dataset. Journal on Technology and Persons with Disabilities11 (2023). https://youdescribe. org/
2023
-
[41]
Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y
Rafael A. Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y. Chen, Aleem Khan, Marcus Bishop, and Nicholas Andrews. 2021. Learning Universal Authorship Representations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online and Punta Cana, Dominican Re...
2021
-
[42]
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie Description. International Journal of Computer Vision123, 1 (5 2017), 94–120. doi:10.1007/ S11263-016-0987-1
2017
-
[43]
Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic Programming Algorithm Opti- mization for Spoken Word Recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing26, 1 (1978), 43–49. doi:10.1109/TASSP.1978.1163055
-
[44]
Eva Schaeffer-Lacroix, Nina Reviers, and Elena Di Giovanni. 2023. Beyond Objectivity in Audio Description: New Practices and Perspectives
2023
-
[45]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Lud- wig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text...
work page internal anchor Pith review arXiv 2022
-
[46]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying About Prompt Formatting. InProceedings of the International Conference on Learning Representations (ICLR)
2024
-
[47]
TomarSuramya. 2006. Converting video formats with FFmpeg.Linux Journal(6 2006). doi:10.5555/1134782.1134792
-
[48]
Derry, Mina Huh, and Amy Pavel
Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, and Amy Pavel. 2024. Making Short-Form Videos Accessible with Hierarchical Video Summaries.Conference on Human Factors in Computing Systems - Proceedings1 (2 2024), 17. doi:10.1145/3613904.3642839
-
[49]
Agnieszka Walczak and Louise Fryer. 2017. Creative Description: The Impact of Audio Description Style on Presence in Visually Impaired Audiences.British Journal of Visual Impairment35, 1 (2017), 6–17. doi:10.1177/0264619616661603
-
[50]
Agnieszka Walczak and Louise Fryer. 2018. Vocal delivery of audio description by genre: measuring users’ presence.Perspectives: Studies in Translatology26, 1 (1 2018), 69–83. doi:10.1080/0907676X.2017.1298634
-
[51]
Yujia Wang and Wei Liang. 2021. Toward automatic audio description gener- ation for accessible videos.Conference on Human Factors in Computing Sys- tems - Proceedings(5 2021). doi:10.1145/3411764.3445347/SUPPL{_}FILE/3411764. 3445347{_}VIDEOPREVIEW.MP4
- [52]
-
[53]
2024.MMAD: Multi-modal Movie Audio Description
Xiaojun Ye, Junhao Chen, Xiang Li, Haidong Xin, Chao Li, Sheng Zhou, and Jiajun Bu. 2024.MMAD: Multi-modal Movie Audio Description. Technical Report. 11415 pages. https://github.com/Daria8976/MMAD
2024
-
[54]
YouDescribe. [n. d.]. YouDescribe. https://www.youdescribe.org/
-
[55]
yt-dlp contributors. 2025. yt-dlp. https://github.com/yt-dlp/yt-dlp Accessed: 2025
2025
-
[56]
Beste F Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon
-
[57]
Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users. (2020). doi:10.1145/3357236.3395433 Making AI Drafts Count: A Quality Threshold in Audio Description Workflows , , A GenAD Prompt Templates This section contains the full prompt templates used in our system, including the system prompt with audio desc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.