pith. sign in

arxiv: 2605.17184 · v1 · pith:YAX7HQDSnew · submitted 2026-05-16 · 💻 cs.HC

Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks

Pith reviewed 2026-05-20 14:03 UTC · model grok-4.3

classification 💻 cs.HC
keywords instructional videosvisual context misalignmentphysical tasksuser studytask performancefirst aidcooking
0
0 comments X

The pith

Visual context misalignment in instructional videos for physical tasks is substantial, decomposable into four attributes, and invisible to users despite harming performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how well instructional videos match the visual context users experience when performing physical tasks like first aid or cooking. By creating specially aligned videos using controlled recordings and comparing them to standard internet videos, they find that better visual alignment leads to 11 percent higher quality and 15 percent faster task completion. They break down the mismatch into four specific attributes related to objects, their states, the environment, and how it's observed. When they misalign just one attribute at a time, performance drops consistently, but users do not notice or report the problem. This reveals that the misalignment is real and impactful but not something learners can easily detect on their own.

Core claim

Using Wizard-of-Oz methods to produce In-Context instructional videos fully aligned with the learner's visual perception, the authors demonstrate that such videos yield 11.1% higher completion quality and 15.5% faster completion compared to typical online videos. Systematic ablation of four visual context attributes—Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context—confirms each independently degrades performance when misaligned, yet users fail to perceive these single-attribute misalignments despite the objective performance costs.

What carries the argument

The In-Context (ICON) video preparation and the four visual context attributes that are ablated independently to measure their effects on task performance.

If this is right

  • Aligned visual context in videos improves both the quality and speed of physical task performance.
  • Each of the four attributes contributes independently to the performance degradation when misaligned.
  • Objective measures show clear drops from misalignment even when subjective user experience does not.
  • Instructional video evaluation should incorporate objective performance metrics rather than relying solely on user perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Instructional content creators could benefit from tools that automatically adjust videos to match user environments.
  • Augmented reality systems for task guidance might address this by overlaying context-specific visuals.
  • Similar misalignment issues could affect other learning media like diagrams or simulations.

Load-bearing premise

That the four visual context attributes can be ablated independently in the Wizard-of-Oz setup without introducing other confounding visual changes.

What would settle it

Finding no statistically significant difference in task completion quality or time when using videos with one visual attribute misaligned versus fully aligned videos.

Figures

Figures reproduced from arXiv: 2605.17184 by Anhong Guo, Chenglin Li, Filippos Bellos, Jason J. Corso, Jingying Wang, Yayuan Li.

Figure 1
Figure 1. Figure 1: We study how visual context misalignment in instructional videos affects physical task completion. Through two [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Physical infrastructure for Studies 1 and 2. Left: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrate two video types, Business-as-Usual (BAU) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of objective task performance, subjective ratings, and cognitive load between BAU and ICON conditions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of ablated ICON videos in which we de [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study results comparing ICON with four [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Likert-scale responses from the ablation study com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparative visual examples of state of the art TI2V [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The exact textual instructions given to study participants for each of the four physical tasks. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that visual context misalignment in instructional videos for physical tasks is substantial, decomposable, and invisible to users. It supports this via two studies using Wizard-of-Oz recordings: Study 1 (N=16) shows that fully aligned In-Context (ICON) videos yield 11.1% higher completion quality and 15.5% faster completion than standard internet videos across first-aid and culinary tasks; qualitative analysis identifies four attributes (Task Object Intrinsics, Task Object State, Environmental Context, Observational Context); Study 2 (N=40) ablates each attribute individually from otherwise aligned videos and reports consistent performance degradation, while users fail to perceive the objective drops.

Significance. If the results hold, this work provides empirical grounding for how visual context affects motor task learning in instructional videos, with implications for HCI design of guidance systems. Strengths include the complementary studies with statistical significance, the controlled Wizard-of-Oz manipulation enabling precise alignment variations, and the counterintuitive finding that misalignment remains invisible despite measurable costs. The decomposability claim, however, depends on successful isolation of attributes.

major comments (2)
  1. [Study 2] Study 2 ablation design: The central claim that misalignment is decomposable rests on single-attribute ablations producing isolated performance drops. Yet the manuscript provides insufficient detail on protocols for ensuring that altering one attribute (e.g., Task Object State) does not inadvertently alter another (e.g., Observational Context via compensatory camera or setup changes) in the physical Wizard-of-Oz recordings. Explicit checks for cross-effects or interdependencies would be required to attribute degradations cleanly to individual attributes rather than entangled factors.
  2. [Results] Results reporting: While statistically significant differences are reported for both studies, the manuscript does not include effect sizes, power analysis, or full details on participant instructions and exclusion criteria. These omissions make it difficult to assess the robustness and generalizability of the 11.1% quality and 15.5% speed improvements or the ablation outcomes.
minor comments (2)
  1. [Abstract] The abstract states '86+ hours' of data collection; providing the exact total would improve precision and replicability.
  2. Terminology for the four visual context attributes should be used consistently in all figures, tables, and discussion sections to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Study 2] Study 2 ablation design: The central claim that misalignment is decomposable rests on single-attribute ablations producing isolated performance drops. Yet the manuscript provides insufficient detail on protocols for ensuring that altering one attribute (e.g., Task Object State) does not inadvertently alter another (e.g., Observational Context via compensatory camera or setup changes) in the physical Wizard-of-Oz recordings. Explicit checks for cross-effects or interdependencies would be required to attribute degradations cleanly to individual attributes rather than entangled factors.

    Authors: We agree that greater detail on the isolation protocols would strengthen the support for decomposability. The Study 2 videos were created by preparing separate Wizard-of-Oz recordings for each single-attribute misalignment while holding all other elements constant through fixed camera positions, staging, and environmental setup. We will add an expanded methods subsection describing these recording procedures, including how each attribute was targeted independently and any verification steps used to confirm no unintended changes to other attributes. revision: yes

  2. Referee: [Results] Results reporting: While statistically significant differences are reported for both studies, the manuscript does not include effect sizes, power analysis, or full details on participant instructions and exclusion criteria. These omissions make it difficult to assess the robustness and generalizability of the 11.1% quality and 15.5% speed improvements or the ablation outcomes.

    Authors: We concur that these elements would aid assessment of the results. In the revised manuscript we will report effect sizes (e.g., Cohen's d) for the key comparisons in both studies, include a post-hoc power analysis, and expand the participant section with complete instructions and explicit exclusion criteria. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical user study grounded in performance data

full rationale

The paper reports two controlled user studies (N=16 and N=40) that measure task completion quality and time under Wizard-of-Oz manipulated visual alignment conditions. All central claims (11.1% quality gain, 15.5% speed gain, four-attribute decomposition, and invisibility to users) rest on direct participant outcome data and systematic single-attribute ablations rather than any equations, fitted parameters, or first-principles derivations. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing. The skeptic concern about possible physical interdependencies in ablations is a validity or confound issue, not a reduction of the reported results to their own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the controlled Wizard-of-Oz recordings, the independence of the four ablated attributes, and standard assumptions of statistical testing in human-subject experiments.

axioms (2)
  • standard math Standard assumptions of statistical significance testing for between-condition comparisons in user studies
    Invoked when reporting statistically significant improvements in completion quality and speed.
  • domain assumption The four named visual context attributes can be isolated and manipulated independently in video recordings
    Central to the ablation design in Study 2.

pith-pipeline@v0.9.0 · 5820 in / 1312 out tokens · 45795 ms · 2026-05-20T14:03:39.751656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Amber Aftab, Ruipu Hu, and Sang Won Lee. 2020. Remo: Generating Interactive Tutorials by Demonstration for Online Tasks. InAdjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 87–89

  2. [2]

    Bjork and Robert A

    Elizabeth L. Bjork and Robert A. Bjork. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. InPsychology and the real world: Essays illustrating fundamental contributions to society, M. A. Gernsbacher, R. W. Pew, L. M. Hough, and J. R. Pomerantz (Eds.). Worth Publishers, 56–64

  3. [3]

    Paul Chandler and John Sweller. 1991. Cognitive load theory and the format of instruction.Cognition and instruction8, 4 (1991), 293–332

  4. [4]

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2401.09047 [cs.CV]

  5. [5]

    Pei-Yu Chi, Sally Ahn, Amanda Ren, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2012. MixT: automatic generation of step-by-step mixed media tutorials. InProceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. 93–102

  6. [6]

    Costley, Mik Fanguy, C

    J. Costley, Mik Fanguy, C. Lange, and Matthew Baldwin. 2020. The effects of video lecture viewing strategies on cognitive load.Journal of Computing in Higher Education33 (2020), 19 – 38. doi:10.1007/s12528-020-09254-y

  7. [7]

    Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. 2023. Fine-grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886(2023)

  8. [8]

    Google DeepMind. 2025. Veo. Google DeepMind. https://deepmind.google/ models/veo/

  9. [9]

    Enqi Fan, Matt Bower, and Jens Siemon. 2024. Video Tutorials in the Traditional Classroom: The Effects on Different Types of Cognitive Load.Technol. Knowl. Learn.29 (2024), 2017–2036. doi:10.1007/s10758-024-09754-1

  10. [10]

    Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin76, 5 (1971), 378

  11. [11]

    C Ailie Fraser, Tricia J Ngoon, Mira Dontcheva, and Scott Klemmer. 2019. Re- Play: contextually presenting learning videos across software applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13

  12. [12]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

  13. [13]

    Gaoping Huang, Xun Qian, Tianyi Wang, Fagun Patel, Maitreya Sreeram, Yuanzhi Cao, Karthik Ramani, and Alexander J Quinn. 2021. Adaptutar: An adaptive tutoring system for machine tasks in augmented reality. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15

  14. [14]

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

  15. [15]

    Marc Jeannerod. 1994. The representing brain: Neural correlates of motor intention and imagery.Behavioral and Brain Sciences17, 2 (1994), 187–202. doi:10.1017/S0140525X00034026

  16. [16]

    Marc Jeannerod. 2001. Neural simulation of action: A unifying mechanism for motor cognition.NeuroImage14, 1 (2001), S103–S109. doi:10.1006/nimg.2001.0832

  17. [17]

    J. M. Juliano, N. Schweighofer, and S. Liew. 2022. Increased cognitive load in immersive virtual reality during visuomotor adaptation is associated with decreased long-term retention and context transfer.Journal of NeuroEngineering and Rehabilitation19 (2022). doi:10.1186/s12984-022-01084-6

  18. [18]

    Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. InCHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712

  19. [19]

    Jeongyeon Kim, Daeun Choi, Nicole Lee, Matt Beane, and Juho Kim. 2023. Surch: Enabling structural search and comparison for surgical videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

  20. [20]

    Anita Komlodi and Gary Marchionini. 1998. Key frame preview techniques for video browsing. InProceedings of the third ACM Conference on Digital libraries. 118–125

  21. [21]

    Asher Koriat and Robert A. Bjork. 2005. Illusions of competence in monitoring one’s knowledge during study.Journal of Experimental Psychology: Learning, Memory, and Cognition31, 2 (2005), 187–194. doi:10.1037/0278-7393.31.2.187

  22. [22]

    Kragh, John F., Thomas J

    Jr. Kragh, John F., Thomas J. Walters, David G. Baer, Charles J. Fox, Charles E. Wade, Jose Salinas, and John B. Holcomb. 2009. Survival With Emergency Tourni- quet Use to Stop Bleeding in Major Limb Trauma.Annals of Surgery249, 1 (January 2009), 1–7. doi:10.1097/SLA.0b013e31818842ba

  23. [23]

    Kuaishou. 2025. Kling AI. Kuaishou. https://app.klingai.com/global/

  24. [24]

    Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. 2023. LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning.arXiv preprint arXiv:2312.03849(2023)

  25. [25]

    Ziyi Liu, Zhengzhe Zhu, Enze Jiang, Feichi Huang, Ana M Villanueva, Xun Qian, Tianyi Wang, and Karthik Ramani. 2023. Instrumentar: Auto-generation of augmented reality tutorials for operating digital instruments through recording embodied demonstration. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

  26. [26]

    Richard E. Mayer. 2020.Multimedia Learning(3 ed.). Cambridge University Press

  27. [27]

    Mariana Morgado, João Botelho, Vanessa Machado, José João Mendes, Olusola Adesope, and Luís Proença. 2024. Video-based approaches in health education: a systematic review and meta-analysis.Scientific Reports14, 23651 (2024)

  28. [28]

    Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X Huang, and Tim K Marks. 2024. TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9015–9025

  29. [29]

    Oppenheimer

    Daniel M. Oppenheimer. 2008. The secret life of fluency.Trends in Cognitive Sciences12, 6 (2008), 237–241. doi:10.1016/j.tics.2008.02.014

  30. [30]

    Raphaël Perraud, Aurélien Tabard, and Sylvain Malacria. 2024. Tutorial mis- matches: investigating the frictions due to interface differences when following software video tutorials. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 1942–1955

  31. [31]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  32. [32]

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems35 (2022), 36479–36494

  33. [33]

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. 2022. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21096–21106

  34. [34]

    Simons and Christopher F

    Daniel J. Simons and Christopher F. Chabris. 1999. Gorillas in our midst: Sustained inattentional blindness for dynamic events.Perception28, 9 (1999), 1059–1074. doi:10.1068/p281059

  35. [35]

    Aaron Smith, Skye Toor, and Patrick Van Kessel. 2018. Many turn to YouTube for children’s content, news, how-to lessons.Pew Research Center7 (2018)

  36. [36]

    Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. 2024. Genhowto: Learning to generate actions and state transformations from in- structional videos. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6561–6571

  37. [37]

    Andreja Istenic Starcic, Ziga Turk, and Matej Zajc. 2015. Transforming Peda- gogical Approaches Using Tangible User Interface Enabled Computer Assisted Learning.International Journal of Emerging Technologies in Learning (iJET)10, 6 (2015), 42–52. doi:10.3991/ijet.v10i6.4865

  38. [38]

    John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4

  39. [39]

    2011.Cognitive Load Theory

    John Sweller, Paul Ayres, and Slava Kalyuga. 2011.Cognitive Load Theory. Springer. doi:10.1007/978-1-4419-8126-4

  40. [40]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  41. [41]

    Cheng-Yao Wang, Wei-Chen Chu, Hou-Ren Chen, Chun-Yen Hsu, and Mike Y Chen. 2014. Evertutor: Automatically creating interactive guided tutorials on smartphones by user demonstration. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 4027–4036

  42. [42]

    Antonenko, A

    Jiahui Wang, Pavlo D. Antonenko, A. Keil, and K. Dawson. 2020. Converging Subjective and Psychophysiological Measures of Cognitive Load to Study the Effects of Instructor-Present Video.Mind, Brain, and Education14 (2020), 279–291. doi:10.1111/mbe.12239

  43. [43]

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean An- drist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. 2023. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20270–20281

  44. [44]

    Saelyne Yang, Sangkyung Kwak, Juhoon Lee, and Juho Kim. 2023. Beyond Instructions: a taxonomy of information types in how-to videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21. 11

  45. [45]

    Saelyne Yang, Anh Truong, Juho Kim, and Dingzeyu Li. 2025. VideoMix: Ag- gregating How-To Videos for Task-Oriented Learning. InProceedings of the 30th International Conference on Intelligent User Interfaces. 1564–1580

  46. [46]

    Saelyne Yang, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2024. AQuA: Automated question-answering in software tutorial videos with visual anchors. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

  47. [47]

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

  48. [48]

    Meehyun Yoon, Hua Zheng, Eulho Jung, and Tong Li. 2022. Effects of Segmen- tation and Self-Explanation Designs on Cognitive Load in Instructional Videos. Contemporary Educational Technology(2022). doi:10.30935/cedtech/11522

  49. [49]

    Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta

    Samy C. Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta. 2022. Learning Surgical Skills Through Video- Based Education: A Systematic Review.Surgical Innovation(2022). doi:10.1177/ 15533506221120146

  50. [50]

    Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, and Kai Chen. 2024. Pia: Your personalized image animator via plug-and-play modules in text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7747–7756

  51. [51]

    2024.Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024.Open-Sora: Democratizing Efficient Video Production for All. https://github.com/hpcaitech/Open-Sora

  52. [52]

    Mingyuan Zhong, Gang Li, Peggy Chi, and Yang Li. 2021. Helpviz: Automatic generation of contextual visual mobile tutorials from text-based instructions. InThe 34th Annual ACM Symposium on User Interface Software and Technology. 1144–1153. 12 A Task Description Fig. 9 shows the exact textual description for each of the four physical tasks used in our study...