Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks

Anhong Guo; Chenglin Li; Filippos Bellos; Jason J. Corso; Jingying Wang; Yayuan Li

arxiv: 2605.17184 · v1 · pith:YAX7HQDSnew · submitted 2026-05-16 · 💻 cs.HC

Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks

Yayuan Li , Chenglin Li , Jingying Wang , Filippos Bellos , Anhong Guo , Jason J. Corso This is my paper

Pith reviewed 2026-05-20 14:03 UTC · model grok-4.3

classification 💻 cs.HC

keywords instructional videosvisual context misalignmentphysical tasksuser studytask performancefirst aidcooking

0 comments

The pith

Visual context misalignment in instructional videos for physical tasks is substantial, decomposable into four attributes, and invisible to users despite harming performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how well instructional videos match the visual context users experience when performing physical tasks like first aid or cooking. By creating specially aligned videos using controlled recordings and comparing them to standard internet videos, they find that better visual alignment leads to 11 percent higher quality and 15 percent faster task completion. They break down the mismatch into four specific attributes related to objects, their states, the environment, and how it's observed. When they misalign just one attribute at a time, performance drops consistently, but users do not notice or report the problem. This reveals that the misalignment is real and impactful but not something learners can easily detect on their own.

Core claim

Using Wizard-of-Oz methods to produce In-Context instructional videos fully aligned with the learner's visual perception, the authors demonstrate that such videos yield 11.1% higher completion quality and 15.5% faster completion compared to typical online videos. Systematic ablation of four visual context attributes—Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context—confirms each independently degrades performance when misaligned, yet users fail to perceive these single-attribute misalignments despite the objective performance costs.

What carries the argument

The In-Context (ICON) video preparation and the four visual context attributes that are ablated independently to measure their effects on task performance.

If this is right

Aligned visual context in videos improves both the quality and speed of physical task performance.
Each of the four attributes contributes independently to the performance degradation when misaligned.
Objective measures show clear drops from misalignment even when subjective user experience does not.
Instructional video evaluation should incorporate objective performance metrics rather than relying solely on user perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Instructional content creators could benefit from tools that automatically adjust videos to match user environments.
Augmented reality systems for task guidance might address this by overlaying context-specific visuals.
Similar misalignment issues could affect other learning media like diagrams or simulations.

Load-bearing premise

That the four visual context attributes can be ablated independently in the Wizard-of-Oz setup without introducing other confounding visual changes.

What would settle it

Finding no statistically significant difference in task completion quality or time when using videos with one visual attribute misaligned versus fully aligned videos.

Figures

Figures reproduced from arXiv: 2605.17184 by Anhong Guo, Chenglin Li, Filippos Bellos, Jason J. Corso, Jingying Wang, Yayuan Li.

**Figure 1.** Figure 1: We study how visual context misalignment in instructional videos affects physical task completion. Through two [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Physical infrastructure for Studies 1 and 2. Left: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrate two video types, Business-as-Usual (BAU) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of objective task performance, subjective ratings, and cognitive load between BAU and ICON conditions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of ablated ICON videos in which we de [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study results comparing ICON with four [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Likert-scale responses from the ablation study com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparative visual examples of state of the art TI2V [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: The exact textual instructions given to study participants for each of the four physical tasks. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two user studies show visual context alignment in instructional videos improves physical task performance by 11% quality and 15% speed, with four attributes each contributing when isolated, though users miss the misalignment.

read the letter

The main thing to know is that this paper gives concrete numbers on how much visual context matters for learning physical tasks from videos. Aligning the video to the user's actual setup boosts completion quality by 11.1% and cuts time by 15.5%, and the effect breaks into four separate visual attributes that each hurt performance when misaligned on their own. Users do not notice any of it despite the objective drops. That combination of magnitude, decomposition, and invisibility is the core new finding from the two studies with 56 total participants across first-aid and cooking tasks. They used Wizard-of-Oz recordings to create fully matched ICON videos for the baseline comparison in Study 1, then ran targeted single-attribute misalignments in Study 2. The consistent degradation across all four attributes (task object intrinsics, object state, environmental context, observational context) adds empirical support to existing motor simulation and cognitive load ideas. The work is straightforward about using objective performance measures rather than relying on what participants say they see. The potential soft spot is whether the single-attribute ablations stayed truly independent in the physical recordings. Changing object state or intrinsics could easily force small adjustments to camera angle or surrounding setup that bleed into observational or environmental context, which would make the drops harder to attribute cleanly to one factor. The abstract reports clean results, but the exact protocols and any checks for those interactions would need a close look in the full text. This is for HCI and education technology researchers who design or evaluate video-based training systems. Anyone working on personalized instructional tools or visual perception in learning would find the quantitative splits and the objective-vs-subjective gap useful. The empirical core is solid enough to deserve a serious referee even if the ablation independence needs extra verification in review.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that visual context misalignment in instructional videos for physical tasks is substantial, decomposable, and invisible to users. It supports this via two studies using Wizard-of-Oz recordings: Study 1 (N=16) shows that fully aligned In-Context (ICON) videos yield 11.1% higher completion quality and 15.5% faster completion than standard internet videos across first-aid and culinary tasks; qualitative analysis identifies four attributes (Task Object Intrinsics, Task Object State, Environmental Context, Observational Context); Study 2 (N=40) ablates each attribute individually from otherwise aligned videos and reports consistent performance degradation, while users fail to perceive the objective drops.

Significance. If the results hold, this work provides empirical grounding for how visual context affects motor task learning in instructional videos, with implications for HCI design of guidance systems. Strengths include the complementary studies with statistical significance, the controlled Wizard-of-Oz manipulation enabling precise alignment variations, and the counterintuitive finding that misalignment remains invisible despite measurable costs. The decomposability claim, however, depends on successful isolation of attributes.

major comments (2)

[Study 2] Study 2 ablation design: The central claim that misalignment is decomposable rests on single-attribute ablations producing isolated performance drops. Yet the manuscript provides insufficient detail on protocols for ensuring that altering one attribute (e.g., Task Object State) does not inadvertently alter another (e.g., Observational Context via compensatory camera or setup changes) in the physical Wizard-of-Oz recordings. Explicit checks for cross-effects or interdependencies would be required to attribute degradations cleanly to individual attributes rather than entangled factors.
[Results] Results reporting: While statistically significant differences are reported for both studies, the manuscript does not include effect sizes, power analysis, or full details on participant instructions and exclusion criteria. These omissions make it difficult to assess the robustness and generalizability of the 11.1% quality and 15.5% speed improvements or the ablation outcomes.

minor comments (2)

[Abstract] The abstract states '86+ hours' of data collection; providing the exact total would improve precision and replicability.
Terminology for the four visual context attributes should be used consistently in all figures, tables, and discussion sections to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and transparency.

read point-by-point responses

Referee: [Study 2] Study 2 ablation design: The central claim that misalignment is decomposable rests on single-attribute ablations producing isolated performance drops. Yet the manuscript provides insufficient detail on protocols for ensuring that altering one attribute (e.g., Task Object State) does not inadvertently alter another (e.g., Observational Context via compensatory camera or setup changes) in the physical Wizard-of-Oz recordings. Explicit checks for cross-effects or interdependencies would be required to attribute degradations cleanly to individual attributes rather than entangled factors.

Authors: We agree that greater detail on the isolation protocols would strengthen the support for decomposability. The Study 2 videos were created by preparing separate Wizard-of-Oz recordings for each single-attribute misalignment while holding all other elements constant through fixed camera positions, staging, and environmental setup. We will add an expanded methods subsection describing these recording procedures, including how each attribute was targeted independently and any verification steps used to confirm no unintended changes to other attributes. revision: yes
Referee: [Results] Results reporting: While statistically significant differences are reported for both studies, the manuscript does not include effect sizes, power analysis, or full details on participant instructions and exclusion criteria. These omissions make it difficult to assess the robustness and generalizability of the 11.1% quality and 15.5% speed improvements or the ablation outcomes.

Authors: We concur that these elements would aid assessment of the results. In the revised manuscript we will report effect sizes (e.g., Cohen's d) for the key comparisons in both studies, include a post-hoc power analysis, and expand the participant section with complete instructions and explicit exclusion criteria. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical user study grounded in performance data

full rationale

The paper reports two controlled user studies (N=16 and N=40) that measure task completion quality and time under Wizard-of-Oz manipulated visual alignment conditions. All central claims (11.1% quality gain, 15.5% speed gain, four-attribute decomposition, and invisibility to users) rest on direct participant outcome data and systematic single-attribute ablations rather than any equations, fitted parameters, or first-principles derivations. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing. The skeptic concern about possible physical interdependencies in ablations is a validity or confound issue, not a reduction of the reported results to their own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the controlled Wizard-of-Oz recordings, the independence of the four ablated attributes, and standard assumptions of statistical testing in human-subject experiments.

axioms (2)

standard math Standard assumptions of statistical significance testing for between-condition comparisons in user studies
Invoked when reporting statistically significant improvements in completion quality and speed.
domain assumption The four named visual context attributes can be isolated and manipulated independently in video recordings
Central to the ablation design in Study 2.

pith-pipeline@v0.9.0 · 5820 in / 1312 out tokens · 45795 ms · 2026-05-20T14:03:39.751656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Visual context misalignment is substantial, decomposable, and invisible to the user.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

[1]

Amber Aftab, Ruipu Hu, and Sang Won Lee. 2020. Remo: Generating Interactive Tutorials by Demonstration for Online Tasks. InAdjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 87–89

work page 2020
[2]

Bjork and Robert A

Elizabeth L. Bjork and Robert A. Bjork. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. InPsychology and the real world: Essays illustrating fundamental contributions to society, M. A. Gernsbacher, R. W. Pew, L. M. Hough, and J. R. Pomerantz (Eds.). Worth Publishers, 56–64

work page 2011
[3]

Paul Chandler and John Sweller. 1991. Cognitive load theory and the format of instruction.Cognition and instruction8, 4 (1991), 293–332

work page 1991
[4]

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2401.09047 [cs.CV]

work page arXiv 2024
[5]

Pei-Yu Chi, Sally Ahn, Amanda Ren, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2012. MixT: automatic generation of step-by-step mixed media tutorials. InProceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. 93–102

work page 2012
[6]

Costley, Mik Fanguy, C

J. Costley, Mik Fanguy, C. Lange, and Matthew Baldwin. 2020. The effects of video lecture viewing strategies on cognitive load.Journal of Computing in Higher Education33 (2020), 19 – 38. doi:10.1007/s12528-020-09254-y

work page doi:10.1007/s12528-020-09254-y 2020
[7]

Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. 2023. Fine-grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886(2023)

work page arXiv 2023
[8]

Google DeepMind. 2025. Veo. Google DeepMind. https://deepmind.google/ models/veo/

work page 2025
[9]

Enqi Fan, Matt Bower, and Jens Siemon. 2024. Video Tutorials in the Traditional Classroom: The Effects on Different Types of Cognitive Load.Technol. Knowl. Learn.29 (2024), 2017–2036. doi:10.1007/s10758-024-09754-1

work page doi:10.1007/s10758-024-09754-1 2024
[10]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin76, 5 (1971), 378

work page 1971
[11]

C Ailie Fraser, Tricia J Ngoon, Mira Dontcheva, and Scott Klemmer. 2019. Re- Play: contextually presenting learning videos across software applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13

work page 2019
[12]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

work page 1988
[13]

Gaoping Huang, Xun Qian, Tianyi Wang, Fagun Patel, Maitreya Sreeram, Yuanzhi Cao, Karthik Ramani, and Alexander J Quinn. 2021. Adaptutar: An adaptive tutoring system for machine tasks in augmented reality. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15

work page 2021
[14]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

work page 2024
[15]

Marc Jeannerod. 1994. The representing brain: Neural correlates of motor intention and imagery.Behavioral and Brain Sciences17, 2 (1994), 187–202. doi:10.1017/S0140525X00034026

work page doi:10.1017/s0140525x00034026 1994
[16]

Marc Jeannerod. 2001. Neural simulation of action: A unifying mechanism for motor cognition.NeuroImage14, 1 (2001), S103–S109. doi:10.1006/nimg.2001.0832

work page doi:10.1006/nimg.2001.0832 2001
[17]

J. M. Juliano, N. Schweighofer, and S. Liew. 2022. Increased cognitive load in immersive virtual reality during visuomotor adaptation is associated with decreased long-term retention and context transfer.Journal of NeuroEngineering and Rehabilitation19 (2022). doi:10.1186/s12984-022-01084-6

work page doi:10.1186/s12984-022-01084-6 2022
[18]

Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. InCHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712

work page 2013
[19]

Jeongyeon Kim, Daeun Choi, Nicole Lee, Matt Beane, and Juho Kim. 2023. Surch: Enabling structural search and comparison for surgical videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

work page 2023
[20]

Anita Komlodi and Gary Marchionini. 1998. Key frame preview techniques for video browsing. InProceedings of the third ACM Conference on Digital libraries. 118–125

work page 1998
[21]

Asher Koriat and Robert A. Bjork. 2005. Illusions of competence in monitoring one’s knowledge during study.Journal of Experimental Psychology: Learning, Memory, and Cognition31, 2 (2005), 187–194. doi:10.1037/0278-7393.31.2.187

work page doi:10.1037/0278-7393.31.2.187 2005
[22]

Kragh, John F., Thomas J

Jr. Kragh, John F., Thomas J. Walters, David G. Baer, Charles J. Fox, Charles E. Wade, Jose Salinas, and John B. Holcomb. 2009. Survival With Emergency Tourni- quet Use to Stop Bleeding in Major Limb Trauma.Annals of Surgery249, 1 (January 2009), 1–7. doi:10.1097/SLA.0b013e31818842ba

work page doi:10.1097/sla.0b013e31818842ba 2009
[23]

Kuaishou. 2025. Kling AI. Kuaishou. https://app.klingai.com/global/

work page 2025
[24]

Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. 2023. LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning.arXiv preprint arXiv:2312.03849(2023)

work page arXiv 2023
[25]

Ziyi Liu, Zhengzhe Zhu, Enze Jiang, Feichi Huang, Ana M Villanueva, Xun Qian, Tianyi Wang, and Karthik Ramani. 2023. Instrumentar: Auto-generation of augmented reality tutorials for operating digital instruments through recording embodied demonstration. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

work page 2023
[26]

Richard E. Mayer. 2020.Multimedia Learning(3 ed.). Cambridge University Press

work page 2020
[27]

Mariana Morgado, João Botelho, Vanessa Machado, José João Mendes, Olusola Adesope, and Luís Proença. 2024. Video-based approaches in health education: a systematic review and meta-analysis.Scientific Reports14, 23651 (2024)

work page 2024
[28]

Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X Huang, and Tim K Marks. 2024. TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9015–9025

work page 2024
[29]

Oppenheimer

Daniel M. Oppenheimer. 2008. The secret life of fluency.Trends in Cognitive Sciences12, 6 (2008), 237–241. doi:10.1016/j.tics.2008.02.014

work page doi:10.1016/j.tics.2008.02.014 2008
[30]

Raphaël Perraud, Aurélien Tabard, and Sylvain Malacria. 2024. Tutorial mis- matches: investigating the frictions due to interface differences when following software video tutorials. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 1942–1955

work page 2024
[31]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[32]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems35 (2022), 36479–36494

work page 2022
[33]

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. 2022. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21096–21106

work page 2022
[34]

Simons and Christopher F

Daniel J. Simons and Christopher F. Chabris. 1999. Gorillas in our midst: Sustained inattentional blindness for dynamic events.Perception28, 9 (1999), 1059–1074. doi:10.1068/p281059

work page doi:10.1068/p281059 1999
[35]

Aaron Smith, Skye Toor, and Patrick Van Kessel. 2018. Many turn to YouTube for children’s content, news, how-to lessons.Pew Research Center7 (2018)

work page 2018
[36]

Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. 2024. Genhowto: Learning to generate actions and state transformations from in- structional videos. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6561–6571

work page 2024
[37]

Andreja Istenic Starcic, Ziga Turk, and Matej Zajc. 2015. Transforming Peda- gogical Approaches Using Tangible User Interface Enabled Computer Assisted Learning.International Journal of Emerging Technologies in Learning (iJET)10, 6 (2015), 42–52. doi:10.3991/ijet.v10i6.4865

work page doi:10.3991/ijet.v10i6.4865 2015
[38]

John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4

work page doi:10.1207/s15516709cog1202_4 1988
[39]

2011.Cognitive Load Theory

John Sweller, Paul Ayres, and Slava Kalyuga. 2011.Cognitive Load Theory. Springer. doi:10.1007/978-1-4419-8126-4

work page doi:10.1007/978-1-4419-8126-4 2011
[40]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Cheng-Yao Wang, Wei-Chen Chu, Hou-Ren Chen, Chun-Yen Hsu, and Mike Y Chen. 2014. Evertutor: Automatically creating interactive guided tutorials on smartphones by user demonstration. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 4027–4036

work page 2014
[42]

Antonenko, A

Jiahui Wang, Pavlo D. Antonenko, A. Keil, and K. Dawson. 2020. Converging Subjective and Psychophysiological Measures of Cognitive Load to Study the Effects of Instructor-Present Video.Mind, Brain, and Education14 (2020), 279–291. doi:10.1111/mbe.12239

work page doi:10.1111/mbe.12239 2020
[43]

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean An- drist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. 2023. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20270–20281

work page 2023
[44]

Saelyne Yang, Sangkyung Kwak, Juhoon Lee, and Juho Kim. 2023. Beyond Instructions: a taxonomy of information types in how-to videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21. 11

work page 2023
[45]

Saelyne Yang, Anh Truong, Juho Kim, and Dingzeyu Li. 2025. VideoMix: Ag- gregating How-To Videos for Task-Oriented Learning. InProceedings of the 30th International Conference on Intelligent User Interfaces. 1564–1580

work page 2025
[46]

Saelyne Yang, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2024. AQuA: Automated question-answering in software tutorial videos with visual anchors. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2024
[47]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Meehyun Yoon, Hua Zheng, Eulho Jung, and Tong Li. 2022. Effects of Segmen- tation and Self-Explanation Designs on Cognitive Load in Instructional Videos. Contemporary Educational Technology(2022). doi:10.30935/cedtech/11522

work page doi:10.30935/cedtech/11522 2022
[49]

Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta

Samy C. Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta. 2022. Learning Surgical Skills Through Video- Based Education: A Systematic Review.Surgical Innovation(2022). doi:10.1177/ 15533506221120146

work page 2022
[50]

Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, and Kai Chen. 2024. Pia: Your personalized image animator via plug-and-play modules in text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7747–7756

work page 2024
[51]

2024.Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024.Open-Sora: Democratizing Efficient Video Production for All. https://github.com/hpcaitech/Open-Sora

work page 2024
[52]

Mingyuan Zhong, Gang Li, Peggy Chi, and Yang Li. 2021. Helpviz: Automatic generation of contextual visual mobile tutorials from text-based instructions. InThe 34th Annual ACM Symposium on User Interface Software and Technology. 1144–1153. 12 A Task Description Fig. 9 shows the exact textual description for each of the four physical tasks used in our study...

work page 2021

[1] [1]

Amber Aftab, Ruipu Hu, and Sang Won Lee. 2020. Remo: Generating Interactive Tutorials by Demonstration for Online Tasks. InAdjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 87–89

work page 2020

[2] [2]

Bjork and Robert A

Elizabeth L. Bjork and Robert A. Bjork. 2011. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. InPsychology and the real world: Essays illustrating fundamental contributions to society, M. A. Gernsbacher, R. W. Pew, L. M. Hough, and J. R. Pomerantz (Eds.). Worth Publishers, 56–64

work page 2011

[3] [3]

Paul Chandler and John Sweller. 1991. Cognitive load theory and the format of instruction.Cognition and instruction8, 4 (1991), 293–332

work page 1991

[4] [4]

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2401.09047 [cs.CV]

work page arXiv 2024

[5] [5]

Pei-Yu Chi, Sally Ahn, Amanda Ren, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2012. MixT: automatic generation of step-by-step mixed media tutorials. InProceedings of the 25th Annual ACM Symposium on User Interface Software and Technology. 93–102

work page 2012

[6] [6]

Costley, Mik Fanguy, C

J. Costley, Mik Fanguy, C. Lange, and Matthew Baldwin. 2020. The effects of video lecture viewing strategies on cognitive load.Journal of Computing in Higher Education33 (2020), 19 – 38. doi:10.1007/s12528-020-09254-y

work page doi:10.1007/s12528-020-09254-y 2020

[7] [7]

Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. 2023. Fine-grained open domain image animation with motion guidance.arXiv preprint arXiv:2311.12886(2023)

work page arXiv 2023

[8] [8]

Google DeepMind. 2025. Veo. Google DeepMind. https://deepmind.google/ models/veo/

work page 2025

[9] [9]

Enqi Fan, Matt Bower, and Jens Siemon. 2024. Video Tutorials in the Traditional Classroom: The Effects on Different Types of Cognitive Load.Technol. Knowl. Learn.29 (2024), 2017–2036. doi:10.1007/s10758-024-09754-1

work page doi:10.1007/s10758-024-09754-1 2024

[10] [10]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin76, 5 (1971), 378

work page 1971

[11] [11]

C Ailie Fraser, Tricia J Ngoon, Mira Dontcheva, and Scott Klemmer. 2019. Re- Play: contextually presenting learning videos across software applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13

work page 2019

[12] [12]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

work page 1988

[13] [13]

Gaoping Huang, Xun Qian, Tianyi Wang, Fagun Patel, Maitreya Sreeram, Yuanzhi Cao, Karthik Ramani, and Alexander J Quinn. 2021. Adaptutar: An adaptive tutoring system for machine tasks in augmented reality. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15

work page 2021

[14] [14]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

work page 2024

[15] [15]

Marc Jeannerod. 1994. The representing brain: Neural correlates of motor intention and imagery.Behavioral and Brain Sciences17, 2 (1994), 187–202. doi:10.1017/S0140525X00034026

work page doi:10.1017/s0140525x00034026 1994

[16] [16]

Marc Jeannerod. 2001. Neural simulation of action: A unifying mechanism for motor cognition.NeuroImage14, 1 (2001), S103–S109. doi:10.1006/nimg.2001.0832

work page doi:10.1006/nimg.2001.0832 2001

[17] [17]

J. M. Juliano, N. Schweighofer, and S. Liew. 2022. Increased cognitive load in immersive virtual reality during visuomotor adaptation is associated with decreased long-term retention and context transfer.Journal of NeuroEngineering and Rehabilitation19 (2022). doi:10.1186/s12984-022-01084-6

work page doi:10.1186/s12984-022-01084-6 2022

[18] [18]

Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. InCHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712

work page 2013

[19] [19]

Jeongyeon Kim, Daeun Choi, Nicole Lee, Matt Beane, and Juho Kim. 2023. Surch: Enabling structural search and comparison for surgical videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

work page 2023

[20] [20]

Anita Komlodi and Gary Marchionini. 1998. Key frame preview techniques for video browsing. InProceedings of the third ACM Conference on Digital libraries. 118–125

work page 1998

[21] [21]

Asher Koriat and Robert A. Bjork. 2005. Illusions of competence in monitoring one’s knowledge during study.Journal of Experimental Psychology: Learning, Memory, and Cognition31, 2 (2005), 187–194. doi:10.1037/0278-7393.31.2.187

work page doi:10.1037/0278-7393.31.2.187 2005

[22] [22]

Kragh, John F., Thomas J

Jr. Kragh, John F., Thomas J. Walters, David G. Baer, Charles J. Fox, Charles E. Wade, Jose Salinas, and John B. Holcomb. 2009. Survival With Emergency Tourni- quet Use to Stop Bleeding in Major Limb Trauma.Annals of Surgery249, 1 (January 2009), 1–7. doi:10.1097/SLA.0b013e31818842ba

work page doi:10.1097/sla.0b013e31818842ba 2009

[23] [23]

Kuaishou. 2025. Kling AI. Kuaishou. https://app.klingai.com/global/

work page 2025

[24] [24]

Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. 2023. LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning.arXiv preprint arXiv:2312.03849(2023)

work page arXiv 2023

[25] [25]

Ziyi Liu, Zhengzhe Zhu, Enze Jiang, Feichi Huang, Ana M Villanueva, Xun Qian, Tianyi Wang, and Karthik Ramani. 2023. Instrumentar: Auto-generation of augmented reality tutorials for operating digital instruments through recording embodied demonstration. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17

work page 2023

[26] [26]

Richard E. Mayer. 2020.Multimedia Learning(3 ed.). Cambridge University Press

work page 2020

[27] [27]

Mariana Morgado, João Botelho, Vanessa Machado, José João Mendes, Olusola Adesope, and Luís Proença. 2024. Video-based approaches in health education: a systematic review and meta-analysis.Scientific Reports14, 23651 (2024)

work page 2024

[28] [28]

Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X Huang, and Tim K Marks. 2024. TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9015–9025

work page 2024

[29] [29]

Oppenheimer

Daniel M. Oppenheimer. 2008. The secret life of fluency.Trends in Cognitive Sciences12, 6 (2008), 237–241. doi:10.1016/j.tics.2008.02.014

work page doi:10.1016/j.tics.2008.02.014 2008

[30] [30]

Raphaël Perraud, Aurélien Tabard, and Sylvain Malacria. 2024. Tutorial mis- matches: investigating the frictions due to interface differences when following software video tutorials. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 1942–1955

work page 2024

[31] [31]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022

[32] [32]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems35 (2022), 36479–36494

work page 2022

[33] [33]

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. 2022. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21096–21106

work page 2022

[34] [34]

Simons and Christopher F

Daniel J. Simons and Christopher F. Chabris. 1999. Gorillas in our midst: Sustained inattentional blindness for dynamic events.Perception28, 9 (1999), 1059–1074. doi:10.1068/p281059

work page doi:10.1068/p281059 1999

[35] [35]

Aaron Smith, Skye Toor, and Patrick Van Kessel. 2018. Many turn to YouTube for children’s content, news, how-to lessons.Pew Research Center7 (2018)

work page 2018

[36] [36]

Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. 2024. Genhowto: Learning to generate actions and state transformations from in- structional videos. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6561–6571

work page 2024

[37] [37]

Andreja Istenic Starcic, Ziga Turk, and Matej Zajc. 2015. Transforming Peda- gogical Approaches Using Tangible User Interface Enabled Computer Assisted Learning.International Journal of Emerging Technologies in Learning (iJET)10, 6 (2015), 42–52. doi:10.3991/ijet.v10i6.4865

work page doi:10.3991/ijet.v10i6.4865 2015

[38] [38]

John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4

work page doi:10.1207/s15516709cog1202_4 1988

[39] [39]

2011.Cognitive Load Theory

John Sweller, Paul Ayres, and Slava Kalyuga. 2011.Cognitive Load Theory. Springer. doi:10.1007/978-1-4419-8126-4

work page doi:10.1007/978-1-4419-8126-4 2011

[40] [40]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Cheng-Yao Wang, Wei-Chen Chu, Hou-Ren Chen, Chun-Yen Hsu, and Mike Y Chen. 2014. Evertutor: Automatically creating interactive guided tutorials on smartphones by user demonstration. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 4027–4036

work page 2014

[42] [42]

Antonenko, A

Jiahui Wang, Pavlo D. Antonenko, A. Keil, and K. Dawson. 2020. Converging Subjective and Psychophysiological Measures of Cognitive Load to Study the Effects of Instructor-Present Video.Mind, Brain, and Education14 (2020), 279–291. doi:10.1111/mbe.12239

work page doi:10.1111/mbe.12239 2020

[43] [43]

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean An- drist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. 2023. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20270–20281

work page 2023

[44] [44]

Saelyne Yang, Sangkyung Kwak, Juhoon Lee, and Juho Kim. 2023. Beyond Instructions: a taxonomy of information types in how-to videos. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21. 11

work page 2023

[45] [45]

Saelyne Yang, Anh Truong, Juho Kim, and Dingzeyu Li. 2025. VideoMix: Ag- gregating How-To Videos for Task-Oriented Learning. InProceedings of the 30th International Conference on Intelligent User Interfaces. 1564–1580

work page 2025

[46] [46]

Saelyne Yang, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2024. AQuA: Automated question-answering in software tutorial videos with visual anchors. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2024

[47] [47]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Meehyun Yoon, Hua Zheng, Eulho Jung, and Tong Li. 2022. Effects of Segmen- tation and Self-Explanation Designs on Cognitive Load in Instructional Videos. Contemporary Educational Technology(2022). doi:10.30935/cedtech/11522

work page doi:10.30935/cedtech/11522 2022

[49] [49]

Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta

Samy C. Youssef, Abdullatif Aydin, Alexander Canning, Nawal Khan, Kamran Ahmed, and Prokar Dasgupta. 2022. Learning Surgical Skills Through Video- Based Education: A Systematic Review.Surgical Innovation(2022). doi:10.1177/ 15533506221120146

work page 2022

[50] [50]

Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, and Kai Chen. 2024. Pia: Your personalized image animator via plug-and-play modules in text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7747–7756

work page 2024

[51] [51]

2024.Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024.Open-Sora: Democratizing Efficient Video Production for All. https://github.com/hpcaitech/Open-Sora

work page 2024

[52] [52]

Mingyuan Zhong, Gang Li, Peggy Chi, and Yang Li. 2021. Helpviz: Automatic generation of contextual visual mobile tutorials from text-based instructions. InThe 34th Annual ACM Symposium on User Interface Software and Technology. 1144–1153. 12 A Task Description Fig. 9 shows the exact textual description for each of the four physical tasks used in our study...

work page 2021