How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

Frank Keller; Olga Loginova

arxiv: 2604.15134 · v1 · submitted 2026-04-16 · 💻 cs.CV

How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

Olga Loginova , Frank Keller This is my paper

Pith reviewed 2026-05-10 11:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric videoprocedural tasksmistake injectionerror recoveryvideo synthesisbenchmarking rubricpsychology-informed errorsLLM video editing

0 comments

The pith

PIE-V builds egocentric videos of procedural tasks by injecting controlled, human-plausible mistakes and recoveries into clean recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PIE-V to address the scarcity of natural errors in existing egocentric procedural video datasets, which limits training for reliable mistake detection and recovery monitoring. It augments clean keystep videos with deviations planned according to psychological principles about task phases and cognitive load, then uses language models to maintain consistency and video synthesis to insert replacement segments seamlessly. A new taxonomy and nine-metric human evaluation rubric assess the results for plausibility, logical coherence, state changes, and text-video alignment. This matters for building systems that can observe and assist with real-world procedures where people make and correct mistakes.

Core claim

PIE-V augments clean keystep procedures from egocentric videos with controlled human-plausible deviations using a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer for cascade-consistent rewrites, an LLM judge that validates procedural coherence, and text-guided video generation to synthesize and stitch replacement clips, applied to 17 tasks and 50 Ego-Exo4D scenarios to inject 102 mistakes and generate 27 recovery corrections, while introducing a unified taxonomy and nine-metric rubric covering step-level and procedure-level quality.

What carries the argument

The psychology-informed error planner conditioned on procedure phase and semantic step load, which generates human-plausible mistakes, paired with LLM cascade rewrites and text-guided video synthesis for seamless clip insertion and coherence validation.

If this is right

Enables creation of large-scale mistake and recovery traces in egocentric procedural videos for model training.
Provides a consistent protocol to audit and compare existing datasets on mistake awareness.
Supports development of models that detect both step-level errors and full procedure failures.
Establishes a baseline showing advantages of structured psychology-informed injection over unstructured LLM generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could generalize to generate synthetic errors in non-egocentric instructional videos for domains like cooking tutorials or assembly tasks.
Adoption of the nine-metric rubric might encourage standardized benchmarks across future work on video-based error simulation.
Training on these augmented videos could improve robustness of AI assistants that intervene during user mistakes in real time.

Load-bearing premise

That the psychology-informed error planner together with LLM rewrites and text-guided video generation will reliably produce mistakes that are both human-plausible and procedurally coherent without introducing visual or logical artifacts.

What would settle it

A controlled human study in which raters compare PIE-V generated mistake segments against real human error recordings from the same tasks and find either lower plausibility scores or detectable editing artifacts and inconsistent object states.

Figures

Figures reproduced from arXiv: 2604.15134 by Frank Keller, Olga Loginova.

**Figure 1.** Figure 1: PIE-V example on an Ego-Exo4D step from “Making Coffee Latte”: (A) reference step; (B) wrong execution with an observable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: PIE-V pipeline overview. Clean keystep procedures are enriched by (1) an error planner (psychology-informed, constrained by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example PIE-V simulation log for the Ego-Exo4D “In [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 1.** Figure 1: Annotation interface used for the rubric (example from the electronics task in EgoOops). Annotators are shown a [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

read the original abstract

Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIE-V gives a workable pipeline for injecting psychology-based mistakes into egocentric procedure videos and supplies a nine-metric rubric to check them, but the reported results stay at the level of counts rather than measured quality.

read the letter

The main takeaway is that this paper builds a concrete system for turning clean keystep videos into ones that contain controlled mistakes and recoveries. It starts with an error planner that draws on psychology to pick deviation types based on procedure phase and step complexity, adds a correction planner, runs LLM rewrites for consistency, uses an LLM judge to catch problems, and finishes with text-guided video synthesis to insert the edited segments. They applied it to 50 Ego-Exo4D scenarios across 17 tasks and produced 102 mistakes plus 27 recoveries, then compared the output against a freeform LLM baseline using their own rubric. That rubric, which scores step-level and procedure-level traits like plausibility, state-change coherence, and text-video grounding, is the clearest addition here and could be reused by others working on procedural monitoring. The taxonomy of mistakes also organizes the space in a way that matches how errors actually appear in egocentric views, where hands often hide the action. These pieces together address a practical shortage in existing datasets that mostly lack recovery traces. The soft spots sit in the validation. The abstract and description give application numbers but no tabulated human scores, no breakdown of how often the LLM judge had to intervene, and no ablation that isolates the psychology conditioning or the video stitching step. Without those numbers it is difficult to know whether the generated clips avoid the usual artifacts in hand trajectories or object states. The fact that the same LLM family is used for planning, rewriting, judging, and generation also leaves open the possibility that coherence is partly self-reinforcing rather than independently verified. This work is aimed at groups building models for mistake detection in robotics or human-AI collaboration, or anyone who needs large volumes of labeled error data without filming new mistakes from scratch. A reader who wants a ready-to-adapt framework and an evaluation protocol will find usable material even if the current numbers are preliminary. It deserves peer review because the gap it targets is real and the components are described in enough detail to be tested by others, though any referee should ask for the missing quantitative results and failure cases.

Referee Report

3 major / 2 minor

Summary. The paper introduces PIE-V, a framework for constructing mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. It combines a psychology-informed error planner (conditioned on procedure phase and semantic step load), a correction planner, LLM cascade rewrites, an LLM judge for coherence validation, and text-guided video synthesis for segment edits. Applied to 17 tasks across 50 Ego-Exo4D scenarios, it generates 102 mistakes and 27 recovery corrections. The work also proposes a unified taxonomy and a nine-metric human rubric covering plausibility, procedure logic, state-change coherence, and text-video grounding, which is used to audit existing resources and compare PIE-V to a freeform LLM baseline.

Significance. If the generated videos prove to be artifact-free and human-plausible, the framework could provide a scalable method for creating synthetic training data for mistake detection and correction in procedural monitoring systems, addressing the scarcity of error traces in egocentric datasets. The introduction of a standardized nine-metric rubric offers a methodological contribution that could enable more consistent benchmarking across future work in this area.

major comments (3)

[Benchmarking and Evaluation] The abstract and framework description report the injection of 102 mistakes and 27 corrections along with the use of the nine-metric human rubric for auditing and baseline comparison, but supply no quantitative rubric scores, inter-annotator agreement, or statistical results from these evaluations. This is load-bearing for the claims of superiority over the freeform LLM baseline and overall utility for mistake-aware training.
[Generation Pipeline (§3)] The LLM judge is described as both validating procedural coherence during generation and repairing failures, yet no independent human validation or comparison against real human error distributions is reported. This risks circularity in the quality claims, as the same LLM-based mechanism certifies the outputs it helps produce.
[Framework Components] No ablation studies or component-wise analysis are described to test the contribution of the psychology-informed planner (conditioned on phase and semantic load) versus simpler freeform prompting, leaving the necessity of the full cascade unverified despite its central role in the framework.

minor comments (2)

[Abstract] The abstract states application to '17 tasks and 50 Ego-Exo4D scenarios' without detailing task selection criteria or diversity coverage, which would help readers assess generalizability.
[Video Synthesis] Clarify early in the text how the text-guided video synthesis step ensures consistency with egocentric viewpoint and hand occlusions, as this is critical for visual plausibility but only briefly mentioned.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that will strengthen the empirical support and transparency of the work.

read point-by-point responses

Referee: [Benchmarking and Evaluation] The abstract and framework description report the injection of 102 mistakes and 27 corrections along with the use of the nine-metric human rubric for auditing and baseline comparison, but supply no quantitative rubric scores, inter-annotator agreement, or statistical results from these evaluations. This is load-bearing for the claims of superiority over the freeform LLM baseline and overall utility for mistake-aware training.

Authors: We agree that the absence of quantitative results weakens the presentation. The manuscript describes the rubric application and baseline comparison but does not report aggregated scores, agreement metrics, or statistical tests in the main text. In the revision we will add these results, including per-metric average scores for PIE-V versus the freeform baseline, inter-annotator agreement (e.g., Krippendorff’s alpha), and significance tests, presented in a dedicated evaluation table to directly substantiate the superiority claims. revision: yes
Referee: [Generation Pipeline (§3)] The LLM judge is described as both validating procedural coherence during generation and repairing failures, yet no independent human validation or comparison against real human error distributions is reported. This risks circularity in the quality claims, as the same LLM-based mechanism certifies the outputs it helps produce.

Authors: We clarify the separation of concerns: the LLM judge is an internal generation tool for coherence checking and repair, while all quality claims rest on a separate human evaluation protocol using the nine-metric rubric applied by independent annotators. This human rubric directly assesses plausibility, procedure logic, state-change coherence, and text-video grounding. We acknowledge that a direct quantitative comparison to real human error distributions is not reported; such annotated traces remain scarce in existing egocentric datasets, which motivates the framework. In revision we will explicitly distinguish the LLM judge from the human rubric and add a limitations paragraph on the lack of real-distribution benchmarks. revision: partial
Referee: [Framework Components] No ablation studies or component-wise analysis are described to test the contribution of the psychology-informed planner (conditioned on phase and semantic load) versus simpler freeform prompting, leaving the necessity of the full cascade unverified despite its central role in the framework.

Authors: The existing comparison to the freeform LLM baseline already functions as a test of the psychology-informed planner’s contribution, since the baseline uses simpler prompting without phase or semantic-load conditioning. In the revision we will reframe this comparison as an explicit ablation study, adding a subsection that reports rubric-score differences attributable to the planner and discusses the incremental value of the correction planner and cascade rewrites. revision: yes

standing simulated objections not resolved

Direct quantitative comparison of generated mistakes to real human error distributions, as sufficiently annotated real error traces are not available in the source egocentric datasets.

Circularity Check

0 steps flagged

No circularity: constructive framework with external human rubric validation

full rationale

The paper presents PIE-V as a methodological framework for augmenting procedural videos with injected mistakes using a psychology-informed planner, LLM cascade rewrites, correction planner, LLM judge for internal validation, and text-guided video synthesis. Final claims rest on application to 17 tasks yielding 102 mistakes and 27 recoveries, audited via a separate nine-metric human rubric covering plausibility, coherence, and grounding, plus comparison to a freeform LLM baseline. No equations, fitted parameters, or derivations exist that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The LLM judge operates internally during generation but does not substitute for the independent human evaluation protocol. This is a self-contained constructive contribution without self-definitional loops or renamed empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about human error patterns and LLM reliability for coherence checking; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Human errors in procedural tasks follow predictable patterns based on procedure phase and semantic step load
Used to condition the error planner in the PIE-V pipeline.
domain assumption LLMs can perform cascade-consistent rewrites and validate procedural coherence
Central to the LLM writer and judge components.

invented entities (1)

PIE-V framework no independent evidence
purpose: To systematically inject and benchmark mistakes in egocentric procedural videos
The proposed end-to-end system combining planners, LLM modules, and video synthesis.

pith-pipeline@v0.9.0 · 5562 in / 1380 out tokens · 25044 ms · 2026-05-10T11:49:23.325071+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

[1]

The use of rating and Likert scales in natural language generation hu- man evaluation tasks: A review and some recommendations

Jacopo Amidei, Paul Piwek, and Alistair Willis. The use of rating and Likert scales in natural language generation hu- man evaluation tasks: A review and some recommendations. InProceedings of the 12th International Conference on Nat- ural Language Generation, pages 397–402, Tokyo, Japan,

work page
[2]

Association for Computational Linguistics. 3

work page
[3]

Konstantinos Bacharidis and Antonis A. Argyros. Vision- based mistake analysis in procedural activities: A review of advances and challenges, 2025. 1

work page 2025
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,

work page
[6]

Predicting implicit arguments in procedural video in- structions

Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, and Frank Keller. Predicting implicit arguments in procedural video in- structions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30399–30419, Vienna, Austria, 2025. Asso- ciation for Computational Linguistics. 3

work page 2025
[7]

Byrne and Susan Bovair

Michael D. Byrne and Susan Bovair. A working memory model of a common procedural error.Cogn. Sci., 21:31–61,

work page
[8]

Byrne and Elizabeth M

Michael D. Byrne and Elizabeth M. Davis. Task structure and postcompletion error in the execution of a routine proce- dure.Human Factors, 48(4):627–638, 2006. 5, 8

work page 2006
[9]

The epic-kitchens dataset: Collection, chal- lenges and baselines, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines, 2020. 3

work page 2020
[10]

Fred J. Damerau. A technique for computer detection and correction of spelling errors.Communications of the ACM, 7:171 – 176, 1964. 2

work page 1964
[11]

Veo 3.1 technical report

Google DeepMind. Veo 3.1 technical report. Technical re- port, Google, 2026. Accessed: 2026-02-24. 7

work page 2026
[12]

Every mistake counts in assembly.ArXiv, abs/2307.16453,

Guodong Ding, Fadime Sener, Shugao Ma, and Angela Yao. Every mistake counts in assembly.ArXiv, abs/2307.16453,

work page arXiv
[13]

Farinella, and Fabio Galasso

Alessandro Flaborea, Guido Maria D’Amely di Melen- dugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, G. Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18483–18492, 2024. 1

work page 2024
[14]

Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos

Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, and Be- hzad Dariush. Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 10094–10104, 2023. 2, 3

work page 2023
[15]

Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288,

Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288,

work page
[16]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024
[17]

Procedural mis- take detection via action effect modeling, 2025

Wenliang Guo, Yujiang Pu, and Yu Kong. Procedural mis- take detection via action effect modeling, 2025. 1

work page 2025
[18]

Egooops: A dataset for mistake action detection from ego- centric videos referring to procedural texts, 2025

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from ego- centric videos referring to procedural texts, 2025. 1, 2, 7, 3, 4

work page 2025
[19]

Sullivan, Casimir J

Youngkyoon Jang, Brian T. Sullivan, Casimir J. H. Ludwig, Iain D. Gilchrist, Dima Damen, and W. Mayol-Cuevas. Epic- tent: An egocentric video dataset for camping tent assembly. 2019 IEEE/CVF International Conference on Computer Vi- sion Workshop (ICCVW), pages 4461–4469, 2019. 3

work page 2019
[20]

Carpenter

Marcel Adam Just and Patricia A. Carpenter. A capacity theory of comprehension: individual differences in working memory.Psychological review, 99 1:122–49, 1992. 8, 5

work page 1992
[21]

Anyv2v: A tuning-free framework for any video-to- video editing tasks, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks, 2024. 12

work page 2024
[22]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, 2024. 7, 2, 3, 4

work page 2024
[23]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet physics. Doklady, 10:707–710, 1965. 2

work page 1965
[24]

Egoedit: Dataset, real-time streaming model, and benchmark for egocentric video editing, 2025

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siaro- hin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, and Willi Menapace. Egoedit: Dataset, real-time streaming model, and benchmark for egocentric video editing, 2025. 12

work page 2025
[25]

Yayuan Li, Aadit Jain, Filippos Bellos, and Jason J. Corso. Mistake attribution: Fine-grained mistake understanding in egocentric videos, 2025. 1, 3, 8

work page 2025
[26]

Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. 7

work page 2024
[27]

Easyv2v: A high-quality instruction-based video editing framework, 2025

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Pe- ter Wonka, and Ashkan Mirzaei. Easyv2v: A high-quality instruction-based video editing framework, 2025. 12

work page 2025
[28]

50 salads.https: //discovery.dundee.ac.uk/en/datasets/50- salads/, 2012

Stephen McKenna and Sebastian Stein. 50 salads.https: //discovery.dundee.ac.uk/en/datasets/50- salads/, 2012. 3

work page 2012
[29]

Learning and verification of task struc- ture in instructional videos, 2023

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, and Trevor Darrell. Learning and verification of task struc- ture in instructional videos, 2023. 1

work page 2023
[30]

Norman.The Psychology of Everyday Things

Donald A. Norman.The Psychology of Everyday Things. Basic Books, New York, 1988. 5, 8, 6

work page 1988
[31]

Cognitive load theory and instructional design: Recent developments

Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38:1 – 4, 2003. 8, 5

work page 2003
[32]

The Proposition Bank: An annotated corpus of semantic roles

Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106, 2005. 2, 8

work page 2005
[33]

Captaincook4d: A dataset for understanding errors in procedural activities,

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pal- lapothula, Akshay Vyas, Bhavya Gouripeddi, Jikai Wang, Qifan Zhang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities,

work page
[34]

Svip: Sequence verification for procedures in videos, 2022

Yicheng Qian, Weixin Luo, Dongze Lian, Xu Tang, Peilin Zhao, and Shenghua Gao. Svip: Sequence verification for procedures in videos, 2022. 2, 3

work page 2022
[35]

The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 2, 3

work page 2020
[36]

Cambridge University Press,

James Reason.Human Error. Cambridge University Press,

work page
[37]

Introducing Runway Gen-4.https: //runwayml.com/research/introducing- runway-gen-4, 2025

Runway Research. Introducing Runway Gen-4.https: //runwayml.com/research/introducing- runway-gen-4, 2025. Accessed: 2025-03-01. 7

work page 2025
[38]

Schoonbeek, Tim Houben, Hans Onvlee, Peter H

Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H. N. de With, and Fons van der Sommen. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting, 2023. 2, 3

work page 2023
[39]

Seedance 1.5 pro: A native audio-visual joint generation foundation model, 2025

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yan- fei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qin- peng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing H...

work page 2025
[40]

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities.2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022. 7, 2, 3

work page 2022
[41]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...

work page 2025
[42]

Corso, and Joyce Chai

Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, and Joyce Chai. Transparent and co- herent procedural mistake detection, 2025. 1

work page 2025
[43]

Tamborello and J

Franklin P. Tamborello and J. Gregory Trafton. A long-term memory competitive process model of a common procedural error.Cognitive Science, 35, 2013. 2, 6, 8

work page 2013
[44]

Kling-omni technical report, 2025

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Ji- ajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao,...

work page 2025
[45]

Best practices for the human evaluation of automatically generated text

Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. Best practices for the human evaluation of automatically generated text. InPro- ceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan, 2019. Association for Computational Linguistics. 3

work page 2019
[46]

Error recovery in socio-technical systems

TW {Van der Schaaf} and Lisette Kanse. Error recovery in socio-technical systems. In7th European Conference on Cognitive Science Approaches to Process Control (CSAPC ’99), Villeneuve d’Asq, France, pages 151–156. Presses Uni- versitaires de Valenciennes, 1999. 8

work page 1999
[47]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Bohus, Ashley Feniello, Bugra Tekin, F

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, D. Bohus, Ashley Feniello, Bugra Tekin, F. Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for in- teractive ai assistants in the real world.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20213–20224, 2023. 2, 3

work page 2023
[49]

correct” or “mistake

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations, 2023. 1 How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos Supplementary Material A. Egocentric Procedural Video Datase...

work page 2023
[50]

roller arm

attach a binary mistake flag to specific action segments and further distinguish structural errors such as misordering or redundant steps, along with incorrect attachment of parts. Notably, this benchmark also marks accumulating mistakes and corrective steps (detaching incorrectly attached parts) with a special label. A smaller number of datasets introduc...

work page
[51]

Take 1 tomato

that full recipe understanding is multimodal rather than purely visual. Even so, some dataset steps have very similar textual descriptions but different step IDs. For example, “Take 1 tomato” (step_id: 149) versus “Take a tomato” (step_id: 247), or “Peel 1 garlic cloves” (step_id: 200) versus “Peel 1 garlic clove” (step_id: 14). Such steps are visually in...

work page
[52]

Insert the test swab into her nostril INSERT(Agent: you, Object: test_swab, Destination: into(nostril(of(her))))

work page
[53]

Add coffee grounds from a bowl to the filter in the French press ADD(Agent: you, Object: coffee_grounds, Origin: bowl, Destination: filter(Location: in(french_press)))

work page
[54]

1": { "step_description

Add cut onions to the egg in the mixing bowl ADD(Agent: you, Object: cut(onions), Coobject: egg(Location: in(mixing_bowl))) Return only JSON that matches the schema. SemRep format and parsing.The semantic representa- tion format is designed for controllable procedural editing rather than full semantic parsing. Predicates are uppercase action labels, and r...

work page

[1] [1]

The use of rating and Likert scales in natural language generation hu- man evaluation tasks: A review and some recommendations

Jacopo Amidei, Paul Piwek, and Alistair Willis. The use of rating and Likert scales in natural language generation hu- man evaluation tasks: A review and some recommendations. InProceedings of the 12th International Conference on Nat- ural Language Generation, pages 397–402, Tokyo, Japan,

work page

[2] [2]

Association for Computational Linguistics. 3

work page

[3] [3]

Konstantinos Bacharidis and Antonis A. Argyros. Vision- based mistake analysis in procedural activities: A review of advances and challenges, 2025. 1

work page 2025

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,

work page

[6] [6]

Predicting implicit arguments in procedural video in- structions

Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, and Frank Keller. Predicting implicit arguments in procedural video in- structions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30399–30419, Vienna, Austria, 2025. Asso- ciation for Computational Linguistics. 3

work page 2025

[7] [7]

Byrne and Susan Bovair

Michael D. Byrne and Susan Bovair. A working memory model of a common procedural error.Cogn. Sci., 21:31–61,

work page

[8] [8]

Byrne and Elizabeth M

Michael D. Byrne and Elizabeth M. Davis. Task structure and postcompletion error in the execution of a routine proce- dure.Human Factors, 48(4):627–638, 2006. 5, 8

work page 2006

[9] [9]

The epic-kitchens dataset: Collection, chal- lenges and baselines, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines, 2020. 3

work page 2020

[10] [10]

Fred J. Damerau. A technique for computer detection and correction of spelling errors.Communications of the ACM, 7:171 – 176, 1964. 2

work page 1964

[11] [11]

Veo 3.1 technical report

Google DeepMind. Veo 3.1 technical report. Technical re- port, Google, 2026. Accessed: 2026-02-24. 7

work page 2026

[12] [12]

Every mistake counts in assembly.ArXiv, abs/2307.16453,

Guodong Ding, Fadime Sener, Shugao Ma, and Angela Yao. Every mistake counts in assembly.ArXiv, abs/2307.16453,

work page arXiv

[13] [13]

Farinella, and Fabio Galasso

Alessandro Flaborea, Guido Maria D’Amely di Melen- dugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, G. Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18483–18492, 2024. 1

work page 2024

[14] [14]

Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos

Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, and Be- hzad Dariush. Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 10094–10104, 2023. 2, 3

work page 2023

[15] [15]

Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288,

Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288,

work page

[16] [16]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024

[17] [17]

Procedural mis- take detection via action effect modeling, 2025

Wenliang Guo, Yujiang Pu, and Yu Kong. Procedural mis- take detection via action effect modeling, 2025. 1

work page 2025

[18] [18]

Egooops: A dataset for mistake action detection from ego- centric videos referring to procedural texts, 2025

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from ego- centric videos referring to procedural texts, 2025. 1, 2, 7, 3, 4

work page 2025

[19] [19]

Sullivan, Casimir J

Youngkyoon Jang, Brian T. Sullivan, Casimir J. H. Ludwig, Iain D. Gilchrist, Dima Damen, and W. Mayol-Cuevas. Epic- tent: An egocentric video dataset for camping tent assembly. 2019 IEEE/CVF International Conference on Computer Vi- sion Workshop (ICCVW), pages 4461–4469, 2019. 3

work page 2019

[20] [20]

Carpenter

Marcel Adam Just and Patricia A. Carpenter. A capacity theory of comprehension: individual differences in working memory.Psychological review, 99 1:122–49, 1992. 8, 5

work page 1992

[21] [21]

Anyv2v: A tuning-free framework for any video-to- video editing tasks, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks, 2024. 12

work page 2024

[22] [22]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, 2024. 7, 2, 3, 4

work page 2024

[23] [23]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet physics. Doklady, 10:707–710, 1965. 2

work page 1965

[24] [24]

Egoedit: Dataset, real-time streaming model, and benchmark for egocentric video editing, 2025

Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siaro- hin, Tomas Jakab, Junlin Han, Sergey Tulyakov, Philip Torr, and Willi Menapace. Egoedit: Dataset, real-time streaming model, and benchmark for egocentric video editing, 2025. 12

work page 2025

[25] [25]

Yayuan Li, Aadit Jain, Filippos Bellos, and Jason J. Corso. Mistake attribution: Fine-grained mistake understanding in egocentric videos, 2025. 1, 3, 8

work page 2025

[26] [26]

Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. 7

work page 2024

[27] [27]

Easyv2v: A high-quality instruction-based video editing framework, 2025

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Pe- ter Wonka, and Ashkan Mirzaei. Easyv2v: A high-quality instruction-based video editing framework, 2025. 12

work page 2025

[28] [28]

50 salads.https: //discovery.dundee.ac.uk/en/datasets/50- salads/, 2012

Stephen McKenna and Sebastian Stein. 50 salads.https: //discovery.dundee.ac.uk/en/datasets/50- salads/, 2012. 3

work page 2012

[29] [29]

Learning and verification of task struc- ture in instructional videos, 2023

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, and Trevor Darrell. Learning and verification of task struc- ture in instructional videos, 2023. 1

work page 2023

[30] [30]

Norman.The Psychology of Everyday Things

Donald A. Norman.The Psychology of Everyday Things. Basic Books, New York, 1988. 5, 8, 6

work page 1988

[31] [31]

Cognitive load theory and instructional design: Recent developments

Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38:1 – 4, 2003. 8, 5

work page 2003

[32] [32]

The Proposition Bank: An annotated corpus of semantic roles

Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106, 2005. 2, 8

work page 2005

[33] [33]

Captaincook4d: A dataset for understanding errors in procedural activities,

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pal- lapothula, Akshay Vyas, Bhavya Gouripeddi, Jikai Wang, Qifan Zhang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities,

work page

[34] [34]

Svip: Sequence verification for procedures in videos, 2022

Yicheng Qian, Weixin Luo, Dongze Lian, Xu Tang, Peilin Zhao, and Shenghua Gao. Svip: Sequence verification for procedures in videos, 2022. 2, 3

work page 2022

[35] [35]

The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 2, 3

work page 2020

[36] [36]

Cambridge University Press,

James Reason.Human Error. Cambridge University Press,

work page

[37] [37]

Introducing Runway Gen-4.https: //runwayml.com/research/introducing- runway-gen-4, 2025

Runway Research. Introducing Runway Gen-4.https: //runwayml.com/research/introducing- runway-gen-4, 2025. Accessed: 2025-03-01. 7

work page 2025

[38] [38]

Schoonbeek, Tim Houben, Hans Onvlee, Peter H

Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H. N. de With, and Fons van der Sommen. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting, 2023. 2, 3

work page 2023

[39] [39]

Seedance 1.5 pro: A native audio-visual joint generation foundation model, 2025

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yan- fei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qin- peng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing H...

work page 2025

[40] [40]

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities.2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022. 7, 2, 3

work page 2022

[41] [41]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...

work page 2025

[42] [42]

Corso, and Joyce Chai

Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, and Joyce Chai. Transparent and co- herent procedural mistake detection, 2025. 1

work page 2025

[43] [43]

Tamborello and J

Franklin P. Tamborello and J. Gregory Trafton. A long-term memory competitive process model of a common procedural error.Cognitive Science, 35, 2013. 2, 6, 8

work page 2013

[44] [44]

Kling-omni technical report, 2025

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Ji- ajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao,...

work page 2025

[45] [45]

Best practices for the human evaluation of automatically generated text

Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. Best practices for the human evaluation of automatically generated text. InPro- ceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan, 2019. Association for Computational Linguistics. 3

work page 2019

[46] [46]

Error recovery in socio-technical systems

TW {Van der Schaaf} and Lisette Kanse. Error recovery in socio-technical systems. In7th European Conference on Cognitive Science Approaches to Process Control (CSAPC ’99), Villeneuve d’Asq, France, pages 151–156. Presses Uni- versitaires de Valenciennes, 1999. 8

work page 1999

[47] [47]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Bohus, Ashley Feniello, Bugra Tekin, F

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, D. Bohus, Ashley Feniello, Bugra Tekin, F. Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for in- teractive ai assistants in the real world.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 20213–20224, 2023. 2, 3

work page 2023

[49] [49]

correct” or “mistake

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations, 2023. 1 How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos Supplementary Material A. Egocentric Procedural Video Datase...

work page 2023

[50] [50]

roller arm

attach a binary mistake flag to specific action segments and further distinguish structural errors such as misordering or redundant steps, along with incorrect attachment of parts. Notably, this benchmark also marks accumulating mistakes and corrective steps (detaching incorrectly attached parts) with a special label. A smaller number of datasets introduc...

work page

[51] [51]

Take 1 tomato

that full recipe understanding is multimodal rather than purely visual. Even so, some dataset steps have very similar textual descriptions but different step IDs. For example, “Take 1 tomato” (step_id: 149) versus “Take a tomato” (step_id: 247), or “Peel 1 garlic cloves” (step_id: 200) versus “Peel 1 garlic clove” (step_id: 14). Such steps are visually in...

work page

[52] [52]

Insert the test swab into her nostril INSERT(Agent: you, Object: test_swab, Destination: into(nostril(of(her))))

work page

[53] [53]

Add coffee grounds from a bowl to the filter in the French press ADD(Agent: you, Object: coffee_grounds, Origin: bowl, Destination: filter(Location: in(french_press)))

work page

[54] [54]

1": { "step_description

Add cut onions to the egg in the mixing bowl ADD(Agent: you, Object: cut(onions), Coobject: egg(Location: in(mixing_bowl))) Return only JSON that matches the schema. SemRep format and parsing.The semantic representa- tion format is designed for controllable procedural editing rather than full semantic parsing. Predicates are uppercase action labels, and r...

work page