pith. sign in

arxiv: 2503.13821 · v3 · submitted 2025-03-18 · 💻 cs.CV

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Pith reviewed 2026-05-23 00:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords video retrievalmultistep textclip stitchinginstructional videosweakly supervisedhard negativesvideo demonstration
0
0 comments X

The pith

Stitch-a-Demo assembles coherent video demonstrations by stitching clips that match each step in a multistep description.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Stitch-a-Demo is a retrieval-based approach designed to generate video demonstrations from multistep text descriptions such as recipes or instruction manuals. Unlike prior work limited to single-step captions, it retrieves clips that correspond to every step and combines them into one video. A special training pipeline builds large weakly supervised datasets of procedures and adds hard negative examples to encourage both step accuracy and visual coherence between clips. This matters to a reader because it provides a way to automatically create visual how-to videos from written multistep guides without recording new content.

Core claim

We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence.

What carries the argument

Retrieval-based stitching trained with weakly supervised multistep data and hard negatives to ensure step matching and visual coherence.

If this is right

  • Multistep descriptions receive visual illustrations in the form of a single assembled video.
  • The method maintains accuracy to each step description individually.
  • Visual coherence holds across clips sourced from separate videos.
  • State-of-the-art performance is reached on instructional video datasets with gains reaching 29%.
  • Human preference studies show strong preference for the generated demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The coherence training could transfer to sequencing other types of media like images or text segments.
  • Hard negative sampling may address consistency issues in related retrieval problems involving ordered data.
  • Real-world applications might include generating demos for DIY projects or software tutorials from user text.

Load-bearing premise

Clips retrieved from different sources can be assembled into a single video that remains visually coherent while accurately reflecting every step description in the multistep input.

What would settle it

Finding that many stitched videos show abrupt visual changes between clips or omit key elements from a step description on a test set of multistep instructions would disprove the method's effectiveness.

Figures

Figures reproduced from arXiv: 2503.13821 by Chi Hsuan Wu, Kristen Grauman, Kumar Ashutosh.

Figure 1
Figure 1. Figure 1: Video demonstration from multistep descriptions. Given multistep descriptions (left) aiming to achieve a procedural task, e.g. making vegan taco, our method obtains clips from thousands of instructional videos to visually demonstrate the procedure (right). The goal is for every clip to correctly describe a step, while maintaining visual consistency. Our proposed method goes beyond current retrieval and gen… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the method. The videos and the step descriptions in C are used to create a procedure mapping M, using step localization FT . The procedure query R and M give video candidates V ′ R. The procedure evaluator FR outputs the likelihood of each candidate. trieving visually and logically coherent video demonstra￾tions from sequential step descriptions, as we tackle in this work. Furthermore, unlike [… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of hard negatives and procedure combination. We design negative samples that violate step correctness, visual continuity, and object state continuity (left). We show an example of combining step descriptions from n (here n = 2) video demonstrations into a novel procedure, using an LLM [19] (right). The novel procedure mixes steps from both descriptions clips in the video collection C. Next, we pro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Our method correctly visualizes the step descriptions (top), compared to prior work. The second to the fourth row shows representative outputs in cooking, woodworking, and gardening. Our method correctly shows video clips from two video sources. Each of the video source alone cannot correctly demonstrate all the step descriptions. The last row contains some failure cases, showing the d… view at source ↗
Figure 5
Figure 5. Figure 5: Search space reduction. Using the effective set cover algorithm, the ground truth (GT) is captured in the candidate set with high probability, even with small sample set sizes. See text. We evaluate all the methods on four axes—step faith￾fulness, goal faithfulness, visual quality, and overall prefer￾ence. Every sample is annotated by three subjects unrelated to this project. We compare two methods at a ti… view at source ↗
Figure 6
Figure 6. Figure 6: Result on distractor set splits. Our model performs competitively on all splits—particularly the more challening RS, Other-pos, and Sim-match. Method MedR↓ R@1↑ R@5↑ w/o augmentation 70 0.03 0.14 Temporally-sampled procedures 10 0.18 0.42 Weakly supervised Dw (ours) 3.5 0.23 0.56 Cor Con OSC MedR↓ R@1↑ R@5↑ ✓ 9 0.17 0.41 ✓ 171 0 0.03 ✓ 11 0.11 0.39 ✓ ✓ 9 0.18 0.45 ✓ ✓ 15 0.07 0.21 ✓ ✓ 5 0.15 0.54 ✓ ✓ ✓ 5 0… view at source ↗
Figure 7
Figure 7. Figure 7: Human preference study interface instructions. We provide examples of all axes for human preference study—step faithful￾ness, goal faithfulness, visual quality and the overall preference. Sec. C, we introduce the distractor set components. We evaluate the performance with each component of the dis￾tractor set. For example, we evaluate the retrieval perfor￾mance with 99 negative samples from ‘Random mix-n￾m… view at source ↗
Figure 8
Figure 8. Figure 8: Human preference study submission form. The video in the interface shows both the candidate procedures side by side, and the step description is shown below them. The video is followed by four questions, asking about each axis, and the result is saved as a CSV file. tional complexity, given a procedure query with M steps and a pool of N videos with K clips each, our method se￾lects the top S clips per step… view at source ↗
read the original abstract

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Stitch-a-Demo, a retrieval-based pipeline that assembles coherent video demonstrations from multistep text inputs (e.g., recipes) by retrieving and stitching clips from diverse sources. It introduces a weakly-supervised training procedure that injects hard negatives to enforce both per-step accuracy and cross-clip visual coherence, and reports state-of-the-art results on in-the-wild instructional videos together with large gains in a human preference study.

Significance. If the quantitative claims hold, the work would advance retrieval-based video synthesis for complex procedural content, offering a practical alternative to generative models when source footage already exists. The emphasis on coherence across independently sourced clips addresses a clear gap in current single-caption retrieval methods.

major comments (1)
  1. [Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting this important point about the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.

    Authors: We acknowledge that the abstract, constrained by length, omits specifics on the evaluation protocol, baselines, datasets, metrics, and tables. The full manuscript details these in the Experiments section (including in-the-wild instructional video datasets, comparison baselines, quantitative metrics yielding up to 29% gains, and the human preference study protocol with results). To improve self-containment of the abstract while preserving its brevity, we will revise it to include a concise reference to the evaluation setting, key metrics, and human study. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a retrieval-based method and training pipeline for assembling video demonstrations from multistep text descriptions, using weakly supervised data creation and hard negatives to promote accuracy and coherence. No equations, fitted parameters, self-citations, or derivation steps are present that reduce any claimed result to its own inputs by construction. The approach is presented as an independent engineering contribution validated on external in-the-wild videos, with no load-bearing self-referential definitions or renamings of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)
  • domain assumption Clips from heterogeneous sources can be stitched while preserving visual coherence and step accuracy
    Implicit premise required for the retrieval-and-assembly approach to succeed.

pith-pipeline@v0.9.0 · 5696 in / 1089 out tokens · 57183 ms · 2026-05-23T00:25:25.107305+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 6 internal anchors

  1. [1]

    Gepsan: Generative procedure step anticipation in cooking videos

    Mohamed A Abdelsalam, Samrudhdhi B Rangrej, Isma Hadji, Nikita Dvornik, Konstantinos G Derpanis, and Af- saneh Fazly. Gepsan: Generative procedure step anticipation in cooking videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2988–2997,

  2. [2]

    Ht-step: Aligning instructional articles with how-to videos

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023. 2, 5, 6, 8

  3. [3]

    Hiervl: Learning hierarchical video- language embeddings

    Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video- language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23066–23078, 2023. 2

  4. [4]

    Video-mined task graphs for keystep recognition in instructional videos

    Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. In Advances in Neural Information Processing Systems, pages 67833–67846. Curran Associates, Inc., 2023. 2, 3

  5. [5]

    Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024

    Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024. 2, 3

  6. [6]

    Detours for navigating instructional videos

    Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, and Kristen Grauman. Detours for navigating instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18804–18815,

  7. [7]

    United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos

    Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024. 2

  8. [8]

    Procedure planning in instructional videos via contextual modeling and model- based policy learning

    Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model- based policy learning. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15611– 15620, 2021. 2

  9. [9]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

  10. [10]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2

  11. [11]

    Procedure planning in instructional videos

    Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020. 2

  12. [12]

    Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021

    Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021. 2

  13. [13]

    Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  15. [15]

    Drop-DTW: Aligning com- mon signal between sequences while dropping outliers

    Mikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Ani- mesh Garg, and Allan Jepson. Drop-DTW: Aligning com- mon signal between sequences while dropping outliers. Advances in Neural Information Processing Systems, 34: 13782–13793, 2021. 2, 3, 6

  16. [16]

    Flow graph to video grounding for weakly-supervised multi-step local- ization

    Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step local- ization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, pages 319–335. Springer, 2022. 2, 3

  17. [17]

    Clip2video: Mastering video-text retrieval via image clip

    Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021. 2

  18. [18]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 2

  19. [19]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

  20. [20]

    Anticipative video transformer

    Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13505–13515, 2021. 2

  21. [21]

    Omnivore: A sin- gle model for many visual modalities

    Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin- gle model for many visual modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022. 2

  22. [22]

    Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Con- ference on Computer Vision, pages 205–224. Springer, 2024. 2

  23. [23]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 2

  24. [24]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar 9 Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 2

  25. [25]

    Temporal alignment networks for long-term video

    Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 2, 5, 3

  26. [26]

    Instruct-imagen: Image gen- eration with multi-modal instruction

    Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et al. Instruct-imagen: Image gen- eration with multi-modal instruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4754–4763, 2024. 2

  27. [27]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022. 2

  28. [28]

    Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  29. [29]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

  30. [30]

    Lego: L earning ego cen- tric action frame generation via visual instruction tuning

    Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In European Conference on Computer Vision, pages 135–155. Springer, 2024. 2, 3

  31. [31]

    Uniformer: Uni- fying convolution and self-attention for visual recognition

    Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022. 2

  32. [32]

    Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020. 2

  33. [33]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InEuropean Conference on Computer Vision, pages 121–137. Springer, 2020. 2

  34. [34]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022. 2

  35. [35]

    Set covering problem.Cornell University Computational Op- timization Open Textbook

    Sherry Liang, Khalid Alanazi, and Kumail Al Hamoud. Set covering problem.Cornell University Computational Op- timization Open Textbook. Cornell University,[online docu- ment], 2020. 5

  36. [36]

    Egocentric video-language pretraining

    Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. InNeurIPS, 2022. 2

  37. [37]

    Learning to recognize procedural activities with distant supervision

    Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, and Lorenzo Torresani. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13853–13863, 2022. 2, 3

  38. [38]

    Text-driven image editing via learnable regions

    Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, and Ming-Hsuan Yang. Text-driven image editing via learnable regions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7059–7068,

  39. [39]

    Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020

    Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020. 2

  40. [40]

    Learning to ground instructional articles in videos through narrations

    Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15201–15213,

  41. [41]

    Generating illustrated instructions

    Sachit Menon, Ishan Misra, and Rohit Girdhar. Generating illustrated instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6274–6284, 2024. 2, 3

  42. [42]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019. 2, 3, 5, 6, 4

  43. [43]

    End-to-end learning of visual representations from uncurated instruc- tional videos

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879– 9889, 2020. 2, 4, 6

  44. [44]

    Step differences in instructional video

    Tushar Nagarajan and Lorenzo Torresani. Step differences in instructional video. InCVPR, 2024. 2

  45. [45]

    Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022

    Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022. 2

  46. [46]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  47. [47]

    Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

    Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023. 2

  48. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings 10 of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4

  49. [49]

    Action scene graphs for long- form understanding of egocentric videos

    Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 3

  50. [50]

    Inverse cooking: Recipe generation from food images

    Amaia Salvador, Michal Drozdzal, Xavier Giro i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InCVPR, 2019. 2, 3

  51. [51]

    Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022

    Fadime Sener, Rishabh Saraf, and Angela Yao. Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022. 2

  52. [52]

    Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020. 5

  53. [53]

    Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024

    Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024. 2, 3, 8

  54. [54]

    Genhowto: Learning to generate actions and state transformations from instructional videos

    Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3

  55. [55]

    The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017

    Dawn Surgenor, Lynsey Hollywood, Sin ´ead Furey, Fiona Lavelle, Laura McGowan, Michelle Spence, Monique Raats, Amanda McCloat, Elaine Mooney, Martin Caraher, et al. The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017. 1, 3

  56. [56]

    Coin: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 2, 5, 6, 8, 3

  57. [57]

    EPIC Fields: Marrying 3D Geometry and Video Understanding

    Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Larina, Diane Larlus, Dima Damen, and Andrea Vedaldi. EPIC Fields: Marrying 3D Geometry and Video Understanding. InProceedings of the Neural Information Processing Systems (NeurIPS), 2023. 2

  58. [58]

    Motioneditor: Editing video motion via content-aware diffusion

    Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, and Yu-Gang Jiang. Motioneditor: Editing video motion via content-aware diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7882–7891, 2024. 2

  59. [59]

    Recipe2video: Synthesizing person- alized videos from recipe texts

    Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Bal- aji Vasan Srinivasan. Recipe2video: Synthesizing person- alized videos from recipe texts. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2268–2277, 2023. 2, 3, 6, 7, 8

  60. [60]

    Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023. 2, 6, 7, 3

  61. [61]

    Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024

    Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, and Chen Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024. 1

  62. [62]

    Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023

    Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023. 2

  63. [63]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 2

  64. [64]

    Internvideo: General video foundation models via generative and discriminative learning, 2022

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 2, 3, 6, 8

  65. [65]

    Internvideo2: Scaling foundation models for mul- timodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2, 3, 6, 7

  66. [66]

    Wikihow.https://www.wikihow.com,

    WikiHow. Wikihow.https://www.wikihow.com,

  67. [67]

    Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

    Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 2

  68. [68]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2, 4

  69. [69]

    Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos

    Xin Xu, Kun Qian, Bo Zhou, Shenghao Chen, and Yitong Li. Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3353–3358, 2021. 1

  70. [70]

    Learn- ing object state changes in videos: An open-world perspec- tive

    Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18493–18503,

  71. [71]

    Learn- ing object state changes in videos: An open-world perspec- tive

    Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18493– 18503, 2024. 3

  72. [72]

    RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

    Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. Recipeqa: A challenge dataset for multi- modal comprehension of cooking recipes.arXiv preprint arXiv:1809.00812, 2018. 2, 3

  73. [73]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video 11 diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

  74. [74]

    Multi- grained vision language pre-training: Align- ing texts with visual concepts

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021. 2

  75. [75]

    P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision

    He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G Der- panis, Richard P Wildes, and Allan D Jepson. P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022. 2

  76. [76]

    Cen- terclip: Token clustering for efficient text-video retrieval

    Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. Cen- terclip: Token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823, 2022. 2

  77. [77]

    Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023

    Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023. 2

  78. [78]

    Procedure-aware pretraining for instructional video understanding

    Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InCVPR, pages 10727–10738, 2023. 2

  79. [79]

    Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

    Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

  80. [80]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, pages 7590–7598, 2018. 3

Showing first 80 references.