Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu; Kristen Grauman; Kumar Ashutosh

arxiv: 2503.13821 · v3 · submitted 2025-03-18 · 💻 cs.CV

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu , Kumar Ashutosh , Kristen Grauman This is my paper

Pith reviewed 2026-05-23 00:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords video retrievalmultistep textclip stitchinginstructional videosweakly supervisedhard negativesvideo demonstration

0 comments

The pith

Stitch-a-Demo assembles coherent video demonstrations by stitching clips that match each step in a multistep description.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Stitch-a-Demo is a retrieval-based approach designed to generate video demonstrations from multistep text descriptions such as recipes or instruction manuals. Unlike prior work limited to single-step captions, it retrieves clips that correspond to every step and combines them into one video. A special training pipeline builds large weakly supervised datasets of procedures and adds hard negative examples to encourage both step accuracy and visual coherence between clips. This matters to a reader because it provides a way to automatically create visual how-to videos from written multistep guides without recording new content.

Core claim

We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence.

What carries the argument

Retrieval-based stitching trained with weakly supervised multistep data and hard negatives to ensure step matching and visual coherence.

If this is right

Multistep descriptions receive visual illustrations in the form of a single assembled video.
The method maintains accuracy to each step description individually.
Visual coherence holds across clips sourced from separate videos.
State-of-the-art performance is reached on instructional video datasets with gains reaching 29%.
Human preference studies show strong preference for the generated demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coherence training could transfer to sequencing other types of media like images or text segments.
Hard negative sampling may address consistency issues in related retrieval problems involving ordered data.
Real-world applications might include generating demos for DIY projects or software tutorials from user text.

Load-bearing premise

Clips retrieved from different sources can be assembled into a single video that remains visually coherent while accurately reflecting every step description in the multistep input.

What would settle it

Finding that many stitched videos show abrupt visual changes between clips or omit key elements from a step description on a test set of multistep instructions would disprove the method's effectiveness.

Figures

Figures reproduced from arXiv: 2503.13821 by Chi Hsuan Wu, Kristen Grauman, Kumar Ashutosh.

**Figure 1.** Figure 1: Video demonstration from multistep descriptions. Given multistep descriptions (left) aiming to achieve a procedural task, e.g. making vegan taco, our method obtains clips from thousands of instructional videos to visually demonstrate the procedure (right). The goal is for every clip to correctly describe a step, while maintaining visual consistency. Our proposed method goes beyond current retrieval and gen… view at source ↗

**Figure 2.** Figure 2: Overview of the method. The videos and the step descriptions in C are used to create a procedure mapping M, using step localization FT . The procedure query R and M give video candidates V ′ R. The procedure evaluator FR outputs the likelihood of each candidate. trieving visually and logically coherent video demonstrations from sequential step descriptions, as we tackle in this work. Furthermore, unlike [… view at source ↗

**Figure 3.** Figure 3: Examples of hard negatives and procedure combination. We design negative samples that violate step correctness, visual continuity, and object state continuity (left). We show an example of combining step descriptions from n (here n = 2) video demonstrations into a novel procedure, using an LLM [19] (right). The novel procedure mixes steps from both descriptions clips in the video collection C. Next, we pro… view at source ↗

**Figure 4.** Figure 4: Qualitative results. Our method correctly visualizes the step descriptions (top), compared to prior work. The second to the fourth row shows representative outputs in cooking, woodworking, and gardening. Our method correctly shows video clips from two video sources. Each of the video source alone cannot correctly demonstrate all the step descriptions. The last row contains some failure cases, showing the d… view at source ↗

**Figure 5.** Figure 5: Search space reduction. Using the effective set cover algorithm, the ground truth (GT) is captured in the candidate set with high probability, even with small sample set sizes. See text. We evaluate all the methods on four axes—step faithfulness, goal faithfulness, visual quality, and overall preference. Every sample is annotated by three subjects unrelated to this project. We compare two methods at a ti… view at source ↗

**Figure 6.** Figure 6: Result on distractor set splits. Our model performs competitively on all splits—particularly the more challening RS, Other-pos, and Sim-match. Method MedR↓ R@1↑ R@5↑ w/o augmentation 70 0.03 0.14 Temporally-sampled procedures 10 0.18 0.42 Weakly supervised Dw (ours) 3.5 0.23 0.56 Cor Con OSC MedR↓ R@1↑ R@5↑ ✓ 9 0.17 0.41 ✓ 171 0 0.03 ✓ 11 0.11 0.39 ✓ ✓ 9 0.18 0.45 ✓ ✓ 15 0.07 0.21 ✓ ✓ 5 0.15 0.54 ✓ ✓ ✓ 5 0… view at source ↗

**Figure 7.** Figure 7: Human preference study interface instructions. We provide examples of all axes for human preference study—step faithfulness, goal faithfulness, visual quality and the overall preference. Sec. C, we introduce the distractor set components. We evaluate the performance with each component of the distractor set. For example, we evaluate the retrieval performance with 99 negative samples from ‘Random mix-nm… view at source ↗

**Figure 8.** Figure 8: Human preference study submission form. The video in the interface shows both the candidate procedures side by side, and the step description is shown below them. The video is followed by four questions, asking about each axis, and the result is saved as a CSV file. tional complexity, given a procedure query with M steps and a pool of N videos with K clips each, our method selects the top S clips per step… view at source ↗

read the original abstract

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stitch-a-Demo offers a practical way to retrieve and stitch clips for multistep descriptions but lacks visible evaluation details in the abstract.

read the letter

The punchline here is that the paper introduces Stitch-a-Demo, which assembles video clips from different sources into a demonstration that matches every step in a multistep text description while aiming for visual coherence. This addresses the issue with prior methods that only handled single step inputs and would produce incoherent results if applied separately to each step. What the paper does well is identify the problem of multistep procedural descriptions, such as recipes or instructions, and propose a retrieval-based solution trained on large-scale weakly supervised data. The injection of hard negatives to promote both correctness and coherence is a sensible addition to the training pipeline. The abstract reports state-of-the-art performance with gains up to 29 percent and strong results in a human preference study on in-the-wild instructional videos. This seems like a practical contribution for applications in education and training where visual illustrations of procedures are needed. On the soft spots, the abstract provides no information on the evaluation protocol, the specific baselines compared against, dataset details, or any ablation studies. This makes it difficult to fully assess the strength of the performance claims or to understand exactly how the coherence is measured and achieved. The central assumption that clips retrieved from varied sources can be combined into a single visually coherent video that accurately covers all steps is reasonable but would benefit from more detailed validation in the full paper. No major internal contradictions or unsupported assumptions stand out from the description, however. This work is aimed at researchers in computer vision who focus on video retrieval, text-to-video tasks, or instructional content generation. Readers interested in extending retrieval methods to sequential inputs or in building systems for procedural guidance would find the training recipe and overall approach useful to consider or build upon. The paper shows clear thinking about the limitations of existing single-context methods and engages honestly with the need for coherence in multistep settings. It deserves a serious referee because it targets a genuine gap with a defined method that can be evaluated and extended. I recommend sending this to peer review to allow the community to examine the full experimental details and results.

Referee Report

1 major / 0 minor

Summary. The paper proposes Stitch-a-Demo, a retrieval-based pipeline that assembles coherent video demonstrations from multistep text inputs (e.g., recipes) by retrieving and stitching clips from diverse sources. It introduces a weakly-supervised training procedure that injects hard negatives to enforce both per-step accuracy and cross-clip visual coherence, and reports state-of-the-art results on in-the-wild instructional videos together with large gains in a human preference study.

Significance. If the quantitative claims hold, the work would advance retrieval-based video synthesis for complex procedural content, offering a practical alternative to generative models when source footage already exists. The emphasis on coherence across independently sourced clips addresses a clear gap in current single-caption retrieval methods.

major comments (1)

[Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting this important point about the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract, constrained by length, omits specifics on the evaluation protocol, baselines, datasets, metrics, and tables. The full manuscript details these in the Experiments section (including in-the-wild instructional video datasets, comparison baselines, quantitative metrics yielding up to 29% gains, and the human preference study protocol with results). To improve self-containment of the abstract while preserving its brevity, we will revise it to include a concise reference to the evaluation setting, key metrics, and human study. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a retrieval-based method and training pipeline for assembling video demonstrations from multistep text descriptions, using weakly supervised data creation and hard negatives to promote accuracy and coherence. No equations, fitted parameters, self-citations, or derivation steps are present that reduce any claimed result to its own inputs by construction. The approach is presented as an independent engineering contribution validated on external in-the-wild videos, with no load-bearing self-referential definitions or renamings of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Clips from heterogeneous sources can be stitched while preserving visual coherence and step accuracy
Implicit premise required for the retrieval-and-assembly approach to succeed.

pith-pipeline@v0.9.0 · 5696 in / 1089 out tokens · 57183 ms · 2026-05-23T00:25:25.107305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 6 internal anchors

[1]

Gepsan: Generative procedure step anticipation in cooking videos

Mohamed A Abdelsalam, Samrudhdhi B Rangrej, Isma Hadji, Nikita Dvornik, Konstantinos G Derpanis, and Af- saneh Fazly. Gepsan: Generative procedure step anticipation in cooking videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2988–2997,

work page
[2]

Ht-step: Aligning instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023. 2, 5, 6, 8

work page 2023
[3]

Hiervl: Learning hierarchical video- language embeddings

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video- language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23066–23078, 2023. 2

work page 2023
[4]

Video-mined task graphs for keystep recognition in instructional videos

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. In Advances in Neural Information Processing Systems, pages 67833–67846. Curran Associates, Inc., 2023. 2, 3

work page 2023
[5]

Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024. 2, 3

work page 2024
[6]

Detours for navigating instructional videos

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, and Kristen Grauman. Detours for navigating instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18804–18815,

work page
[7]

United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024. 2

work page 2024
[8]

Procedure planning in instructional videos via contextual modeling and model- based policy learning

Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model- based policy learning. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15611– 15620, 2021. 2

work page 2021
[9]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

work page 2023
[10]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2

work page 2024
[11]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020. 2

work page 2020
[12]

Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021. 2

work page arXiv 2021
[13]

Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2

work page 2022
[14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Drop-DTW: Aligning com- mon signal between sequences while dropping outliers

Mikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Ani- mesh Garg, and Allan Jepson. Drop-DTW: Aligning com- mon signal between sequences while dropping outliers. Advances in Neural Information Processing Systems, 34: 13782–13793, 2021. 2, 3, 6

work page 2021
[16]

Flow graph to video grounding for weakly-supervised multi-step local- ization

Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step local- ization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, pages 319–335. Springer, 2022. 2, 3

work page 2022
[17]

Clip2video: Mastering video-text retrieval via image clip

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021. 2

work page arXiv 2021
[18]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 2

work page 2019
[19]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Anticipative video transformer

Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13505–13515, 2021. 2

work page 2021
[21]

Omnivore: A sin- gle model for many visual modalities

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin- gle model for many visual modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022. 2

work page 2022
[22]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Con- ference on Computer Vision, pages 205–224. Springer, 2024. 2

work page 2024
[23]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 2

work page 2022
[24]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar 9 Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 2

work page 2024
[25]

Temporal alignment networks for long-term video

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 2, 5, 3

work page 2022
[26]

Instruct-imagen: Image gen- eration with multi-modal instruction

Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et al. Instruct-imagen: Image gen- eration with multi-modal instruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4754–4763, 2024. 2

work page 2024
[27]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022. 2

work page 2022
[28]

Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[29]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

work page 2017
[30]

Lego: L earning ego cen- tric action frame generation via visual instruction tuning

Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In European Conference on Computer Vision, pages 135–155. Springer, 2024. 2, 3

work page 2024
[31]

Uniformer: Uni- fying convolution and self-attention for visual recognition

Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022. 2

work page arXiv 2022
[32]

Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020. 2

work page arXiv 2005
[33]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InEuropean Conference on Computer Vision, pages 121–137. Springer, 2020. 2

work page 2020
[34]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022. 2

work page 2022
[35]

Set covering problem.Cornell University Computational Op- timization Open Textbook

Sherry Liang, Khalid Alanazi, and Kumail Al Hamoud. Set covering problem.Cornell University Computational Op- timization Open Textbook. Cornell University,[online docu- ment], 2020. 5

work page 2020
[36]

Egocentric video-language pretraining

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. InNeurIPS, 2022. 2

work page 2022
[37]

Learning to recognize procedural activities with distant supervision

Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, and Lorenzo Torresani. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13853–13863, 2022. 2, 3

work page 2022
[38]

Text-driven image editing via learnable regions

Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, and Ming-Hsuan Yang. Text-driven image editing via learnable regions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7059–7068,

work page
[39]

Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020. 2

work page arXiv 2002
[40]

Learning to ground instructional articles in videos through narrations

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15201–15213,

work page
[41]

Generating illustrated instructions

Sachit Menon, Ishan Misra, and Rohit Girdhar. Generating illustrated instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6274–6284, 2024. 2, 3

work page 2024
[42]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019. 2, 3, 5, 6, 4

work page 2019
[43]

End-to-end learning of visual representations from uncurated instruc- tional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879– 9889, 2020. 2, 4, 6

work page 2020
[44]

Step differences in instructional video

Tushar Nagarajan and Lorenzo Torresani. Step differences in instructional video. InCVPR, 2024. 2

work page 2024
[45]

Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022

Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022. 2

work page arXiv 2022
[46]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023. 2

work page 2023
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings 10 of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021
[49]

Action scene graphs for long- form understanding of egocentric videos

Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 3

work page 2024
[50]

Inverse cooking: Recipe generation from food images

Amaia Salvador, Michal Drozdzal, Xavier Giro i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InCVPR, 2019. 2, 3

work page 2019
[51]

Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022

Fadime Sener, Rishabh Saraf, and Angela Yao. Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022. 2

work page 2022
[52]

Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020. 5

work page 2020
[53]

Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024

Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024. 2, 3, 8

work page arXiv 2024
[54]

Genhowto: Learning to generate actions and state transformations from instructional videos

Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3

work page 2024
[55]

The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017

Dawn Surgenor, Lynsey Hollywood, Sin ´ead Furey, Fiona Lavelle, Laura McGowan, Michelle Spence, Monique Raats, Amanda McCloat, Elaine Mooney, Martin Caraher, et al. The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017. 1, 3

work page 2017
[56]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 2, 5, 6, 8, 3

work page 2019
[57]

EPIC Fields: Marrying 3D Geometry and Video Understanding

Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Larina, Diane Larlus, Dima Damen, and Andrea Vedaldi. EPIC Fields: Marrying 3D Geometry and Video Understanding. InProceedings of the Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023
[58]

Motioneditor: Editing video motion via content-aware diffusion

Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, and Yu-Gang Jiang. Motioneditor: Editing video motion via content-aware diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7882–7891, 2024. 2

work page 2024
[59]

Recipe2video: Synthesizing person- alized videos from recipe texts

Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Bal- aji Vasan Srinivasan. Recipe2video: Synthesizing person- alized videos from recipe texts. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2268–2277, 2023. 2, 3, 6, 7, 8

work page 2023
[60]

Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023

Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023. 2, 6, 7, 3

work page arXiv 2023
[61]

Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024

Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, and Chen Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024. 1

work page arXiv 2024
[62]

Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023. 2

work page arXiv 2023
[63]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Internvideo: General video foundation models via generative and discriminative learning, 2022

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 2, 3, 6, 8

work page 2022
[65]

Internvideo2: Scaling foundation models for mul- timodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2, 3, 6, 7

work page 2024
[66]

Wikihow.https://www.wikihow.com,

WikiHow. Wikihow.https://www.wikihow.com,

work page
[67]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 2

work page 2022
[68]

Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2, 4

work page arXiv 2021
[69]

Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos

Xin Xu, Kun Qian, Bo Zhou, Shenghao Chen, and Yitong Li. Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3353–3358, 2021. 1

work page 2021
[70]

Learn- ing object state changes in videos: An open-world perspec- tive

Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18493–18503,

work page
[71]

Learn- ing object state changes in videos: An open-world perspec- tive

Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18493– 18503, 2024. 3

work page 2024
[72]

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. Recipeqa: A challenge dataset for multi- modal comprehension of cooking recipes.arXiv preprint arXiv:1809.00812, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[73]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video 11 diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021. 2

work page arXiv 2021
[75]

P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision

He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G Der- panis, Richard P Wildes, and Allan D Jepson. P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022. 2

work page 2022
[76]

Cen- terclip: Token clustering for efficient text-video retrieval

Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. Cen- terclip: Token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823, 2022. 2

work page arXiv 2022
[77]

Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023. 2

work page arXiv 2023
[78]

Procedure-aware pretraining for instructional video understanding

Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InCVPR, pages 10727–10738, 2023. 2

work page 2023
[79]

Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[80]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, pages 7590–7598, 2018. 3

work page 2018

Showing first 80 references.

[1] [1]

Gepsan: Generative procedure step anticipation in cooking videos

Mohamed A Abdelsalam, Samrudhdhi B Rangrej, Isma Hadji, Nikita Dvornik, Konstantinos G Derpanis, and Af- saneh Fazly. Gepsan: Generative procedure step anticipation in cooking videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2988–2997,

work page

[2] [2]

Ht-step: Aligning instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023. 2, 5, 6, 8

work page 2023

[3] [3]

Hiervl: Learning hierarchical video- language embeddings

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video- language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23066–23078, 2023. 2

work page 2023

[4] [4]

Video-mined task graphs for keystep recognition in instructional videos

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. In Advances in Neural Information Processing Systems, pages 67833–67846. Curran Associates, Inc., 2023. 2, 3

work page 2023

[5] [5]

Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024. 2, 3

work page 2024

[6] [6]

Detours for navigating instructional videos

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, and Kristen Grauman. Detours for navigating instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18804–18815,

work page

[7] [7]

United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024. 2

work page 2024

[8] [8]

Procedure planning in instructional videos via contextual modeling and model- based policy learning

Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model- based policy learning. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15611– 15620, 2021. 2

work page 2021

[9] [9]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

work page 2023

[10] [10]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2

work page 2024

[11] [11]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020. 2

work page 2020

[12] [12]

Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021. 2

work page arXiv 2021

[13] [13]

Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2

work page 2022

[14] [14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Drop-DTW: Aligning com- mon signal between sequences while dropping outliers

Mikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Ani- mesh Garg, and Allan Jepson. Drop-DTW: Aligning com- mon signal between sequences while dropping outliers. Advances in Neural Information Processing Systems, 34: 13782–13793, 2021. 2, 3, 6

work page 2021

[16] [16]

Flow graph to video grounding for weakly-supervised multi-step local- ization

Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step local- ization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, pages 319–335. Springer, 2022. 2, 3

work page 2022

[17] [17]

Clip2video: Mastering video-text retrieval via image clip

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021. 2

work page arXiv 2021

[18] [18]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 2

work page 2019

[19] [19]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Anticipative video transformer

Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13505–13515, 2021. 2

work page 2021

[21] [21]

Omnivore: A sin- gle model for many visual modalities

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin- gle model for many visual modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022. 2

work page 2022

[22] [22]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Con- ference on Computer Vision, pages 205–224. Springer, 2024. 2

work page 2024

[23] [23]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 2

work page 2022

[24] [24]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar 9 Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 2

work page 2024

[25] [25]

Temporal alignment networks for long-term video

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 2, 5, 3

work page 2022

[26] [26]

Instruct-imagen: Image gen- eration with multi-modal instruction

Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et al. Instruct-imagen: Image gen- eration with multi-modal instruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4754–4763, 2024. 2

work page 2024

[27] [27]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022. 2

work page 2022

[28] [28]

Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025

[29] [29]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

work page 2017

[30] [30]

Lego: L earning ego cen- tric action frame generation via visual instruction tuning

Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In European Conference on Computer Vision, pages 135–155. Springer, 2024. 2, 3

work page 2024

[31] [31]

Uniformer: Uni- fying convolution and self-attention for visual recognition

Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022. 2

work page arXiv 2022

[32] [32]

Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020. 2

work page arXiv 2005

[33] [33]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InEuropean Conference on Computer Vision, pages 121–137. Springer, 2020. 2

work page 2020

[34] [34]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022. 2

work page 2022

[35] [35]

Set covering problem.Cornell University Computational Op- timization Open Textbook

Sherry Liang, Khalid Alanazi, and Kumail Al Hamoud. Set covering problem.Cornell University Computational Op- timization Open Textbook. Cornell University,[online docu- ment], 2020. 5

work page 2020

[36] [36]

Egocentric video-language pretraining

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. InNeurIPS, 2022. 2

work page 2022

[37] [37]

Learning to recognize procedural activities with distant supervision

Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, and Lorenzo Torresani. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13853–13863, 2022. 2, 3

work page 2022

[38] [38]

Text-driven image editing via learnable regions

Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, and Ming-Hsuan Yang. Text-driven image editing via learnable regions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7059–7068,

work page

[39] [39]

Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020. 2

work page arXiv 2002

[40] [40]

Learning to ground instructional articles in videos through narrations

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15201–15213,

work page

[41] [41]

Generating illustrated instructions

Sachit Menon, Ishan Misra, and Rohit Girdhar. Generating illustrated instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6274–6284, 2024. 2, 3

work page 2024

[42] [42]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019. 2, 3, 5, 6, 4

work page 2019

[43] [43]

End-to-end learning of visual representations from uncurated instruc- tional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879– 9889, 2020. 2, 4, 6

work page 2020

[44] [44]

Step differences in instructional video

Tushar Nagarajan and Lorenzo Torresani. Step differences in instructional video. InCVPR, 2024. 2

work page 2024

[45] [45]

Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022

Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022. 2

work page arXiv 2022

[46] [46]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023. 2

work page 2023

[48] [48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings 10 of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021

[49] [49]

Action scene graphs for long- form understanding of egocentric videos

Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 3

work page 2024

[50] [50]

Inverse cooking: Recipe generation from food images

Amaia Salvador, Michal Drozdzal, Xavier Giro i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InCVPR, 2019. 2, 3

work page 2019

[51] [51]

Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022

Fadime Sener, Rishabh Saraf, and Angela Yao. Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022. 2

work page 2022

[52] [52]

Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020. 5

work page 2020

[53] [53]

Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024

Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024. 2, 3, 8

work page arXiv 2024

[54] [54]

Genhowto: Learning to generate actions and state transformations from instructional videos

Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3

work page 2024

[55] [55]

The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017

Dawn Surgenor, Lynsey Hollywood, Sin ´ead Furey, Fiona Lavelle, Laura McGowan, Michelle Spence, Monique Raats, Amanda McCloat, Elaine Mooney, Martin Caraher, et al. The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017. 1, 3

work page 2017

[56] [56]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 2, 5, 6, 8, 3

work page 2019

[57] [57]

EPIC Fields: Marrying 3D Geometry and Video Understanding

Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Larina, Diane Larlus, Dima Damen, and Andrea Vedaldi. EPIC Fields: Marrying 3D Geometry and Video Understanding. InProceedings of the Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023

[58] [58]

Motioneditor: Editing video motion via content-aware diffusion

Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, and Yu-Gang Jiang. Motioneditor: Editing video motion via content-aware diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7882–7891, 2024. 2

work page 2024

[59] [59]

Recipe2video: Synthesizing person- alized videos from recipe texts

Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Bal- aji Vasan Srinivasan. Recipe2video: Synthesizing person- alized videos from recipe texts. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2268–2277, 2023. 2, 3, 6, 7, 8

work page 2023

[60] [60]

Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023

Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023. 2, 6, 7, 3

work page arXiv 2023

[61] [61]

Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024

Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, and Chen Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024. 1

work page arXiv 2024

[62] [62]

Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023. 2

work page arXiv 2023

[63] [63]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Internvideo: General video foundation models via generative and discriminative learning, 2022

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 2, 3, 6, 8

work page 2022

[65] [65]

Internvideo2: Scaling foundation models for mul- timodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2, 3, 6, 7

work page 2024

[66] [66]

Wikihow.https://www.wikihow.com,

WikiHow. Wikihow.https://www.wikihow.com,

work page

[67] [67]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 2

work page 2022

[68] [68]

Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2, 4

work page arXiv 2021

[69] [69]

Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos

Xin Xu, Kun Qian, Bo Zhou, Shenghao Chen, and Yitong Li. Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3353–3358, 2021. 1

work page 2021

[70] [70]

Learn- ing object state changes in videos: An open-world perspec- tive

Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18493–18503,

work page

[71] [71]

Learn- ing object state changes in videos: An open-world perspec- tive

Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18493– 18503, 2024. 3

work page 2024

[72] [72]

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. Recipeqa: A challenge dataset for multi- modal comprehension of cooking recipes.arXiv preprint arXiv:1809.00812, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[73] [73]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video 11 diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021. 2

work page arXiv 2021

[75] [75]

P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision

He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G Der- panis, Richard P Wildes, and Allan D Jepson. P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022. 2

work page 2022

[76] [76]

Cen- terclip: Token clustering for efficient text-video retrieval

Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. Cen- terclip: Token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823, 2022. 2

work page arXiv 2022

[77] [77]

Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023. 2

work page arXiv 2023

[78] [78]

Procedure-aware pretraining for instructional video understanding

Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InCVPR, pages 10727–10738, 2023. 2

work page 2023

[79] [79]

Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023

[80] [80]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, pages 7590–7598, 2018. 3

work page 2018