Stitch-a-Demo: Video Demonstrations from Multistep Descriptions
Pith reviewed 2026-05-23 00:25 UTC · model grok-4.3
The pith
Stitch-a-Demo assembles coherent video demonstrations by stitching clips that match each step in a multistep description.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence.
What carries the argument
Retrieval-based stitching trained with weakly supervised multistep data and hard negatives to ensure step matching and visual coherence.
If this is right
- Multistep descriptions receive visual illustrations in the form of a single assembled video.
- The method maintains accuracy to each step description individually.
- Visual coherence holds across clips sourced from separate videos.
- State-of-the-art performance is reached on instructional video datasets with gains reaching 29%.
- Human preference studies show strong preference for the generated demonstrations.
Where Pith is reading between the lines
- The coherence training could transfer to sequencing other types of media like images or text segments.
- Hard negative sampling may address consistency issues in related retrieval problems involving ordered data.
- Real-world applications might include generating demos for DIY projects or software tutorials from user text.
Load-bearing premise
Clips retrieved from different sources can be assembled into a single video that remains visually coherent while accurately reflecting every step description in the multistep input.
What would settle it
Finding that many stitched videos show abrupt visual changes between clips or omit key elements from a step description on a test set of multistep instructions would disprove the method's effectiveness.
Figures
read the original abstract
When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Stitch-a-Demo, a retrieval-based pipeline that assembles coherent video demonstrations from multistep text inputs (e.g., recipes) by retrieving and stitching clips from diverse sources. It introduces a weakly-supervised training procedure that injects hard negatives to enforce both per-step accuracy and cross-clip visual coherence, and reports state-of-the-art results on in-the-wild instructional videos together with large gains in a human preference study.
Significance. If the quantitative claims hold, the work would advance retrieval-based video synthesis for complex procedural content, offering a practical alternative to generative models when source footage already exists. The emphasis on coherence across independently sourced clips addresses a clear gap in current single-caption retrieval methods.
major comments (1)
- [Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.
Simulated Author's Rebuttal
We thank the referee for their detailed review and for highlighting this important point about the abstract. We address the comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the method 'achieves state-of-the-art performance, with gains up to 29%' and 'dramatic wins in a human preference study' is presented without any description of the evaluation protocol, baselines, datasets, metrics, or results tables. This absence renders the primary performance assertion unverifiable and load-bearing for the paper's contribution.
Authors: We acknowledge that the abstract, constrained by length, omits specifics on the evaluation protocol, baselines, datasets, metrics, and tables. The full manuscript details these in the Experiments section (including in-the-wild instructional video datasets, comparison baselines, quantitative metrics yielding up to 29% gains, and the human preference study protocol with results). To improve self-containment of the abstract while preserving its brevity, we will revise it to include a concise reference to the evaluation setting, key metrics, and human study. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes a retrieval-based method and training pipeline for assembling video demonstrations from multistep text descriptions, using weakly supervised data creation and hard negatives to promote accuracy and coherence. No equations, fitted parameters, self-citations, or derivation steps are present that reduce any claimed result to its own inputs by construction. The approach is presented as an independent engineering contribution validated on external in-the-wild videos, with no load-bearing self-referential definitions or renamings of prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clips from heterogeneous sources can be stitched while preserving visual coherence and step accuracy
Reference graph
Works this paper leans on
-
[1]
Gepsan: Generative procedure step anticipation in cooking videos
Mohamed A Abdelsalam, Samrudhdhi B Rangrej, Isma Hadji, Nikita Dvornik, Konstantinos G Derpanis, and Af- saneh Fazly. Gepsan: Generative procedure step anticipation in cooking videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2988–2997,
-
[2]
Ht-step: Aligning instructional articles with how-to videos
Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023. 2, 5, 6, 8
work page 2023
-
[3]
Hiervl: Learning hierarchical video- language embeddings
Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video- language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23066–23078, 2023. 2
work page 2023
-
[4]
Video-mined task graphs for keystep recognition in instructional videos
Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. In Advances in Neural Information Processing Systems, pages 67833–67846. Curran Associates, Inc., 2023. 2, 3
work page 2023
-
[5]
Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos.Ad- vances in Neural Information Processing Systems, 36, 2024. 2, 3
work page 2024
-
[6]
Detours for navigating instructional videos
Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, and Kristen Grauman. Detours for navigating instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18804–18815,
-
[7]
United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos
Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024. 2
work page 2024
-
[8]
Procedure planning in instructional videos via contextual modeling and model- based policy learning
Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model- based policy learning. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15611– 15620, 2021. 2
work page 2021
-
[9]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2
work page 2023
-
[10]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2
work page 2024
-
[11]
Procedure planning in instructional videos
Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020. 2
work page 2020
-
[12]
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss.arXiv preprint arXiv:2109.04290, 2021. 2
-
[13]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: collection, pipeline and chal- lenges for epic-kitchens-100.International Journal of Com- puter Vision, 130(1):33–55, 2022. 2
work page 2022
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Drop-DTW: Aligning com- mon signal between sequences while dropping outliers
Mikita Dvornik, Isma Hadji, Konstantinos G Derpanis, Ani- mesh Garg, and Allan Jepson. Drop-DTW: Aligning com- mon signal between sequences while dropping outliers. Advances in Neural Information Processing Systems, 34: 13782–13793, 2021. 2, 3, 6
work page 2021
-
[16]
Flow graph to video grounding for weakly-supervised multi-step local- ization
Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step local- ization. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- ings, Part XXXV, pages 319–335. Springer, 2022. 2, 3
work page 2022
-
[17]
Clip2video: Mastering video-text retrieval via image clip
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021. 2
-
[18]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 2
work page 2019
-
[19]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Anticipative video transformer
Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13505–13515, 2021. 2
work page 2021
-
[21]
Omnivore: A sin- gle model for many visual modalities
Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin- gle model for many visual modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022. 2
work page 2022
-
[22]
Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Con- ference on Computer Vision, pages 205–224. Springer, 2024. 2
work page 2024
-
[23]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 2
work page 2022
-
[24]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar 9 Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 2
work page 2024
-
[25]
Temporal alignment networks for long-term video
Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 2, 5, 3
work page 2022
-
[26]
Instruct-imagen: Image gen- eration with multi-modal instruction
Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et al. Instruct-imagen: Image gen- eration with multi-modal instruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4754–4763, 2024. 2
work page 2024
-
[27]
Scaling up vision-language pre-training for image captioning
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022. 2
work page 2022
-
[28]
Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, and Andrew Zisserman. Epic-sounds: A large-scale dataset of actions that sound.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2
work page 2025
-
[29]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6
work page 2017
-
[30]
Lego: L earning ego cen- tric action frame generation via visual instruction tuning
Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M Rehg, and Miao Liu. Lego: L earning ego cen- tric action frame generation via visual instruction tuning. In European Conference on Computer Vision, pages 135–155. Springer, 2024. 2, 3
work page 2024
-
[31]
Uniformer: Uni- fying convolution and self-attention for visual recognition
Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022. 2
-
[32]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training.arXiv preprint arXiv:2005.00200, 2020. 2
-
[33]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InEuropean Conference on Computer Vision, pages 121–137. Springer, 2020. 2
work page 2020
-
[34]
Mvitv2: Improved multiscale vision transformers for classification and detection
Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022. 2
work page 2022
-
[35]
Set covering problem.Cornell University Computational Op- timization Open Textbook
Sherry Liang, Khalid Alanazi, and Kumail Al Hamoud. Set covering problem.Cornell University Computational Op- timization Open Textbook. Cornell University,[online docu- ment], 2020. 5
work page 2020
-
[36]
Egocentric video-language pretraining
Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. InNeurIPS, 2022. 2
work page 2022
-
[37]
Learning to recognize procedural activities with distant supervision
Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, and Lorenzo Torresani. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13853–13863, 2022. 2, 3
work page 2022
-
[38]
Text-driven image editing via learnable regions
Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, and Ming-Hsuan Yang. Text-driven image editing via learnable regions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7059–7068,
-
[39]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation.arXiv preprint arXiv:2002.06353, 2020. 2
-
[40]
Learning to ground instructional articles in videos through narrations
Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15201–15213,
-
[41]
Generating illustrated instructions
Sachit Menon, Ishan Misra, and Rohit Girdhar. Generating illustrated instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6274–6284, 2024. 2, 3
work page 2024
-
[42]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019. 2, 3, 5, 6, 4
work page 2019
-
[43]
End-to-end learning of visual representations from uncurated instruc- tional videos
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879– 9889, 2020. 2, 4, 6
work page 2020
-
[44]
Step differences in instructional video
Tushar Nagarajan and Lorenzo Torresani. Step differences in instructional video. InCVPR, 2024. 2
work page 2024
-
[45]
Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. Grit: Faster and better image captioning transformer using dual visual features.arXiv preprint arXiv:2207.09666, 2022. 2
-
[46]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023. 2
work page 2023
-
[48]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings 10 of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4
work page 2021
-
[49]
Action scene graphs for long- form understanding of egocentric videos
Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 3
work page 2024
-
[50]
Inverse cooking: Recipe generation from food images
Amaia Salvador, Michal Drozdzal, Xavier Giro i Nieto, and Adriana Romero. Inverse cooking: Recipe generation from food images. InCVPR, 2019. 2, 3
work page 2019
-
[51]
Fadime Sener, Rishabh Saraf, and Angela Yao. Transferring knowledge from text to video: Zero-shot anticipation for pro- cedural actions.IEEE transactions on pattern analysis and machine intelligence, 45(6):7836–7852, 2022. 2
work page 2022
-
[52]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in Neural Information Processing Systems, 33:16857–16867, 2020. 5
work page 2020
-
[53]
Tom ´aˇs Sou ˇcek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. Showhowto: Generating scene-conditioned step-by-step visual instructions.arXiv preprint arXiv:2412.01987, 2024. 2, 3, 8
-
[54]
Genhowto: Learning to generate actions and state transformations from instructional videos
Tom ´aˇs Sou ˇcek, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. Genhowto: Learning to generate actions and state transformations from instructional videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3
work page 2024
-
[55]
The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017
Dawn Surgenor, Lynsey Hollywood, Sin ´ead Furey, Fiona Lavelle, Laura McGowan, Michelle Spence, Monique Raats, Amanda McCloat, Elaine Mooney, Martin Caraher, et al. The impact of video technology on learning: A cooking skills experiment.Appetite, 114:306–312, 2017. 1, 3
work page 2017
-
[56]
Coin: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 2, 5, 6, 8, 3
work page 2019
-
[57]
EPIC Fields: Marrying 3D Geometry and Video Understanding
Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Larina, Diane Larlus, Dima Damen, and Andrea Vedaldi. EPIC Fields: Marrying 3D Geometry and Video Understanding. InProceedings of the Neural Information Processing Systems (NeurIPS), 2023. 2
work page 2023
-
[58]
Motioneditor: Editing video motion via content-aware diffusion
Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, and Yu-Gang Jiang. Motioneditor: Editing video motion via content-aware diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7882–7891, 2024. 2
work page 2024
-
[59]
Recipe2video: Synthesizing person- alized videos from recipe texts
Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Bal- aji Vasan Srinivasan. Recipe2video: Synthesizing person- alized videos from recipe texts. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2268–2277, 2023. 2, 3, 6, 7, 8
work page 2023
-
[60]
Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr: Learning composed video retrieval from web video captions.arXiv preprint arXiv:2308.14746, 2023. 2, 6, 7, 3
-
[61]
Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, and Chen Feng. Vlm see, robot do: Human demo video to robot action plan via vision language model.arXiv preprint arXiv:2410.08792, 2024. 1
-
[62]
Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. Pdpp: Projected diffusion for procedure planning in instructional videos.arXiv preprint arXiv:2303.14676, 2023. 2
-
[63]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Internvideo: General video foundation models via generative and discriminative learning, 2022
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. 2, 3, 6, 8
work page 2022
-
[65]
Internvideo2: Scaling foundation models for mul- timodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2, 3, 6, 7
work page 2024
- [66]
-
[67]
Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition
Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 2
work page 2022
-
[68]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2, 4
-
[69]
Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos
Xin Xu, Kun Qian, Bo Zhou, Shenghao Chen, and Yitong Li. Two-stream 2d/3d residual networks for learning robot manipulations from human demonstration videos. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3353–3358, 2021. 1
work page 2021
-
[70]
Learn- ing object state changes in videos: An open-world perspec- tive
Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18493–18503,
-
[71]
Learn- ing object state changes in videos: An open-world perspec- tive
Zihui Xue, Kumar Ashutosh, and Kristen Grauman. Learn- ing object state changes in videos: An open-world perspec- tive. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18493– 18503, 2024. 3
work page 2024
-
[72]
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. Recipeqa: A challenge dataset for multi- modal comprehension of cooking recipes.arXiv preprint arXiv:1809.00812, 2018. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[73]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video 11 diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Multi- grained vision language pre-training: Align- ing texts with visual concepts
Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021. 2
-
[75]
P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision
He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G Der- panis, Richard P Wildes, and Allan D Jepson. P3iv: Prob- abilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022. 2
work page 2022
-
[76]
Cen- terclip: Token clustering for efficient text-video retrieval
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. Cen- terclip: Token clustering for efficient text-video retrieval. arXiv preprint arXiv:2205.00823, 2022. 2
-
[77]
Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video represen- tation from instructional videos and their narrations.arXiv preprint arXiv:2303.17839, 2023. 2
-
[78]
Procedure-aware pretraining for instructional video understanding
Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InCVPR, pages 10727–10738, 2023. 2
work page 2023
-
[79]
Honglu Zhou, Roberto Mart ´ın-Mart´ın, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[80]
Towards automatic learning of procedures from web instructional videos
Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, pages 7590–7598, 2018. 3
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.