Recognition: 2 theorem links
· Lean TheoremCan Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Pith reviewed 2026-05-17 05:32 UTC · model grok-4.3
The pith
Multi-modal LLMs can provide live step-by-step guidance when given streaming video input and a benchmark of timed user mistakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that effective live situated coaching requires models capable of asynchronous reaction to video streams, detection of whether instructions were executed successfully, and timely alerts to mistakes at their visual occurrence, which is supported by the Qualcomm Interactive Cooking dataset with its dense timed annotations and by LiveMamba as a streaming multi-modal baseline that processes continuous input to output guidance.
What carries the argument
LiveMamba, a streaming multi-modal LLM designed to process video streams asynchronously and deliver timed instructions plus mistake alerts.
If this is right
- State-of-the-art multi-modal LLMs can be systematically benchmarked on their ability to deliver timely feedback during task execution.
- The dataset enables training and evaluation of models for detecting user errors at the exact moments they appear visually.
- Streaming architectures support asynchronous responses that match the pace of ongoing human actions rather than discrete turns.
- AI systems for interactive coaching become feasible for situated tasks that involve physical steps and corrections.
Where Pith is reading between the lines
- The same timed-annotation approach could extend to other procedural activities such as equipment assembly or home repair.
- Real-time mistake alerting might reduce cumulative errors more than delayed feedback given after a task ends.
- Pairing this capability with physical robots could allow AI to guide hands-on interactions as they unfold.
Load-bearing premise
The densely annotated timed mistake alerts in the dataset accurately reflect real user errors and allow meaningful evaluation of asynchronous real-time performance using existing video data.
What would settle it
A live user study in which participants cook while the model streams guidance and alerts, then checks whether the model's alert timestamps align with independently observed actual mistakes better than chance timing.
Figures
read the original abstract
Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qualcomm Interactive Cooking, a new benchmark and dataset built on CaptainCook4D containing user mistakes during cooking tasks. It features densely annotated timed instructions, feedback, and precisely timestamped mistake alerts. The work evaluates state-of-the-art multi-modal LLMs on this benchmark for live step-by-step guidance and introduces LiveMamba, a streaming multi-modal LLM designed for asynchronous reaction to video streams, claiming to provide the first dedicated benchmark and strong baseline for live situated coaching.
Significance. If the evaluation protocol and results hold, the contribution would be significant for advancing interactive AI assistants beyond turn-based conversation toward real-time, situated coaching. The new dataset with timed mistake annotations addresses a clear gap in resources for training and evaluating asynchronous mistake detection and guidance. The introduction of LiveMamba as a streaming baseline is a concrete step forward, and the focus on video streams with corrections provides a falsifiable testbed for future models.
major comments (1)
- [Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.
minor comments (1)
- [Abstract] The abstract states that the dataset 'contains user mistakes during task execution' and features 'densely annotated, timed instructions and feedback messages' but provides no quantitative details on annotation density, inter-annotator agreement, or the distribution of mistake types; adding these statistics would strengthen the dataset description.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment on the evaluation and benchmark setup below.
read point-by-point responses
-
Referee: [Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.
Authors: We agree that a strict streaming protocol without future information is essential to support our claims of live, situated coaching. In the manuscript, we describe LiveMamba as a streaming model that reacts asynchronously to video streams. To address the concern directly: model evaluations are performed by feeding frames in sequential order without providing access to subsequent frames. The annotations in the dataset were created with full context to accurately timestamp mistakes, but this does not affect the model inputs during evaluation. We will expand the experimental setup section to explicitly detail the frame-by-frame input constraints and confirm the absence of any lookahead or batch processing in the reported results. revision: yes
Circularity Check
No circularity: new benchmark and model proposal
full rationale
The paper introduces the Qualcomm Interactive Cooking benchmark built on CaptainCook4D with new dense timed annotations for mistakes and instructions, then evaluates existing MLLMs and proposes LiveMamba as a streaming baseline. No equations, fitted parameters, or derivations are present that reduce predictions to inputs by construction. Central claims rest on dataset creation and empirical evaluation rather than self-definitional loops, self-citation load-bearing premises, or renaming of prior results. The work is self-contained as an empirical contribution with external benchmarks for comparison.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LIVEMAMBA utilizes a lightweight Mamba backbone, a 'when-to-say' mechanism, novel data augmentation for mistake recognition, and iterative re-planning for adaptive delivery.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...
work page 2022
-
[3]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024
-
[4]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2, 2023
work page 2023
-
[5]
Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023
Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023
work page 2023
-
[6]
Look, remember and reason: Visual reasoning with grounded rationales
Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. InICLR, 2024
work page 2024
-
[7]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020
work page 2020
-
[8]
Videollm-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024
work page 2024
-
[9]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025
-
[10]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024
work page 2024
-
[11]
InstructBLIP: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023
work page 2023
-
[12]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. InIJCV, 2022
work page 2022
-
[13]
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025
-
[14]
Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, and Ting Cao. Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition.arXiv preprint arXiv:2503.06220, 2025
-
[15]
An Yang et. al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
An Yang et al. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Grauman et. al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022
work page 2022
-
[18]
Marah I Abdin et. al. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Shuai Bai et. al. Qwen2.5-vl technical report.CoRR, abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense. InICCV, 2017
work page 2017
-
[21]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Ego4d: Around the world in 3, 000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagara- jan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...
work page 2022
-
[23]
Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023
-
[24]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi: 10.48550/ARXIV .2312.00752. URL https://doi.org/ 10.48550/arXiv.2312.00752
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
- [25]
-
[26]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[28]
Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024
-
[29]
Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962
Arthur B Kahn. Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962
work page 1962
-
[30]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021
work page 2021
-
[31]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[32]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023
work page 2023
-
[33]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024
work page 2024
-
[35]
LION-FS: fast & slow video- language thinker as online video assistant
Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: fast & slow video- language thinker as online video assistant. InCVPR, 2025
work page 2025
-
[36]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[37]
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025
-
[38]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[40]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[41]
Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, and Jose M Alvare. Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024
-
[42]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[43]
Video-ChatGPT: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL, August 2024
work page 2024
-
[44]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InICCV, 2019
work page 2019
-
[45]
What to say and when to say it: Live fitness coaching as a testbed for situated interaction
Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Böhm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, and Roland Memisevic. What to say and when to say it: Live fitness coaching as a testbed for situated interaction. InNeurIPS, 2024
work page 2024
-
[46]
Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate
Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric D. Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. InNeurIPS, 2024
work page 2024
-
[47]
Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world? arXiv preprint arXiv:2503.19356, 2025
-
[48]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[49]
Assembly101: A large-scale multi-view video dataset for understanding procedural activities
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InCVPR, 2022. 13
work page 2022
-
[50]
Ego4d goal-step: Toward hierarchical understanding of procedural activities
Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: Toward hierarchical understanding of procedural activities. In NeurIPS, 2023
work page 2023
-
[51]
COIN: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019
work page 2019
-
[52]
Coin: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019
work page 2019
-
[53]
Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.CoRR, abs/2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. InICCV, 2023
work page 2023
-
[58]
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts.arXiv preprint arXiv:2503.22952, 2025
-
[59]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InCVPR, 2025
work page 2025
-
[60]
Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025
-
[61]
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025
-
[62]
Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025
work page 2025
-
[63]
Video-llama: An instruction-tuned audio-visual language model for video understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InEMNLP - System Demonstrations, 2023
work page 2023
-
[64]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023. URL https://arxiv.org/abs/2306.02858
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 14
-
[66]
Towards automatic learning of procedures from web instructional videos
Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InAAAI, 2018
work page 2018
-
[67]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023. 15 Appendix A Overview In the following we provide details of the annotation process of our Q...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
and the FHO annotations [17]. The FHO annotations largely include fine-grained short duration actions while the goalstep annotations include longer-ranged actions more closely aligned with the recipe steps in our Qualcomm Interactive Cooking benchmark. To create temporally localized counterfactual mistakes, we first find the FHO actions that are included ...
-
[69]
Users might toast or heat the tortilla before placing it on the cutting board
-
[70]
Users may place the tortilla on a plate instead of a cutting board
-
[71]
Instruction: Now add 1/4 tsp salt to a bowl
Users could use an unclean surface instead of a clean cutting board. Instruction: Now add 1/4 tsp salt to a bowl. When guiding the user through this step, the cooking instructor should watch out for these potential mistakes:
-
[72]
Spilling salt while measuring or adding it
-
[73]
Adding too much salt, specifically 1/2 tsp instead of the required 1/4 tsp
-
[74]
Confusing 1/3 tablespoon with 1/3 teaspoon
-
[75]
Accidentally adding the salt to the pan rather than the bowl. We then use these “mistake summaries” to prompt the multi-modal LLM. For Gemini-2.5- Flash [55], Qwen2.5-VL-7B-Instruct [19], Qwen2-VL-7B-Instruct [56], VideoLLaMA3-7B [62], VideoChat2 [34] we use the following prompts: Gemini-2.5-Flash / Qwen2x-VL-7B-Instruct / VideoLLaMA3-7B/ VideoChat2: Chec...
-
[76]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.