pith. machine review for the scientific record. sign in

arxiv: 2511.21998 · v2 · submitted 2025-11-27 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-modal LLMlive guidanceinteractive coachingtask executionmistake detectionstreaming modelcooking benchmarkasynchronous response
0
0 comments X

The pith

Multi-modal LLMs can provide live step-by-step guidance when given streaming video input and a benchmark of timed user mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that multi-modal LLMs need to move beyond turn-based conversation to deliver real-time instructional coaching during ongoing tasks. It creates the Qualcomm Interactive Cooking benchmark from videos of users cooking that include mistakes and corrections, complete with precise timestamps for instructions and error alerts. The authors also introduce LiveMamba, a streaming model built to process video continuously and respond with feedback at the right moments. A sympathetic reader would care because capable AI assistants must monitor actions as they happen and correct errors before they compound rather than waiting for the next turn.

Core claim

The paper establishes that effective live situated coaching requires models capable of asynchronous reaction to video streams, detection of whether instructions were executed successfully, and timely alerts to mistakes at their visual occurrence, which is supported by the Qualcomm Interactive Cooking dataset with its dense timed annotations and by LiveMamba as a streaming multi-modal baseline that processes continuous input to output guidance.

What carries the argument

LiveMamba, a streaming multi-modal LLM designed to process video streams asynchronously and deliver timed instructions plus mistake alerts.

If this is right

  • State-of-the-art multi-modal LLMs can be systematically benchmarked on their ability to deliver timely feedback during task execution.
  • The dataset enables training and evaluation of models for detecting user errors at the exact moments they appear visually.
  • Streaming architectures support asynchronous responses that match the pace of ongoing human actions rather than discrete turns.
  • AI systems for interactive coaching become feasible for situated tasks that involve physical steps and corrections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same timed-annotation approach could extend to other procedural activities such as equipment assembly or home repair.
  • Real-time mistake alerting might reduce cumulative errors more than delayed feedback given after a task ends.
  • Pairing this capability with physical robots could allow AI to guide hands-on interactions as they unfold.

Load-bearing premise

The densely annotated timed mistake alerts in the dataset accurately reflect real user errors and allow meaningful evaluation of asynchronous real-time performance using existing video data.

What would settle it

A live user study in which participants cook while the model streams guidance and alerts, then checks whether the model's alert timestamps align with independently observed actual mistakes better than chance timing.

Figures

Figures reproduced from arXiv: 2511.21998 by Apratim Bhattacharyya, Bicheng Xu, Leonid Sigal, Litian Liu, Pulkit Madan, Reza Pourreza, Roland Memisevic, Sanjay Haresh, Sunny Panchal.

Figure 1
Figure 1. Figure 1: At the stage where the tomatoes are being sliced, an instruction with the desired thickness [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: An overview of the step-by-step task guidance scenario in our Qualcomm Interactive [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our LIVEMAMBA model architecture. The input video stream is processed by an InternViT vision head which produces M tokens, and is then reduced to K tokens by a Q-Former. The language backbone produces feedback and invokes the Re-planner if necessary before the next instruction. 4 LIVEMAMBA for Step-by-Step Instructions To deal with the challenge of guiding a user through step-by-step instructions we propos… view at source ↗
Figure 3
Figure 3. Figure 3: Our LIVEMAMBA is able to successfully recognize the person has added the black pepper as instructed and points out when the person should heat the oil in a non-stick frying pan, in the Qualcomm Interactive Cooking benchmark. 5.4 Turn-based Evaluation The streaming evaluation in Tabs. 3 and 4 tested models on multi-step user guidance, a setup inherently challenging due to error propagation; for instance, fa… view at source ↗
Figure 4
Figure 4. Figure 4: Data samples from the main set. Left: the user prepares spicy tuna avocado wraps. Right: [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data samples from the advanced planning set. Left: the user is making ramen. Right: the [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Data samples from the advanced planning set. Left: the user is preparing butter corn cup. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Predictions from our LIVEMAMBA from the main set of the Qualcomm Interactive Cooking benchmark. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Qualcomm Interactive Cooking, a new benchmark and dataset built on CaptainCook4D containing user mistakes during cooking tasks. It features densely annotated timed instructions, feedback, and precisely timestamped mistake alerts. The work evaluates state-of-the-art multi-modal LLMs on this benchmark for live step-by-step guidance and introduces LiveMamba, a streaming multi-modal LLM designed for asynchronous reaction to video streams, claiming to provide the first dedicated benchmark and strong baseline for live situated coaching.

Significance. If the evaluation protocol and results hold, the contribution would be significant for advancing interactive AI assistants beyond turn-based conversation toward real-time, situated coaching. The new dataset with timed mistake annotations addresses a clear gap in resources for training and evaluating asynchronous mistake detection and guidance. The introduction of LiveMamba as a streaming baseline is a concrete step forward, and the focus on video streams with corrections provides a falsifiable testbed for future models.

major comments (1)
  1. [Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.
minor comments (1)
  1. [Abstract] The abstract states that the dataset 'contains user mistakes during task execution' and features 'densely annotated, timed instructions and feedback messages' but provides no quantitative details on annotation density, inter-annotator agreement, or the distribution of mistake types; adding these statistics would strengthen the dataset description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on the evaluation and benchmark setup below.

read point-by-point responses
  1. Referee: [Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.

    Authors: We agree that a strict streaming protocol without future information is essential to support our claims of live, situated coaching. In the manuscript, we describe LiveMamba as a streaming model that reacts asynchronously to video streams. To address the concern directly: model evaluations are performed by feeding frames in sequential order without providing access to subsequent frames. The annotations in the dataset were created with full context to accurately timestamp mistakes, but this does not affect the model inputs during evaluation. We will expand the experimental setup section to explicitly detail the frame-by-frame input constraints and confirm the absence of any lookahead or batch processing in the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and model proposal

full rationale

The paper introduces the Qualcomm Interactive Cooking benchmark built on CaptainCook4D with new dense timed annotations for mistakes and instructions, then evaluates existing MLLMs and proposes LiveMamba as a streaming baseline. No equations, fitted parameters, or derivations are present that reduce predictions to inputs by construction. Central claims rest on dataset creation and empirical evaluation rather than self-definitional loops, self-citation load-bearing premises, or renaming of prior results. The work is self-contained as an empirical contribution with external benchmarks for comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical applied paper; no mathematical derivations, free parameters, or invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 966 out tokens · 27337 ms · 2026-05-17T05:32:22.441929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    LIVEMAMBA utilizes a lightweight Mamba backbone, a 'when-to-say' mechanism, novel data augmentation for mistake recognition, and iterative re-planning for adaptive delivery.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 16 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...

  3. [3]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024

  4. [4]

    Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2, 2023

  5. [5]

    Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023

    Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023

  6. [6]

    Look, remember and reason: Visual reasoning with grounded rationales

    Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. InICLR, 2024

  7. [7]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020

  8. [8]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

  9. [9]

    Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

  11. [11]

    InstructBLIP: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023

  12. [12]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. InIJCV, 2022

  13. [13]

    Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

  14. [14]

    Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition.arXiv preprint arXiv:2503.06220, 2025

    Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, and Ting Cao. Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition.arXiv preprint arXiv:2503.06220, 2025

  15. [15]

    An Yang et. al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024

  16. [16]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 11

  17. [17]

    Grauman et. al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

  18. [18]

    Marah I Abdin et. al. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024

  19. [19]

    Shuai Bai et. al. Qwen2.5-vl technical report.CoRR, abs/2502.13923, 2025

  20. [20]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense. InICCV, 2017

  21. [21]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Ego4d: Around the world in 3, 000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagara- jan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...

  23. [23]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023

  24. [24]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi: 10.48550/ARXIV .2312.00752. URL https://doi.org/ 10.48550/arXiv.2312.00752

  25. [25]

    Girshick

    Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A dataset for large vocabulary instance segmentation. InCVPR, 2019

  26. [26]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  27. [27]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  28. [28]

    Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

    Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

  29. [29]

    Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962

    Arthur B Kahn. Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962

  30. [30]

    Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

  31. [31]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  32. [32]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

  33. [33]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 12

  34. [34]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

  35. [35]

    LION-FS: fast & slow video- language thinker as online video assistant

    Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: fast & slow video- language thinker as online video assistant. InCVPR, 2025

  36. [36]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  37. [37]

    Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

    Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

  38. [38]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

  39. [39]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  40. [40]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

  41. [41]

    Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

    Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, and Jose M Alvare. Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

  42. [42]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

  43. [43]

    Video-ChatGPT: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL, August 2024

  44. [44]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InICCV, 2019

  45. [45]

    What to say and when to say it: Live fitness coaching as a testbed for situated interaction

    Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Böhm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, and Roland Memisevic. What to say and when to say it: Live fitness coaching as a testbed for situated interaction. InNeurIPS, 2024

  46. [46]

    Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate

    Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric D. Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. InNeurIPS, 2024

  47. [47]

    Can vision-language models answer face to face questions in the real-world? arXiv preprint arXiv:2503.19356, 2025

    Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world? arXiv preprint arXiv:2503.19356, 2025

  48. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  49. [49]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InCVPR, 2022. 13

  50. [50]

    Ego4d goal-step: Toward hierarchical understanding of procedural activities

    Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: Toward hierarchical understanding of procedural activities. In NeurIPS, 2023

  51. [51]

    COIN: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

  52. [52]

    Coin: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

  53. [53]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.CoRR, abs/2507.06261, 2025

  54. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023

  55. [55]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  56. [56]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  57. [57]

    Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. InICCV, 2023

  58. [58]

    Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts.arXiv preprint arXiv:2503.22952, 2025

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts.arXiv preprint arXiv:2503.22952, 2025

  59. [59]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InCVPR, 2025

  60. [60]

    Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

    Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

  61. [61]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

  62. [62]

    Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025

  63. [63]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InEMNLP - System Demonstrations, 2023

  64. [64]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023. URL https://arxiv.org/abs/2306.02858

  65. [65]

    Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 14

  66. [66]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InAAAI, 2018

  67. [67]

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023. 15 Appendix A Overview In the following we provide details of the annotation process of our Q...

  68. [68]

    critical

    and the FHO annotations [17]. The FHO annotations largely include fine-grained short duration actions while the goalstep annotations include longer-ranged actions more closely aligned with the recipe steps in our Qualcomm Interactive Cooking benchmark. To create temporally localized counterfactual mistakes, we first find the FHO actions that are included ...

  69. [69]

    Users might toast or heat the tortilla before placing it on the cutting board

  70. [70]

    Users may place the tortilla on a plate instead of a cutting board

  71. [71]

    Instruction: Now add 1/4 tsp salt to a bowl

    Users could use an unclean surface instead of a clean cutting board. Instruction: Now add 1/4 tsp salt to a bowl. When guiding the user through this step, the cooking instructor should watch out for these potential mistakes:

  72. [72]

    Spilling salt while measuring or adding it

  73. [73]

    Adding too much salt, specifically 1/2 tsp instead of the required 1/4 tsp

  74. [74]

    Confusing 1/3 tablespoon with 1/3 teaspoon

  75. [75]

    mistake summaries

    Accidentally adding the salt to the pan rather than the bowl. We then use these “mistake summaries” to prompt the multi-modal LLM. For Gemini-2.5- Flash [55], Qwen2.5-VL-7B-Instruct [19], Qwen2-VL-7B-Instruct [56], VideoLLaMA3-7B [62], VideoChat2 [34] we use the following prompts: Gemini-2.5-Flash / Qwen2x-VL-7B-Instruct / VideoLLaMA3-7B/ VideoChat2: Chec...

  76. [76]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...