arxiv: 2511.21998 · v2 · submitted 2025-11-27 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya , Bicheng Xu , Sanjay Haresh , Reza Pourreza , Litian Liu , Sunny Panchal , Pulkit Madan , Leonid Sigal

show 1 more author

Roland Memisevic

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-modal LLMlive guidanceinteractive coachingtask executionmistake detectionstreaming modelcooking benchmarkasynchronous response

0 comments

The pith

Multi-modal LLMs can provide live step-by-step guidance when given streaming video input and a benchmark of timed user mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that multi-modal LLMs need to move beyond turn-based conversation to deliver real-time instructional coaching during ongoing tasks. It creates the Qualcomm Interactive Cooking benchmark from videos of users cooking that include mistakes and corrections, complete with precise timestamps for instructions and error alerts. The authors also introduce LiveMamba, a streaming model built to process video continuously and respond with feedback at the right moments. A sympathetic reader would care because capable AI assistants must monitor actions as they happen and correct errors before they compound rather than waiting for the next turn.

Core claim

The paper establishes that effective live situated coaching requires models capable of asynchronous reaction to video streams, detection of whether instructions were executed successfully, and timely alerts to mistakes at their visual occurrence, which is supported by the Qualcomm Interactive Cooking dataset with its dense timed annotations and by LiveMamba as a streaming multi-modal baseline that processes continuous input to output guidance.

What carries the argument

LiveMamba, a streaming multi-modal LLM designed to process video streams asynchronously and deliver timed instructions plus mistake alerts.

If this is right

State-of-the-art multi-modal LLMs can be systematically benchmarked on their ability to deliver timely feedback during task execution.
The dataset enables training and evaluation of models for detecting user errors at the exact moments they appear visually.
Streaming architectures support asynchronous responses that match the pace of ongoing human actions rather than discrete turns.
AI systems for interactive coaching become feasible for situated tasks that involve physical steps and corrections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same timed-annotation approach could extend to other procedural activities such as equipment assembly or home repair.
Real-time mistake alerting might reduce cumulative errors more than delayed feedback given after a task ends.
Pairing this capability with physical robots could allow AI to guide hands-on interactions as they unfold.

Load-bearing premise

The densely annotated timed mistake alerts in the dataset accurately reflect real user errors and allow meaningful evaluation of asynchronous real-time performance using existing video data.

What would settle it

A live user study in which participants cook while the model streams guidance and alerts, then checks whether the model's alert timestamps align with independently observed actual mistakes better than chance timing.

Figures

Figures reproduced from arXiv: 2511.21998 by Apratim Bhattacharyya, Bicheng Xu, Leonid Sigal, Litian Liu, Pulkit Madan, Reza Pourreza, Roland Memisevic, Sanjay Haresh, Sunny Panchal.

**Figure 1.** Figure 1: An overview of the step-by-step task guidance scenario in our Qualcomm Interactive [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Our LIVEMAMBA model architecture. The input video stream is processed by an InternViT vision head which produces M tokens, and is then reduced to K tokens by a Q-Former. The language backbone produces feedback and invokes the Re-planner if necessary before the next instruction. 4 LIVEMAMBA for Step-by-Step Instructions To deal with the challenge of guiding a user through step-by-step instructions we propos… view at source ↗

**Figure 3.** Figure 3: Our LIVEMAMBA is able to successfully recognize the person has added the black pepper as instructed and points out when the person should heat the oil in a non-stick frying pan, in the Qualcomm Interactive Cooking benchmark. 5.4 Turn-based Evaluation The streaming evaluation in Tabs. 3 and 4 tested models on multi-step user guidance, a setup inherently challenging due to error propagation; for instance, fa… view at source ↗

**Figure 4.** Figure 4: Data samples from the main set. Left: the user prepares spicy tuna avocado wraps. Right: [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Data samples from the advanced planning set. Left: the user is making ramen. Right: the [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Data samples from the advanced planning set. Left: the user is preparing butter corn cup. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Predictions from our LIVEMAMBA from the main set of the Qualcomm Interactive Cooking benchmark. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a timed-mistake benchmark for live cooking guidance and a streaming baseline, but the setup uses post-hoc annotations on fixed videos so it may not prove real asynchronous performance.

read the letter

The main point is a new benchmark called Qualcomm Interactive Cooking, built on CaptainCook4D, that adds densely timed mistake alerts and feedback to cooking videos, plus a streaming model they call LiveMamba meant for asynchronous guidance. It tries to move beyond turn-based chat to something that can react to a video stream while a user is actually doing the task and making errors. That focus on mistakes plus corrections and precise timestamps is the clearest addition over prior instructional datasets. The paper does a reasonable job explaining why existing MLLMs fall short for live coaching and why you need video that shows real user slips rather than perfect executions. Using an existing 4D dataset as the base keeps the effort grounded instead of starting from scratch. The stress-test concern about lookahead holds up from what is shown. Annotations made with full video context and any batch-style evaluation would not demonstrate true no-future-frame streaming, and the work stays one-directional on pre-recorded clips rather than testing closed-loop interaction where guidance changes what the user does next. The abstract also gives no numbers, error breakdowns, or implementation details on how LiveMamba actually streams, so the claim of a strong baseline cannot be checked yet. This is useful for groups working on practical interactive assistants in domains like cooking, assembly, or training. Readers who need a concrete testbed for real-time mistake detection would get value from the dataset if it is released with clear streaming protocols. It is worth sending to peer review because a benchmark with this emphasis can help the field even if the current evaluation needs tightening on the asynchronous part and more quantitative results.

Referee Report

1 major / 1 minor

Summary. The paper introduces Qualcomm Interactive Cooking, a new benchmark and dataset built on CaptainCook4D containing user mistakes during cooking tasks. It features densely annotated timed instructions, feedback, and precisely timestamped mistake alerts. The work evaluates state-of-the-art multi-modal LLMs on this benchmark for live step-by-step guidance and introduces LiveMamba, a streaming multi-modal LLM designed for asynchronous reaction to video streams, claiming to provide the first dedicated benchmark and strong baseline for live situated coaching.

Significance. If the evaluation protocol and results hold, the contribution would be significant for advancing interactive AI assistants beyond turn-based conversation toward real-time, situated coaching. The new dataset with timed mistake annotations addresses a clear gap in resources for training and evaluating asynchronous mistake detection and guidance. The introduction of LiveMamba as a streaming baseline is a concrete step forward, and the focus on video streams with corrections provides a falsifiable testbed for future models.

major comments (1)

[Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.

minor comments (1)

[Abstract] The abstract states that the dataset 'contains user mistakes during task execution' and features 'densely annotated, timed instructions and feedback messages' but provides no quantitative details on annotation density, inter-annotator agreement, or the distribution of mistake types; adding these statistics would strengthen the dataset description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on the evaluation and benchmark setup below.

read point-by-point responses

Referee: [Evaluation / Benchmark Setup] The central claim that the Qualcomm Interactive Cooking benchmark enables meaningful evaluation of live, situated coaching (Abstract) requires that model evaluations strictly simulate streaming input with no access to future visual information. The skeptic concern is load-bearing here: if annotations were produced offline with full video context and if LiveMamba or baseline evaluations permit any batch processing or lookahead over the clip, the reported performance would not demonstrate true asynchronous real-time capability. Clarification is needed on the exact streaming protocol, frame-by-frame input constraints, and whether any future-frame information leaks into the model or annotation process.

Authors: We agree that a strict streaming protocol without future information is essential to support our claims of live, situated coaching. In the manuscript, we describe LiveMamba as a streaming model that reacts asynchronously to video streams. To address the concern directly: model evaluations are performed by feeding frames in sequential order without providing access to subsequent frames. The annotations in the dataset were created with full context to accurately timestamp mistakes, but this does not affect the model inputs during evaluation. We will expand the experimental setup section to explicitly detail the frame-by-frame input constraints and confirm the absence of any lookahead or batch processing in the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and model proposal

full rationale

The paper introduces the Qualcomm Interactive Cooking benchmark built on CaptainCook4D with new dense timed annotations for mistakes and instructions, then evaluates existing MLLMs and proposes LiveMamba as a streaming baseline. No equations, fitted parameters, or derivations are present that reduce predictions to inputs by construction. Central claims rest on dataset creation and empirical evaluation rather than self-definitional loops, self-citation load-bearing premises, or renaming of prior results. The work is self-contained as an empirical contribution with external benchmarks for comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical applied paper; no mathematical derivations, free parameters, or invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 966 out tokens · 27337 ms · 2026-05-17T05:32:22.441929+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LIVEMAMBA utilizes a lightweight Mamba backbone, a 'when-to-say' mechanism, novel data augmentation for mistake recognition, and iterative re-planning for adaptive delivery.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 16 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...

work page 2022
[3]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024

work page arXiv 2024
[4]

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2, 2023

work page 2023
[5]

Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023

Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? InEMNLP Findings, 2023

work page 2023
[6]

Look, remember and reason: Visual reasoning with grounded rationales

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. InICLR, 2024

work page 2024
[7]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020

work page 2020
[8]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

work page 2024
[9]

Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

work page arXiv 2025
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

work page 2024
[11]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023

work page 2023
[12]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. InIJCV, 2022

work page 2022
[13]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

work page arXiv 2025
[14]

Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition.arXiv preprint arXiv:2503.06220, 2025

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, and Ting Cao. Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition.arXiv preprint arXiv:2503.06220, 2025

work page arXiv 2025
[15]

An Yang et. al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Grauman et. al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

work page 2022
[18]

Marah I Abdin et. al. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Shuai Bai et. al. Qwen2.5-vl technical report.CoRR, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense. InICCV, 2017

work page 2017
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Ego4d: Around the world in 3, 000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagara- jan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...

work page 2022
[23]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023

work page arXiv 2023
[24]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi: 10.48550/ARXIV .2312.00752. URL https://doi.org/ 10.48550/arXiv.2312.00752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[25]

Girshick

Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A dataset for large vocabulary instance segmentation. InCVPR, 2019

work page 2019
[26]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[28]

Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

work page arXiv 2024
[29]

Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962

Arthur B Kahn. Topological sorting of large networks.Communications of the ACM, 5(11): 558–562, 1962

work page 1962
[30]

Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

work page 2021
[31]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022
[32]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

work page 2023
[33]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

work page 2024
[35]

LION-FS: fast & slow video- language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: fast & slow video- language thinker as online video assistant. InCVPR, 2025

work page 2025
[36]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[37]

Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

work page arXiv 2025
[38]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[40]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[41]

Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, and Jose M Alvare. Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

work page arXiv 2024
[42]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[43]

Video-ChatGPT: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL, August 2024

work page 2024
[44]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InICCV, 2019

work page 2019
[45]

What to say and when to say it: Live fitness coaching as a testbed for situated interaction

Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Böhm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, and Roland Memisevic. What to say and when to say it: Live fitness coaching as a testbed for situated interaction. InNeurIPS, 2024

work page 2024
[46]

Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric D. Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. InNeurIPS, 2024

work page 2024
[47]

Can vision-language models answer face to face questions in the real-world? arXiv preprint arXiv:2503.19356, 2025

Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world? arXiv preprint arXiv:2503.19356, 2025

work page arXiv 2025
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[49]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InCVPR, 2022. 13

work page 2022
[50]

Ego4d goal-step: Toward hierarchical understanding of procedural activities

Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: Toward hierarchical understanding of procedural activities. In NeurIPS, 2023

work page 2023
[51]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

work page 2019
[52]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

work page 2019
[53]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.CoRR, abs/2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. InICCV, 2023

work page 2023
[58]

Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts.arXiv preprint arXiv:2503.22952, 2025

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts.arXiv preprint arXiv:2503.22952, 2025

work page arXiv 2025
[59]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InCVPR, 2025

work page 2025
[60]

Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

work page arXiv 2025
[61]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

work page arXiv 2025
[62]

Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025

work page 2025
[63]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InEMNLP - System Demonstrations, 2023

work page 2023
[64]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023. URL https://arxiv.org/abs/2306.02858

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 14

work page arXiv 2024
[66]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InAAAI, 2018

work page 2018
[67]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023. 15 Appendix A Overview In the following we provide details of the annotation process of our Q...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

critical

and the FHO annotations [17]. The FHO annotations largely include fine-grained short duration actions while the goalstep annotations include longer-ranged actions more closely aligned with the recipe steps in our Qualcomm Interactive Cooking benchmark. To create temporally localized counterfactual mistakes, we first find the FHO actions that are included ...

work page
[69]

Users might toast or heat the tortilla before placing it on the cutting board

work page
[70]

Users may place the tortilla on a plate instead of a cutting board

work page
[71]

Instruction: Now add 1/4 tsp salt to a bowl

Users could use an unclean surface instead of a clean cutting board. Instruction: Now add 1/4 tsp salt to a bowl. When guiding the user through this step, the cooking instructor should watch out for these potential mistakes:

work page
[72]

Spilling salt while measuring or adding it

work page
[73]

Adding too much salt, specifically 1/2 tsp instead of the required 1/4 tsp

work page
[74]

Confusing 1/3 tablespoon with 1/3 teaspoon

work page
[75]

mistake summaries

Accidentally adding the salt to the pan rather than the bowl. We then use these “mistake summaries” to prompt the multi-modal LLM. For Gemini-2.5- Flash [55], Qwen2.5-VL-7B-Instruct [19], Qwen2-VL-7B-Instruct [56], VideoLLaMA3-7B [62], VideoChat2 [34] we use the following prompts: Gemini-2.5-Flash / Qwen2x-VL-7B-Instruct / VideoLLaMA3-7B/ VideoChat2: Chec...

work page
[76]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025