arxiv: 2512.03479 · v2 · submitted 2025-12-03 · 💻 cs.CV

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

Wenliang Guo , Yu Kong This is my paper

Pith reviewed 2026-05-17 03:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords object-centric reasoningprocedural understandinginstructional videosVideoQA benchmarkevidence groundingmultimodal modelsstate transitionstemporal localization

0 comments

The pith

Models give plausible answers about object changes in videos but fail to point to the visual evidence for those answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Procedural activities in videos are driven by how objects change state over time, but existing benchmarks test actions instead of these object evolutions. ProcObject-10K supplies 10,522 open-ended questions from instructional videos, each linked to specific visual evidence across five reasoning types like preconditions and counterfactuals. When 13 leading multimodal models are tested on the benchmark, they produce reasonable answers yet localize the supporting video segments with low accuracy. This gap indicates the models lean on language patterns rather than tracking fine details of object dynamics. The authors also show that fine-tuning with object-level supervision lifts performance on this benchmark and improves results on related grounded question-answering and planning tasks.

Core claim

The paper establishes that current multimodal large language models display a clear answering-grounding gap on object-centric procedural tasks: they generate plausible answers to questions about object state transitions, preconditions, counterfactuals, mistakes, and readiness, yet achieve mean intersection-over-union below 45 percent when localizing the temporal and spatial evidence that supports those answers, revealing heavy dependence on linguistic priors instead of fine-grained visual object dynamics.

What carries the argument

The ProcObject-10K benchmark, which supplies open-ended VideoQA pairs together with spatial-temporal grounding annotations for object state changes across egocentric and exocentric views.

If this is right

Fine-tuned models using pseudo object-level supervision and spatial-temporal constraints improve scores on the ProcObject-10K benchmark itself.
The same fine-tuned models show better transfer performance on other grounded VideoQA datasets and embodied planning tasks.
The benchmark jointly measures both answer correctness and evidence localization, unlike prior action-centric video benchmarks.
Evaluation spans 137 tasks from 9 domains and includes both egocentric and exocentric video perspectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future video models may need explicit training on localization objectives to reduce shortcut reliance on language patterns.
Similar object-state benchmarks could be created for robotics or simulation environments to test physical reasoning.
The observed gap suggests that simply increasing model size without grounding-focused data may not close the divide between answers and evidence.

Load-bearing premise

The 10,522 question-answer pairs and their grounding annotations correctly and without bias capture the object state transitions and temporal evidence needed for the five reasoning types.

What would settle it

Independent re-annotation of a random sample of the questions and groundings that finds frequent mismatches between the labeled video segments and the actual object changes described in the answers.

Figures

Figures reproduced from arXiv: 2512.03479 by Wenliang Guo, Yu Kong.

**Figure 2.** Figure 2: Overview of the data collection pipeline, which combines automatic video sampling and QA generation using LVLMs with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: QA types in our benchmark dataset. Different from existing instructional datasets focusing on action analysis, our benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset statistics of our Object-IVQA benchmark dataset. temporal discontinuities and provides rich contextual information for evaluating long-range multi-hop reasoning over disjoint object interactions. Activity Distribution. Object-IVQA spans 52 cooking activities, each consisting of multiple instances. Compared with other procedural activities such as household tasks, cooking involves more structured … view at source ↗

**Figure 5.** Figure 5: Our agent framework decomposes video QA into planning, processing, analyzing, and generation agents, along with an example [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than fine-grained object dynamics. As a step toward closing this gap, we further provide an object-centric supervised fine-tuning baseline with pseudo object-level supervision and spatial-temporal constraints. Models fine-tuned on ProcObject-10K not only improve on the benchmark itself, but also transfer effectively to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and evaluation toolkit will be publicly released to support future research on object-centric procedural understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProcObject-10K adds a useful new benchmark for object-centric reasoning and grounding in instructional videos, but the reported answering-grounding gap rests on unverified annotation quality.

read the letter

The main point is that this paper introduces ProcObject-10K, a benchmark of 10,522 open-ended VideoQA pairs from 1,799 clips across egocentric and exocentric views, and shows that 13 leading MLLMs produce plausible answers yet fail to localize supporting evidence with mIoU below 45 percent. They also release a fine-tuning baseline using pseudo object supervision that improves results and transfers to other grounded QA and planning tasks.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProcObject-10K, the first benchmark for joint object-centric reasoning and temporal evidence grounding in instructional videos. It comprises 10,522 open-ended VideoQA pairs from 1,799 clips across 137 tasks in 9 domains and five reasoning types (preconditions, state evolution, counterfactuals, mistakes, readiness). Benchmarking 13 MLLMs reveals a substantial answering-grounding gap with mIoU <45%, interpreted as evidence of reliance on linguistic priors over fine-grained object dynamics. An object-centric SFT baseline with pseudo-supervision and spatial-temporal constraints is shown to improve benchmark performance and transfer to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and toolkit will be released publicly.

Significance. If the grounding annotations are shown to be reliable, the work has high significance: it provides the first large-scale evidence that current MLLMs fail at localizing object state transitions in procedural videos despite plausible answers, and the transfer results indicate the benchmark can drive progress toward more grounded procedural understanding. The public release of data and evaluation code is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Dataset construction / annotation section] Dataset construction / annotation section: no inter-annotator agreement statistics (Cohen's kappa, mean IoU, or per-reasoning-type agreement) or adjudication procedure are reported for the spatial-temporal grounding labels on the 10,522 QA pairs. This directly affects the central claim, because if multiple plausible evidence regions exist or label noise is high, the reported mIoU <45% gap could reflect annotation variance rather than model reliance on linguistic priors.
[Results / evaluation protocol] Results / evaluation protocol: the paper reports aggregate mIoU <45% but does not include a human performance baseline on the same grounding task or per-reasoning-type breakdowns with confidence intervals. Without these, it is difficult to calibrate whether the gap is diagnostic of model failure or partly an artifact of the annotation protocol.

minor comments (2)

[Evaluation section] Clarify in §4 or the evaluation section whether the mIoU is computed with a fixed threshold or as mean IoU, and whether it is averaged over all QA pairs or per reasoning type.
[Abstract] The abstract states 'mIoU < 45%' without specifying the exact aggregation; add a sentence in the main text that matches the abstract claim precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects for strengthening the presentation of our benchmark and results. We address each major comment in detail below.

read point-by-point responses

Referee: [Dataset construction / annotation section] Dataset construction / annotation section: no inter-annotator agreement statistics (Cohen's kappa, mean IoU, or per-reasoning-type agreement) or adjudication procedure are reported for the spatial-temporal grounding labels on the 10,522 QA pairs. This directly affects the central claim, because if multiple plausible evidence regions exist or label noise is high, the reported mIoU <45% gap could reflect annotation variance rather than model reliance on linguistic priors.

Authors: We agree that providing inter-annotator agreement statistics is necessary to support the reliability of the annotations and the validity of our central claim. In the revised version of the manuscript, we will add a dedicated subsection in the dataset construction section detailing the annotation process, including the adjudication procedure used for the spatial-temporal grounding labels. We will also report Cohen's kappa, mean IoU between annotators, and agreement statistics broken down by reasoning type. These additions will help demonstrate that annotation variance is low and does not explain the observed performance gap. revision: yes
Referee: [Results / evaluation protocol] Results / evaluation protocol: the paper reports aggregate mIoU <45% but does not include a human performance baseline on the same grounding task or per-reasoning-type breakdowns with confidence intervals. Without these, it is difficult to calibrate whether the gap is diagnostic of model failure or partly an artifact of the annotation protocol.

Authors: We concur that a human baseline and more detailed breakdowns would better contextualize the results. We will incorporate a human performance baseline for the grounding task, where human annotators localize the evidence segments for a sampled set of questions, and report the corresponding mIoU. Additionally, we will provide per-reasoning-type mIoU results accompanied by confidence intervals in the updated results section. This will allow readers to better assess the significance of the model-human gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical benchmark paper

full rationale

This is an empirical benchmark paper that introduces new annotated data (10,522 VideoQA pairs across 1,799 clips) and reports model performance metrics on 13 MLLMs without any mathematical derivations, equations, or first-principles claims. The central results consist of observed performance gaps (e.g., mIoU < 45%) and transfer improvements from fine-tuning, which are directly tied to the released dataset and external model evaluations rather than reducing to fitted parameters or self-referential definitions. No load-bearing steps invoke self-citations for uniqueness theorems, ansatzes, or renamings of known results; the work is self-contained against external benchmarks and falsifiable via the public annotations and toolkit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the newly collected annotations faithfully represent object state transitions and temporal evidence; no free parameters are fitted, no new physical entities are postulated, and no mathematical axioms beyond standard evaluation metrics are invoked.

pith-pipeline@v0.9.0 · 5517 in / 1265 out tokens · 60700 ms · 2026-05-17T03:02:05.821095+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 6 internal anchors

[1]

Is video-based education an ef- fective method in surgical education? a systematic review

Akgul Ahmet, Kus Gamze, Mustafaoglu Rustem, and Karaborklu Argut Sezen. Is video-based education an ef- fective method in surgical education? a systematic review. Journal of surgical education, 75(5):1150–1158, 2018. 1

work page 2018
[2]

Claude 4.1: Advanced reasoning model, 2025

Anthropic. Claude 4.1: Advanced reasoning model, 2025. Accessed: 2025-11-13. 7, 8

work page 2025
[3]

Video-mined task graphs for keystep recognition in instructional videos

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36: 67833–67846, 2023. 1, 2, 8

work page 2023
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024

Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen- Chun Chen, and Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024. 2, 3

work page 2024
[6]

Grounded multi- hop videoqa in long-form egocentric videos

Qirui Chen, Shangzhe Di, and Weidi Xie. Grounded multi- hop videoqa in long-form egocentric videos. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 2159–2167, 2025. 2, 3

work page 2025
[7]

Shine: Saliency-aware hierarchical negative ranking for compositional temporal grounding

Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kord- jamshidi, and Yu Kong. Shine: Saliency-aware hierarchical negative ranking for compositional temporal grounding. In European Conference on Computer Vision, pages 398–416. Springer, 2024. 2

work page 2024
[8]

Video question answering with procedural programs

Rohan Choudhury, Koichiro Niinuma, Kris M Kitani, and Laszlo A Jeni. Video question answering with procedural programs. InEuropean Conference on Computer Vision, pages 315–332. Springer, 2024. 2

work page 2024
[9]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page
[10]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

work page
[11]

Future transformer for long-term action anticipation

Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052– 3061, 2022. 2

work page 2022
[12]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14953–14962, 2023. 2

work page 2023
[13]

Promqa: Question answering dataset for multimodal procedural activity understanding.arXiv preprint arXiv:2410.22211, 2024

Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. Promqa: Question answering dataset for multimodal procedural activity understanding.arXiv preprint arXiv:2410.22211, 2024. 2, 3, 8

work page arXiv 2024
[14]

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, and Teruko Mitamura. Promqa-assembly: Multimodal procedu- ral qa dataset on assembly.arXiv preprint arXiv:2509.02949,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2758–2766, 2017. 3

work page 2017
[16]

Multimodal subtask graph generation from instructional videos

Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos.arXiv preprint arXiv:2302.08672, 2023. 1, 2

work page arXiv 2023
[17]

Videomultiagents: A multi-agent framework for video question answering, 2025

Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpan- deep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, and Ehsan Adeli. Videomultiagents: A multi-agent framework for video question answering, 2025. 3, 6

work page 2025
[18]

Error recognition in pro- cedural videos using generalized task graph

Shih-Po Lee and Ehsan Elhamifar. Error recognition in pro- cedural videos using generalized task graph. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10009–10021, 2025. 2

work page 2025
[19]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18655– 18666, 2024. 2, 3

work page 2024
[20]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022. 6

work page 2022
[21]

Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025

Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yi- nan Tang, Peng Xu, Kun Liu, and Tongtong Yuan. Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large mod- els.arXiv preprint arXiv:2505.12589, 2025. 7

work page arXiv 2025
[22]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Gemini 2.5 pro: Multimodal reasoning model,

Google LLC. Gemini 2.5 pro: Multimodal reasoning model,

work page
[24]

Accessed: 2025-11-13. 7, 8

work page 2025
[25]

Videogpt+: Integrating image and video encoders for enhanced video understanding

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arxiv, 2024. 7

work page 2024
[26]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3 9

work page 2023
[27]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 2

work page 2019
[28]

Learning and verification of task structure in instructional videos.arXiv preprint arXiv:2303.13519, 2023

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, and Trevor Darrell. Learning and verification of task structure in instructional videos.arXiv preprint arXiv:2303.13519, 2023. 1, 2, 8

work page arXiv 2023
[29]

Gpt-5: Large language model, 2025

OpenAI. Gpt-5: Large language model, 2025. Accessed: 2025-11-13. 7, 8

work page 2025
[30]

Cap- taincook4d: A dataset for understanding errors in procedural activities.Advances in Neural Information Processing Sys- tems, 37:135626–135679, 2024

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pal- lapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, et al. Cap- taincook4d: A dataset for understanding errors in procedural activities.Advances in Neural Information Processing Sys- tems, 37:135626–135679, 2024. 2, 3

work page 2024
[31]

Action scene graphs for long- form understanding of egocentric videos

Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long- form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18622–18632, 2024. 8

work page 2024
[32]

As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022. 1, 2

work page 2022
[33]

Look for the change: Learning object states and state-modifying actions from untrimmed web videos

Tom ´aˇs Souˇcek, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, and Josef Sivic. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956– 13966, 2022. 2, 3

work page 2022
[34]

Multi-task learning of object states and state-modifying actions from web videos.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 46(7): 5114–5130, 2024

Tom ´aˇs Souˇcek, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, and Josef Sivic. Multi-task learning of object states and state-modifying actions from web videos.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 46(7): 5114–5130, 2024. 3

work page 2024
[35]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 1, 2, 3

work page 2019
[36]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 2, 3, 6, 8

work page 2024
[39]

Trackverse: A large- scale object-centric video dataset for image-level representa- tion learning

Yibing Wei, Samuel Church, Victor Suciu, Jinhong Lin, Cheng-En Wu, and Pedro Morgado. Trackverse: A large- scale object-centric video dataset for image-level representa- tion learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11153–11163, 2025. 2, 3

work page 2025
[40]

Don’t pour cereal into cof- fee: Differentiable temporal logic for temporal action seg- mentation.Advances in Neural Information Processing Sys- tems, 35:14890–14903, 2022

Ziwei Xu, Yogesh Rawat, Yongkang Wong, Mohan S Kankanhalli, and Mubarak Shah. Don’t pour cereal into cof- fee: Differentiable temporal logic for temporal action seg- mentation.Advances in Neural Information Processing Sys- tems, 35:14890–14903, 2022. 8

work page 2022
[41]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024. 3

work page 2024
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Panda: To- wards generalist video anomaly detection via agentic ai en- gineer

Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: To- wards generalist video anomaly detection via agentic ai en- gineer. InNeurIPS, 2025. 3, 6

work page 2025
[44]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 3

work page 2019
[45]

Moscato: Predicting multiple object state change through ac- tions

Parnian Zameni, Yuhan Shen, and Ehsan Elhamifar. Moscato: Predicting multiple object state change through ac- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11600–11611, 2025. 3

work page 2025
[46]

Actionformer: Lo- calizing moments of actions with transformers

Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Lo- calizing moments of actions with transformers. InEuropean Conference on Computer Vision, pages 492–510. Springer,

work page
[47]

Cross- task weakly supervised learning from instructional videos

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019. 2 10

work page 2019