pith. sign in

arxiv: 2509.02949 · v2 · submitted 2025-09-03 · 💻 cs.CL · cs.CV

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal QAprocedural reasoningassembly tasksvideo and manual understandingdatasetmodel benchmarkingtask graphs
0
0 comments X

The pith

ProMQA-Assembly supplies 646 multimodal QA pairs on assembly videos and manuals to test procedural reasoning in AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset of 646 question-answer pairs drawn from human assembly videos paired with their instruction manuals. Questions are generated through a semi-automated process in which language models propose candidates that humans then review and refine using fine-grained action labels. The authors also produce 81 task graphs that outline the assembly sequences and support both verification and model evaluation. Benchmarking on the dataset reveals that the questions remain difficult for most multimodal models while reasoning-oriented systems perform better. This resource is positioned to help develop assistants that can follow step-by-step physical tasks.

Core claim

The paper establishes ProMQA-Assembly as a collection of 646 QA pairs that require integrated understanding of assembly videos and their accompanying instruction manuals presented in an online style, created via LLM-assisted candidate generation followed by human verification and augmented with 81 instruction task graphs, and demonstrates through model benchmarks that these questions are challenging while reasoning models show stronger results.

What carries the argument

The semi-automated QA annotation pipeline that has LLMs generate candidate pairs which humans verify, combined with fine-grained action labels to increase question variety and 81 instruction task graphs to support verification and evaluation.

If this is right

  • Assembly-task assistants can now be evaluated on realistic multimodal procedural questions rather than text-only or video-only tests.
  • Reasoning-focused models appear better suited than standard multimodal models for handling sequences that span video demonstrations and written instructions.
  • The task graphs provide a structured way to verify question quality and to measure whether models follow correct assembly order.
  • The dataset supports development of systems that interact with humans during everyday or industrial assembly activities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semi-automated pipeline could be reused to build comparable QA resources for other procedural domains such as cooking or repair tasks.
  • Task graphs might be incorporated directly into model training to improve step-by-step consistency beyond evaluation.
  • The online-style presentation of questions could highlight differences between models that process information incrementally versus those that wait for complete input.

Load-bearing premise

The semi-automated process of LLM-generated QA candidates followed by human verification produces questions that genuinely demand combined use of video and manual information rather than being answerable from either source alone.

What would settle it

If leading multimodal models achieve near-perfect accuracy on the questions when given only the text of the manuals and no video, or only the video and no manual text, the claim that the questions require multimodal procedural understanding would be undermined.

Figures

Figures reproduced from arXiv: 2509.02949 by Ken Fukuda, Kimihiro Hasegawa, Masaki Asada, Susan Holm, Teruko Mitamura, Vincent Zhou, Wiradee Imrattanatrai, Yuran Wang.

Figure 1
Figure 1. Figure 1: Task example: A user performs actions based on instructions. When the user asks a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: QA generation prompts [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task graph annotation interface 5.1 Preprocess The initial sets of nodes were collected based on the coarse action labels in Assembly101. Then, we add “START” and “END” steps for each graph. Also, we show a set of recordings, where multiple users assemble the same toy often in different step orders [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: QA generation prompt example: “Default” for “location” type. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: QA generation prompt example: “With fine” for “missing” type. Note that some fine￾grained actions are omitted for brevity. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: QA generation prompt example: “With image” for “past” type. Note that a corresponding parts’ image as shown in the lower middle of [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: QA generation prompt example: prompt for question generation in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: QA generation prompt example: prompt for answer generation in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: QA verification interface. An annotator verifies the question and answers (left panel) based [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The different view of QA verification interface. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Benchmarking prompt example. Note that the parts image and sampled frames are omitted [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: LLM-as-a-judge prompt example. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ProMQA-Assembly, a new multimodal QA dataset consisting of 646 question-answer pairs focused on assembly activities. It requires understanding of human activity videos paired with instruction manuals. The dataset is created using a semi-automated pipeline in which LLMs generate candidate QA pairs that humans then verify, with additional use of fine-grained action labels for diversity and 81 newly created instruction task graphs. The authors benchmark several models, including proprietary multimodal ones, and conclude that the questions are challenging while reasoning models show promising performance.

Significance. If the central claim holds—that the QA pairs genuinely require cross-modal integration of video and manual content—this dataset would address an underexplored area in procedural activity understanding and provide a useful evaluation resource for assistants in everyday and industrial assembly settings. The release of task graphs and the cost-effective annotation approach are additional strengths that could support reproducibility and extension by others.

major comments (2)
  1. [Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.
  2. [Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.
minor comments (2)
  1. [Abstract] The abstract and introduction could more clearly distinguish between the contribution of the dataset release versus the specific benchmarking results.
  2. [Task graphs] Notation for the 81 task graphs and how they are used in verification versus benchmarking could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the detailed feedback and will address each point below, proposing specific revisions to the manuscript.

read point-by-point responses
  1. Referee: [Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.

    Authors: We thank the referee for highlighting this important aspect. The current manuscript describes the human verification but does not provide quantitative metrics or unimodal baselines. In the revised version, we will add explicit details on the verification mechanism, including the use of task graphs to guide annotators in ensuring questions require both video and manual information. We will also report a metric such as the percentage of questions that annotators deemed unimodal and include unimodal model baselines in the evaluation to substantiate the multimodal challenge. revision: yes

  2. Referee: [Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.

    Authors: We agree that more details would improve the substantiation of our claims. We will revise the paper to include quantitative error analysis of the benchmarked models, inter-annotator agreement scores for the human verification process, and expanded details on the verification procedure, such as annotator guidelines and the role of task graphs in the process. These additions will better support the reliability and difficulty assessments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with independent benchmarking

full rationale

The paper introduces a new dataset (ProMQA-Assembly) via semi-automated LLM candidate generation plus human verification, augmented by task graphs and action labels. It then reports empirical benchmarks on proprietary multimodal models. No equations, fitted parameters, predictions, or derivation chains appear. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on the released data and observed model performance rather than any reduction to inputs by construction. This is a standard non-circular dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM-generated candidates plus human verification yield accurate and diverse multimodal questions without systematic bias or loss of procedural structure.

axioms (1)
  • domain assumption LLMs can produce candidate QA pairs from videos and manuals that humans can efficiently verify for quality and diversity
    Invoked in the description of the cost-effective semi-automated annotation approach.

pith-pipeline@v0.9.0 · 5749 in / 1199 out tokens · 62596 ms · 2026-05-18T19:59:08.855514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

    cs.CV 2026-05 unverdicted novelty 7.0

    AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.

  2. ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

    cs.CV 2025-12 conditional novelty 7.0

    ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Claude 3.7 sonnet and claude code, 2025

    Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet

  2. [2]

    Video-mined task graphs for keystep recognition in instructional videos

    Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36:67833–67846, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  4. [4]

    The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose

    Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, and Stephen Gould. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 847–859, January 2021

  5. [5]

    A virtual reality training curriculum for laparoscopic colorectal surgery

    Laura Beyer-Berjot, Stéphane Berdah, Daniel A Hashimoto, Ara Darzi, and Rajesh Aggarwal. A virtual reality training curriculum for laparoscopic colorectal surgery. Journal of surgical education, 73(6):932–941, 2016

  6. [6]

    The epic-kitchens dataset: Collection, challenges and baselines

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  8. [8]

    Temporal action segmentation: An analysis of modern techniques

    Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46: 1011–1030, 2022. URL https://api.semanticscholar.org/CorpusID:252992530

  9. [9]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  10. [10]

    Flow graph to video grounding for weakly-supervised multi-step localization

    Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022

  11. [11]

    Prego: Online mistake detection in procedural egocentric videos

    Alessandro Flaborea, Guido Maria D’Amely di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18483–18492, June 2024

  12. [12]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10

  13. [13]

    Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

    Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

  14. [14]

    Introducing gemini 2.0: our new ai model for the agentic era, 2024

    Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/

  15. [15]

    Gemini 2.5: Our most intelligent ai model, 2025

    Google. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025/ #gemini-2-5-thinking

  16. [16]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  17. [17]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages ...

  18. [18]

    Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024

    Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024. URL https://arxiv.org/abs/2410.05343

  19. [19]

    ProMQA: Question answering dataset for multimodal procedural activity understanding

    Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. ProMQA: Question answering dataset for multimodal procedural activity understanding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

  20. [21]

    Epic-tent: An egocentric video dataset for camping tent assembly

    Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas. Epic-tent: An egocentric video dataset for camping tent assembly. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019

  21. [22]

    Multimodal subtask graph generation from instructional videos

    Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos. arXiv preprint arXiv:2302.08672, 2023

  22. [23]

    A new measure of rank correlation

    Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938

  23. [24]

    The language of actions: Recovering the syntax and semantics of goal-directed human activities

    Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014

  24. [25]

    CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans

    Yash Kumar Lal, Vanya Cohen, Nathanael Chambers, Niranjan Balasubramanian, and Ray Mooney. CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19336–1935...

  25. [26]

    Error detection in egocentric procedural task videos

    Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, June 2024

  26. [27]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13872–13882, June 2024

  27. [28]

    A diversity-promoting objective function for neural conversation models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119...

  28. [29]

    IKEA manuals at work: 4d grounding of assembly instructions on internet videos

    Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d grounding of assembly instructions on internet videos. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  29. [30]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

  30. [31]

    Egoschema: A diag- nostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diag- nostic benchmark for very long-form video language understanding. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 46212–46244. Curran Associates, Inc., 2023. URL https://...

  31. [32]

    The brio-ta dataset: Understanding anomalous assembly process in manufacturing

    Kosuke Moriwaki, Gaku Nakano, and Tetsuo Inoshita. The brio-ta dataset: Understanding anomalous assembly process in manufacturing. In 2022 IEEE International Conference on Image Processing (ICIP), pages 1991–1995. IEEE, 2022

  32. [33]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

  33. [34]

    Openai o3-mini, 2025

    OpenAI. Openai o3-mini, 2025. URL https://openai.com/index/openai-o3-mini/

  34. [35]

    Llm evaluators recognize and favor their own generations

    Arjun Panickssery, Samuel Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37:68772–68802, 2024

  35. [36]

    Captaincook4d: A dataset for un- derstanding errors in procedural activities

    Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for un- derstanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, a...

  36. [37]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ f4a04396c2ed1342a5d8d05e94cb6101-Paper-Datasets_and_Benchmarks_Track. pdf

  37. [38]

    Large language models sensitivity to the order of options in multiple-choice questions

    Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.1...

  38. [39]

    The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain

    Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, January 2021

  39. [40]

    Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

    Anna Rogers, Matt Gardner, and Isabelle Augenstein. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10), February 2023. ISSN 0360-0300. doi: 10.1145/3560260. URL https://doi.org/10.1145/ 3560260

  40. [41]

    Cvqa: Culturally-diverse multilingual visual question answering benchmark

    David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv preprint arXiv:2406.05967, 2024

  41. [42]

    emnlp-main.308/

    Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, and Yejin Choi. proScript: Partially ordered scripts generation. In Marie-Francine Moens, Xu- anjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2138–2149, Punta Cana, Dominican Re- public...

  42. [43]

    Scripts, plans, goals, and understanding: An inquiry into human knowledge structures

    Roger C Schank and Robert P Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Lawrence Erlbaum, 1977

  43. [44]

    Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting

    Tim J Schoonbeek, Tim Houben, Hans Onvlee, Fons van der Sommen, et al. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4365–4374, 2024

  44. [45]

    Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos

    Luigi Seminara, Giovanni Maria Farinella, and Antonino Furnari. Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=2HvgvB4aWq

  45. [46]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21096–21106, June 2022

  46. [47]

    Sebastian Stein and Stephen J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13, page 729–738, New York, NY , USA, 2013. Association for Computing Machinery. ISBN 9781450317702. doi: 10...

  47. [48]

    Coin: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  48. [49]

    Temporal evaluation

    Naushad UzZaman and James Allen. Temporal evaluation. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 351–356, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology. org/P11-2061/

  49. [50]

    Lvbench: An extreme long video understanding benchmark, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024. 13

  50. [51]

    Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),...

  51. [52]

    MMLU-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processi...

  52. [53]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, and Shih-Fu Chang. Dataset bias mitigation in multiple-choice visual question answering and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 8598–8617, Singa- pore, ...

  53. [54]

    WorldCuisines: A massive-scale bench- mark for multilingual and multicultural visual question answering on global cuisines

    Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Wang Yutong, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amri- ani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Cheng Ching Lam, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri,...

  54. [55]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021

  55. [56]

    Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation

    Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi, Toutai Mit- suyama, and Yoichi Sato. Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation. arXiv preprint arXiv:2402.00293, 2024

  56. [57]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  57. [58]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

  58. [59]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processin...

  59. [60]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32 (1), Apr. 2018. doi: 10.1609/aaai.v32i1.12342. URL https://ojs.aaai.org/index.php/ AAAI/article/view/12342

  60. [61]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 15 A QA annotation A.1 Preprocess To make full use of fine-grained action labels, the preprocessin...