ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Ken Fukuda; Kimihiro Hasegawa; Masaki Asada; Susan Holm; Teruko Mitamura; Vincent Zhou; Wiradee Imrattanatrai; Yuran Wang

arxiv: 2509.02949 · v2 · submitted 2025-09-03 · 💻 cs.CL · cs.CV

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa , Wiradee Imrattanatrai , Masaki Asada , Susan Holm , Yuran Wang , Vincent Zhou , Ken Fukuda , Teruko Mitamura This is my paper

Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal QAprocedural reasoningassembly tasksvideo and manual understandingdatasetmodel benchmarkingtask graphs

0 comments

The pith

ProMQA-Assembly supplies 646 multimodal QA pairs on assembly videos and manuals to test procedural reasoning in AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset of 646 question-answer pairs drawn from human assembly videos paired with their instruction manuals. Questions are generated through a semi-automated process in which language models propose candidates that humans then review and refine using fine-grained action labels. The authors also produce 81 task graphs that outline the assembly sequences and support both verification and model evaluation. Benchmarking on the dataset reveals that the questions remain difficult for most multimodal models while reasoning-oriented systems perform better. This resource is positioned to help develop assistants that can follow step-by-step physical tasks.

Core claim

The paper establishes ProMQA-Assembly as a collection of 646 QA pairs that require integrated understanding of assembly videos and their accompanying instruction manuals presented in an online style, created via LLM-assisted candidate generation followed by human verification and augmented with 81 instruction task graphs, and demonstrates through model benchmarks that these questions are challenging while reasoning models show stronger results.

What carries the argument

The semi-automated QA annotation pipeline that has LLMs generate candidate pairs which humans verify, combined with fine-grained action labels to increase question variety and 81 instruction task graphs to support verification and evaluation.

If this is right

Assembly-task assistants can now be evaluated on realistic multimodal procedural questions rather than text-only or video-only tests.
Reasoning-focused models appear better suited than standard multimodal models for handling sequences that span video demonstrations and written instructions.
The task graphs provide a structured way to verify question quality and to measure whether models follow correct assembly order.
The dataset supports development of systems that interact with humans during everyday or industrial assembly activities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semi-automated pipeline could be reused to build comparable QA resources for other procedural domains such as cooking or repair tasks.
Task graphs might be incorporated directly into model training to improve step-by-step consistency beyond evaluation.
The online-style presentation of questions could highlight differences between models that process information incrementally versus those that wait for complete input.

Load-bearing premise

The semi-automated process of LLM-generated QA candidates followed by human verification produces questions that genuinely demand combined use of video and manual information rather than being answerable from either source alone.

What would settle it

If leading multimodal models achieve near-perfect accuracy on the questions when given only the text of the manuals and no video, or only the video and no manual text, the claim that the questions require multimodal procedural understanding would be undermined.

Figures

Figures reproduced from arXiv: 2509.02949 by Ken Fukuda, Kimihiro Hasegawa, Masaki Asada, Susan Holm, Teruko Mitamura, Vincent Zhou, Wiradee Imrattanatrai, Yuran Wang.

**Figure 2.** Figure 2: QA generation prompts [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Task graph annotation interface 5.1 Preprocess The initial sets of nodes were collected based on the coarse action labels in Assembly101. Then, we add “START” and “END” steps for each graph. Also, we show a set of recordings, where multiple users assemble the same toy often in different step orders [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: QA generation prompt example: “Default” for “location” type. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: QA generation prompt example: “With fine” for “missing” type. Note that some finegrained actions are omitted for brevity. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: QA generation prompt example: “With image” for “past” type. Note that a corresponding parts’ image as shown in the lower middle of [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: QA generation prompt example: prompt for question generation in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: QA generation prompt example: prompt for answer generation in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: QA verification interface. An annotator verifies the question and answers (left panel) based [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: The different view of QA verification interface. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Benchmarking prompt example. Note that the parts image and sampled frames are omitted [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: LLM-as-a-judge prompt example. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A new assembly QA dataset with 646 pairs and task graphs that fills a gap, but the multimodal challenge claim lacks solid checks on whether questions need both video and manual.

read the letter

This paper releases ProMQA-Assembly, a dataset of 646 QA pairs drawn from assembly videos and instruction manuals, plus 81 task graphs. The main point is to give researchers a resource for testing procedural multimodal understanding in an area that has not had much dedicated evaluation data before. They built it with a semi-automated process where LLMs draft questions and humans verify them, then added fine-grained action labels to vary the question types and used the graphs to support both creation and benchmarking. That approach is practical and directly targets the assembly domain, which is a clear plus for anyone working on real-world human-AI assistants. Benchmarking a few models, including proprietary multimodal ones, and noting that reasoning models do better is a reasonable first step. The soft spot is that we still do not have strong evidence the questions actually force integration of video and text. The description does not include unimodal baselines, checks that questions cannot be answered from the manual alone, or error analysis focused on modality necessity. Human verification helps, but without those controls it is possible some items are solvable from one source. That makes the claim about challenging multimodal questions rest on an assumption rather than demonstrated fact. This work is aimed at people building or evaluating multimodal systems for procedural tasks like assembly or similar hands-on activities. Readers who need new test sets in that niche will get direct value from the released data and graphs. It deserves a serious referee because it supplies concrete new resources rather than just another model tweak. I would send it for peer review and ask reviewers to look closely at the question validation process and any quantitative results.

Referee Report

2 major / 2 minor

Summary. The paper presents ProMQA-Assembly, a new multimodal QA dataset consisting of 646 question-answer pairs focused on assembly activities. It requires understanding of human activity videos paired with instruction manuals. The dataset is created using a semi-automated pipeline in which LLMs generate candidate QA pairs that humans then verify, with additional use of fine-grained action labels for diversity and 81 newly created instruction task graphs. The authors benchmark several models, including proprietary multimodal ones, and conclude that the questions are challenging while reasoning models show promising performance.

Significance. If the central claim holds—that the QA pairs genuinely require cross-modal integration of video and manual content—this dataset would address an underexplored area in procedural activity understanding and provide a useful evaluation resource for assistants in everyday and industrial assembly settings. The release of task graphs and the cost-effective annotation approach are additional strengths that could support reproducibility and extension by others.

major comments (2)

[Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.
[Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.

minor comments (2)

[Abstract] The abstract and introduction could more clearly distinguish between the contribution of the dataset release versus the specific benchmarking results.
[Task graphs] Notation for the 81 task graphs and how they are used in verification versus benchmarking could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the detailed feedback and will address each point below, proposing specific revisions to the manuscript.

read point-by-point responses

Referee: [Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.

Authors: We thank the referee for highlighting this important aspect. The current manuscript describes the human verification but does not provide quantitative metrics or unimodal baselines. In the revised version, we will add explicit details on the verification mechanism, including the use of task graphs to guide annotators in ensuring questions require both video and manual information. We will also report a metric such as the percentage of questions that annotators deemed unimodal and include unimodal model baselines in the evaluation to substantiate the multimodal challenge. revision: yes
Referee: [Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.

Authors: We agree that more details would improve the substantiation of our claims. We will revise the paper to include quantitative error analysis of the benchmarked models, inter-annotator agreement scores for the human verification process, and expanded details on the verification procedure, such as annotator guidelines and the role of task graphs in the process. These additions will better support the reliability and difficulty assessments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with independent benchmarking

full rationale

The paper introduces a new dataset (ProMQA-Assembly) via semi-automated LLM candidate generation plus human verification, augmented by task graphs and action labels. It then reports empirical benchmarks on proprietary multimodal models. No equations, fitted parameters, predictions, or derivation chains appear. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on the released data and observed model performance rather than any reduction to inputs by construction. This is a standard non-circular dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM-generated candidates plus human verification yield accurate and diverse multimodal questions without systematic bias or loss of procedural structure.

axioms (1)

domain assumption LLMs can produce candidate QA pairs from videos and manuals that humans can efficiently verify for quality and diversity
Invoked in the description of the cost-effective semi-automated annotation approach.

pith-pipeline@v0.9.0 · 5749 in / 1199 out tokens · 62596 ms · 2026-05-18T19:59:08.855514+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
cs.CV 2026-05 unverdicted novelty 7.0

AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
cs.CV 2025-12 conditional novelty 7.0

ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Claude 3.7 sonnet and claude code, 2025

Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet

work page 2025
[2]

Video-mined task graphs for keystep recognition in instructional videos

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36:67833–67846, 2023

work page 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose

Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, and Stephen Gould. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 847–859, January 2021

work page 2021
[5]

A virtual reality training curriculum for laparoscopic colorectal surgery

Laura Beyer-Berjot, Stéphane Berdah, Daniel A Hashimoto, Ara Darzi, and Rajesh Aggarwal. A virtual reality training curriculum for laparoscopic colorectal surgery. Journal of surgical education, 73(6):932–941, 2016

work page 2016
[6]

The epic-kitchens dataset: Collection, challenges and baselines

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Temporal action segmentation: An analysis of modern techniques

Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46: 1011–1030, 2022. URL https://api.semanticscholar.org/CorpusID:252992530

work page 2022
[9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024
[10]

Flow graph to video grounding for weakly-supervised multi-step localization

Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022

work page 2022
[11]

Prego: Online mistake detection in procedural egocentric videos

Alessandro Flaborea, Guido Maria D’Amely di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18483–18492, June 2024

work page 2024
[12]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

work page arXiv 1909
[14]

Introducing gemini 2.0: our new ai model for the agentic era, 2024

Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/

work page 2024
[15]

Gemini 2.5: Our most intelligent ai model, 2025

Google. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025/ #gemini-2-5-thinking

work page 2025
[16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022
[17]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages ...

work page 2024
[18]

Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024. URL https://arxiv.org/abs/2410.05343

work page arXiv 2024
[19]

ProMQA: Question answering dataset for multimodal procedural activity understanding

Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. ProMQA: Question answering dataset for multimodal procedural activity understanding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

work page 2025
[21]

Epic-tent: An egocentric video dataset for camping tent assembly

Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas. Epic-tent: An egocentric video dataset for camping tent assembly. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019

work page 2019
[22]

Multimodal subtask graph generation from instructional videos

Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos. arXiv preprint arXiv:2302.08672, 2023

work page arXiv 2023
[23]

A new measure of rank correlation

Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938

work page 1938
[24]

The language of actions: Recovering the syntax and semantics of goal-directed human activities

Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014

work page 2014
[25]

CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans

Yash Kumar Lal, Vanya Cohen, Nathanael Chambers, Niranjan Balasubramanian, and Ray Mooney. CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19336–1935...

work page doi:10.18653/v1/2024.emnlp-main.1077 2024
[26]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, June 2024

work page 2024
[27]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13872–13882, June 2024

work page 2024
[28]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119...

work page 2016
[29]

IKEA manuals at work: 4d grounding of assembly instructions on internet videos

Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d grounding of assembly instructions on internet videos. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[30]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

work page 2024
[31]

Egoschema: A diag- nostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diag- nostic benchmark for very long-form video language understanding. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 46212–46244. Curran Associates, Inc., 2023. URL https://...

work page 2023
[32]

The brio-ta dataset: Understanding anomalous assembly process in manufacturing

Kosuke Moriwaki, Gaku Nakano, and Tetsuo Inoshita. The brio-ta dataset: Understanding anomalous assembly process in manufacturing. In 2022 IEEE International Conference on Image Processing (ICIP), pages 1991–1995. IEEE, 2022

work page 2022
[33]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

work page 2024
[34]

Openai o3-mini, 2025

OpenAI. Openai o3-mini, 2025. URL https://openai.com/index/openai-o3-mini/

work page 2025
[35]

Llm evaluators recognize and favor their own generations

Arjun Panickssery, Samuel Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37:68772–68802, 2024

work page 2024
[36]

Captaincook4d: A dataset for un- derstanding errors in procedural activities

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for un- derstanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, a...

work page
[37]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ f4a04396c2ed1342a5d8d05e94cb6101-Paper-Datasets_and_Benchmarks_Track. pdf

work page 2024
[38]

Large language models sensitivity to the order of options in multiple-choice questions

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.1...

work page 2024
[39]

The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, January 2021

work page 2021
[40]

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

Anna Rogers, Matt Gardner, and Isabelle Augenstein. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10), February 2023. ISSN 0360-0300. doi: 10.1145/3560260. URL https://doi.org/10.1145/ 3560260

work page doi:10.1145/3560260 2023
[41]

Cvqa: Culturally-diverse multilingual visual question answering benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv preprint arXiv:2406.05967, 2024

work page arXiv 2024
[42]

emnlp-main.308/

Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, and Yejin Choi. proScript: Partially ordered scripts generation. In Marie-Francine Moens, Xu- anjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2138–2149, Punta Cana, Dominican Re- public...

work page doi:10.18653/v1/2021 2021
[43]

Scripts, plans, goals, and understanding: An inquiry into human knowledge structures

Roger C Schank and Robert P Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Lawrence Erlbaum, 1977

work page 1977
[44]

Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting

Tim J Schoonbeek, Tim Houben, Hans Onvlee, Fons van der Sommen, et al. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4365–4374, 2024

work page 2024
[45]

Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos

Luigi Seminara, Giovanni Maria Farinella, and Antonino Furnari. Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=2HvgvB4aWq

work page 2024
[46]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21096–21106, June 2022

work page 2022
[47]

Sebastian Stein and Stephen J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13, page 729–738, New York, NY , USA, 2013. Association for Computing Machinery. ISBN 9781450317702. doi: 10...

work page doi:10.1145/2493432.2493482 2013
[48]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[49]

Temporal evaluation

Naushad UzZaman and James Allen. Temporal evaluation. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 351–356, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology. org/P11-2061/

work page 2011
[50]

Lvbench: An extreme long video understanding benchmark, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024. 13

work page 2024
[51]

Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),...

work page 2023
[52]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processi...

work page 2024
[53]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, and Shih-Fu Chang. Dataset bias mitigation in multiple-choice visual question answering and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 8598–8617, Singa- pore, ...

work page doi:10.18653/v1/2023 2023
[54]

WorldCuisines: A massive-scale bench- mark for multilingual and multicultural visual question answering on global cuisines

Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Wang Yutong, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amri- ani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Cheng Ching Lam, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri,...

work page 2025
[55]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021

work page 2021
[56]

Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation

Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi, Toutai Mit- suyama, and Yoichi Sato. Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation. arXiv preprint arXiv:2402.00293, 2024

work page arXiv 2024
[57]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024
[58]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processin...

work page 2023
[60]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32 (1), Apr. 2018. doi: 10.1609/aaai.v32i1.12342. URL https://ojs.aaai.org/index.php/ AAAI/article/view/12342

work page doi:10.1609/aaai.v32i1.12342 2018
[61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 15 A QA annotation A.1 Preprocess To make full use of fine-grained action labels, the preprocessin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Claude 3.7 sonnet and claude code, 2025

Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet

work page 2025

[2] [2]

Video-mined task graphs for keystep recognition in instructional videos

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36:67833–67846, 2023

work page 2023

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose

Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, and Stephen Gould. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 847–859, January 2021

work page 2021

[5] [5]

A virtual reality training curriculum for laparoscopic colorectal surgery

Laura Beyer-Berjot, Stéphane Berdah, Daniel A Hashimoto, Ara Darzi, and Rajesh Aggarwal. A virtual reality training curriculum for laparoscopic colorectal surgery. Journal of surgical education, 73(6):932–941, 2016

work page 2016

[6] [6]

The epic-kitchens dataset: Collection, challenges and baselines

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Temporal action segmentation: An analysis of modern techniques

Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46: 1011–1030, 2022. URL https://api.semanticscholar.org/CorpusID:252992530

work page 2022

[9] [9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024

[10] [10]

Flow graph to video grounding for weakly-supervised multi-step localization

Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022

work page 2022

[11] [11]

Prego: Online mistake detection in procedural egocentric videos

Alessandro Flaborea, Guido Maria D’Amely di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18483–18492, June 2024

work page 2024

[12] [12]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019

work page arXiv 1909

[14] [14]

Introducing gemini 2.0: our new ai model for the agentic era, 2024

Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/

work page 2024

[15] [15]

Gemini 2.5: Our most intelligent ai model, 2025

Google. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025/ #gemini-2-5-thinking

work page 2025

[16] [16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

work page 2022

[17] [17]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages ...

work page 2024

[18] [18]

Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024. URL https://arxiv.org/abs/2410.05343

work page arXiv 2024

[19] [19]

ProMQA: Question answering dataset for multimodal procedural activity understanding

Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. ProMQA: Question answering dataset for multimodal procedural activity understanding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

work page 2025

[20] [21]

Epic-tent: An egocentric video dataset for camping tent assembly

Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas. Epic-tent: An egocentric video dataset for camping tent assembly. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019

work page 2019

[21] [22]

Multimodal subtask graph generation from instructional videos

Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos. arXiv preprint arXiv:2302.08672, 2023

work page arXiv 2023

[22] [23]

A new measure of rank correlation

Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938

work page 1938

[23] [24]

The language of actions: Recovering the syntax and semantics of goal-directed human activities

Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014

work page 2014

[24] [25]

CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans

Yash Kumar Lal, Vanya Cohen, Nathanael Chambers, Niranjan Balasubramanian, and Ray Mooney. CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19336–1935...

work page doi:10.18653/v1/2024.emnlp-main.1077 2024

[25] [26]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, June 2024

work page 2024

[26] [27]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13872–13882, June 2024

work page 2024

[27] [28]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119...

work page 2016

[28] [29]

IKEA manuals at work: 4d grounding of assembly instructions on internet videos

Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d grounding of assembly instructions on internet videos. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[29] [30]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

work page 2024

[30] [31]

Egoschema: A diag- nostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diag- nostic benchmark for very long-form video language understanding. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 46212–46244. Curran Associates, Inc., 2023. URL https://...

work page 2023

[31] [32]

The brio-ta dataset: Understanding anomalous assembly process in manufacturing

Kosuke Moriwaki, Gaku Nakano, and Tetsuo Inoshita. The brio-ta dataset: Understanding anomalous assembly process in manufacturing. In 2022 IEEE International Conference on Image Processing (ICIP), pages 1991–1995. IEEE, 2022

work page 2022

[32] [33]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

work page 2024

[33] [34]

Openai o3-mini, 2025

OpenAI. Openai o3-mini, 2025. URL https://openai.com/index/openai-o3-mini/

work page 2025

[34] [35]

Llm evaluators recognize and favor their own generations

Arjun Panickssery, Samuel Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37:68772–68802, 2024

work page 2024

[35] [36]

Captaincook4d: A dataset for un- derstanding errors in procedural activities

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for un- derstanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, a...

work page

[36] [37]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ f4a04396c2ed1342a5d8d05e94cb6101-Paper-Datasets_and_Benchmarks_Track. pdf

work page 2024

[37] [38]

Large language models sensitivity to the order of options in multiple-choice questions

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.1...

work page 2024

[38] [39]

The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, January 2021

work page 2021

[39] [40]

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

Anna Rogers, Matt Gardner, and Isabelle Augenstein. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10), February 2023. ISSN 0360-0300. doi: 10.1145/3560260. URL https://doi.org/10.1145/ 3560260

work page doi:10.1145/3560260 2023

[40] [41]

Cvqa: Culturally-diverse multilingual visual question answering benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv preprint arXiv:2406.05967, 2024

work page arXiv 2024

[41] [42]

emnlp-main.308/

Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, and Yejin Choi. proScript: Partially ordered scripts generation. In Marie-Francine Moens, Xu- anjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2138–2149, Punta Cana, Dominican Re- public...

work page doi:10.18653/v1/2021 2021

[42] [43]

Scripts, plans, goals, and understanding: An inquiry into human knowledge structures

Roger C Schank and Robert P Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Lawrence Erlbaum, 1977

work page 1977

[43] [44]

Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting

Tim J Schoonbeek, Tim Houben, Hans Onvlee, Fons van der Sommen, et al. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4365–4374, 2024

work page 2024

[44] [45]

Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos

Luigi Seminara, Giovanni Maria Farinella, and Antonino Furnari. Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=2HvgvB4aWq

work page 2024

[45] [46]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21096–21106, June 2022

work page 2022

[46] [47]

Sebastian Stein and Stephen J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13, page 729–738, New York, NY , USA, 2013. Association for Computing Machinery. ISBN 9781450317702. doi: 10...

work page doi:10.1145/2493432.2493482 2013

[47] [48]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019

[48] [49]

Temporal evaluation

Naushad UzZaman and James Allen. Temporal evaluation. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 351–356, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology. org/P11-2061/

work page 2011

[49] [50]

Lvbench: An extreme long video understanding benchmark, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024. 13

work page 2024

[50] [51]

Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),...

work page 2023

[51] [52]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processi...

work page 2024

[52] [53]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, and Shih-Fu Chang. Dataset bias mitigation in multiple-choice visual question answering and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 8598–8617, Singa- pore, ...

work page doi:10.18653/v1/2023 2023

[53] [54]

WorldCuisines: A massive-scale bench- mark for multilingual and multicultural visual question answering on global cuisines

Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Wang Yutong, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amri- ani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Cheng Ching Lam, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri,...

work page 2025

[54] [55]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021

work page 2021

[55] [56]

Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation

Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi, Toutai Mit- suyama, and Yoichi Sato. Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation. arXiv preprint arXiv:2402.00293, 2024

work page arXiv 2024

[56] [57]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024

[57] [58]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [59]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processin...

work page 2023

[59] [60]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32 (1), Apr. 2018. doi: 10.1609/aaai.v32i1.12342. URL https://ojs.aaai.org/index.php/ AAAI/article/view/12342

work page doi:10.1609/aaai.v32i1.12342 2018

[60] [61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 15 A QA annotation A.1 Preprocess To make full use of fine-grained action labels, the preprocessin...

work page internal anchor Pith review Pith/arXiv arXiv 2025