ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3
The pith
ProMQA-Assembly supplies 646 multimodal QA pairs on assembly videos and manuals to test procedural reasoning in AI systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes ProMQA-Assembly as a collection of 646 QA pairs that require integrated understanding of assembly videos and their accompanying instruction manuals presented in an online style, created via LLM-assisted candidate generation followed by human verification and augmented with 81 instruction task graphs, and demonstrates through model benchmarks that these questions are challenging while reasoning models show stronger results.
What carries the argument
The semi-automated QA annotation pipeline that has LLMs generate candidate pairs which humans verify, combined with fine-grained action labels to increase question variety and 81 instruction task graphs to support verification and evaluation.
If this is right
- Assembly-task assistants can now be evaluated on realistic multimodal procedural questions rather than text-only or video-only tests.
- Reasoning-focused models appear better suited than standard multimodal models for handling sequences that span video demonstrations and written instructions.
- The task graphs provide a structured way to verify question quality and to measure whether models follow correct assembly order.
- The dataset supports development of systems that interact with humans during everyday or industrial assembly activities.
Where Pith is reading between the lines
- The same semi-automated pipeline could be reused to build comparable QA resources for other procedural domains such as cooking or repair tasks.
- Task graphs might be incorporated directly into model training to improve step-by-step consistency beyond evaluation.
- The online-style presentation of questions could highlight differences between models that process information incrementally versus those that wait for complete input.
Load-bearing premise
The semi-automated process of LLM-generated QA candidates followed by human verification produces questions that genuinely demand combined use of video and manual information rather than being answerable from either source alone.
What would settle it
If leading multimodal models achieve near-perfect accuracy on the questions when given only the text of the manuals and no video, or only the video and no manual text, the claim that the questions require multimodal procedural understanding would be undermined.
Figures
read the original abstract
Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ProMQA-Assembly, a new multimodal QA dataset consisting of 646 question-answer pairs focused on assembly activities. It requires understanding of human activity videos paired with instruction manuals. The dataset is created using a semi-automated pipeline in which LLMs generate candidate QA pairs that humans then verify, with additional use of fine-grained action labels for diversity and 81 newly created instruction task graphs. The authors benchmark several models, including proprietary multimodal ones, and conclude that the questions are challenging while reasoning models show promising performance.
Significance. If the central claim holds—that the QA pairs genuinely require cross-modal integration of video and manual content—this dataset would address an underexplored area in procedural activity understanding and provide a useful evaluation resource for assistants in everyday and industrial assembly settings. The release of task graphs and the cost-effective annotation approach are additional strengths that could support reproducibility and extension by others.
major comments (2)
- [Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.
- [Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.
minor comments (2)
- [Abstract] The abstract and introduction could more clearly distinguish between the contribution of the dataset release versus the specific benchmarking results.
- [Task graphs] Notation for the 81 task graphs and how they are used in verification versus benchmarking could be clarified for readers.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We appreciate the detailed feedback and will address each point below, proposing specific revisions to the manuscript.
read point-by-point responses
-
Referee: [Data creation] Data creation section: the human verification step in the semi-automated pipeline is described as ensuring quality, but the manuscript provides no explicit mechanism, reported metric, or unimodal baseline to confirm that questions cannot be solved from the manual text alone or from video frames alone. This assumption is load-bearing for the claim that ProMQA-Assembly contains challenging multimodal questions.
Authors: We thank the referee for highlighting this important aspect. The current manuscript describes the human verification but does not provide quantitative metrics or unimodal baselines. In the revised version, we will add explicit details on the verification mechanism, including the use of task graphs to guide annotators in ensuring questions require both video and manual information. We will also report a metric such as the percentage of questions that annotators deemed unimodal and include unimodal model baselines in the evaluation to substantiate the multimodal challenge. revision: yes
-
Referee: [Benchmarking] Benchmarking and evaluation: no quantitative error analysis, inter-annotator agreement scores, or details on how verification was performed are reported. This leaves the support for dataset reliability and question difficulty only moderately substantiated, directly affecting the strength of the benchmarking conclusions.
Authors: We agree that more details would improve the substantiation of our claims. We will revise the paper to include quantitative error analysis of the benchmarked models, inter-annotator agreement scores for the human verification process, and expanded details on the verification procedure, such as annotator guidelines and the role of task graphs in the process. These additions will better support the reliability and difficulty assessments. revision: yes
Circularity Check
No circularity: empirical dataset release with independent benchmarking
full rationale
The paper introduces a new dataset (ProMQA-Assembly) via semi-automated LLM candidate generation plus human verification, augmented by task graphs and action labels. It then reports empirical benchmarks on proprietary multimodal models. No equations, fitted parameters, predictions, or derivation chains appear. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on the released data and observed model performance rather than any reduction to inputs by construction. This is a standard non-circular dataset paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can produce candidate QA pairs from videos and manuals that humans can efficiently verify for quality and diversity
Forward citations
Cited by 2 Pith papers
-
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
-
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
Reference graph
Works this paper leans on
-
[1]
Claude 3.7 sonnet and claude code, 2025
Anthropic. Claude 3.7 sonnet and claude code, 2025. URL https://www.anthropic.com/ news/claude-3-7-sonnet
work page 2025
-
[2]
Video-mined task graphs for keystep recognition in instructional videos
Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos. Advances in Neural Information Processing Systems, 36:67833–67846, 2023
work page 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose
Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, and Stephen Gould. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 847–859, January 2021
work page 2021
-
[5]
A virtual reality training curriculum for laparoscopic colorectal surgery
Laura Beyer-Berjot, Stéphane Berdah, Daniel A Hashimoto, Ara Darzi, and Rajesh Aggarwal. A virtual reality training curriculum for laparoscopic colorectal surgery. Journal of surgical education, 73(6):932–941, 2016
work page 2016
-
[6]
The epic-kitchens dataset: Collection, challenges and baselines
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020
work page 2020
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Temporal action segmentation: An analysis of modern techniques
Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46: 1011–1030, 2022. URL https://api.semanticscholar.org/CorpusID:252992530
work page 2022
-
[9]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024
work page 2024
-
[10]
Flow graph to video grounding for weakly-supervised multi-step localization
Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and Allan D Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022
work page 2022
-
[11]
Prego: Online mistake detection in procedural egocentric videos
Alessandro Flaborea, Guido Maria D’Amely di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. Prego: Online mistake detection in procedural egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18483–18492, June 2024
work page 2024
-
[12]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291, 2019
-
[14]
Introducing gemini 2.0: our new ai model for the agentic era, 2024
Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/
work page 2024
-
[15]
Gemini 2.5: Our most intelligent ai model, 2025
Google. Gemini 2.5: Our most intelligent ai model, 2025. URL https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025/ #gemini-2-5-thinking
work page 2025
-
[16]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
work page 2022
-
[17]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages ...
work page 2024
-
[18]
Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024
Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, and Shinsuke Mori. Egooops: A dataset for mistake action detection from egocentric videos with procedural texts, 2024. URL https://arxiv.org/abs/2410.05343
-
[19]
ProMQA: Question answering dataset for multimodal procedural activity understanding
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, and Teruko Mitamura. ProMQA: Question answering dataset for multimodal procedural activity understanding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...
work page 2025
-
[21]
Epic-tent: An egocentric video dataset for camping tent assembly
Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas. Epic-tent: An egocentric video dataset for camping tent assembly. In Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019
work page 2019
-
[22]
Multimodal subtask graph generation from instructional videos
Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee. Multimodal subtask graph generation from instructional videos. arXiv preprint arXiv:2302.08672, 2023
-
[23]
A new measure of rank correlation
Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938
work page 1938
-
[24]
The language of actions: Recovering the syntax and semantics of goal-directed human activities
Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014
work page 2014
-
[25]
CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans
Yash Kumar Lal, Vanya Cohen, Nathanael Chambers, Niranjan Balasubramanian, and Ray Mooney. CaT-bench: Benchmarking language model understanding of causal and temporal dependencies in plans. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19336–1935...
-
[26]
Error detection in egocentric procedural task videos
Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18655–18666, June 2024
work page 2024
-
[27]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13872–13882, June 2024
work page 2024
-
[28]
A diversity-promoting objective function for neural conversation models
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119...
work page 2016
-
[29]
IKEA manuals at work: 4d grounding of assembly instructions on internet videos
Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d grounding of assembly instructions on internet videos. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[30]
Openeqa: Embodied question answering in the era of foundation models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...
work page 2024
-
[31]
Egoschema: A diag- nostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diag- nostic benchmark for very long-form video language understanding. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 46212–46244. Curran Associates, Inc., 2023. URL https://...
work page 2023
-
[32]
The brio-ta dataset: Understanding anomalous assembly process in manufacturing
Kosuke Moriwaki, Gaku Nakano, and Tetsuo Inoshita. The brio-ta dataset: Understanding anomalous assembly process in manufacturing. In 2022 IEEE International Conference on Image Processing (ICIP), pages 1991–1995. IEEE, 2022
work page 2022
-
[33]
OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/
work page 2024
-
[34]
OpenAI. Openai o3-mini, 2025. URL https://openai.com/index/openai-o3-mini/
work page 2025
-
[35]
Llm evaluators recognize and favor their own generations
Arjun Panickssery, Samuel Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37:68772–68802, 2024
work page 2024
-
[36]
Captaincook4d: A dataset for un- derstanding errors in procedural activities
Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for un- derstanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, a...
-
[37]
URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ f4a04396c2ed1342a5d8d05e94cb6101-Paper-Datasets_and_Benchmarks_Track. pdf
work page 2024
-
[38]
Large language models sensitivity to the order of options in multiple-choice questions
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.1...
work page 2024
-
[39]
Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, January 2021
work page 2021
-
[40]
Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension
Anna Rogers, Matt Gardner, and Isabelle Augenstein. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10), February 2023. ISSN 0360-0300. doi: 10.1145/3560260. URL https://doi.org/10.1145/ 3560260
-
[41]
Cvqa: Culturally-diverse multilingual visual question answering benchmark
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. arXiv preprint arXiv:2406.05967, 2024
-
[42]
Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, and Yejin Choi. proScript: Partially ordered scripts generation. In Marie-Francine Moens, Xu- anjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2138–2149, Punta Cana, Dominican Re- public...
-
[43]
Scripts, plans, goals, and understanding: An inquiry into human knowledge structures
Roger C Schank and Robert P Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Lawrence Erlbaum, 1977
work page 1977
-
[44]
Tim J Schoonbeek, Tim Houben, Hans Onvlee, Fons van der Sommen, et al. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4365–4374, 2024
work page 2024
-
[45]
Luigi Seminara, Giovanni Maria Farinella, and Antonino Furnari. Differentiable task graph learning: Procedural activity representation and online mistake detection from egocentric videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=2HvgvB4aWq
work page 2024
-
[46]
Assembly101: A large-scale multi-view video dataset for understanding procedural activities
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21096–21106, June 2022
work page 2022
-
[47]
Sebastian Stein and Stephen J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13, page 729–738, New York, NY , USA, 2013. Association for Computing Machinery. ISBN 9781450317702. doi: 10...
-
[48]
Coin: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[49]
Naushad UzZaman and James Allen. Temporal evaluation. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 351–356, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology. org/P11-2061/
work page 2011
-
[50]
Lvbench: An extreme long video understanding benchmark, 2024
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2024. 13
work page 2024
-
[51]
Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),...
work page 2023
-
[52]
MMLU-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processi...
work page 2024
-
[53]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen
Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, and Shih-Fu Chang. Dataset bias mitigation in multiple-choice visual question answering and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 8598–8617, Singa- pore, ...
-
[54]
Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Wang Yutong, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amri- ani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Cheng Ching Lam, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri,...
work page 2025
-
[55]
Next-qa: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021
work page 2021
-
[56]
Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation
Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi, Toutai Mit- suyama, and Yoichi Sato. Finebio: a fine-grained video dataset of biological experiments with hierarchical annotation. arXiv preprint arXiv:2402.00293, 2024
-
[57]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[58]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processin...
work page 2023
-
[60]
Towards automatic learning of procedures from web instructional videos
Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32 (1), Apr. 2018. doi: 10.1609/aaai.v32i1.12342. URL https://ojs.aaai.org/index.php/ AAAI/article/view/12342
-
[61]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 15 A QA annotation A.1 Preprocess To make full use of fine-grained action labels, the preprocessin...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.