ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa , Wiradee Imrattanatrai , Masaki Asada , Susan Holm , Yuran Wang , Vincent Zhou , Ken Fukuda , Teruko Mitamura

Authors on Pith no claims yet

classification 💻 cs.CL cs.CV

keywords assemblydatasetmultimodalevaluationmodelspromqa-assemblytasksactivities

0 comments

read the original abstract

Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
cs.CV 2026-05 unverdicted novelty 7.0

AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.