Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Pith reviewed 2026-05-22 09:17 UTC · model grok-4.3
The pith
State-of-the-art large vision-language models struggle with fine-grained spatio-temporal reasoning on furniture assembly videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Flat-Pack Bench shows that state-of-the-art LVLMs exhibit substantial limitations in fine-grained spatio-temporal reasoning, including weak use of temporal information from videos, limited tracking ability, and incomplete understanding of spatial interactions such as physical contact.
What carries the argument
Multiple-choice questions paired with visual prompts that highlight relevant parts within furniture assembly videos.
If this is right
- Current models have limited ability to leverage temporal information across video frames.
- Tracking of individual parts through an assembly sequence remains unreliable.
- Recognition of spatial interactions such as physical contact between components is weak.
- Coarse-grained video benchmarks miss the procedural understanding needed for step-by-step tasks.
Where Pith is reading between the lines
- Similar benchmarks focused on other procedural activities such as cooking or mechanical repair could expose comparable gaps.
- Training approaches that add explicit temporal modeling and physical contact simulation may narrow the performance shortfall.
- The benchmark offers a direct metric for measuring whether future models can support humans through complex manual procedures.
Load-bearing premise
Multiple-choice questions paired with visual prompts on furniture assembly videos accurately isolate and measure the targeted fine-grained spatio-temporal capabilities without confounding effects from question phrasing or highlighting choices.
What would settle it
A model that scores near ceiling on the benchmark questions yet still fails to follow or guide real furniture assembly when given the same videos would indicate the questions do not fully isolate the claimed reasoning deficits.
Figures
read the original abstract
The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Flat-Pack Bench, a new benchmark for assessing large vision-language models on fine-grained spatio-temporal reasoning in furniture assembly videos. Tasks include temporal ordering of assembly actions, temporal localization of states, part mating, and object tracking, evaluated via multiple-choice questions often paired with visual prompts that highlight relevant parts. Experiments on state-of-the-art LVLMs show significant struggles, which the authors attribute to limitations in leveraging temporal video information, tracking ability, and understanding physical interactions such as contact.
Significance. If the benchmark design validly isolates the claimed capabilities, the work would provide a useful new testbed for video understanding beyond coarse action recognition or captioning, targeting practical domains like assembly and instructional videos. The empirical evaluation on multiple models offers concrete evidence of current limitations, and the focus on in-the-wild complex scenarios with non-verbal entities is a strength.
major comments (2)
- [Abstract / Benchmark Construction] Abstract and benchmark description: The central claim that poor performance demonstrates limited temporal reasoning and tracking depends on the tasks requiring integration across video frames. However, the use of 'visual prompts highlighting relevant parts as references for fine-grained questions' risks allowing models to solve tasks (e.g., part mating or state localization) using only static spatial cues from highlighted regions in single frames, without processing temporal sequences or full video context. This potential confound must be ruled out with explicit controls, such as ablation on prompt timing or frame selection.
- [Experiments] Experimental results section: The reported struggles of SOTA LVLMs are summarized in the abstract, but without details on dataset construction (e.g., how videos are segmented, how questions are generated to avoid linguistic shortcuts, or quantitative breakdowns by task), it is difficult to assess whether the performance gaps specifically isolate spatio-temporal deficits rather than other factors like prompt sensitivity or visual highlighting artifacts.
minor comments (2)
- [Benchmark] Clarify the exact format and timing of visual prompts (e.g., whether highlights are overlaid on every frame or only key frames) to improve reproducibility.
- [Related Work] Add more discussion of how the benchmark avoids overlap with existing video QA datasets that use similar assembly or manipulation scenarios.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional controls and details that strengthen the presentation of our benchmark.
read point-by-point responses
-
Referee: [Abstract / Benchmark Construction] Abstract and benchmark description: The central claim that poor performance demonstrates limited temporal reasoning and tracking depends on the tasks requiring integration across video frames. However, the use of 'visual prompts highlighting relevant parts as references for fine-grained questions' risks allowing models to solve tasks (e.g., part mating or state localization) using only static spatial cues from highlighted regions in single frames, without processing temporal sequences or full video context. This potential confound must be ruled out with explicit controls, such as ablation on prompt timing or frame selection.
Authors: We appreciate the referee's identification of this potential confound. While our visual prompts are designed to focus attention on relevant entities for fine-grained questions, we agree that explicit controls are necessary to confirm reliance on temporal integration. In the revised manuscript, we have added an ablation study evaluating model performance when provided with only single frames (or randomly selected frames) containing the same visual prompts. Results show substantial performance degradation compared to full video input, supporting that the tasks require cross-frame reasoning. We have also clarified in the benchmark description that questions for temporal ordering and state localization are constructed to depend on sequence information even when parts are highlighted. revision: yes
-
Referee: [Experiments] Experimental results section: The reported struggles of SOTA LVLMs are summarized in the abstract, but without details on dataset construction (e.g., how videos are segmented, how questions are generated to avoid linguistic shortcuts, or quantitative breakdowns by task), it is difficult to assess whether the performance gaps specifically isolate spatio-temporal deficits rather than other factors like prompt sensitivity or visual highlighting artifacts.
Authors: We agree that expanded details on benchmark construction will improve transparency and allow readers to better evaluate the isolation of spatio-temporal capabilities. The revised manuscript now includes a dedicated subsection detailing video segmentation (assembly videos are divided into clips based on human-annotated action boundaries and state transitions), the question generation pipeline (hybrid human-AI templating with verification to eliminate linguistic shortcuts by ensuring answers require visual evidence), and per-task quantitative breakdowns with error analysis. These additions directly address concerns about prompt sensitivity and highlighting artifacts. revision: yes
Circularity Check
Empirical benchmark with no derivation chain or self-referential reductions
full rationale
The paper introduces Flat-Pack Bench as a new evaluation suite for fine-grained spatio-temporal tasks in furniture assembly videos and reports direct experimental results on existing LVLMs. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; claims rest on new multiple-choice evaluations rather than reducing to inputs by construction, self-citations, or ansatzes. The work is self-contained as an empirical contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Furniture assembly videos require and can isolate fine-grained spatio-temporal reasoning distinct from coarse action recognition.
- domain assumption Visual prompts highlighting parts provide a fair reference for evaluating model understanding without altering the core reasoning task.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? InFindings of the Associ- ation for Computational Linguistics: EMNLP 2023, pages 12325–12341, Singapore, 2023. Association for Computa- tional Linguistics. 2
work page 2023
-
[4]
Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal, and Roland Memisevic. Can multi-modal LLMs provide live step-by-step task guidance? InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025. 2
work page 2025
-
[5]
Activitynet: A large-scale video bench- mark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 2
work page 2015
-
[6]
Perceptionlm: Open-access data and models for detailed visual understanding.arXiv:2504.13180, 2025
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Tri- antafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philip...
-
[7]
Gemini Thinking — Gemini API Documentation
Google AI for Developers. Gemini Thinking — Gemini API Documentation. https://ai.google.dev/gemini- api/docs/thinking#summaries , 2025. Accessed: 2025-11-13. 7
work page 2025
-
[8]
Google DeepMind. Gemini 2.5 pro model card. https: //modelcards.withgoogle.com/assets/docum ents/gemini-2.5-pro.pdf , 2025. Accessed: 2025- 11-10. 4, 7
work page 2025
-
[9]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...
work page 2017
-
[10]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Pro- cessing Systems, 2022. 5
work page 2022
-
[11]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, 2024. arXiv:2311.17005 [cs]. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?, 2025. arXiv:2503.23765 [cs]. 2
-
[14]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2024. 4, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Coarse correspondences boost spatial-temporal reasoning in multimodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3783–3792, 2025. 6
work page 2025
-
[16]
IKEA manuals at work: 4d ground- ing of assembly instructions on internet videos
Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, 9 Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d ground- ing of assembly instructions on internet videos. InNeurIPS Datasets and Benchmarks Track, 2024. 2, 3, 8, 1
work page 2024
-
[17]
Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On-Demand Spatial- Temporal Understanding at Arbitrary Resolution, 2025. arXiv:2409.12961 [cs]. 2
-
[18]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2
work page 2023
-
[19]
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19036–19046, 2025. 2
work page 2025
-
[20]
OpenAI. Gpt-5 system card. https://cdn.openai .com/gpt-5-system-card.pdf , 2025. Accessed: 2025-10-19. 1, 2, 4, 3
work page 2025
-
[21]
What to say and when to say it: Live fitness coaching as a testbed for situ- ated interaction
Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius B¨ohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todor- ovich, Ingo Bax, and Roland Memisevic. What to say and when to say it: Live fitness coaching as a testbed for situ- ated interaction. InNeural Information Processing Systems Dataset...
work page 2024
- [22]
-
[23]
SAM 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InThe Thir- teenth Internat...
-
[24]
SAMA: Towards Multi-Turn Ref- erential Grounded Video Chat with Large Language Models
Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, and Yu-Gang Jiang. SAMA: Towards Multi-Turn Ref- erential Grounded Video Chat with Large Language Models. arXiv preprint arXiv:2505.18812, 2025. 2
-
[25]
D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.Proceed- ings of IEEE International Conference on Computer Vision (ICCV), 2023. 2, 5, 7
work page 2023
-
[26]
Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025. 2
-
[27]
Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio- temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. 2
work page 2021
-
[28]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team et al. Gemini: A Family of Highly Capable Multimodal Models, 2025. arXiv:2312.11805 [cs]. 1, 4, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Demo2code: From sum- marizing demonstrations to synthesizing code via extended chain-of-thought
Huaxiaoyue Wang, Gonzalo Gonzalez-Pumariega, Yash Sharma, and Sanjiban Choudhury. Demo2code: From sum- marizing demonstrations to synthesizing code via extended chain-of-thought. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 5
work page 2023
-
[30]
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bu- gra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023. 2
work page 2023
-
[31]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. 5, 7
work page 2023
-
[32]
Compositional 4d dynamic scenes understanding with physics priors for video question answering
Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, and Alan Yuille. Compositional 4d dynamic scenes understanding with physics priors for video question answering. InInternational Conference on Learning Repre- sentations (ICLR), 2025. 1, 2
work page 2025
-
[33]
LongVideoBench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InThe Thirty-eight Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 2
work page 2024
-
[34]
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 1
work page 2021
-
[35]
Seeing the arrow of time in large multimodal models
Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. InNeurIPS, 2026. 2, 4, 5
work page 2026
-
[36]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024. 2, 4, 7
work page Pith review arXiv 2024
-
[38]
Generative frame sampler for long video understanding
Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caim- ing Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 2025. 4, 2
work page 2025
-
[39]
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, 2025
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Bo- qiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, and Lidong Bing. VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, 2025. arXiv:2501.00599 [cs]. 1, 2, 4, 5
-
[40]
Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, and Yong Li. The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?, 2025. arXiv:2504.04540 [cs]. 2 10
-
[41]
Llava- next: A strong zero-shot video understanding model
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model. http s://llava-vl.github.io/blog/2024-04-30- llava-next-video/ , 2024. Accessed March 15,2026. 1, 4
work page 2024
-
[42]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video Instruction Tuning With Synthetic Data, 2025. arXiv:2410.02713 [cs]. 1, 4, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Where does it exist: Spatio-temporal video grounding for multi-form sentences
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 2
work page 2020
-
[44]
Contphy: Continuum physical concept learning and reason- ing from videos
Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B Tenenbaum, and Chuang Gan. Contphy: Continuum physical concept learning and reason- ing from videos. InInternational Conference on Machine Learning. PMLR, 2024. 1, 2
work page 2024
-
[45]
Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, and Weijia Li. UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Ur- ban Scenarios, 2025. arXiv:2408.17267 [cs]. 2
-
[46]
Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles
Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data, 2025. arXiv:2509.03501 [cs]. 1, 2
-
[47]
VLM4D: To- wards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: To- wards Spatiotemporal Awareness in Vision Language Models. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 2
work page 2025
-
[48]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Apollo: An Exploration of Video Understanding in Large Multimodal Models, 2024
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. Apollo: An Exploration of Video Understanding in Large Multimodal Models, 2024. arXiv:2412.10360 [cs]. 2 11 FLAT-PACKBENCH: Evaluating Spatio-Temporal Understanding in Large Vision...
-
[50]
We found that segmentation annotations are only pro- vided for parts that are in the process of being connected in a particular key frame. This limits the questions to these particular key frames, and even so, to only the parts annotated therein
-
[51]
Figure S1 shows some examples of available segmenta- tions in IMaW to illustrate these issues
Furthermore, IMaW only includes information at the sub- assembly granularity, which precludes questions that one might want to ask about specific parts. Figure S1 shows some examples of available segmenta- tions in IMaW to illustrate these issues. Thus, we annotate our own segmentation maps as described in Sec. 3. Which of the parts shown in the highlight...
-
[52]
are visually quite distinct from Part 10. Their position- ing, as if they are about to be inserted into Part 10 (not Part 7), also makes it easy to answer this question without any temporal context. Observe that even without the videos, the correct answer is easy to infer from the visual prompt and commonsense reasoning alone. Such examples motivated us t...
-
[56]
A multiple-choice question (with its list of answer options) that refers to both the video and the image. Some additional assumptions to keep in mind: – Two furniture parts are "connected" if they are directly attached (physically in contact), like they would be in the final assembly. – Parts simply touching each other physically, but not in their final a...
- [57]
-
[58]
Are {query part1} and {query part2} connected (physi- cally in contact) in the shown image? Table S3 shows the results. We observed poor perfor- mance across all settings, indicating that LVLMs struggle to understand even simpler concepts like physical contact. 4 Task Instructions for Collage Prompts You are a furniture-assembly expert. You are given a vi...
-
[59]
Right side: A frame from the furniture assembly video showing the assembly process
-
[60]
Left side: A labeled frame from the video on the right side, fixed for the entire video, displaying bright numeric IDs on each visible furniture part. Call this Image A
-
[61]
Center: Another labeled frame from the video on the right side, fixed for the entire video, also displaying bright numeric IDs on each visible furniture part. Call this Image B. Both Image A and Image B remain constant throughout the video. Note: The numeric IDs in Image A and Image B are not necessarily the same; the same ID may refer to different parts ...
-
[62]
Use the right-side video frames to observe the assembly steps
-
[63]
Use the fixed left-side labeled frame to identify and relate the furniture parts mentioned in the question. For TRACKQuestions
-
[64]
Use the fixed labeled frames on the left side (Image A) and in the center (Image B) to identify and relate the furniture parts mentioned in the question
-
[66]
answer", whose value is the letter of your chosen option (e.g.,
Select the correct answer based on the video and labeled parts. – Respond **only** with a JSON object containing a single key, "answer", whose value is the letter of your chosen option (e.g.,"A","B","C"). **Do not include any explanations or additional text--reply with only the JSON string.** Now answer the following question: Figure S4.Task Instructions ...
-
[67]
Use the video frames after the first frame to observe the assembly steps
-
[68]
Use the first labeled frame to identify and relate the furniture parts mentioned in the question. For TRACKQuestions
-
[69]
Use the video frames after the first two frames to observe the assembly steps
-
[70]
Use the first two labeled frames, i.e., Image A and Image B, to identify and relate the furniture parts mentioned in the question
-
[71]
Carefully read the question and all answer choices
-
[72]
answer", whose value is the letter of your chosen option (e.g.,
Select the correct answer based on the video and labeled parts. – Respond **only** with a JSON object containing a single key, "answer", whose value is the letter of your chosen option (e.g.,"A","B","C"). **Do not include any explanations or additional text--reply with only the JSON string.** Now answer the following question: Figure S5.Task Instructions ...
-
[73]
Each question consists of: a. A furniture assembly video. b. 1-2 visual prompts or frames extracted from the video, with certain parts shaded and la- beled. c. An MCQ question with at most 4 options
-
[74]
Examine the labeled frame(s) and understand the relationships between the highlighted parts using the video
-
[76]
1-2 images showing different stages in the assembly of a furniture item
Each question consists of: a. 1-2 images showing different stages in the assembly of a furniture item. Certain parts of the furniture will be shaded and labeled. b. An MCQ question about the furniture assembly pro- cess with at most 4 options
-
[77]
First, examine the labeled images(s) and understand the relationships between the highlighted parts
-
[78]
Read the question and then determine the correct option
-
[79]
Please explain this answer step-by-step
Even if you cannot infer the answer from the question and the image alone, we request that you use your best judgment and select an option. Figure S6.Human Evaluation Instructions. Left:Instructions for the standard task, where participants were provided the assembly video, visual prompt, and question text.Right:Instructions for the image-only task, they ...
-
[80]
Please specify this difficulty on a scale of 1-3 where: a
Zoomed-in/out cameras: Are there frames where the camera is too zoomed-in/zoomed-out such that it becomes difficult to understand the assembly video? Answer based on overall impact on your understanding, not peak moments where this occurs. Please specify this difficulty on a scale of 1-3 where: a. 1 means zoom-in/zoom-out effects did not affect your under...
-
[81]
Out-of-frame rotations: How difficult is it to understand out-of-frame (outside the camera’s view) rotations/attachments of parts in the video? Rate this on a scale of 1-3 where: a. 1 means that the parts go out-of-frame for <10 frames and even when they do, the remain partially visible or their motion is very unambiguous b. 2 means that a noticeable numb...
-
[82]
Rate the video from 1 to 3 where: a
Tracking or tracing the motion of moving parts: We are trying to understand the amount of motion and and interactions between the parts in the video and whether it makes it difficult to understand which part was connected where. Rate the video from 1 to 3 where: a. 1 means that the parts are interacted with minimally, typically only just before they are a...
-
[83]
Text/Visual overlay: Some of the videos you see might have text or pictures from the furniture assembly guide overlaid on the frames. Depending on how difficult did the overlaid content make it for you to understand the video either due to obscuring the visuals or providing ambiguous instructions, please assign a score from 1 to 3 where: a. 1 means that t...
-
[84]
0: Canonical/upright orientation b
Initial orientation mismatch: Does the assembly start in a flipped, upside-down, or non-canonical orientation relative to standard furniture assembly guides? a. 0: Canonical/upright orientation b. 1: Non-canonical (e.g., upside-down or rotated) Figure S8.Instructions for annotation difficulty of videos.We asked annotators to rate the videos in our benchma...
-
[85]
A video of a furniture assembly in progress
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.