pith. sign in

arxiv: 2605.21625 · v1 · pith:DGNFU6XGnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI· cs.CL

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Pith reviewed 2026-05-22 09:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords Large Vision-Language ModelsSpatio-Temporal ReasoningVideo UnderstandingFurniture AssemblyFine-Grained EvaluationTemporal OrderingObject TrackingBenchmarks
0
0 comments X

The pith

State-of-the-art large vision-language models struggle with fine-grained spatio-temporal reasoning on furniture assembly videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Flat-Pack Bench, a new evaluation set built around furniture assembly videos to probe detailed step-by-step understanding. Existing benchmarks emphasize coarse actions or familiar objects that are easy to name, yet practical tasks require precise judgments about action sequence, state timing, part connections, and ongoing tracking. The authors test models with multiple-choice questions supported by visual highlights that point to specific parts. Experiments demonstrate that leading models have trouble drawing on temporal cues across frames, maintaining track of components, and recognizing physical contacts between pieces.

Core claim

Flat-Pack Bench shows that state-of-the-art LVLMs exhibit substantial limitations in fine-grained spatio-temporal reasoning, including weak use of temporal information from videos, limited tracking ability, and incomplete understanding of spatial interactions such as physical contact.

What carries the argument

Multiple-choice questions paired with visual prompts that highlight relevant parts within furniture assembly videos.

If this is right

  • Current models have limited ability to leverage temporal information across video frames.
  • Tracking of individual parts through an assembly sequence remains unreliable.
  • Recognition of spatial interactions such as physical contact between components is weak.
  • Coarse-grained video benchmarks miss the procedural understanding needed for step-by-step tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks focused on other procedural activities such as cooking or mechanical repair could expose comparable gaps.
  • Training approaches that add explicit temporal modeling and physical contact simulation may narrow the performance shortfall.
  • The benchmark offers a direct metric for measuring whether future models can support humans through complex manual procedures.

Load-bearing premise

Multiple-choice questions paired with visual prompts on furniture assembly videos accurately isolate and measure the targeted fine-grained spatio-temporal capabilities without confounding effects from question phrasing or highlighting choices.

What would settle it

A model that scores near ceiling on the benchmark questions yet still fails to follow or guide real furniture assembly when given the same videos would indicate the questions do not fully isolate the claimed reasoning deficits.

Figures

Figures reproduced from arXiv: 2605.21625 by Aditya Chetan, Bharath Hariharan, Bharath Raj Nagoor Kani, Eric Cai, Noah Snavely, Peeyush Kushwaha, Qianqian Wang, Utkarsh Mall.

Figure 1
Figure 1. Figure 1: Motivation for FLAT-PACK BENCH. For AI assistants to understand an assembly process through observation, they need to be adept at fine-grained spatio-temporal reasoning about the video. We propose FLAT-PACK BENCH to evaluate Large Vision-Language Models on four such fine-grained video understanding tasks, namely – Temporal Ordering, Temporal Localization, Tracking, and Mating. change (e.g., tomatoes gettin… view at source ↗
Figure 2
Figure 2. Figure 2: Snapshot of FLAT-PACK BENCH. Each question consists of an assembly video (top row), one or two visual prompts (Images A, B), and a multiple-choice question. The corresponding visual inputs are shown within each question box. Videos are sourced from the internet and may include artifacts like overlaid text. For clarity, part labels are enlarged, as the visual prompts are shown at reduced scale. 4. Evaluatio… view at source ↗
Figure 3
Figure 3. Figure 3: Visual Data Ablation. We study the effect of different strategies of providing the visual prompt and video processing (a). Next, we analyze how the (b) color scheme, (c) mark type, and (d) mark size affect the LVLM’s performance on our benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Self-probing Explanations. Qualitative example from Gemini 2.5 Pro. We highlight the video with the relevant connection events for clarity. We can observe that the model looks at the video, but makes an error due to gaps in its spatio-temporal reasoning. 5.4. Probing Errors with Self-Explanations Since LVLMs are failing to effectively utilize the spatio￾temporal context in videos, we perform a deeper inves… view at source ↗
Figure 5
Figure 5. Figure 5: Temporal Video Agent. An overview of our agentic baseline. First, a Code LLM uses the API specification and the input question to generate a program. The generated program uses the assembly video and the visual prompt’s frame index and mask to produce a response for the question. We also show an example trace for a question. We can analyse the execution trace to pin-point the sources of error. image (for r… view at source ↗
read the original abstract

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Flat-Pack Bench, a new benchmark for assessing large vision-language models on fine-grained spatio-temporal reasoning in furniture assembly videos. Tasks include temporal ordering of assembly actions, temporal localization of states, part mating, and object tracking, evaluated via multiple-choice questions often paired with visual prompts that highlight relevant parts. Experiments on state-of-the-art LVLMs show significant struggles, which the authors attribute to limitations in leveraging temporal video information, tracking ability, and understanding physical interactions such as contact.

Significance. If the benchmark design validly isolates the claimed capabilities, the work would provide a useful new testbed for video understanding beyond coarse action recognition or captioning, targeting practical domains like assembly and instructional videos. The empirical evaluation on multiple models offers concrete evidence of current limitations, and the focus on in-the-wild complex scenarios with non-verbal entities is a strength.

major comments (2)
  1. [Abstract / Benchmark Construction] Abstract and benchmark description: The central claim that poor performance demonstrates limited temporal reasoning and tracking depends on the tasks requiring integration across video frames. However, the use of 'visual prompts highlighting relevant parts as references for fine-grained questions' risks allowing models to solve tasks (e.g., part mating or state localization) using only static spatial cues from highlighted regions in single frames, without processing temporal sequences or full video context. This potential confound must be ruled out with explicit controls, such as ablation on prompt timing or frame selection.
  2. [Experiments] Experimental results section: The reported struggles of SOTA LVLMs are summarized in the abstract, but without details on dataset construction (e.g., how videos are segmented, how questions are generated to avoid linguistic shortcuts, or quantitative breakdowns by task), it is difficult to assess whether the performance gaps specifically isolate spatio-temporal deficits rather than other factors like prompt sensitivity or visual highlighting artifacts.
minor comments (2)
  1. [Benchmark] Clarify the exact format and timing of visual prompts (e.g., whether highlights are overlaid on every frame or only key frames) to improve reproducibility.
  2. [Related Work] Add more discussion of how the benchmark avoids overlap with existing video QA datasets that use similar assembly or manipulation scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional controls and details that strengthen the presentation of our benchmark.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Construction] Abstract and benchmark description: The central claim that poor performance demonstrates limited temporal reasoning and tracking depends on the tasks requiring integration across video frames. However, the use of 'visual prompts highlighting relevant parts as references for fine-grained questions' risks allowing models to solve tasks (e.g., part mating or state localization) using only static spatial cues from highlighted regions in single frames, without processing temporal sequences or full video context. This potential confound must be ruled out with explicit controls, such as ablation on prompt timing or frame selection.

    Authors: We appreciate the referee's identification of this potential confound. While our visual prompts are designed to focus attention on relevant entities for fine-grained questions, we agree that explicit controls are necessary to confirm reliance on temporal integration. In the revised manuscript, we have added an ablation study evaluating model performance when provided with only single frames (or randomly selected frames) containing the same visual prompts. Results show substantial performance degradation compared to full video input, supporting that the tasks require cross-frame reasoning. We have also clarified in the benchmark description that questions for temporal ordering and state localization are constructed to depend on sequence information even when parts are highlighted. revision: yes

  2. Referee: [Experiments] Experimental results section: The reported struggles of SOTA LVLMs are summarized in the abstract, but without details on dataset construction (e.g., how videos are segmented, how questions are generated to avoid linguistic shortcuts, or quantitative breakdowns by task), it is difficult to assess whether the performance gaps specifically isolate spatio-temporal deficits rather than other factors like prompt sensitivity or visual highlighting artifacts.

    Authors: We agree that expanded details on benchmark construction will improve transparency and allow readers to better evaluate the isolation of spatio-temporal capabilities. The revised manuscript now includes a dedicated subsection detailing video segmentation (assembly videos are divided into clips based on human-annotated action boundaries and state transitions), the question generation pipeline (hybrid human-AI templating with verification to eliminate linguistic shortcuts by ensuring answers require visual evidence), and per-task quantitative breakdowns with error analysis. These additions directly address concerns about prompt sensitivity and highlighting artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

The paper introduces Flat-Pack Bench as a new evaluation suite for fine-grained spatio-temporal tasks in furniture assembly videos and reports direct experimental results on existing LVLMs. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; claims rest on new multiple-choice evaluations rather than reducing to inputs by construction, self-citations, or ansatzes. The work is self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard assumptions about what constitutes spatio-temporal understanding and the validity of multiple-choice evaluation for video models; no free parameters or invented entities are introduced as this is an empirical evaluation suite rather than a theoretical model.

axioms (2)
  • domain assumption Furniture assembly videos require and can isolate fine-grained spatio-temporal reasoning distinct from coarse action recognition.
    Invoked in the motivation for creating tasks around temporal ordering, state localization, part mating, and tracking.
  • domain assumption Visual prompts highlighting parts provide a fair reference for evaluating model understanding without altering the core reasoning task.
    Stated in the description of how questions are paired with visual prompts.

pith-pipeline@v0.9.0 · 5772 in / 1413 out tokens · 21191 ms · 2026-05-22T09:17:31.049046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 9 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...

  3. [3]

    Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? InFindings of the Associ- ation for Computational Linguistics: EMNLP 2023, pages 12325–12341, Singapore, 2023. Association for Computa- tional Linguistics. 2

  4. [4]

    Can multi-modal LLMs provide live step-by-step task guidance? InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025

    Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal, and Roland Memisevic. Can multi-modal LLMs provide live step-by-step task guidance? InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025. 2

  5. [5]

    Activitynet: A large-scale video bench- mark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 2

  6. [6]

    Perceptionlm: Open-access data and models for detailed visual understanding.arXiv:2504.13180, 2025

    Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Tri- antafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philip...

  7. [7]

    Gemini Thinking — Gemini API Documentation

    Google AI for Developers. Gemini Thinking — Gemini API Documentation. https://ai.google.dev/gemini- api/docs/thinking#summaries , 2025. Accessed: 2025-11-13. 7

  8. [8]

    Gemini 2.5 pro model card

    Google DeepMind. Gemini 2.5 pro model card. https: //modelcards.withgoogle.com/assets/docum ents/gemini-2.5-pro.pdf , 2025. Accessed: 2025- 11-10. 4, 7

  9. [9]

    The” something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The” something something” video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...

  10. [10]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Pro- cessing Systems, 2022. 5

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 1, 4

  12. [12]

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, 2024. arXiv:2311.17005 [cs]. 1, 2

  13. [13]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

    Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?, 2025. arXiv:2503.23765 [cs]. 2

  14. [14]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2024. 4, 2

  15. [15]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3783–3792, 2025. 6

  16. [16]

    IKEA manuals at work: 4d ground- ing of assembly instructions on internet videos

    Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, 9 Weiyu Liu, and Jiajun Wu. IKEA manuals at work: 4d ground- ing of assembly instructions on internet videos. InNeurIPS Datasets and Benchmarks Track, 2024. 2, 3, 8, 1

  17. [17]

    Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On-Demand Spatial- Temporal Understanding at Arbitrary Resolution, 2025. arXiv:2409.12961 [cs]. 2

  18. [18]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2

  19. [19]

    VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19036–19046, 2025. 2

  20. [20]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. https://cdn.openai .com/gpt-5-system-card.pdf , 2025. Accessed: 2025-10-19. 1, 2, 4, 3

  21. [21]

    What to say and when to say it: Live fitness coaching as a testbed for situ- ated interaction

    Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius B¨ohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todor- ovich, Ingo Bax, and Roland Memisevic. What to say and when to say it: Live fitness coaching as a testbed for situ- ated interaction. InNeural Information Processing Systems Dataset...

  22. [22]

    Prolific

    Prolific. Prolific. https://www.prolific.com, 2025. London, UK. Version used: March 2026. Accessed: 2026-03-

  23. [23]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InThe Thir- teenth Internat...

  24. [24]

    SAMA: Towards Multi-Turn Ref- erential Grounded Video Chat with Large Language Models

    Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, and Yu-Gang Jiang. SAMA: Towards Multi-Turn Ref- erential Grounded Video Chat with Large Language Models. arXiv preprint arXiv:2505.18812, 2025. 2

  25. [25]

    Vipergpt: Visual inference via python execution for reasoning.Proceed- ings of IEEE International Conference on Computer Vision (ICCV), 2023

    D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.Proceed- ings of IEEE International Conference on Computer Vision (ICCV), 2023. 2, 5, 7

  26. [26]

    Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

    Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025. 2

  27. [27]

    Human-centric spatio- temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021

    Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio- temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. 2

  28. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team et al. Gemini: A Family of Highly Capable Multimodal Models, 2025. arXiv:2312.11805 [cs]. 1, 4, 2

  29. [29]

    Demo2code: From sum- marizing demonstrations to synthesizing code via extended chain-of-thought

    Huaxiaoyue Wang, Gonzalo Gonzalez-Pumariega, Yash Sharma, and Sanjiban Choudhury. Demo2code: From sum- marizing demonstrations to synthesizing code via extended chain-of-thought. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 5

  30. [30]

    Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bu- gra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023. 2

  31. [31]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. 5, 7

  32. [32]

    Compositional 4d dynamic scenes understanding with physics priors for video question answering

    Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, and Alan Yuille. Compositional 4d dynamic scenes understanding with physics priors for video question answering. InInternational Conference on Learning Repre- sentations (ICLR), 2025. 1, 2

  33. [33]

    LongVideoBench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InThe Thirty-eight Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 2

  34. [34]

    NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 1

  35. [35]

    Seeing the arrow of time in large multimodal models

    Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. InNeurIPS, 2026. 2, 4, 5

  36. [36]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3

  37. [37]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024. 2, 4, 7

  38. [38]

    Generative frame sampler for long video understanding

    Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caim- ing Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 2025. 4, 2

  39. [39]

    VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, 2025

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Bo- qiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, and Lidong Bing. VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, 2025. arXiv:2501.00599 [cs]. 1, 2, 4, 5

  40. [40]

    The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?, 2025

    Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, and Yong Li. The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?, 2025. arXiv:2504.04540 [cs]. 2 10

  41. [41]

    Llava- next: A strong zero-shot video understanding model

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model. http s://llava-vl.github.io/blog/2024-04-30- llava-next-video/ , 2024. Accessed March 15,2026. 1, 4

  42. [42]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video Instruction Tuning With Synthetic Data, 2025. arXiv:2410.02713 [cs]. 1, 4, 3

  43. [43]

    Where does it exist: Spatio-temporal video grounding for multi-form sentences

    Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 2

  44. [44]

    Contphy: Continuum physical concept learning and reason- ing from videos

    Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B Tenenbaum, and Chuang Gan. Contphy: Continuum physical concept learning and reason- ing from videos. InInternational Conference on Machine Learning. PMLR, 2024. 1, 2

  45. [45]

    UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Ur- ban Scenarios, 2025

    Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, and Weijia Li. UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Ur- ban Scenarios, 2025. arXiv:2408.17267 [cs]. 2

  46. [46]

    Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles

    Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data, 2025. arXiv:2509.03501 [cs]. 1, 2

  47. [47]

    VLM4D: To- wards Spatiotemporal Awareness in Vision Language Models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: To- wards Spatiotemporal Awareness in Vision Language Models. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 2

  48. [48]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...

  49. [49]

    Apollo: An Exploration of Video Understanding in Large Multimodal Models, 2024

    Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. Apollo: An Exploration of Video Understanding in Large Multimodal Models, 2024. arXiv:2412.10360 [cs]. 2 11 FLAT-PACKBENCH: Evaluating Spatio-Temporal Understanding in Large Vision...

  50. [50]

    This limits the questions to these particular key frames, and even so, to only the parts annotated therein

    We found that segmentation annotations are only pro- vided for parts that are in the process of being connected in a particular key frame. This limits the questions to these particular key frames, and even so, to only the parts annotated therein

  51. [51]

    Figure S1 shows some examples of available segmenta- tions in IMaW to illustrate these issues

    Furthermore, IMaW only includes information at the sub- assembly granularity, which precludes questions that one might want to ask about specific parts. Figure S1 shows some examples of available segmenta- tions in IMaW to illustrate these issues. Thus, we annotate our own segmentation maps as described in Sec. 3. Which of the parts shown in the highlight...

  52. [52]

    Their position- ing, as if they are about to be inserted into Part 10 (not Part 7), also makes it easy to answer this question without any temporal context

    are visually quite distinct from Part 10. Their position- ing, as if they are about to be inserted into Part 10 (not Part 7), also makes it easy to answer this question without any temporal context. Observe that even without the videos, the correct answer is easy to infer from the visual prompt and commonsense reasoning alone. Such examples motivated us t...

  53. [56]

    connected

    A multiple-choice question (with its list of answer options) that refers to both the video and the image. Some additional assumptions to keep in mind: – Two furniture parts are "connected" if they are directly attached (physically in contact), like they would be in the final assembly. – Parts simply touching each other physically, but not in their final a...

  54. [57]

    connected

    In this task, "connected" means the two parts are in direct physical contact, in the same way they will be when the furniture is fully assembled (not merely near each other or partially aligned). Based on the given image, are {query part1} and {query part2} connected?

  55. [58]

    We observed poor perfor- mance across all settings, indicating that LVLMs struggle to understand even simpler concepts like physical contact

    Are {query part1} and {query part2} connected (physi- cally in contact) in the shown image? Table S3 shows the results. We observed poor perfor- mance across all settings, indicating that LVLMs struggle to understand even simpler concepts like physical contact. 4 Task Instructions for Collage Prompts You are a furniture-assembly expert. You are given a vi...

  56. [59]

    Right side: A frame from the furniture assembly video showing the assembly process

  57. [60]

    Call this Image A

    Left side: A labeled frame from the video on the right side, fixed for the entire video, displaying bright numeric IDs on each visible furniture part. Call this Image A

  58. [61]

    Call this Image B

    Center: Another labeled frame from the video on the right side, fixed for the entire video, also displaying bright numeric IDs on each visible furniture part. Call this Image B. Both Image A and Image B remain constant throughout the video. Note: The numeric IDs in Image A and Image B are not necessarily the same; the same ID may refer to different parts ...

  59. [62]

    Use the right-side video frames to observe the assembly steps

  60. [63]

    For TRACKQuestions

    Use the fixed left-side labeled frame to identify and relate the furniture parts mentioned in the question. For TRACKQuestions

  61. [64]

    Use the fixed labeled frames on the left side (Image A) and in the center (Image B) to identify and relate the furniture parts mentioned in the question

  62. [66]

    answer", whose value is the letter of your chosen option (e.g.,

    Select the correct answer based on the video and labeled parts. – Respond **only** with a JSON object containing a single key, "answer", whose value is the letter of your chosen option (e.g.,"A","B","C"). **Do not include any explanations or additional text--reply with only the JSON string.** Now answer the following question: Figure S4.Task Instructions ...

  63. [67]

    Use the video frames after the first frame to observe the assembly steps

  64. [68]

    For TRACKQuestions

    Use the first labeled frame to identify and relate the furniture parts mentioned in the question. For TRACKQuestions

  65. [69]

    Use the video frames after the first two frames to observe the assembly steps

  66. [70]

    Use the first two labeled frames, i.e., Image A and Image B, to identify and relate the furniture parts mentioned in the question

  67. [71]

    Carefully read the question and all answer choices

  68. [72]

    answer", whose value is the letter of your chosen option (e.g.,

    Select the correct answer based on the video and labeled parts. – Respond **only** with a JSON object containing a single key, "answer", whose value is the letter of your chosen option (e.g.,"A","B","C"). **Do not include any explanations or additional text--reply with only the JSON string.** Now answer the following question: Figure S5.Task Instructions ...

  69. [73]

    A furniture assembly video

    Each question consists of: a. A furniture assembly video. b. 1-2 visual prompts or frames extracted from the video, with certain parts shaded and la- beled. c. An MCQ question with at most 4 options

  70. [74]

    Examine the labeled frame(s) and understand the relationships between the highlighted parts using the video

  71. [76]

    1-2 images showing different stages in the assembly of a furniture item

    Each question consists of: a. 1-2 images showing different stages in the assembly of a furniture item. Certain parts of the furniture will be shaded and labeled. b. An MCQ question about the furniture assembly pro- cess with at most 4 options

  72. [77]

    First, examine the labeled images(s) and understand the relationships between the highlighted parts

  73. [78]

    Read the question and then determine the correct option

  74. [79]

    Please explain this answer step-by-step

    Even if you cannot infer the answer from the question and the image alone, we request that you use your best judgment and select an option. Figure S6.Human Evaluation Instructions. Left:Instructions for the standard task, where participants were provided the assembly video, visual prompt, and question text.Right:Instructions for the image-only task, they ...

  75. [80]

    Please specify this difficulty on a scale of 1-3 where: a

    Zoomed-in/out cameras: Are there frames where the camera is too zoomed-in/zoomed-out such that it becomes difficult to understand the assembly video? Answer based on overall impact on your understanding, not peak moments where this occurs. Please specify this difficulty on a scale of 1-3 where: a. 1 means zoom-in/zoom-out effects did not affect your under...

  76. [81]

    1 means that the parts go out-of-frame for <10 frames and even when they do, the remain partially visible or their motion is very unambiguous b

    Out-of-frame rotations: How difficult is it to understand out-of-frame (outside the camera’s view) rotations/attachments of parts in the video? Rate this on a scale of 1-3 where: a. 1 means that the parts go out-of-frame for <10 frames and even when they do, the remain partially visible or their motion is very unambiguous b. 2 means that a noticeable numb...

  77. [82]

    Rate the video from 1 to 3 where: a

    Tracking or tracing the motion of moving parts: We are trying to understand the amount of motion and and interactions between the parts in the video and whether it makes it difficult to understand which part was connected where. Rate the video from 1 to 3 where: a. 1 means that the parts are interacted with minimally, typically only just before they are a...

  78. [83]

    Text/Visual overlay: Some of the videos you see might have text or pictures from the furniture assembly guide overlaid on the frames. Depending on how difficult did the overlaid content make it for you to understand the video either due to obscuring the visuals or providing ambiguous instructions, please assign a score from 1 to 3 where: a. 1 means that t...

  79. [84]

    0: Canonical/upright orientation b

    Initial orientation mismatch: Does the assembly start in a flipped, upside-down, or non-canonical orientation relative to standard furniture assembly guides? a. 0: Canonical/upright orientation b. 1: Non-canonical (e.g., upside-down or rotated) Figure S8.Instructions for annotation difficulty of videos.We asked annotators to rate the videos in our benchma...

  80. [85]

    A video of a furniture assembly in progress

Showing first 80 references.