pith. sign in

arxiv: 2605.22570 · v1 · pith:4DUTBFXMnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Pith reviewed 2026-05-22 07:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatio-temporal reasoningmultimodal large language modelsvideo benchmarkgenerative video synthesisactive data generationhierarchical task evaluationcomputer vision evaluation
0
0 comments X

The pith

VGenST-Bench actively synthesizes controlled videos to diagnose fine-grained spatio-temporal reasoning in multimodal language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new video benchmark that relies on generative models to create evaluation scenarios instead of collecting existing footage. This approach allows precise control over spatial scales, perspectives, and scene dynamics through a 3x2x2 taxonomy. A multi-agent pipeline with human oversight generates both the videos and paired question-answer sets. The benchmark separates low-level visual perception tasks from higher-level reasoning tasks in a hierarchical suite. The central goal is to expose specific weaknesses in how current MLLMs handle space and time that passive datasets obscure.

Core claim

By replacing passive curation of real videos with active synthesis from generative models, VGenST-Bench produces videos whose spatio-temporal properties are known and adjustable in advance. The resulting dataset and task hierarchy let researchers isolate whether an MLLM fails at basic perception, at integrating motion across frames, or at higher-order spatial-temporal inference. This controlled construction directly supports fine-grained diagnosis of model capabilities.

What carries the argument

The multi-agent pipeline that combines generative video models with human quality control to produce videos and QA pairs under an explicit 3x2x2 taxonomy of spatial scale, perspective, and scene dynamics.

If this is right

  • Existing MLLMs can be tested on decoupled perception versus reasoning subtasks to locate exact failure modes.
  • New models can be trained or fine-tuned against the controlled variations in spatial scale, viewpoint, and dynamics.
  • Benchmark scores become comparable across models because every video property is known and documented.
  • The taxonomy supports systematic expansion by adding new dimensions while keeping the synthesis pipeline fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the synthesis method scales reliably, future benchmarks in other domains such as causal or social reasoning could adopt active generation instead of scraping existing media.
  • The separation of perception and reasoning tasks suggests a template for auditing other multimodal capabilities where low-level feature extraction might mask higher-level deficits.

Load-bearing premise

The videos generated by the pipeline match real-world spatio-temporal properties closely enough that any model errors can be attributed to reasoning deficits rather than artifacts of the synthesis process.

What would settle it

Run the same MLLM suite on VGenST-Bench videos and on matched real-world videos that contain identical spatial-temporal events; if error patterns differ systematically, the synthesis artifacts explain the benchmark results.

Figures

Figures reproduced from arXiv: 2605.22570 by Eunbyung Park, Hogun Park, Jinho Park, Youbin Kim.

Figure 1
Figure 1. Figure 1: Examples of VGenST-Bench. Each example contains a generated video and a multiple￾choice question targeting a specific spatio-temporal reasoning. Correct answers are highlighted. Abstract Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VGenST-Bench. A) Dataset generation. Given input video themes, our multi-agent pipeline jointly synthesizes videos paired with scene graphs, scenarios, and QA sets. B) Task & level design. Videos are organized along a 3 × 2 × 2 taxonomy over Spatial scale, Perspective, and Scene dynamics, with one spatio-temporal task assigned per cell. QA pairs follow a three-level hierarchy: (L1) Visual perce… view at source ↗
Figure 3
Figure 3. Figure 3: Representative videos for the 12 tasks of VGenST-Bench. Each cell of the 3 × 2 × 2 taxonomy (Spatial scale × Perspective × Scene dynamics) is paired with one dedicated reasoning task. Rows correspond to spatial scales (Figural / Vista / Environmental); columns are grouped by perspective (Egocentric / Exocentric) and scene dynamics (Static / Dynamic). Each strip shows four sampled frames from a representati… view at source ↗
Figure 4
Figure 4. Figure 4: VGenST-Bench construction pipeline. Starting from a theme, four agents operate in sequence. The Scene Graph Agent produces a structured scene graph specifying objects and spatial composition; the Scenario Agent expands it into a temporally grounded scenario with reasoning goal and timeline; the Video Agent synthesizes the corresponding image and video through generative models; and the QA Agent generates b… view at source ↗
Figure 5
Figure 5. Figure 5: Hierarchical Analysis: Accuracy across the three question levels. (a) All models degrade consistently from L1 to L3, while humans remain near-ceiling. (b) Breakdown by model, with the L1−L3 gap (∆). evaluation. As expected, human annotators remain a clear upper bound, achieving 99.0% on average and near-saturation across all twelve tasks. Even the strongest evaluated MLLM, Gemini 3 Flash, achieves only 85.… view at source ↗
Figure 6
Figure 6. Figure 6: Robustness Analysis. (a) None-of-these variants show a clear asymmetry: V1 maintains base accuracy, while V2 produces dramatic drops across all models. (b) Open-ended evaluation by question level reveals large drop on L3 for all models. Together, these reveal that closed-form MCQ accuracy may overestimate spatio-temporal reasoning capability. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reasoning failure of Direction Estimation task. The model’s reasoning trace correctly identifies the initial orientation, the leftward camera turn, and the final view, but inverts the resulting egocentric direction at the final step, concluding with the wrong answer. results show that current MLLMs perform multiple-choice reasoning by ranking the given options against each other rather than verifying the c… view at source ↗
Figure 8
Figure 8. Figure 8: Word cloud of the 1,000 themes in VGenST-Bench. To maximize visual diversity, VGenST-Bench draws scenar￾ios from a curated pool of themes that specify the visual and semantic context of each video. For each of the tasks in our tax￾onomy, we manually identified 10 theme categories that are semantically compatible with the task’s required scene proper￾ties, spanning everyday, industrial, sci-fi, and fantasy … view at source ↗
Figure 9
Figure 9. Figure 9: Construction Pipeline of VGenST-Bench. (i) Task Selector examines the input theme and determines which of the 12 tasks in our taxonomy is most appropriate for that theme. The selector returns a single task assignment (e.g., MC_F_EGO_STA). When constructing VGenST-Bench, we used a curated set of predefined themes (Appendix C.7). Task Selector is therefore bypassed during benchmark construction and is intend… view at source ↗
Figure 10
Figure 10. Figure 10: Task Selector — system prompt. Samples a (theme, task) pair from the curated theme pool of the target task (Appendix C.7). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Scene Graph Generator — system prompt template. Shared across all 12 tasks; the per-task scene-graph schema (required objects, attributes, relations) is injected into the template at runtime. Scene Graph Agent: Scene Graph Validator System Prompt (part 1/2) You are an expert AI Scene Graph Validator specializing in Spatio-Temporal Reasoning Benchmarks for MLLM. Your goal is to rigorously audit the 'Genera… view at source ↗
Figure 12
Figure 12. Figure 12: Scene Graph Validator — system prompt (part 1/2). Verifies schema compliance and emits a structured rejection feedback string when the candidate scene graph fails any required check. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Scene Graph Validator — system prompt (part 2/2). Scene Graph Agent: Scene Graph Validator User Prompt Please rigorously evaluate the SCENE GRAPH against the TASK DEFINITION and TASK RULES based strictly on your system instructions. Do NOT generate a new scene graph. Your job is to audit the provided data for logical perfection. INPUTS: TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} THEME: {T… view at source ↗
Figure 14
Figure 14. Figure 14: Scene Graph Validator — user prompt. Carries the candidate scene graph and the task-specific schema for validation. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Scenario Generator — system prompt (part 2/2). Scenario Agent: Scenario Generator User Prompt Please analyze this data and generate the scenario based strictly on your system instructions. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} TASK GUIDELINES: {TASK_GUIDELINES} REFERENCE EXAMPLES: {EXAMPLE} VALIDATION FEEDBACK (From Previous Attempt): {FEEDBACK} [P… view at source ↗
Figure 17
Figure 17. Figure 17: Scenario Generator — user prompt. Carries the scene graph, the task definition, the task rules and guidelines, the reference few-shot examples, and any prior validator feedback. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Scenario Validator — system prompt (part 1/2). Checks that the candidate timeline is sufficient to derive the ground-truth answer and contains no contradictions with the underlying scene graph. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scenario Validator — system prompt (part 2/2). Scenario Agent: Scenario Validator User Prompt Please rigorously evaluate the SCENARIO against the ground-truth SCENE GRAPH based strictly on your system instructions. Do NOT generate a new scenario. Your job is to audit the provided scenario and output the validation result in the specified JSON format. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TA… view at source ↗
Figure 20
Figure 20. Figure 20: Scenario Validator — user prompt. Carries the candidate scenario and the scene graph for cross-checking. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Image Prompt Translator — system prompt (part 1/2). Produces the first-frame prompt that the text-to-image generator turns into an anchor frame for downstream video synthesis. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Image Prompt Translator — system prompt (part 2/2). Video Agent: Image Prompt Translator User Prompt Please analyze the validated data and generate the first-frame image prompts based strictly on your system instructions. Do NOT include any timeline or movement descriptions in the prompts. Focus only on capturing the perfect starting state. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINIT… view at source ↗
Figure 23
Figure 23. Figure 23: Image Prompt Translator — user prompt. Carries the scene graph and the scenario’s initial state. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Video Prompt Translator — system prompt (part 1/3). Composes a video prompt that conditions the image-to-video generator on the anchor frame, the scenario’s timeline, and the camera setup. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Video Prompt Translator — system prompt (part 2/3). 42 [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Video Prompt Translator — system prompt (part 3/3). Video Agent: Video Prompt Translator User Prompt Please generate the Image-to-Video prompt based strictly on your system instructions. The video begins from the pre-generated Anchor Frame. Describe only the motion and changes from that point forward. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} TASK GUIDE… view at source ↗
Figure 27
Figure 27. Figure 27: Video Prompt Translator — user prompt. Carries the scenario, the anchor-frame description, and the camera trajectory. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: QA Generator — system prompt. Generates a base MCQ conditioned on the scene graph, the scenario, and the cell-specific QA template, with distractors drawn from the task’s distractor pool. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: QA Generator — user prompt. Carries the scene graph, the scenario, the QA template, and the distractor pool entries. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Reformatter — system prompt (part 1/2). Expands a base MCQ into the three reformulation variants: V1 (None-of-these distractor), V2 (None-of-these answer), and V3 (open￾ended). 46 [PITH_FULL_IMAGE:figures/full_fig_p046_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Reformatter — system prompt (part 2/2). QA Agent: Reformatter (V1/V2/V3) User Prompt TASK ID: {TASK_ID} QA TYPE: {QA_TYPE_ID} ({QA_TYPE_NAME}) BASE MCQs: {BASE_MCQS} Generate the 3 variants per question per your system instructions [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Reformatter — user prompt. Carries the base MCQ and the target variant identifier. H Qualitative Examples This section provides per-task qualitative examples of VGenST-Bench. For each of the 12 tasks, we sample one representative video (random sample idx) and render four cards: 8-frames of video, underlying scene graph (verbatim JSON), scenario (verbatim JSON), and a representative QA pairs containing one… view at source ↗
Figure 33
Figure 33. Figure 33: Frames for MC_F_EGO_STA, idx 81 (Tennis Player’s Courtside Bench). MC_F_EGO_STA - Multi-Container Attribute Mapping idx=81 | Tennis Player's Courtside Bench | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Multi_Container_Attribute_Mapping", "theme": "Tennis Player's Courtside Bench" }, "objects": [ { "id": "obj_anchor",… view at source ↗
Figure 34
Figure 34. Figure 34 [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Scene graph (part 2/2) for MC_F_EGO_STA, idx 81. MC_F_EGO_STA - Multi-Container Attribute Mapping idx=81 | Tennis Player's Courtside Bench | Scenario { "reasoning_goal": "Viewers must map content (seen from top) to container identity (seen from side) by integrating both camera angles. Mapping: Cobalt Matte Box White Towels, Copper Metal Tin Yellow Tennis Balls, Sage Ceramic Mug Orange Energy Gels.", "time… view at source ↗
Figure 36
Figure 36. Figure 36: Scenario for MC_F_EGO_STA, idx 81. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Sample QA pairs (one per cognitive level) for MC_F_EGO_STA, idx 81. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Frames for QC_F_EGO_DYN, idx 14 (Retail Checkout Counter). QC_F_EGO_DYN - Quantity Change Tracking idx=14 | Retail Checkout Counter | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Quantity_Change_Tracking", "theme": "Retail Checkout Counter" }, "objects": [ { "id": "obj_anchor", "label": "Gray Speckled Retail Counter",… view at source ↗
Figure 39
Figure 39. Figure 39 [PITH_FULL_IMAGE:figures/full_fig_p051_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Scene graph (part 2/2) for QC_F_EGO_DYN, idx 14. QC_F_EGO_DYN - Quantity Change Tracking idx=14 | Retail Checkout Counter | Scenario { "reasoning_goal": "Viewers must track 3 sequential ADD/REMOVE actions on identical Small Gold Coins to determine that 1 object(s) remain inside the Tall Black Cash Register Tray at the end.", "timeline": { "(Phase 1 Setup)": "Eye-level view of empty Tall Black Cash Registe… view at source ↗
Figure 41
Figure 41. Figure 41: Scenario for QC_F_EGO_DYN, idx 14. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Sample QA pairs for QC_F_EGO_DYN, idx 14. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Frames for CI_F_EXO_STA, idx 3 (Bathroom Vanity Counter). CI_F_EXO_STA - Container Intersection Inference idx=3 | Bathroom Vanity Counter | Scene Graph | part 1/3 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Container_Intersection_Inference", "theme": "Bathroom Vanity Counter" }, "objects": [ { "id": "obj_surface", "label": "White Marble Bat… view at source ↗
Figure 44
Figure 44. Figure 44: Scene graph (part 1/3) for CI_F_EXO_STA, idx 3. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Scene graph (part 2/3) for CI_F_EXO_STA, idx 3. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Scene graph (part 3/3) for CI_F_EXO_STA, idx 3. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Scenario for CI_F_EXO_STA, idx 3. CI_F_EXO_STA - Container Intersection Inference idx=3 | Bathroom Vanity Counter | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the Wide Rectangular Bamboo Tray Box located in the frame? (A) Left (B) Center (C) Right Q [L2-CM: Camera Motion] How does the camera move throughout the video? (A) It stays fixed at side view (B) It alternates between si… view at source ↗
Figure 48
Figure 48. Figure 48: Sample QA pairs for CI_F_EXO_STA, idx 3. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Frames for CM_F_EXO_DYN, idx 94 (Music Producer’s Synthesizer Stand). CM_F_EXO_DYN - Causal Mapping idx=94 | Music Producer's Synthesizer Stand | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Causal_Mapping", "theme": "Music Producer's Synthesizer Stand" }, "objects": [ { "id": "obj_agent", "label": "Music Producer", "… view at source ↗
Figure 50
Figure 50. Figure 50: Scene graph (part 1/2) for CM_F_EXO_DYN, idx 94. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Scene graph (part 2/2) for CM_F_EXO_DYN, idx 94. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Scenario for CM_F_EXO_DYN, idx 94. CM_F_EXO_DYN - Causal Mapping idx=94 | Music Producer's Synthesizer Stand | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the agent located in the frame? (A) Left (B) Center (C) Right Q [L2-AR: Action Recognition] What happens to the Motorized Fader Slider on the left after the RGB Pad Button on the left is pressed? (A) It illuminates green, sync… view at source ↗
Figure 53
Figure 53. Figure 53: Sample QA pairs for CM_F_EXO_DYN, idx 94. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Frames for DE_V_EGO_STA, idx 35 (Comedy Club Backstage L-Hallway). DE_V_EGO_STA - Direction Estimation idx=35 | Comedy Club Backstage L-Hallway | Scene Graph { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Direction_Estimation", "theme": "Comedy Club Backstage L-Hallway" }, "objects": [ { "id": "landmark_1", "label": "Neon 'ON AIR' Sign", "role":… view at source ↗
Figure 55
Figure 55. Figure 55: Scene graph for DE_V_EGO_STA, idx 35. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: Scenario for DE_V_EGO_STA, idx 35. DE_V_EGO_STA - Direction Estimation idx=35 | Comedy Club Backstage L-Hallway | Sample QAs Q [L1-OA: Object Attribute] What are the dominant colors of the Red Velvet Stage Curtain? (A) Dark stained wood (B) Deep Red Velvet (C) Safety Yellow (D) Bright Orange Wireframe on Black Background Q [L2-CM: Camera Motion] In which direction does the camera turn at the corner? (A) L… view at source ↗
Figure 57
Figure 57. Figure 57: Sample QA pairs for DE_V_EGO_STA, idx 35. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: Frames for IO_V_EGO_DYN, idx 31 (Farmhouse Kitchen with Prep Table and Hutch). IO_V_EGO_DYN - Interacted Object Identification idx=31 | Farmhouse Kitchen with Prep Table and Hutch | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Interacted_Object_Identification", "theme": "Farmhouse Kitchen with Prep Table and Hutch" }, "… view at source ↗
Figure 59
Figure 59. Figure 59: Scene graph (part 1/2) for IO_V_EGO_DYN, idx 31. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: Scene graph (part 2/2) for IO_V_EGO_DYN, idx 31. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_60.png] view at source ↗
Figure 61
Figure 61. Figure 61: Scenario for IO_V_EGO_DYN, idx 31. IO_V_EGO_DYN - Interacted Object Identification idx=31 | Farmhouse Kitchen with Prep Table and Hutch | Sample QAs Q [L1-FL: Frame Localization] In the view right after agent appears, where is the agent located in the frame? (A) Left (B) Center (C) Right Q [L2-AR: Action Recognition] Which object does the agent pick up? (A) White Egg in Bowl (B) Brown Bread Loaf (C) Red C… view at source ↗
Figure 62
Figure 62. Figure 62: Sample QA pairs for IO_V_EGO_DYN, idx 31. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_62.png] view at source ↗
Figure 63
Figure 63. Figure 63: Frames for HO_V_EXO_STA, idx 28 (Law Firm Office). HO_V_EXO_STA - Height Ordering idx=28 | Law Firm Office | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Height_Ordering", "theme": "Law Firm Office" }, "objects": [ { "id": "obj_lowest", "label": "Black Leather Attache Case", "role": "lowest_object", "attributes": { "colo… view at source ↗
Figure 64
Figure 64. Figure 64 [PITH_FULL_IMAGE:figures/full_fig_p066_64.png] view at source ↗
Figure 65
Figure 65. Figure 65: Scene graph (part 2/2) for HO_V_EXO_STA, idx 28. HO_V_EXO_STA - Height Ordering idx=28 | Law Firm Office | Scenario { "reasoning_goal": "The viewer must determine the height ordering: Black Leather Attache Case (on the dark hardwood floor) is lowest, Gold Brass Desk Clock (on the walnut partner desk) is middle, Maroon Legal Reference Volume (on the high mahogany bookcase) is highest. Distractor Green Bank… view at source ↗
Figure 66
Figure 66. Figure 66: Scenario for HO_V_EXO_STA, idx 28. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_66.png] view at source ↗
Figure 67
Figure 67. Figure 67: Sample QA pairs for HO_V_EXO_STA, idx 28. 68 [PITH_FULL_IMAGE:figures/full_fig_p068_67.png] view at source ↗
Figure 68
Figure 68. Figure 68: Frames for VI_V_EXO_DYN, idx 17 (Living Room with Tall Wooden Bookshelf). VI_V_EXO_DYN - Visibility Identification idx=17 | Living Room with Tall Wooden Bookshelf | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Visibility_Identification", "theme": "Living Room with Tall Wooden Bookshelf" }, "objects": [ { "id": "obj_obse… view at source ↗
Figure 69
Figure 69. Figure 69 [PITH_FULL_IMAGE:figures/full_fig_p069_69.png] view at source ↗
Figure 70
Figure 70. Figure 70: Scene graph (part 2/2) for VI_V_EXO_DYN, idx 17. VI_V_EXO_DYN - Visibility Identification idx=17 | Living Room with Tall Wooden Bookshelf | Scenario { "reasoning_goal": "Viewers must determine that the Resident sees the Visitor as Occluded initially, and after the Visitor moves around the Tall Wooden Bookshelf, the visibility status changes to Visible.", "timeline": { "(Phase 1 Initial State)": "Bird's-ey… view at source ↗
Figure 71
Figure 71. Figure 71: Scenario for VI_V_EXO_DYN, idx 17. 70 [PITH_FULL_IMAGE:figures/full_fig_p070_71.png] view at source ↗
Figure 72
Figure 72. Figure 72: Sample QA pairs for VI_V_EXO_DYN, idx 17. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_72.png] view at source ↗
Figure 73
Figure 73. Figure 73: Frames for DS_E_EGO_STA, idx 94 (Medieval Castle Dungeon Network). DS_E_EGO_STA - Directional Signage Grounding idx=94 | Medieval Castle Dungeon Network | Scene Graph { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Directional_Signage_Grounding", "theme": "Medieval Castle Dungeon Network" }, "objects": [ { "id": "obj_sign", "label": "Iron… view at source ↗
Figure 74
Figure 74. Figure 74: Scene graph for DS_E_EGO_STA, idx 94. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_74.png] view at source ↗
Figure 75
Figure 75. Figure 75: Scenario for DS_E_EGO_STA, idx 94. DS_E_EGO_STA - Directional Signage Grounding idx=94 | Medieval Castle Dungeon Network | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the directional sign located in the frame? (A) Top (B) Bottom (C) Left (D) Right Q [L2-CM: Camera Motion] How does the camera move throughout the video? (A) It walks straight forward (B) It walks forward then turns… view at source ↗
Figure 76
Figure 76. Figure 76: Sample QA pairs for DS_E_EGO_STA, idx 94. 73 [PITH_FULL_IMAGE:figures/full_fig_p073_76.png] view at source ↗
Figure 77
Figure 77. Figure 77: Frames for RV_E_EGO_DYN, idx 13 (Go-Kart Circuit). RV_E_EGO_DYN - Relative Velocity Identification idx=13 | Go-Kart Circuit | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Relative_Velocity_Identification", "theme": "Go-Kart Circuit" }, "objects": [ { "id": "obj_ego", "label": "Pink Go-Kart", "role": "ego_agent",… view at source ↗
Figure 78
Figure 78. Figure 78 [PITH_FULL_IMAGE:figures/full_fig_p074_78.png] view at source ↗
Figure 79
Figure 79. Figure 79: Scene graph (part 2/2) for RV_E_EGO_DYN, idx 13. 75 [PITH_FULL_IMAGE:figures/full_fig_p075_79.png] view at source ↗
Figure 80
Figure 80. Figure 80: Scenario for RV_E_EGO_DYN, idx 13. RV_E_EGO_DYN - Relative Velocity Identification idx=13 | Go-Kart Circuit | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the Mechanic Cart located in the frame? (A) Left side (B) Center (C) Right side Q [L2-AR: Action Recognition] How does the Turbo Kart appear to move relative to the camera? (A) Drifts backward (B) Streaks forward (C) Stays stat… view at source ↗
Figure 81
Figure 81. Figure 81: Sample QA pairs for RV_E_EGO_DYN, idx 13. 76 [PITH_FULL_IMAGE:figures/full_fig_p076_81.png] view at source ↗
Figure 82
Figure 82. Figure 82: Frames for LS_E_EXO_STA, idx 86 (Polar Research Base). LS_E_EXO_STA - Landmark Spatial Composition idx=86 | Polar Research Base | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Landmark_Spatial_Composition", "theme": "Polar Research Base" }, "objects": [ { "id": "obj_landmark_1", "label": "Bright Orange Main Habita… view at source ↗
Figure 83
Figure 83. Figure 83: Scene graph (part 1/2) for LS_E_EXO_STA, idx 86. 77 [PITH_FULL_IMAGE:figures/full_fig_p077_83.png] view at source ↗
Figure 84
Figure 84. Figure 84: Scene graph (part 2/2) for LS_E_EXO_STA, idx 86. LS_E_EXO_STA - Landmark Spatial Composition idx=86 | Polar Research Base | Scenario { "reasoning_goal": "The viewer must deduce that the Dark Blue Aurora Observatory is to the E of the Bright Orange Main Habitat Module, by combining: (1) crane-up reveals the White Insulated Medical Bay to the S of the Bright Orange Main Habitat Module, and (2) camera flies … view at source ↗
Figure 85
Figure 85. Figure 85: Scenario for LS_E_EXO_STA, idx 86. 78 [PITH_FULL_IMAGE:figures/full_fig_p078_85.png] view at source ↗
Figure 86
Figure 86. Figure 86: Sample QA pairs for LS_E_EXO_STA, idx 86. 79 [PITH_FULL_IMAGE:figures/full_fig_p079_86.png] view at source ↗
Figure 87
Figure 87. Figure 87: Frames for BT_E_EXO_DYN, idx 94 (Wasteland Highway). BT_E_EXO_DYN - Behavioral Trigger Identification idx=94 | Wasteland highway | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Behavioral_Trigger_Identification", "theme": "Wasteland highway" }, "objects": [ { "id": "obj_agent", "label": "Black Spiked Post-Apocaly… view at source ↗
Figure 88
Figure 88. Figure 88 [PITH_FULL_IMAGE:figures/full_fig_p080_88.png] view at source ↗
Figure 89
Figure 89. Figure 89: Scene graph (part 2/2) for BT_E_EXO_DYN, idx 94. BT_E_EXO_DYN - Behavioral Trigger Identification idx=94 | Wasteland highway | Scenario { "reasoning_goal": "The viewer must visually verify that the Black Spiked Post-Apocalyptic Combat Vehicle's wait-and-resume reaction was directly caused by the Pack of Mutant Wild Dogs.", "timeline": { "(Setup)": "High-angle drone camera tracks the Black Spiked Post-Apoc… view at source ↗
Figure 90
Figure 90. Figure 90: Scenario for BT_E_EXO_DYN, idx 94. 81 [PITH_FULL_IMAGE:figures/full_fig_p081_90.png] view at source ↗
Figure 91
Figure 91. Figure 91: Sample QA pairs for BT_E_EXO_DYN, idx 94. 82 [PITH_FULL_IMAGE:figures/full_fig_p082_91.png] view at source ↗
read the original abstract

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces VGenST-Bench, a video benchmark for spatio-temporal reasoning in MLLMs that shifts from passive curation to active synthesis via generative models and a multi-agent pipeline with human QC. It defines a 3x2x2 taxonomy (Spatial Scale, Perspective, Scene Dynamics) and a hierarchical task suite decoupling low-level perception from high-level reasoning, claiming this enables fine-grained diagnosis of model capabilities.

Significance. If the synthesized videos prove free of systematic artifacts that confound reasoning evaluation, the active-synthesis paradigm could meaningfully improve diagnostic granularity over existing passive video benchmarks. The taxonomy and task hierarchy are well-motivated design choices that directly target the stated limitations of prior datasets.

major comments (1)
  1. [Abstract and §3] Abstract and construction pipeline (described in §3): the claim that VGenST-Bench supports fine-grained diagnosis presupposes that generative artifacts do not systematically bias performance across taxonomy cells. No quantitative checks—optical-flow statistics, depth-consistency metrics, motion-continuity scores, or side-by-side realism ratings versus real videos—are reported to verify that the synthesized distribution matches real-world spatio-temporal statistics closely enough for the diagnostic claim to hold.
minor comments (2)
  1. [§3] The manuscript should clarify the exact generative models and prompting strategies used in the multi-agent pipeline, including any failure modes observed during video synthesis.
  2. [Figures 2–4] Figure captions and taxonomy diagrams would benefit from explicit mapping to the hierarchical task levels to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the evidentiary requirements for our diagnostic claims. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and construction pipeline (described in §3): the claim that VGenST-Bench supports fine-grained diagnosis presupposes that generative artifacts do not systematically bias performance across taxonomy cells. No quantitative checks—optical-flow statistics, depth-consistency metrics, motion-continuity scores, or side-by-side realism ratings versus real videos—are reported to verify that the synthesized distribution matches real-world spatio-temporal statistics closely enough for the diagnostic claim to hold.

    Authors: We agree that the current manuscript lacks the quantitative distributional checks the referee identifies. The multi-agent pipeline and human QC stage are designed to minimize obvious artifacts, but these do not substitute for explicit statistical validation. In the revised manuscript we will add to §3 the suggested analyses: optical-flow statistics, depth-consistency metrics, motion-continuity scores, and side-by-side human realism ratings comparing synthesized videos to real-world counterparts. These additions will directly support the fine-grained diagnosis claim by demonstrating that the synthesized distribution does not systematically deviate from real spatio-temporal statistics across taxonomy cells. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper presents a descriptive construction of VGenST-Bench via a multi-agent generative pipeline, 3x2x2 taxonomy, and hierarchical task suite without equations, fitted parameters, predictions, or derivations. The central claim that active synthesis enables fine-grained diagnosis follows directly from the stated design choices (controlled scenarios, decoupling perception from reasoning, human QC) and does not reduce to any prior inputs or self-citations by construction. No load-bearing step equates to its own inputs; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative synthesis can produce sufficiently faithful spatio-temporal scenarios; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Generative models combined with multi-agent pipelines and human review can create videos whose spatio-temporal properties are suitable for fine-grained reasoning evaluation
    Invoked when the paper claims that active synthesis enables precise diagnosis of MLLM capabilities.

pith-pipeline@v0.9.0 · 5721 in / 1277 out tokens · 51367 ms · 2026-05-22T07:08:44.791877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios... multi-agent pipeline... 3×2×2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics... hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · 17 internal anchors

  1. [1]

    Claude sonnet 4.6, 2025

    Anthropic. Claude sonnet 4.6, 2025. URL https://www.anthropic.com/claude/sonnet. Accessed: 2026-05

  2. [2]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  3. [3]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  4. [4]

    Revisiting the" video" in video-language understanding

    Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022

  5. [5]

    Seedream: Bytedance image generation model

    ByteDance Seed. Seedream: Bytedance image generation model. https://seed.bytedance. com/en/seedream5_0_lite, 2024. Accessed: 2026-05

  6. [6]

    Temporalbench: Towards fine-grained temporal understanding for multimodal video models

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. 2024

  7. [7]

    Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

  8. [8]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  9. [9]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  10. [10]

    Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

    Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B Tenenbaum, and Chuang Gan. Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

  11. [11]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

  12. [12]

    Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

  13. [13]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  14. [14]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  15. [15]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 10

  16. [16]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  17. [17]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  18. [18]

    Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

    Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, et al. Learning human-perceived fakeness in ai-generated videos via multimodal llms.arXiv preprint arXiv:2509.22646, 2025

  19. [19]

    Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

    Rohit Girdhar and Deva Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

  20. [20]

    Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025

    Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025. URL https: //arxiv.org/abs/2506.07966

  21. [21]

    Veo 3, 2025

    Google. Veo 3, 2025. URLhttps://aistudio.google.com/models/veo-3

  22. [22]

    Nano Banana: Gemini image generation model

    Google DeepMind. Nano Banana: Gemini image generation model. https://deepmind. google/models/gemini/image/, 2025. Accessed: 2026-05

  23. [23]

    Gemini 3 Flash

    Google DeepMind. Gemini 3 Flash. https://deepmind.google/models/gemini/flash/,

  24. [24]

    Google DeepMind. Gemma 4. https://ai.google.dev/gemma/docs/core/model_ card_4/, 2026. Accessed: 2026-05

  25. [25]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  26. [26]

    Egoexobench: A benchmark for first-and third-person view video understanding in mllms

    Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025

  27. [27]

    Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

    Mary Hegarty, Daniel R Montello, Anthony E Richardson, Toru Ishikawa, and Kristin Lovelace. Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

  28. [28]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  29. [29]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  30. [30]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  31. [31]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

  32. [32]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 11

  33. [33]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023

  34. [34]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  35. [35]

    Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections

    Roberta L Klatzky. Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. InSpatial cognition: An interdisciplinary approach to representing and processing spatial knowledge, pages 1–17. Springer, 1998

  36. [36]

    Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

    Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025

  37. [37]

    Kling AI: Kuaishou video generation model

    Kuaishou Technology. Kling AI: Kuaishou video generation model. https://klingai.com/,

  38. [38]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  39. [39]

    Tvqa+: Spatio-temporal grounding for video question answering

    Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

  40. [40]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 487–507, 2023

  41. [41]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  42. [42]

    Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

    Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

  43. [43]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  44. [44]

    Unfolding spatial cognition: Evaluating multimodal models on visual simulations

    Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. Unfolding spatial cognition: Evaluating multimodal models on visual simulations. arXiv preprint arXiv:2506.04633, 2025

  45. [45]

    Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

  46. [46]

    Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

    Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

  47. [47]

    Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025

    Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025. 12

  48. [48]

    Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

    Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

  49. [49]

    Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding

    Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv preprint arXiv:2507.07984, 2025

  50. [50]

    Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

    Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

  51. [51]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  52. [52]

    Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

  53. [53]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  54. [54]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

  55. [55]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

  56. [56]

    Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

    Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

  57. [57]

    Scale and multiple psychologies of space

    Daniel R Montello. Scale and multiple psychologies of space. InEuropean conference on spatial information theory, pages 312–321. Springer, 1993

  58. [58]

    Kimi K2.6

    Moonshot AI. Kimi K2.6. https://huggingface.co/moonshotai/Kimi-K2.6, 2026. Accessed: 2026-05

  59. [59]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

  60. [60]

    Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

  61. [61]

    Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

    Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, 2023

  62. [62]

    Clevr-x: A visual reasoning dataset for natural language explanations

    Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020

  63. [63]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  64. [64]

    Vidu: Ai video generation model

    Shengshu Technology. Vidu: Ai video generation model. https://www.vidu.com/, 2024. Accessed: 2026-05. 13

  65. [65]

    Both text and images leaked! a systematic analysis of data contamination in multimodal llm

    Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, and Benyou Wang. Both text and images leaked! a systematic analysis of data contamination in multimodal llm. arXiv preprint arXiv:2411.03823, 2024

  66. [66]

    Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

    Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

  67. [67]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  68. [68]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026. URL https://arxiv.org/abs/2604.14148. ByteDance Seed

  69. [69]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  70. [70]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  71. [71]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  72. [72]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. Advances in Neural Information Processing Systems, 37:75392–75421, 2024

  73. [73]

    Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

    Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

  74. [74]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  75. [75]

    3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

    Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, and Alan L Yuille. 3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

  76. [76]

    Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

    Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, and Alan Yuille. Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

  77. [77]

    Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24669–24679, 2025

  78. [78]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025. 14

  79. [79]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  80. [80]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

Showing first 80 references.