VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Eunbyung Park; Hogun Park; Jinho Park; Youbin Kim

arxiv: 2605.22570 · v1 · pith:4DUTBFXMnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Jinho Park , Youbin Kim , Hogun Park , Eunbyung Park This is my paper

Pith reviewed 2026-05-22 07:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatio-temporal reasoningmultimodal large language modelsvideo benchmarkgenerative video synthesisactive data generationhierarchical task evaluationcomputer vision evaluation

0 comments

The pith

VGenST-Bench actively synthesizes controlled videos to diagnose fine-grained spatio-temporal reasoning in multimodal language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new video benchmark that relies on generative models to create evaluation scenarios instead of collecting existing footage. This approach allows precise control over spatial scales, perspectives, and scene dynamics through a 3x2x2 taxonomy. A multi-agent pipeline with human oversight generates both the videos and paired question-answer sets. The benchmark separates low-level visual perception tasks from higher-level reasoning tasks in a hierarchical suite. The central goal is to expose specific weaknesses in how current MLLMs handle space and time that passive datasets obscure.

Core claim

By replacing passive curation of real videos with active synthesis from generative models, VGenST-Bench produces videos whose spatio-temporal properties are known and adjustable in advance. The resulting dataset and task hierarchy let researchers isolate whether an MLLM fails at basic perception, at integrating motion across frames, or at higher-order spatial-temporal inference. This controlled construction directly supports fine-grained diagnosis of model capabilities.

What carries the argument

The multi-agent pipeline that combines generative video models with human quality control to produce videos and QA pairs under an explicit 3x2x2 taxonomy of spatial scale, perspective, and scene dynamics.

If this is right

Existing MLLMs can be tested on decoupled perception versus reasoning subtasks to locate exact failure modes.
New models can be trained or fine-tuned against the controlled variations in spatial scale, viewpoint, and dynamics.
Benchmark scores become comparable across models because every video property is known and documented.
The taxonomy supports systematic expansion by adding new dimensions while keeping the synthesis pipeline fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthesis method scales reliably, future benchmarks in other domains such as causal or social reasoning could adopt active generation instead of scraping existing media.
The separation of perception and reasoning tasks suggests a template for auditing other multimodal capabilities where low-level feature extraction might mask higher-level deficits.

Load-bearing premise

The videos generated by the pipeline match real-world spatio-temporal properties closely enough that any model errors can be attributed to reasoning deficits rather than artifacts of the synthesis process.

What would settle it

Run the same MLLM suite on VGenST-Bench videos and on matched real-world videos that contain identical spatial-temporal events; if error patterns differ systematically, the synthesis artifacts explain the benchmark results.

Figures

Figures reproduced from arXiv: 2605.22570 by Eunbyung Park, Hogun Park, Jinho Park, Youbin Kim.

**Figure 1.** Figure 1: Examples of VGenST-Bench. Each example contains a generated video and a multiplechoice question targeting a specific spatio-temporal reasoning. Correct answers are highlighted. Abstract Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal… view at source ↗

**Figure 2.** Figure 2: Overview of VGenST-Bench. A) Dataset generation. Given input video themes, our multi-agent pipeline jointly synthesizes videos paired with scene graphs, scenarios, and QA sets. B) Task & level design. Videos are organized along a 3 × 2 × 2 taxonomy over Spatial scale, Perspective, and Scene dynamics, with one spatio-temporal task assigned per cell. QA pairs follow a three-level hierarchy: (L1) Visual perce… view at source ↗

**Figure 3.** Figure 3: Representative videos for the 12 tasks of VGenST-Bench. Each cell of the 3 × 2 × 2 taxonomy (Spatial scale × Perspective × Scene dynamics) is paired with one dedicated reasoning task. Rows correspond to spatial scales (Figural / Vista / Environmental); columns are grouped by perspective (Egocentric / Exocentric) and scene dynamics (Static / Dynamic). Each strip shows four sampled frames from a representati… view at source ↗

**Figure 4.** Figure 4: VGenST-Bench construction pipeline. Starting from a theme, four agents operate in sequence. The Scene Graph Agent produces a structured scene graph specifying objects and spatial composition; the Scenario Agent expands it into a temporally grounded scenario with reasoning goal and timeline; the Video Agent synthesizes the corresponding image and video through generative models; and the QA Agent generates b… view at source ↗

**Figure 5.** Figure 5: Hierarchical Analysis: Accuracy across the three question levels. (a) All models degrade consistently from L1 to L3, while humans remain near-ceiling. (b) Breakdown by model, with the L1−L3 gap (∆). evaluation. As expected, human annotators remain a clear upper bound, achieving 99.0% on average and near-saturation across all twelve tasks. Even the strongest evaluated MLLM, Gemini 3 Flash, achieves only 85.… view at source ↗

**Figure 6.** Figure 6: Robustness Analysis. (a) None-of-these variants show a clear asymmetry: V1 maintains base accuracy, while V2 produces dramatic drops across all models. (b) Open-ended evaluation by question level reveals large drop on L3 for all models. Together, these reveal that closed-form MCQ accuracy may overestimate spatio-temporal reasoning capability. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Reasoning failure of Direction Estimation task. The model’s reasoning trace correctly identifies the initial orientation, the leftward camera turn, and the final view, but inverts the resulting egocentric direction at the final step, concluding with the wrong answer. results show that current MLLMs perform multiple-choice reasoning by ranking the given options against each other rather than verifying the c… view at source ↗

**Figure 8.** Figure 8: Word cloud of the 1,000 themes in VGenST-Bench. To maximize visual diversity, VGenST-Bench draws scenarios from a curated pool of themes that specify the visual and semantic context of each video. For each of the tasks in our taxonomy, we manually identified 10 theme categories that are semantically compatible with the task’s required scene properties, spanning everyday, industrial, sci-fi, and fantasy … view at source ↗

**Figure 9.** Figure 9: Construction Pipeline of VGenST-Bench. (i) Task Selector examines the input theme and determines which of the 12 tasks in our taxonomy is most appropriate for that theme. The selector returns a single task assignment (e.g., MC_F_EGO_STA). When constructing VGenST-Bench, we used a curated set of predefined themes (Appendix C.7). Task Selector is therefore bypassed during benchmark construction and is intend… view at source ↗

**Figure 10.** Figure 10: Task Selector — system prompt. Samples a (theme, task) pair from the curated theme pool of the target task (Appendix C.7). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Scene Graph Generator — system prompt template. Shared across all 12 tasks; the per-task scene-graph schema (required objects, attributes, relations) is injected into the template at runtime. Scene Graph Agent: Scene Graph Validator System Prompt (part 1/2) You are an expert AI Scene Graph Validator specializing in Spatio-Temporal Reasoning Benchmarks for MLLM. Your goal is to rigorously audit the 'Genera… view at source ↗

**Figure 12.** Figure 12: Scene Graph Validator — system prompt (part 1/2). Verifies schema compliance and emits a structured rejection feedback string when the candidate scene graph fails any required check. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: Scene Graph Validator — system prompt (part 2/2). Scene Graph Agent: Scene Graph Validator User Prompt Please rigorously evaluate the SCENE GRAPH against the TASK DEFINITION and TASK RULES based strictly on your system instructions. Do NOT generate a new scene graph. Your job is to audit the provided data for logical perfection. INPUTS: TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} THEME: {T… view at source ↗

**Figure 14.** Figure 14: Scene Graph Validator — user prompt. Carries the candidate scene graph and the task-specific schema for validation. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗

**Figure 16.** Figure 16: Scenario Generator — system prompt (part 2/2). Scenario Agent: Scenario Generator User Prompt Please analyze this data and generate the scenario based strictly on your system instructions. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} TASK GUIDELINES: {TASK_GUIDELINES} REFERENCE EXAMPLES: {EXAMPLE} VALIDATION FEEDBACK (From Previous Attempt): {FEEDBACK} [P… view at source ↗

**Figure 17.** Figure 17: Scenario Generator — user prompt. Carries the scene graph, the task definition, the task rules and guidelines, the reference few-shot examples, and any prior validator feedback. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗

**Figure 18.** Figure 18: Scenario Validator — system prompt (part 1/2). Checks that the candidate timeline is sufficient to derive the ground-truth answer and contains no contradictions with the underlying scene graph. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗

**Figure 19.** Figure 19: Scenario Validator — system prompt (part 2/2). Scenario Agent: Scenario Validator User Prompt Please rigorously evaluate the SCENARIO against the ground-truth SCENE GRAPH based strictly on your system instructions. Do NOT generate a new scenario. Your job is to audit the provided scenario and output the validation result in the specified JSON format. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TA… view at source ↗

**Figure 20.** Figure 20: Scenario Validator — user prompt. Carries the candidate scenario and the scene graph for cross-checking. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗

**Figure 21.** Figure 21: Image Prompt Translator — system prompt (part 1/2). Produces the first-frame prompt that the text-to-image generator turns into an anchor frame for downstream video synthesis. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗

**Figure 22.** Figure 22: Image Prompt Translator — system prompt (part 2/2). Video Agent: Image Prompt Translator User Prompt Please analyze the validated data and generate the first-frame image prompts based strictly on your system instructions. Do NOT include any timeline or movement descriptions in the prompts. Focus only on capturing the perfect starting state. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINIT… view at source ↗

**Figure 23.** Figure 23: Image Prompt Translator — user prompt. Carries the scene graph and the scenario’s initial state. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗

**Figure 24.** Figure 24: Video Prompt Translator — system prompt (part 1/3). Composes a video prompt that conditions the image-to-video generator on the anchor frame, the scenario’s timeline, and the camera setup. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_24.png] view at source ↗

**Figure 25.** Figure 25: Video Prompt Translator — system prompt (part 2/3). 42 [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗

**Figure 26.** Figure 26: Video Prompt Translator — system prompt (part 3/3). Video Agent: Video Prompt Translator User Prompt Please generate the Image-to-Video prompt based strictly on your system instructions. The video begins from the pre-generated Anchor Frame. Describe only the motion and changes from that point forward. INPUTS: SCENE GRAPH: {SCENE_GRAPH} TASK DEFINITION: {TASK_DEFINITION} TASK RULES: {TASK_RULES} TASK GUIDE… view at source ↗

**Figure 27.** Figure 27: Video Prompt Translator — user prompt. Carries the scenario, the anchor-frame description, and the camera trajectory. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_27.png] view at source ↗

**Figure 28.** Figure 28: QA Generator — system prompt. Generates a base MCQ conditioned on the scene graph, the scenario, and the cell-specific QA template, with distractors drawn from the task’s distractor pool. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_28.png] view at source ↗

**Figure 29.** Figure 29: QA Generator — user prompt. Carries the scene graph, the scenario, the QA template, and the distractor pool entries. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_29.png] view at source ↗

**Figure 30.** Figure 30: Reformatter — system prompt (part 1/2). Expands a base MCQ into the three reformulation variants: V1 (None-of-these distractor), V2 (None-of-these answer), and V3 (openended). 46 [PITH_FULL_IMAGE:figures/full_fig_p046_30.png] view at source ↗

**Figure 31.** Figure 31: Reformatter — system prompt (part 2/2). QA Agent: Reformatter (V1/V2/V3) User Prompt TASK ID: {TASK_ID} QA TYPE: {QA_TYPE_ID} ({QA_TYPE_NAME}) BASE MCQs: {BASE_MCQS} Generate the 3 variants per question per your system instructions [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗

**Figure 32.** Figure 32: Reformatter — user prompt. Carries the base MCQ and the target variant identifier. H Qualitative Examples This section provides per-task qualitative examples of VGenST-Bench. For each of the 12 tasks, we sample one representative video (random sample idx) and render four cards: 8-frames of video, underlying scene graph (verbatim JSON), scenario (verbatim JSON), and a representative QA pairs containing one… view at source ↗

**Figure 33.** Figure 33: Frames for MC_F_EGO_STA, idx 81 (Tennis Player’s Courtside Bench). MC_F_EGO_STA - Multi-Container Attribute Mapping idx=81 | Tennis Player's Courtside Bench | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Multi_Container_Attribute_Mapping", "theme": "Tennis Player's Courtside Bench" }, "objects": [ { "id": "obj_anchor",… view at source ↗

**Figure 34.** Figure 34 [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗

**Figure 35.** Figure 35: Scene graph (part 2/2) for MC_F_EGO_STA, idx 81. MC_F_EGO_STA - Multi-Container Attribute Mapping idx=81 | Tennis Player's Courtside Bench | Scenario { "reasoning_goal": "Viewers must map content (seen from top) to container identity (seen from side) by integrating both camera angles. Mapping: Cobalt Matte Box White Towels, Copper Metal Tin Yellow Tennis Balls, Sage Ceramic Mug Orange Energy Gels.", "time… view at source ↗

**Figure 36.** Figure 36: Scenario for MC_F_EGO_STA, idx 81. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_36.png] view at source ↗

**Figure 37.** Figure 37: Sample QA pairs (one per cognitive level) for MC_F_EGO_STA, idx 81. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_37.png] view at source ↗

**Figure 38.** Figure 38: Frames for QC_F_EGO_DYN, idx 14 (Retail Checkout Counter). QC_F_EGO_DYN - Quantity Change Tracking idx=14 | Retail Checkout Counter | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Quantity_Change_Tracking", "theme": "Retail Checkout Counter" }, "objects": [ { "id": "obj_anchor", "label": "Gray Speckled Retail Counter",… view at source ↗

**Figure 39.** Figure 39 [PITH_FULL_IMAGE:figures/full_fig_p051_39.png] view at source ↗

**Figure 40.** Figure 40: Scene graph (part 2/2) for QC_F_EGO_DYN, idx 14. QC_F_EGO_DYN - Quantity Change Tracking idx=14 | Retail Checkout Counter | Scenario { "reasoning_goal": "Viewers must track 3 sequential ADD/REMOVE actions on identical Small Gold Coins to determine that 1 object(s) remain inside the Tall Black Cash Register Tray at the end.", "timeline": { "(Phase 1 Setup)": "Eye-level view of empty Tall Black Cash Registe… view at source ↗

**Figure 41.** Figure 41: Scenario for QC_F_EGO_DYN, idx 14. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_41.png] view at source ↗

**Figure 42.** Figure 42: Sample QA pairs for QC_F_EGO_DYN, idx 14. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_42.png] view at source ↗

**Figure 43.** Figure 43: Frames for CI_F_EXO_STA, idx 3 (Bathroom Vanity Counter). CI_F_EXO_STA - Container Intersection Inference idx=3 | Bathroom Vanity Counter | Scene Graph | part 1/3 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Container_Intersection_Inference", "theme": "Bathroom Vanity Counter" }, "objects": [ { "id": "obj_surface", "label": "White Marble Bat… view at source ↗

**Figure 44.** Figure 44: Scene graph (part 1/3) for CI_F_EXO_STA, idx 3. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_44.png] view at source ↗

**Figure 45.** Figure 45: Scene graph (part 2/3) for CI_F_EXO_STA, idx 3. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_45.png] view at source ↗

**Figure 46.** Figure 46: Scene graph (part 3/3) for CI_F_EXO_STA, idx 3. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_46.png] view at source ↗

**Figure 47.** Figure 47: Scenario for CI_F_EXO_STA, idx 3. CI_F_EXO_STA - Container Intersection Inference idx=3 | Bathroom Vanity Counter | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the Wide Rectangular Bamboo Tray Box located in the frame? (A) Left (B) Center (C) Right Q [L2-CM: Camera Motion] How does the camera move throughout the video? (A) It stays fixed at side view (B) It alternates between si… view at source ↗

**Figure 48.** Figure 48: Sample QA pairs for CI_F_EXO_STA, idx 3. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_48.png] view at source ↗

**Figure 49.** Figure 49: Frames for CM_F_EXO_DYN, idx 94 (Music Producer’s Synthesizer Stand). CM_F_EXO_DYN - Causal Mapping idx=94 | Music Producer's Synthesizer Stand | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Figural", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Causal_Mapping", "theme": "Music Producer's Synthesizer Stand" }, "objects": [ { "id": "obj_agent", "label": "Music Producer", "… view at source ↗

**Figure 50.** Figure 50: Scene graph (part 1/2) for CM_F_EXO_DYN, idx 94. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_50.png] view at source ↗

**Figure 51.** Figure 51: Scene graph (part 2/2) for CM_F_EXO_DYN, idx 94. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_51.png] view at source ↗

**Figure 52.** Figure 52: Scenario for CM_F_EXO_DYN, idx 94. CM_F_EXO_DYN - Causal Mapping idx=94 | Music Producer's Synthesizer Stand | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the agent located in the frame? (A) Left (B) Center (C) Right Q [L2-AR: Action Recognition] What happens to the Motorized Fader Slider on the left after the RGB Pad Button on the left is pressed? (A) It illuminates green, sync… view at source ↗

**Figure 53.** Figure 53: Sample QA pairs for CM_F_EXO_DYN, idx 94. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_53.png] view at source ↗

**Figure 54.** Figure 54: Frames for DE_V_EGO_STA, idx 35 (Comedy Club Backstage L-Hallway). DE_V_EGO_STA - Direction Estimation idx=35 | Comedy Club Backstage L-Hallway | Scene Graph { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Direction_Estimation", "theme": "Comedy Club Backstage L-Hallway" }, "objects": [ { "id": "landmark_1", "label": "Neon 'ON AIR' Sign", "role":… view at source ↗

**Figure 55.** Figure 55: Scene graph for DE_V_EGO_STA, idx 35. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_55.png] view at source ↗

**Figure 56.** Figure 56: Scenario for DE_V_EGO_STA, idx 35. DE_V_EGO_STA - Direction Estimation idx=35 | Comedy Club Backstage L-Hallway | Sample QAs Q [L1-OA: Object Attribute] What are the dominant colors of the Red Velvet Stage Curtain? (A) Dark stained wood (B) Deep Red Velvet (C) Safety Yellow (D) Bright Orange Wireframe on Black Background Q [L2-CM: Camera Motion] In which direction does the camera turn at the corner? (A) L… view at source ↗

**Figure 57.** Figure 57: Sample QA pairs for DE_V_EGO_STA, idx 35. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_57.png] view at source ↗

**Figure 58.** Figure 58: Frames for IO_V_EGO_DYN, idx 31 (Farmhouse Kitchen with Prep Table and Hutch). IO_V_EGO_DYN - Interacted Object Identification idx=31 | Farmhouse Kitchen with Prep Table and Hutch | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Interacted_Object_Identification", "theme": "Farmhouse Kitchen with Prep Table and Hutch" }, "… view at source ↗

**Figure 59.** Figure 59: Scene graph (part 1/2) for IO_V_EGO_DYN, idx 31. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_59.png] view at source ↗

**Figure 60.** Figure 60: Scene graph (part 2/2) for IO_V_EGO_DYN, idx 31. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_60.png] view at source ↗

**Figure 61.** Figure 61: Scenario for IO_V_EGO_DYN, idx 31. IO_V_EGO_DYN - Interacted Object Identification idx=31 | Farmhouse Kitchen with Prep Table and Hutch | Sample QAs Q [L1-FL: Frame Localization] In the view right after agent appears, where is the agent located in the frame? (A) Left (B) Center (C) Right Q [L2-AR: Action Recognition] Which object does the agent pick up? (A) White Egg in Bowl (B) Brown Bread Loaf (C) Red C… view at source ↗

**Figure 62.** Figure 62: Sample QA pairs for IO_V_EGO_DYN, idx 31. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_62.png] view at source ↗

**Figure 63.** Figure 63: Frames for HO_V_EXO_STA, idx 28 (Law Firm Office). HO_V_EXO_STA - Height Ordering idx=28 | Law Firm Office | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Height_Ordering", "theme": "Law Firm Office" }, "objects": [ { "id": "obj_lowest", "label": "Black Leather Attache Case", "role": "lowest_object", "attributes": { "colo… view at source ↗

**Figure 64.** Figure 64 [PITH_FULL_IMAGE:figures/full_fig_p066_64.png] view at source ↗

**Figure 65.** Figure 65: Scene graph (part 2/2) for HO_V_EXO_STA, idx 28. HO_V_EXO_STA - Height Ordering idx=28 | Law Firm Office | Scenario { "reasoning_goal": "The viewer must determine the height ordering: Black Leather Attache Case (on the dark hardwood floor) is lowest, Gold Brass Desk Clock (on the walnut partner desk) is middle, Maroon Legal Reference Volume (on the high mahogany bookcase) is highest. Distractor Green Bank… view at source ↗

**Figure 66.** Figure 66: Scenario for HO_V_EXO_STA, idx 28. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_66.png] view at source ↗

**Figure 67.** Figure 67: Sample QA pairs for HO_V_EXO_STA, idx 28. 68 [PITH_FULL_IMAGE:figures/full_fig_p068_67.png] view at source ↗

**Figure 68.** Figure 68: Frames for VI_V_EXO_DYN, idx 17 (Living Room with Tall Wooden Bookshelf). VI_V_EXO_DYN - Visibility Identification idx=17 | Living Room with Tall Wooden Bookshelf | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Vista", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Visibility_Identification", "theme": "Living Room with Tall Wooden Bookshelf" }, "objects": [ { "id": "obj_obse… view at source ↗

**Figure 69.** Figure 69 [PITH_FULL_IMAGE:figures/full_fig_p069_69.png] view at source ↗

**Figure 70.** Figure 70: Scene graph (part 2/2) for VI_V_EXO_DYN, idx 17. VI_V_EXO_DYN - Visibility Identification idx=17 | Living Room with Tall Wooden Bookshelf | Scenario { "reasoning_goal": "Viewers must determine that the Resident sees the Visitor as Occluded initially, and after the Visitor moves around the Tall Wooden Bookshelf, the visibility status changes to Visible.", "timeline": { "(Phase 1 Initial State)": "Bird's-ey… view at source ↗

**Figure 71.** Figure 71: Scenario for VI_V_EXO_DYN, idx 17. 70 [PITH_FULL_IMAGE:figures/full_fig_p070_71.png] view at source ↗

**Figure 72.** Figure 72: Sample QA pairs for VI_V_EXO_DYN, idx 17. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_72.png] view at source ↗

**Figure 73.** Figure 73: Frames for DS_E_EGO_STA, idx 94 (Medieval Castle Dungeon Network). DS_E_EGO_STA - Directional Signage Grounding idx=94 | Medieval Castle Dungeon Network | Scene Graph { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Static", "perspective": "Ego", "task_type": "Directional_Signage_Grounding", "theme": "Medieval Castle Dungeon Network" }, "objects": [ { "id": "obj_sign", "label": "Iron… view at source ↗

**Figure 74.** Figure 74: Scene graph for DS_E_EGO_STA, idx 94. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_74.png] view at source ↗

**Figure 75.** Figure 75: Scenario for DS_E_EGO_STA, idx 94. DS_E_EGO_STA - Directional Signage Grounding idx=94 | Medieval Castle Dungeon Network | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the directional sign located in the frame? (A) Top (B) Bottom (C) Left (D) Right Q [L2-CM: Camera Motion] How does the camera move throughout the video? (A) It walks straight forward (B) It walks forward then turns… view at source ↗

**Figure 76.** Figure 76: Sample QA pairs for DS_E_EGO_STA, idx 94. 73 [PITH_FULL_IMAGE:figures/full_fig_p073_76.png] view at source ↗

**Figure 77.** Figure 77: Frames for RV_E_EGO_DYN, idx 13 (Go-Kart Circuit). RV_E_EGO_DYN - Relative Velocity Identification idx=13 | Go-Kart Circuit | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Dynamic", "perspective": "Ego", "task_type": "Relative_Velocity_Identification", "theme": "Go-Kart Circuit" }, "objects": [ { "id": "obj_ego", "label": "Pink Go-Kart", "role": "ego_agent",… view at source ↗

**Figure 78.** Figure 78 [PITH_FULL_IMAGE:figures/full_fig_p074_78.png] view at source ↗

**Figure 79.** Figure 79: Scene graph (part 2/2) for RV_E_EGO_DYN, idx 13. 75 [PITH_FULL_IMAGE:figures/full_fig_p075_79.png] view at source ↗

**Figure 80.** Figure 80: Scenario for RV_E_EGO_DYN, idx 13. RV_E_EGO_DYN - Relative Velocity Identification idx=13 | Go-Kart Circuit | Sample QAs Q [L1-FL: Frame Localization] In the initial view, where is the Mechanic Cart located in the frame? (A) Left side (B) Center (C) Right side Q [L2-AR: Action Recognition] How does the Turbo Kart appear to move relative to the camera? (A) Drifts backward (B) Streaks forward (C) Stays stat… view at source ↗

**Figure 81.** Figure 81: Sample QA pairs for RV_E_EGO_DYN, idx 13. 76 [PITH_FULL_IMAGE:figures/full_fig_p076_81.png] view at source ↗

**Figure 82.** Figure 82: Frames for LS_E_EXO_STA, idx 86 (Polar Research Base). LS_E_EXO_STA - Landmark Spatial Composition idx=86 | Polar Research Base | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Static", "perspective": "Exo", "task_type": "Landmark_Spatial_Composition", "theme": "Polar Research Base" }, "objects": [ { "id": "obj_landmark_1", "label": "Bright Orange Main Habita… view at source ↗

**Figure 83.** Figure 83: Scene graph (part 1/2) for LS_E_EXO_STA, idx 86. 77 [PITH_FULL_IMAGE:figures/full_fig_p077_83.png] view at source ↗

**Figure 84.** Figure 84: Scene graph (part 2/2) for LS_E_EXO_STA, idx 86. LS_E_EXO_STA - Landmark Spatial Composition idx=86 | Polar Research Base | Scenario { "reasoning_goal": "The viewer must deduce that the Dark Blue Aurora Observatory is to the E of the Bright Orange Main Habitat Module, by combining: (1) crane-up reveals the White Insulated Medical Bay to the S of the Bright Orange Main Habitat Module, and (2) camera flies … view at source ↗

**Figure 85.** Figure 85: Scenario for LS_E_EXO_STA, idx 86. 78 [PITH_FULL_IMAGE:figures/full_fig_p078_85.png] view at source ↗

**Figure 86.** Figure 86: Sample QA pairs for LS_E_EXO_STA, idx 86. 79 [PITH_FULL_IMAGE:figures/full_fig_p079_86.png] view at source ↗

**Figure 87.** Figure 87: Frames for BT_E_EXO_DYN, idx 94 (Wasteland Highway). BT_E_EXO_DYN - Behavioral Trigger Identification idx=94 | Wasteland highway | Scene Graph | part 1/2 { "scene_meta": { "spatial_scale": "Environmental", "scene_dynamics": "Dynamic", "perspective": "Exo", "task_type": "Behavioral_Trigger_Identification", "theme": "Wasteland highway" }, "objects": [ { "id": "obj_agent", "label": "Black Spiked Post-Apocaly… view at source ↗

**Figure 88.** Figure 88 [PITH_FULL_IMAGE:figures/full_fig_p080_88.png] view at source ↗

**Figure 89.** Figure 89: Scene graph (part 2/2) for BT_E_EXO_DYN, idx 94. BT_E_EXO_DYN - Behavioral Trigger Identification idx=94 | Wasteland highway | Scenario { "reasoning_goal": "The viewer must visually verify that the Black Spiked Post-Apocalyptic Combat Vehicle's wait-and-resume reaction was directly caused by the Pack of Mutant Wild Dogs.", "timeline": { "(Setup)": "High-angle drone camera tracks the Black Spiked Post-Apoc… view at source ↗

**Figure 90.** Figure 90: Scenario for BT_E_EXO_DYN, idx 94. 81 [PITH_FULL_IMAGE:figures/full_fig_p081_90.png] view at source ↗

**Figure 91.** Figure 91: Sample QA pairs for BT_E_EXO_DYN, idx 94. 82 [PITH_FULL_IMAGE:figures/full_fig_p082_91.png] view at source ↗

read the original abstract

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGenST-Bench shifts to active synthesis for finer control in spatio-temporal MLLM testing, but the lack of validation data leaves the core claim unproven.

read the letter

The punchline here is that VGenST-Bench tries to improve on existing spatio-temporal benchmarks by actively generating videos with generative models instead of curating passive ones, using a multi-agent pipeline and human checks to create controlled test cases. What comes across as new is the explicit 3x2x2 taxonomy that covers spatial scale, perspective, and scene dynamics, combined with a hierarchical task design that separates perception from reasoning. This structure gives a clearer way to pinpoint specific weaknesses in MLLMs for dynamic environments. The paper explains the limitations of current datasets well and shows how active synthesis could provide more diversity and control. They earn credit for thinking through the construction process in detail, including the quality control stage. The softer part is that without any reported results or validation experiments in the work, it's difficult to assess whether the generated videos are realistic enough. The central assumption that failures on these videos point to reasoning deficits rather than generation issues isn't backed by data like consistency metrics or comparisons to real videos. If the synthesis introduces systematic biases in certain taxonomy cells, the diagnostic value drops. This kind of paper is for people focused on evaluating and advancing multimodal models for video and robotics applications. Readers working on benchmark creation or MLLM testing would find the taxonomy and task hierarchy practical to consider. I would recommend sending it for peer review. The idea has potential and the design choices are thoughtful, so referees can help strengthen the empirical side.

Referee Report

1 major / 2 minor

Summary. The paper introduces VGenST-Bench, a video benchmark for spatio-temporal reasoning in MLLMs that shifts from passive curation to active synthesis via generative models and a multi-agent pipeline with human QC. It defines a 3x2x2 taxonomy (Spatial Scale, Perspective, Scene Dynamics) and a hierarchical task suite decoupling low-level perception from high-level reasoning, claiming this enables fine-grained diagnosis of model capabilities.

Significance. If the synthesized videos prove free of systematic artifacts that confound reasoning evaluation, the active-synthesis paradigm could meaningfully improve diagnostic granularity over existing passive video benchmarks. The taxonomy and task hierarchy are well-motivated design choices that directly target the stated limitations of prior datasets.

major comments (1)

[Abstract and §3] Abstract and construction pipeline (described in §3): the claim that VGenST-Bench supports fine-grained diagnosis presupposes that generative artifacts do not systematically bias performance across taxonomy cells. No quantitative checks—optical-flow statistics, depth-consistency metrics, motion-continuity scores, or side-by-side realism ratings versus real videos—are reported to verify that the synthesized distribution matches real-world spatio-temporal statistics closely enough for the diagnostic claim to hold.

minor comments (2)

[§3] The manuscript should clarify the exact generative models and prompting strategies used in the multi-agent pipeline, including any failure modes observed during video synthesis.
[Figures 2–4] Figure captions and taxonomy diagrams would benefit from explicit mapping to the hierarchical task levels to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the evidentiary requirements for our diagnostic claims. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and construction pipeline (described in §3): the claim that VGenST-Bench supports fine-grained diagnosis presupposes that generative artifacts do not systematically bias performance across taxonomy cells. No quantitative checks—optical-flow statistics, depth-consistency metrics, motion-continuity scores, or side-by-side realism ratings versus real videos—are reported to verify that the synthesized distribution matches real-world spatio-temporal statistics closely enough for the diagnostic claim to hold.

Authors: We agree that the current manuscript lacks the quantitative distributional checks the referee identifies. The multi-agent pipeline and human QC stage are designed to minimize obvious artifacts, but these do not substitute for explicit statistical validation. In the revised manuscript we will add to §3 the suggested analyses: optical-flow statistics, depth-consistency metrics, motion-continuity scores, and side-by-side human realism ratings comparing synthesized videos to real-world counterparts. These additions will directly support the fine-grained diagnosis claim by demonstrating that the synthesized distribution does not systematically deviate from real spatio-temporal statistics across taxonomy cells. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper presents a descriptive construction of VGenST-Bench via a multi-agent generative pipeline, 3x2x2 taxonomy, and hierarchical task suite without equations, fitted parameters, predictions, or derivations. The central claim that active synthesis enables fine-grained diagnosis follows directly from the stated design choices (controlled scenarios, decoupling perception from reasoning, human QC) and does not reduce to any prior inputs or self-citations by construction. No load-bearing step equates to its own inputs; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative synthesis can produce sufficiently faithful spatio-temporal scenarios; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Generative models combined with multi-agent pipelines and human review can create videos whose spatio-temporal properties are suitable for fine-grained reasoning evaluation
Invoked when the paper claims that active synthesis enables precise diagnosis of MLLM capabilities.

pith-pipeline@v0.9.0 · 5721 in / 1277 out tokens · 51367 ms · 2026-05-22T07:08:44.791877+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios... multi-agent pipeline... 3×2×2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics... hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · 17 internal anchors

[1]

Claude sonnet 4.6, 2025

Anthropic. Claude sonnet 4.6, 2025. URL https://www.anthropic.com/claude/sonnet. Accessed: 2026-05

work page 2025
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015
[3]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022

work page 2022
[5]

Seedream: Bytedance image generation model

ByteDance Seed. Seedream: Bytedance image generation model. https://seed.bytedance. com/en/seedream5_0_lite, 2024. Accessed: 2026-05

work page 2024
[6]

Temporalbench: Towards fine-grained temporal understanding for multimodal video models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. 2024

work page 2024
[7]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025
[8]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[9]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[10]

Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B Tenenbaum, and Chuang Gan. Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

work page 2025
[11]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[12]

Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

work page arXiv 2024
[13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[14]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025
[17]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

work page 2024
[18]

Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, et al. Learning human-perceived fakeness in ai-generated videos via multimodal llms.arXiv preprint arXiv:2509.22646, 2025

work page arXiv 2025
[19]

Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

Rohit Girdhar and Deva Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

work page arXiv 1910
[20]

Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025. URL https: //arxiv.org/abs/2506.07966

work page arXiv 2025
[21]

Veo 3, 2025

Google. Veo 3, 2025. URLhttps://aistudio.google.com/models/veo-3

work page 2025
[22]

Nano Banana: Gemini image generation model

Google DeepMind. Nano Banana: Gemini image generation model. https://deepmind. google/models/gemini/image/, 2025. Accessed: 2026-05

work page 2025
[23]

Gemini 3 Flash

Google DeepMind. Gemini 3 Flash. https://deepmind.google/models/gemini/flash/,

work page
[24]

Google DeepMind. Gemma 4. https://ai.google.dev/gemma/docs/core/model_ card_4/, 2026. Accessed: 2026-05

work page 2026
[25]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[26]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025

work page arXiv 2025
[27]

Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

Mary Hegarty, Daniel R Montello, Anthony E Richardson, Toru Ishikawa, and Kristin Lovelace. Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

work page 2006
[28]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[30]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[32]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 11

work page 2017
[33]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023

work page 2023
[34]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections

Roberta L Klatzky. Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. InSpatial cognition: An interdisciplinary approach to representing and processing spatial knowledge, pages 1–17. Springer, 1998

work page 1998
[36]

Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025

work page arXiv 2025
[37]

Kling AI: Kuaishou video generation model

Kuaishou Technology. Kling AI: Kuaishou video generation model. https://klingai.com/,

work page
[38]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[39]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

work page 2020
[40]

Revealing single frame bias for video-and-language learning

Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 487–507, 2023

work page 2023
[41]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

work page arXiv 2024
[43]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[44]

Unfolding spatial cognition: Evaluating multimodal models on visual simulations

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. Unfolding spatial cognition: Evaluating multimodal models on visual simulations. arXiv preprint arXiv:2506.04633, 2025

work page arXiv 2025
[45]

Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

work page 2025
[46]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

work page 2023
[47]

Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025

Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025. 12

work page arXiv 2025
[48]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025
[49]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv preprint arXiv:2507.07984, 2025

work page arXiv 2025
[50]

Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

work page 2024
[51]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[52]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025
[53]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[54]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024
[55]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

work page 2025
[56]

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Scale and multiple psychologies of space

Daniel R Montello. Scale and multiple psychologies of space. InEuropean conference on spatial information theory, pages 312–321. Springer, 1993

work page 1993
[58]

Kimi K2.6

Moonshot AI. Kimi K2.6. https://huggingface.co/moonshotai/Kimi-K2.6, 2026. Accessed: 2026-05

work page 2026
[59]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

work page
[60]

Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

work page 2023
[61]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, 2023

work page 2023
[62]

Clevr-x: A visual reasoning dataset for natural language explanations

Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020

work page 2020
[63]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Vidu: Ai video generation model

Shengshu Technology. Vidu: Ai video generation model. https://www.vidu.com/, 2024. Accessed: 2026-05. 13

work page 2024
[65]

Both text and images leaked! a systematic analysis of data contamination in multimodal llm

Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, and Benyou Wang. Both text and images leaked! a systematic analysis of data contamination in multimodal llm. arXiv preprint arXiv:2411.03823, 2024

work page arXiv 2024
[66]

Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

work page arXiv 2025
[67]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026. URL https://arxiv.org/abs/2604.14148. ByteDance Seed

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

work page 2024
[71]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. Advances in Neural Information Processing Systems, 37:75392–75421, 2024

work page 2024
[73]

Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

work page 2025
[74]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, and Alan L Yuille. 3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

work page 2023
[76]

Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, and Alan Yuille. Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

work page arXiv 2024
[77]

Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24669–24679, 2025

work page 2025
[78]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021

Showing first 80 references.

[1] [1]

Claude sonnet 4.6, 2025

Anthropic. Claude sonnet 4.6, 2025. URL https://www.anthropic.com/claude/sonnet. Accessed: 2026-05

work page 2025

[2] [2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015

[3] [3]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022

work page 2022

[5] [5]

Seedream: Bytedance image generation model

ByteDance Seed. Seedream: Bytedance image generation model. https://seed.bytedance. com/en/seedream5_0_lite, 2024. Accessed: 2026-05

work page 2024

[6] [6]

Temporalbench: Towards fine-grained temporal understanding for multimodal video models

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. 2024

work page 2024

[7] [7]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025

[8] [8]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024

[9] [9]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024

[10] [10]

Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B Tenenbaum, and Chuang Gan. Compositional physical reasoning of objects and events from videos.IEEE transactions on pattern analysis and machine intelligence, 2025

work page 2025

[11] [11]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024

[12] [12]

Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

work page arXiv 2024

[13] [13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017

[14] [14]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025

[17] [17]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

work page 2024

[18] [18]

Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, et al. Learning human-perceived fakeness in ai-generated videos via multimodal llms.arXiv preprint arXiv:2509.22646, 2025

work page arXiv 2025

[19] [19]

Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

Rohit Girdhar and Deva Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

work page arXiv 1910

[20] [20]

Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence, 2025. URL https: //arxiv.org/abs/2506.07966

work page arXiv 2025

[21] [21]

Veo 3, 2025

Google. Veo 3, 2025. URLhttps://aistudio.google.com/models/veo-3

work page 2025

[22] [22]

Nano Banana: Gemini image generation model

Google DeepMind. Nano Banana: Gemini image generation model. https://deepmind. google/models/gemini/image/, 2025. Accessed: 2026-05

work page 2025

[23] [23]

Gemini 3 Flash

Google DeepMind. Gemini 3 Flash. https://deepmind.google/models/gemini/flash/,

work page

[24] [24]

Google DeepMind. Gemma 4. https://ai.google.dev/gemma/docs/core/model_ card_4/, 2026. Accessed: 2026-05

work page 2026

[25] [25]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017

[26] [26]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025

work page arXiv 2025

[27] [27]

Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

Mary Hegarty, Daniel R Montello, Anthony E Richardson, Toru Ishikawa, and Kristin Lovelace. Spatial abilities at different scales: Individual differences in aptitude-test performance and spatial-layout learning.Intelligence, 34(2):151–176, 2006

work page 2006

[28] [28]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019

[30] [30]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025

[32] [32]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 11

work page 2017

[33] [33]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023

work page 2023

[34] [34]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections

Roberta L Klatzky. Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. InSpatial cognition: An interdisciplinary approach to representing and processing spatial knowledge, pages 1–17. Springer, 1998

work page 1998

[36] [36]

Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025

work page arXiv 2025

[37] [37]

Kling AI: Kuaishou video generation model

Kuaishou Technology. Kling AI: Kuaishou video generation model. https://klingai.com/,

work page

[38] [38]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[39] [39]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

work page 2020

[40] [40]

Revealing single frame bias for video-and-language learning

Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 487–507, 2023

work page 2023

[41] [41]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024

work page arXiv 2024

[43] [43]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024

[44] [44]

Unfolding spatial cognition: Evaluating multimodal models on visual simulations

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. Unfolding spatial cognition: Evaluating multimodal models on visual simulations. arXiv preprint arXiv:2506.04633, 2025

work page arXiv 2025

[45] [45]

Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

work page 2025

[46] [46]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

work page 2023

[47] [47]

Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025

Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025. 12

work page arXiv 2025

[48] [48]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025

[49] [49]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv preprint arXiv:2507.07984, 2025

work page arXiv 2025

[50] [50]

Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes.Advances in Neural Information Process- ing Systems, 37:140903–140936, 2024

work page 2024

[51] [51]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[52] [52]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025

[53] [53]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[54] [54]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024

[55] [55]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

work page 2025

[56] [56]

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Scale and multiple psychologies of space

Daniel R Montello. Scale and multiple psychologies of space. InEuropean conference on spatial information theory, pages 312–321. Springer, 1993

work page 1993

[58] [58]

Kimi K2.6

Moonshot AI. Kimi K2.6. https://huggingface.co/moonshotai/Kimi-K2.6, 2026. Accessed: 2026-05

work page 2026

[59] [59]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

work page

[60] [60]

Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

work page 2023

[61] [61]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, 2023

work page 2023

[62] [62]

Clevr-x: A visual reasoning dataset for natural language explanations

Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020

work page 2020

[63] [63]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Vidu: Ai video generation model

Shengshu Technology. Vidu: Ai video generation model. https://www.vidu.com/, 2024. Accessed: 2026-05. 13

work page 2024

[65] [65]

Both text and images leaked! a systematic analysis of data contamination in multimodal llm

Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, and Benyou Wang. Both text and images leaked! a systematic analysis of data contamination in multimodal llm. arXiv preprint arXiv:2411.03823, 2024

work page arXiv 2024

[66] [66]

Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

work page arXiv 2025

[67] [67]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[68] [68]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026. URL https://arxiv.org/abs/2604.14148. ByteDance Seed

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

work page 2024

[71] [71]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. Advances in Neural Information Processing Systems, 37:75392–75421, 2024

work page 2024

[73] [73]

Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv e-prints, pages arXiv–2507, 2025

work page 2025

[74] [74]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, and Alan L Yuille. 3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36:58717–58735, 2023

work page 2023

[76] [76]

Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, and Alan Yuille. Compositional 4d dynamic scenes understanding with physics priors for video question answer- ing.arXiv preprint arXiv:2406.00622, 2024

work page arXiv 2024

[77] [77]

Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24669–24679, 2025

work page 2025

[78] [78]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021