4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Bo Li; Hongyu Li; Manyuan Zhang; Mingze Sun; Ruqi Huang; Shuang Chen; Xiang An; Xiaobin Hu; Xinlei Yu; Xin Xie

arxiv: 2605.05997 · v2 · pith:3FBW3A4Anew · submitted 2026-05-07 · 💻 cs.CV

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Zhangquan Chen , Manyuan Zhang , Xinlei Yu , Xiang An , Bo Li , Xin Xie , ZiDong Wang , Mingze Sun

show 4 more authors

Shuang Chen Hongyu Li Xiaobin Hu Ruqi Huang

This is my paper

Pith reviewed 2026-05-25 06:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reasoningvision-language modelsdynamic spatial understandinglatent mental imageryreinforcement learningmonocular video

0 comments

The pith

Vision-language models can reason about moving scenes by simulating 4D changes inside their hidden space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that VLMs can develop an internal ability to simulate how scenes evolve over time in their continuous latent space rather than describing dynamics only in text or calling external geometry tools. It introduces an annotation-free pipeline to create 4D reasoning data from ordinary videos, then uses Dynamic-Imagery Fine-Tuning to supervise both text outputs and 4D latents together, followed by 4D Reinforcement Learning that applies outcome rewards while keeping gradients on text tokens only. If this works, models gain intrinsic dynamic spatial understanding that stays efficient at inference time. A reader would care because monocular video is the most common way machines see the physical world, yet current VLMs still struggle with its temporal and geometric demands.

Core claim

4DThinker is the first framework that lets VLMs think with 4D by internally simulating scene evolution inside the continuous hidden space; this is realized through a scalable data-generation pipeline from raw videos, Dynamic-Imagery Fine-Tuning that jointly supervises textual tokens and 4D latents, and 4D Reinforcement Learning that restricts policy gradients to text tokens for stable optimization on complex tasks.

What carries the argument

Dynamic latent mental imagery: the mechanism that lets the model simulate how scenes evolve within its continuous hidden space.

If this is right

VLMs achieve higher accuracy on multiple dynamic spatial reasoning benchmarks without added inference cost.
Reasoning stays inside the model rather than depending on separate geometry engines.
The same training recipe scales to new raw video sources because the data pipeline requires no manual 4D labels.
Policy optimization remains stable because gradients are confined to text tokens even when 4D latents are involved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to embodied agents that must predict object motion from onboard cameras.
Removing the text-only gradient restriction might allow end-to-end 4D control signals but would require new stability techniques.
The method opens a route to test whether other latent dimensions beyond 4D (such as force or material properties) can be learned the same way.

Load-bearing premise

The assumption that Dynamic-Imagery Fine-Tuning and 4D Reinforcement Learning can ground the model in dynamic visual semantics without any external geometric modules.

What would settle it

A controlled test in which 4DThinker is run on the same dynamic spatial reasoning benchmarks but with the 4D latent supervision removed; if accuracy stays the same as the full model, the claim that internal 4D simulation is necessary collapses.

Figures

Figures reproduced from arXiv: 2605.05997 by Bo Li, Hongyu Li, Manyuan Zhang, Mingze Sun, Ruqi Huang, Shuang Chen, Xiang An, Xiaobin Hu, Xinlei Yu, Xin Xie, Zhangquan Chen, Zidong Wang.

**Figure 1.** Figure 1: Overview of 4DThinker. Top: Inference architecture. The model interleaves text reasoning with latent visual tokens as “mental imagery” on a continuous manifold, enabling correct dynamic reasoning where purely textual CoT (e.g., Gemini-3.1-Pro) fails. Bottom: Two-stage training pipeline built on the data from view at source ↗

**Figure 2.** Figure 2: Overview of our scalable, annotation-free 4D data generation pipeline in three stages. (1) Video preprocessing: raw videos are processed via MegaSaM and SAM3 to extract camera trajectories and consistent mask overlays for landmarks. (2) Motion-centric QA construction: the pipeline formulates MCQs and imagery for both camera and object motions, grounded by sampled boundary or interval overlays. (3) Imagery-… view at source ↗

**Figure 3.** Figure 3: Prompt for landmark identification. Mhigh identifies one static and one dynamic object with short visual descriptions, which are subsequently used as text prompts for SAM3 mask extraction. Camera motion QA prompts. For camera motion data, the question generation prompt ( view at source ↗

**Figure 4.** Figure 4: Prompt for static mask consistency verification. Mhigh evaluates four criteria across all overlay frames to implement the consistency filter (Eq. (2)) view at source ↗

**Figure 5.** Figure 5: Prompt for dynamic object mask verification. Unlike the binary static check ( view at source ↗

**Figure 6.** Figure 6: Prompt for camera motion question generation. Given a time segment and answer options, Mhigh produces a natural-language MCQ view at source ↗

**Figure 7.** Figure 7: Prompt for object movement direction analysis. Mhigh analyzes centroid displacement and apparent scale variation across masked key frames to determine the primary movement direction, while explicitly separating camera ego-motion from the object’s own motion. C Candidate Question Types and Answer Choices Tab. 6 summarizes the five candidate question types in our data generation pipeline, and Tab. 7 lists th… view at source ↗

**Figure 8.** Figure 8: Prompt for object speed change analysis. Complementary to the view at source ↗

**Figure 9.** Figure 9: Prompts for object motion question generation (four types). Each variant probes a different aspect of dynamic understanding: (a) movement direction, (b) 4D question with boundingbox grounding, (c) distance change relative to the camera, and (d) speed variation over time. D Training and Evaluation Datasets Training data. The DIFT stage uses ∼38K samples synthesized by our pipeline from SpatialVID Wang et … view at source ↗

**Figure 10.** Figure 10: Prompt for camera motion CoT synthesis. Given the video, static mask overlays, and the correct answer, Mhigh produces reasoning trace where <output_image> placeholders represent the model’s own “mental imagery,” which are later replaced by latent visual tokens during DIFT training view at source ↗

**Figure 11.** Figure 11: Prompt for object motion CoT synthesis. Analogous to the camera motion variant ( view at source ↗

**Figure 12.** Figure 12: The system instruction appended before every question during DIFT training, 4DRL training, and inference. It specifies the output format that the model must follow. • Dir (Direction): The movement direction of a target object. • Ori (Orientation): How the orientation of an object evolves. • Spd (Speed): The speed change pattern of a target object. • SpdC (Speed Comparison): Comparing the speeds of two obj… view at source ↗

**Figure 13.** Figure 13: Qualitative example on DSR-Bench (fine-grained). 4DThinker correctly identifies a two-phase pattern (first becomes larger, then keeps constant) by mentally simulating the guinea pig’s trajectory via latent 4D imagery. Both Gemini-3 and the base Qwen2.5-VL-3B fail. H Implementation Details Base model. We build 4DThinker on a list of base VLM (e.g., Qwen2.5-VL Bai et al. (2025), Qwen3-VL Team (2025), Intern… view at source ↗

**Figure 14.** Figure 14: Qualitative example on Dyn-Bench (holistic). 4DThinker correctly identifies the player’s diagonal movement pattern across the full court by mentally tracking his position through 4D latents, while both Gemini-3 and the base Qwen2.5-VL-3B incorrectly conclude that the player stays in one half of the court, relying on local frame-level heuristics. Mask overlay parameters. For generating mask overlays (Eq. (… view at source ↗

**Figure 15.** Figure 15: Qualitative example. 4DThinker tracks the panda’s apparent size across frames through latent visual tokens and correctly determines that the size ratio remains stable, distinguishing the panda’s posture change from actual camera zoom or physical depth movement view at source ↗

**Figure 16.** Figure 16: Qualitative example. Given a first-person driving video, 4DThinker tracks gradual environmental transitions via 4D latents and predict the open fields to dense forest. 21 view at source ↗

read the original abstract

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

4DThinker introduces latent 4D imagery for VLMs via DIFT and 4DRL, but the RL stage restricts gradients to text tokens only, which limits how much the imagery actually adapts on reasoning tasks.

read the letter

The paper's main move is to train VLMs to keep internal 4D latents that simulate scene dynamics over time, using an annotation-free video synthesis pipeline, then DIFT for joint text-and-latent supervision, and finally 4DRL with outcome rewards. The RL step explicitly blocks gradients from reaching the 4D latents to keep training stable. That design is the central technical choice and also the clearest limitation on the page. The synthesis pipeline and the DIFT stage look like the parts that actually add something new; prior work either stayed in text or bolted on external geometry modules. The data generation approach is practical and could stand alone as a contribution for anyone needing dynamic spatial examples. The abstract states that the model beats strong baselines across benchmarks, but supplies no numbers, no setup details, and no ablation on whether the latents are doing the work or just riding along. The gradient restriction in 4DRL means the 4D component receives no direct optimization signal once the model moves to the target tasks. If the synthetic data does not perfectly match benchmark distributions, the latents stay fixed while only the text policy improves. That undercuts the claim that the model is truly thinking with 4D rather than using 4D latents as a fixed feature extractor. The paper is aimed at groups working on VLMs for robotics or embodied reasoning. A reader who wants concrete alternatives to text-only or modular pipelines will find usable ideas here, even if the results section needs to be checked in full. It is worth sending to referees because the problem is well-motivated and the training split is distinct, though the gradient choice and the missing metrics are points that should be addressed in review.

Referee Report

2 major / 1 minor

Summary. The paper presents 4DThinker as the first framework enabling VLMs to perform dynamic spatial reasoning from monocular video by 'thinking with 4D' via internally simulated dynamic latent mental imagery. It introduces an annotation-free pipeline to synthesize 4D reasoning data from raw videos, Dynamic-Imagery Fine-Tuning (DIFT) for joint supervision of textual tokens and 4D latents, and 4D Reinforcement Learning (4DRL) using outcome-based rewards with policy gradients restricted to text tokens. Experiments across multiple benchmarks show consistent outperformance over strong baselines, with code released at the provided GitHub repository.

Significance. If the central claims hold after addressing the optimization design, the work offers a meaningful advance by shifting from external geometric modules or purely textual reasoning to intrinsic 4D latent simulation in VLMs. The annotation-free synthesis and code release are strengths that support reproducibility and further exploration of latent imagery for physical reasoning tasks.

major comments (2)

[4DRL] 4DRL description (following DIFT in the method): restricting policy gradients to text tokens only means the 4D latents receive no direct optimization signal from the outcome-based rewards on target reasoning tasks. This risks the imagery component remaining static after DIFT if the synthesized data distribution does not perfectly match benchmarks, directly undermining the claim that the model is 'thinking with 4D' during complex reasoning.
[DIFT] DIFT section: the joint supervision on textual tokens and 4D latents from synthesized data is presented as grounding dynamic visual semantics, but no analysis is provided on whether the 4D latents continue to evolve or contribute causally during 4DRL inference on held-out benchmarks.

minor comments (1)

[Abstract] Abstract: claims of outperformance are stated without even high-level metrics or baseline names; a one-sentence summary of key results would improve readability while full tables remain in the experiments section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential significance of 4DThinker. We address each major comment below with the strongest honest defense supported by the manuscript's design and results. Where clarification or additional analysis is warranted, we commit to revisions.

read point-by-point responses

Referee: [4DRL] 4DRL description (following DIFT in the method): restricting policy gradients to text tokens only means the 4D latents receive no direct optimization signal from the outcome-based rewards on target reasoning tasks. This risks the imagery component remaining static after DIFT if the synthesized data distribution does not perfectly match benchmarks, directly undermining the claim that the model is 'thinking with 4D' during complex reasoning.

Authors: The restriction of policy gradients to text tokens in 4DRL is a deliberate design choice for training stability: applying outcome-based rewards directly to the high-dimensional 4D latent space risks unstable or noisy updates given the sparsity of the signal. The 4D latents are already grounded via joint supervision in DIFT on the synthesized data, and during inference the model generates and conditions on these latents as part of its forward pass for all reasoning steps. The empirical gains over strong baselines that lack any 4D component indicate that the imagery remains functional and beneficial on held-out tasks. We will revise the 4DRL section to explicitly articulate this rationale and the implicit role of the latents. revision: partial
Referee: [DIFT] DIFT section: the joint supervision on textual tokens and 4D latents from synthesized data is presented as grounding dynamic visual semantics, but no analysis is provided on whether the 4D latents continue to evolve or contribute causally during 4DRL inference on held-out benchmarks.

Authors: We agree that direct evidence of the 4D latents' causal contribution after DIFT would strengthen the central claim. In the revision we will add an analysis subsection with (i) an ablation that freezes the 4D latents after DIFT and measures the resulting drop on 4DRL benchmarks, and (ii) qualitative visualizations of latent trajectories on held-out videos to illustrate continued dynamic simulation. These additions will be placed after the 4DRL description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methods are independent training procedures

full rationale

The paper's core claims rest on a new annotation-free synthesis pipeline, DIFT joint supervision, and 4DRL with explicit gradient restriction to text tokens. These are presented as novel procedural contributions without any reduction of outputs to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. No equations or derivations in the provided text equate a 'prediction' to its own training signal. The design choices (e.g., restricting gradients) are explicit engineering decisions, not hidden equivalences. This matches the default case of a self-contained methodological paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract provides high-level description; detailed assumptions would require full paper.

axioms (1)

domain assumption Internal 4D latent representations can capture dynamic spatial semantics better than text alone
Basis for the entire framework.

invented entities (1)

4D latent mental imagery no independent evidence
purpose: To allow VLMs to simulate scene dynamics internally
Core new concept introduced without external validation mentioned.

pith-pipeline@v0.9.0 · 5802 in / 1232 out tokens · 50283 ms · 2026-05-25T06:12:37.349708+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIFT jointly supervises textual tokens and 4D latents … Lsim = 1 − (1/|Tlat|) Σ h⊤t−1 zt / (‖ht−1‖ ‖zt‖)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

first framework that enables VLMs to 'think with 4D' through dynamic latent mental imagery … internally simulating how scenes evolve within the continuous hidden space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 10 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv
[3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Adagar: Adaptive gabor representation for dynamic scene reconstruction.arXiv preprint arXiv:2601.00796,

Jiewen Chan, Zhenjun Zhao, and Yu-Lun Liu. Adagar: Adaptive gabor representation for dynamic scene reconstruction.arXiv preprint arXiv:2601.00796,

work page arXiv
[5]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a. Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, and Ruqi Huang. Sifthinker: ...

work page arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

work page arXiv
[10]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, et al. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

work page arXiv
[12]

Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a. Haoang Li, Ji Zhao, Jean-Charles Bazin, Pyojin Kim, Kyungdon Joo, Zhenjun Zhao, and Yun-Hui Liu. Hong kong world: Leveraging structural regularity for line-based slam.IE...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding

Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, and Si Liu. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8592–8603, 2025b. Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang ...

work page arXiv
[14]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a. Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Lon...

work page arXiv
[15]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025a

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025a. 11 Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie ...

work page arXiv
[19]

Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.MATH-AI @ NeurIPS 2025,

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Zhang Hengyuan. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.MATH-AI @ NeurIPS 2025,

work page 2025
[20]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025a. Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangta...

work page arXiv
[21]

Mllm-4d: Towards visual-based spatial-temporal intelligence.arXiv preprint arXiv:2603.00515,

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, and Xiaodong Cun. Mllm-4d: Towards visual-based spatial-temporal intelligence.arXiv preprint arXiv:2603.00515,

work page arXiv
[22]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,

work page arXiv
[23]

Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873,

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873,

work page arXiv
[24]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv
[25]

Learning to reason in 4d: Dynamic spatial understanding for vision language models.arXiv preprint arXiv:2512.20557, 2025a

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, and Xiaojuan Qi. Learning to reason in 4d: Dynamic spatial understanding for vision language models.arXiv preprint arXiv:2512.20557, 2025a. Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, and Xiaojuan Qi. Learning to reason in 4d: Dynamic spatial understand...

work page arXiv
[26]

the red car on the left lane

12 A Object Selection Rules As described in Sec. 3.1, we define a set of predefined rules R to guide Mhigh in selecting a representative static object os and a dynamic object od from each video. Specifically, we instruct Mhigh with the following criteria: Static object selection. • The object must bestationarythroughout the entire video (e.g., the traffic...

work page 2025
[27]

mental imagery,

is appended before every question during DIFT training, 4DRL, and inference, specifying the output format that the model must follow. During 4DRL, the format reward Rfmt checks adherence to this think-answer structure. Table 6: Candidate question types, target objects, and descriptions in our data generation pipeline. Category Type Target Object Descripti...

work page 2026
[28]

Absolute

DSR-Bench subtasks.DSR-Bench Zhou et al. (2025a) organizes its 13 subtasks along two axes: viewpoint mobility(Absolute vs. Relative) andspatial attribute type. “Absolute” (A.) denotes that the viewpoint is fixed at a specific timestamp, while “Relative” (R.) denotes that the viewpoint moves with the observing agent over time. The attribute types are: •Dis...

work page 2026

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv

[3] [3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Adagar: Adaptive gabor representation for dynamic scene reconstruction.arXiv preprint arXiv:2601.00796,

Jiewen Chan, Zhenjun Zhao, and Yu-Lun Liu. Adagar: Adaptive gabor representation for dynamic scene reconstruction.arXiv preprint arXiv:2601.00796,

work page arXiv

[5] [5]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a. Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, and Ruqi Huang. Sifthinker: ...

work page arXiv

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

work page arXiv

[10] [10]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, et al. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

work page arXiv

[12] [12]

Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a. Haoang Li, Ji Zhao, Jean-Charles Bazin, Pyojin Kim, Kyungdon Joo, Zhenjun Zhao, and Yun-Hui Liu. Hong kong world: Leveraging structural regularity for line-based slam.IE...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding

Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, and Si Liu. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8592–8603, 2025b. Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang ...

work page arXiv

[14] [14]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a. Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Lon...

work page arXiv

[15] [15]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025a

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025a. 11 Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie ...

work page arXiv

[19] [19]

Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.MATH-AI @ NeurIPS 2025,

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, and Zhang Hengyuan. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.MATH-AI @ NeurIPS 2025,

work page 2025

[20] [20]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025a. Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangta...

work page arXiv

[21] [21]

Mllm-4d: Towards visual-based spatial-temporal intelligence.arXiv preprint arXiv:2603.00515,

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, and Xiaodong Cun. Mllm-4d: Towards visual-based spatial-temporal intelligence.arXiv preprint arXiv:2603.00515,

work page arXiv

[22] [22]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029,

work page arXiv

[23] [23]

Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873,

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873,

work page arXiv

[24] [24]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv

[25] [25]

Learning to reason in 4d: Dynamic spatial understanding for vision language models.arXiv preprint arXiv:2512.20557, 2025a

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, and Xiaojuan Qi. Learning to reason in 4d: Dynamic spatial understanding for vision language models.arXiv preprint arXiv:2512.20557, 2025a. Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, and Xiaojuan Qi. Learning to reason in 4d: Dynamic spatial understand...

work page arXiv

[26] [26]

the red car on the left lane

12 A Object Selection Rules As described in Sec. 3.1, we define a set of predefined rules R to guide Mhigh in selecting a representative static object os and a dynamic object od from each video. Specifically, we instruct Mhigh with the following criteria: Static object selection. • The object must bestationarythroughout the entire video (e.g., the traffic...

work page 2025

[27] [27]

mental imagery,

is appended before every question during DIFT training, 4DRL, and inference, specifying the output format that the model must follow. During 4DRL, the format reward Rfmt checks adherence to this think-answer structure. Table 6: Candidate question types, target objects, and descriptions in our data generation pipeline. Category Type Target Object Descripti...

work page 2026

[28] [28]

Absolute

DSR-Bench subtasks.DSR-Bench Zhou et al. (2025a) organizes its 13 subtasks along two axes: viewpoint mobility(Absolute vs. Relative) andspatial attribute type. “Absolute” (A.) denotes that the viewpoint is fixed at a specific timestamp, while “Relative” (R.) denotes that the viewpoint moves with the observing agent over time. The attribute types are: •Dis...

work page 2026