pith. machine review for the scientific record. sign in

arxiv: 2504.01805 · v2 · submitted 2025-04-02 · 💻 cs.CV

Recognition: 2 theorem links

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords video spatial reasoningmultimodal large language modelsreinforcement learningRLVRmap imaginationspatial benchmarksVSI-Bench
0
0 comments X

The pith

SpaceR uses RL with a map imagination step to lift open MLLMs above GPT-4o on video spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SpaceR framework to tackle video spatial reasoning in multimodal large language models, a task that requires inferring 3D layouts from 2D video frames. It builds the SpaceR-151k dataset with 91k verifiable spatial questions across varied scenarios plus 60k general samples, then applies Spatially-Guided RLVR that extends standard GRPO by adding a map imagination step during reasoning. Experiments show the resulting models reach state-of-the-art accuracy on spatial benchmarks while remaining competitive on general video understanding tasks. A sympathetic reader would care because the method offers a concrete path to strengthen open models' spatial capabilities using targeted reinforcement learning rather than ever-larger pretraining.

Core claim

SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks including VSI-Bench, STI-Bench, and SPAR-Bench by training on the SpaceR-151k dataset with Spatially-Guided RLVR, which extends GRPO through a novel map imagination mechanism that encourages the model to infer spatial layouts in its thinking process, while maintaining competitive results on video understanding benchmarks such as Video-MME, TempCompass, and LongVideoBench.

What carries the argument

Spatially-Guided RLVR (SG-RLVR), which extends Group Relative Policy Optimization by inserting a map imagination mechanism that prompts the model to construct spatial layouts during the reasoning trace before producing an answer.

If this is right

  • SpaceR surpasses GPT-4o by 11.6% accuracy on VSI-Bench.
  • SpaceR reaches performance on par with Gemini-2.0-Flash on VSI-Bench.
  • State-of-the-art results hold on STI-Bench and SPAR-Bench.
  • General video understanding performance stays competitive on Video-MME, TempCompass, and LongVideoBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Map imagination guidance could be tested on image-only spatial tasks or on temporal reasoning to see if the same RL structure transfers.
  • Specialized verifiable-reward RL loops may narrow the gap between open and proprietary MLLMs on other narrow reasoning skills.
  • Independent spatial benchmarks created after model release would provide a stronger test of whether the gains reflect genuine layout understanding.

Load-bearing premise

The map imagination mechanism inside SG-RLVR genuinely improves spatial reasoning rather than merely increasing the chance of producing benchmark-correct answers during RL training.

What would settle it

An ablation that removes the map imagination step from SG-RLVR while keeping the rest of the training procedure would show whether the accuracy gains on VSI-Bench and similar benchmarks disappear.

read the original abstract

Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the SpaceR framework to enhance MLLMs for video spatial reasoning. It contributes the SpaceR-151k dataset (91k verifiable spatial questions plus 60k general samples) and proposes Spatially-Guided RLVR (SG-RLVR), which extends GRPO by adding a map imagination step that encourages explicit spatial layout inference before answer generation. Experiments report SOTA results on VSI-Bench, STI-Bench, and SPAR-Bench, with SpaceR surpassing GPT-4o by 11.6% accuracy on VSI-Bench while remaining competitive on general video benchmarks such as Video-MME.

Significance. If the performance gains prove attributable to the map imagination mechanism rather than generic RL effects or dataset scale, the work would meaningfully advance open-source MLLM spatial reasoning and reduce reliance on proprietary models. The public release of code, model, and dataset supports reproducibility and follow-on research.

major comments (2)
  1. [Experiments] Experiments section: No ablation is reported that trains standard GRPO versus SG-RLVR on the identical SpaceR-151k data. Without this control, the 11.6% VSI-Bench gain cannot be confidently attributed to the map imagination mechanism rather than reward optimization or data effects alone.
  2. [Results] Results tables (VSI-Bench, STI-Bench): Baseline implementations for GPT-4o and Gemini-2.0-Flash are not detailed, nor are variance estimates or statistical significance tests provided for the reported accuracy differences, weakening the strength of the SOTA claim.
minor comments (1)
  1. [Abstract] The abstract and method description could more explicitly separate the contributions of the new dataset from those of the SG-RLVR algorithm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate the suggested revisions to strengthen the attribution of results and the transparency of our evaluations.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation is reported that trains standard GRPO versus SG-RLVR on the identical SpaceR-151k data. Without this control, the 11.6% VSI-Bench gain cannot be confidently attributed to the map imagination mechanism rather than reward optimization or data effects alone.

    Authors: We agree that this control experiment is essential to isolate the contribution of the map imagination step. In the revised manuscript we will add an ablation that trains the base model with standard GRPO on the exact same SpaceR-151k dataset and directly compares its performance against SG-RLVR on VSI-Bench, STI-Bench, and SPAR-Bench. This will allow readers to attribute gains more confidently to the spatial guidance mechanism. revision: yes

  2. Referee: [Results] Results tables (VSI-Bench, STI-Bench): Baseline implementations for GPT-4o and Gemini-2.0-Flash are not detailed, nor are variance estimates or statistical significance tests provided for the reported accuracy differences, weakening the strength of the SOTA claim.

    Authors: We acknowledge the need for greater transparency. In the revision we will expand the experimental details section to describe the exact prompts, input formatting, and inference settings used for GPT-4o and Gemini-2.0-Flash. We will also report standard deviations across multiple runs and include statistical significance tests (e.g., paired t-tests) for the key accuracy differences on VSI-Bench and STI-Bench. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks with independent evaluation

full rationale

The paper introduces SpaceR-151k dataset and SG-RLVR training (extending GRPO with map imagination) then reports accuracy gains on held-out benchmarks VSI-Bench, STI-Bench, SPAR-Bench, Video-MME etc. No equations, fitted parameters, or self-citations reduce the headline 11.6% improvement to a quantity defined by the training data itself. The map imagination step is a proposed mechanism whose benefit is measured by downstream benchmark scores rather than by construction or renaming of inputs. All load-bearing claims rest on external test sets and standard RLVR training, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work assumes RLVR-style training with verifiable rewards transfers from text LLMs to multimodal video spatial tasks and that the introduced map imagination step adds genuine spatial structure without introducing new free parameters beyond standard RL hyperparameters.

free parameters (1)
  • RL training hyperparameters
    Standard learning rate, batch size, and reward scaling parameters typical of GRPO-style training; not enumerated in abstract.
axioms (1)
  • domain assumption Verifiable rewards can be reliably assigned to spatial reasoning questions in video
    Central to the RLVR paradigm applied here.
invented entities (1)
  • map imagination mechanism no independent evidence
    purpose: Encourage the model to infer spatial layouts during thinking
    New component added to GRPO; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5658 in / 1183 out tokens · 56384 ms · 2026-05-15T15:12:24.514112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • DimensionForcing linking_requires_D3 unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    SpaceR surpasses the advanced GPT-4o by 11.6% accuracy on VSI-Bench

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  2. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  3. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  4. Token Warping Helps MLLMs Look from Nearby Viewpoints

    cs.CV 2026-04 unverdicted novelty 7.0

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  5. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  6. SCP: Spatial Causal Prediction in Video

    cs.CV 2026-03 unverdicted novelty 7.0

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  7. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  8. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  9. Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

    cs.CV 2026-05 unverdicted novelty 6.0

    Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D...

  10. 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.

  11. Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

  12. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  13. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  14. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  15. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  16. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  17. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  18. MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

  19. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  20. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

  21. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  22. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 20 Pith papers · 16 internal anchors

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923

  3. [3]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271

  4. [4]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198

  5. [5]

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

  6. [6]

    In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839

    Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839

  7. [7]

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning , pages 8469–8488

  8. [8]

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. 2025. Virgo: A preliminary exploration on reproducing o1-like mllm. arXiv preprint arXiv:2501.01904

  9. [9]

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. 2025. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776

  10. [10]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2024. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075

  11. [11]

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision , pages 5267–5275

  12. [12]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  13. [13]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  14. [14]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720

  15. [15]

    Yang Jin, Zehuan Yuan, Yadong Mu, et al. 2022. Embracing consistency: A one-stage approach for spatio-temporal video grounding. Advances in Neural Information Processing Systems , 35:29192–29204

  16. [16]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

  17. [17]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2023. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005

  18. [18]

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. 2025. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? arXiv preprint arXiv:2503.23765

  19. [19]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81. 10

  20. [20]

    Kun-Yu Lin, Jia-Run Du, Yipeng Gao, Jiaming Zhou, and Wei-Shi Zheng. 2023. Diversifying spatial- temporal perception for video domain generalization. Advances in Neural Information Processing Systems , 36:56012–56026

  21. [21]

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268

  22. [22]

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. 2025. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982

  23. [23]

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Tempcompass: Do video llms really understand videos? In Findings of the Association for Computational Linguistics ACL 2024, pages 8731–8772

  24. [24]

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang

  25. [25]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785

  26. [26]

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. 2023. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems , 36:42748– 42761

  27. [27]

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. 2025. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536

  28. [28]

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to- sentence models. In Proceedings of the IEEE international conference on computer vision , pages 2641– 2649

  29. [29]

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision , 115:211–252

  30. [30]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  31. [31]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599

  32. [32]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. 2025. Kimi-vl technical report. arXiv preprint arXiv:2504.07491

  33. [33]

    Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1494–1504

  34. [34]

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long- context interleaved video-language understanding. Advances in Neural Information Processing Systems , 37:28828–28857

  35. [35]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115

  36. [36]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2024. Thinking in space: How multimodal large language models see, remember, and recall spaces. arXiv preprint arXiv:2412.14171

  37. [37]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800

  38. [38]

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 69–85. Springer. 11

  39. [39]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision , pages 11975–11986

  40. [40]

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. 2025. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106

  41. [41]

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. 2025. From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976. 12 A More Details for Data Construction Relative Distance Measuring from the closest point of each o...

  42. [42]

    If I am standing by the {positioning object} and facing the {orienting object}, is the {querying object} to my left, right, or back? An object is to my back if I would have to turn at least 135 degrees in order to face it

  43. [43]

    Appearance Order What will be the first-time appearance order of the following categories in the video: {choice a}, {choice b}, {choice c}, {choice d}? Object/Room Size

    If I am standing by the {positioning object} and facing the {orienting object}, is the {querying object} to my front-left, front-right, back-left, or back-right? Directions refer to the quadrants of a Cartesian plane (assuming I am at the origin and facing the positive y-axis). Appearance Order What will be the first-time appearance order of the following...

  44. [44]

    What is the length of the longest dimension (length, width, or height) of the {object}, measured in centimeters?

  45. [45]

    wall”, “floor

    What is the size of this room (in square meters)? If multiple rooms are shown, estimate the size of the combined space. Absolute Distance Measuring from the closest point of each object, what is the direct distance between the {object 1} and the {object 2} (in meters)? Counting How many {object}(s) are in this room? Figure 6: Question templates for QA pai...