pith. sign in

arxiv: 2607.01784 · v1 · pith:OWZ6MIWJnew · submitted 2026-07-02 · 💻 cs.CV

SpaceEra++: A Unified Framework Towards 3D Spatial Reasoning in Video

Pith reviewed 2026-07-03 16:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D spatial reasoningvideo understandingvision-language modelsframe samplingspatial alignmentembodied interactionscene representation
0
0 comments X

The pith

SpaceEra++ adds selective video frame sampling and pairwise object alignment to improve 3D spatial reasoning from video in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends an earlier system to fix two limits in turning video into 3D scene understanding: videos often supply too little useful input, and training gives weak signals about object positions. ScenePick selects a compact set of frames that still cover the space and keep key objects visible. SpaceAlign adds training signals that force the model to respect both exact 3D coordinates and relative distances between objects at the same time. Tests on several benchmarks show the combined changes raise accuracy over earlier versions and over other strong models, while removing either part lowers results.

Core claim

SpaceEra++ overcomes insufficient video input and weak spatial constraints by introducing ScenePick, which samples frames to balance spatial coverage with semantic importance, and SpaceAlign, which jointly optimizes absolute coordinates and relative object relations during training, producing better 3D spatial understanding from video without altering base model size or data volume.

What carries the argument

ScenePick, a frame sampling strategy that balances spatial coverage with object semantics, together with SpaceAlign, a training step that enforces pairwise object constraints using both absolute coordinates and relative spatial relations.

If this is right

  • The combined use of ScenePick and SpaceAlign produces consistent accuracy gains across multiple spatial-reasoning benchmarks.
  • Removing either component individually reduces performance, confirming each contributes to the overall improvement.
  • The same design choices supply concrete directions for strengthening spatial capabilities in other video-based models.
  • The approach works without increasing training data size or replacing the underlying vision-language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sampling and alignment steps succeed, similar selection rules could be tested on other video tasks such as action prediction or navigation planning.
  • The method may lower reliance on large 3D-labeled datasets by extracting more spatial signal from ordinary video.
  • Applying the same frame and constraint logic to real robot camera streams could test whether benchmark gains appear in physical environments.

Load-bearing premise

The assumption that insufficient scanning-video input and weak reasoning constraints are the primary bottlenecks and that ScenePick plus SpaceAlign will reliably fix them without changes to data scale or base model architecture.

What would settle it

A side-by-side test on the same benchmarks where the full SpaceEra++ model performs no better than the original SpaceEra or strong baselines, or where ablating ScenePick or SpaceAlign leaves accuracy unchanged, would show the new components do not deliver the claimed gains.

Figures

Figures reproduced from arXiv: 2607.01784 by Haoyu Zhang, Liqiang Nie, Meng Liu, Qianlong Xiang, Weili Guan, Yaowei Wang.

Figure 1
Figure 1. Figure 1: Two examples of spatial reasoning tasks: path planning (left) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SpaceEra++ framework. The blue text denotes newly [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of ScanForgeQA data construction, which consists of scene construction, scan creation, and QA generation. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of our proposed ScenePick frame sampling strategy. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of our SpatailMind prompting strategy, which consists of two main steps: scene decomposition and question decomposition. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameter analysis under different hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of scene using different frame sampling methods. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reward curves along training steps under different RL strategies. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example illustrating the thinking process of the SpaceAlign strategy, comparing predictions from Qwen2.5-VL-7B, the conference version [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Visual-spatial understanding, defined as the ability to infer object relationships and scene layouts from visual inputs, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, pre-trained vision-language models (VLMs) remain constrained by spatial uncertainty stemming from inherently 2D observations and by the scarcity of data for 3D spatial understanding. To address these limitations, we proposed a novel framework, SpaceEra, in the NeurIPS 2025 Spotlight paper. Although it achieved significant performance gains, we further observed that its effectiveness is hindered by insufficient input from scanning videos and weak reasoning constraints. To tackle these newly emerged challenges, we extend the original framework into a comprehensive system, termed SpaceEra++, which spans data construction, model design, training optimization, and prompting inference. Specifically, to alleviate input insufficiency, we introduce ScenePick, a frame sampling strategy that balances spatial coverage with object semantics to produce compact yet comprehensive scene representations. In addition, to enhance spatial reasoning, we develop SpaceAlign, which enforces pairwise object constraints by jointly exploiting absolute coordinates and relative spatial relations, thereby aligning optimization with spatial accuracy. Extensive experiments across multiple benchmarks demonstrate consistent improvements over strong baselines, while ablation studies validate both the individual and joint contributions of each component, and further analyses provide guidance for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper extends the prior SpaceEra framework (NeurIPS 2025 Spotlight) into SpaceEra++ to improve 3D spatial reasoning in video for VLMs. It identifies two new limitations (insufficient input from scanning videos and weak reasoning constraints) and introduces ScenePick (semantic-aware frame sampling for compact scene representations) and SpaceAlign (joint absolute+relative pairwise object constraints). The manuscript claims that these components, together with data construction, training optimization, and prompting, yield consistent gains over strong baselines on multiple benchmarks, with ablations confirming individual and joint contributions and additional analyses offering future guidance.

Significance. If the attribution of gains to ScenePick and SpaceAlign holds under controlled conditions, the work would offer a practical unified pipeline for mitigating 2D-to-3D spatial uncertainty in VLMs, building directly on a prior accepted paper and supplying concrete sampling and alignment techniques plus benchmark guidance. The emphasis on both data and model design elements is a strength.

major comments (1)
  1. [Abstract / Experiments section] The central attribution—that ScenePick and SpaceAlign reliably resolve the stated bottlenecks and produce the reported gains—requires controlled experiments that hold total training tokens, base VLM architecture, and training schedule fixed. The abstract asserts that ablation studies validate component contributions, but without such isolation the causal claim remains untested and load-bearing for the extension narrative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our extension of the SpaceEra framework. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract / Experiments section] The central attribution—that ScenePick and SpaceAlign reliably resolve the stated bottlenecks and produce the reported gains—requires controlled experiments that hold total training tokens, base VLM architecture, and training schedule fixed. The abstract asserts that ablation studies validate component contributions, but without such isolation the causal claim remains untested and load-bearing for the extension narrative.

    Authors: We agree that rigorous isolation of ScenePick and SpaceAlign contributions is essential for the causal claims. All ablation variants in the manuscript use the identical base VLM architecture, the same pre-trained weights, and the exact same training schedule (optimizer, learning rate, epochs, and batch size). ScenePick operates purely at inference-time frame selection and does not change the per-sample token budget; SpaceAlign adds only a pairwise loss term without altering input token counts. We will revise the Experiments section to explicitly tabulate total training tokens per ablation variant and add a sentence confirming these controls, thereby strengthening the attribution without requiring new runs. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces ScenePick and SpaceAlign as new modules extending prior SpaceEra work, with performance claims resting entirely on ablation studies and benchmark comparisons rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. The self-citation to the authors' NeurIPS 2025 paper provides background context but does not serve as load-bearing justification for the current results; no uniqueness theorems, ansatzes, or renamings are invoked. The derivation chain is self-contained because improvements are measured against independent external benchmarks and controlled ablations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5773 in / 1167 out tokens · 33351 ms · 2026-07-03T16:14:36.504253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”arXiv preprint arXiv:2412.14171, 2024

  2. [2]

    Llava-vsd: Large language-and-vision assistant for visual spatial description,

    Y . Jin, J. Li, J. Zhang, J. Hu, Z. Gan, X. Tan, Y . Liu, Y . Wang, C. Wang, and L. Ma, “Llava-vsd: Large language-and-vision assistant for visual spatial description,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11 420–11 425

  3. [3]

    Drivevlm: The convergence of autonomous driving and large vision-language models,

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” in8th Annual Conference on Robot Learning

  4. [4]

    Attribute-guided collaborative learning for partial person re- identification,

    H. Zhang, M. Liu, Y . Li, M. Yan, Z. Gao, X. Chang, and L. Nie, “Attribute-guided collaborative learning for partial person re- identification,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 14 144–14 160, 2023

  5. [5]

    Palm-e: an embodied multimodal language model,

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: an embodied multimodal language model,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 8469–8488

  6. [6]

    Hourvideo: 1-hour video- language understanding,

    K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyza- guirre, Z. Durante, M. Li, J. Wu, and F.-F. Li, “Hourvideo: 1-hour video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 168–53 197, 2024

  7. [7]

    Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding,

    H. Zhang, Q. Chu, M. Liu, H. Shi, Y . Wang, and L. Nie, “Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 502–12 510

  8. [8]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

  9. [9]

    Spatialrgpt: Grounded spatial reasoning in vision-language models,

    A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu, “Spatialrgpt: Grounded spatial reasoning in vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 135 062–135 093, 2025

  10. [10]

    Spatialbot: Precise spatial understanding with vision language models,

    W. Cai, I. Ponomarenko, J. Yuan, X. Li, W. Yang, H. Dong, and B. Zhao, “Spatialbot: Precise spatial understanding with vision language models,” arXiv preprint arXiv:2406.13642, 2024

  11. [11]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 428–26 438

  12. [12]

    3d-llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 482–20 494, 2023

  13. [13]

    Gpt4scene: Under- stand 3d scenes from videos with vision-language models

    Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025

  14. [14]

    Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642,

    H. Zhang, M. Liu, Z. Li, H. Wen, W. Guan, Y . Wang, and L. Nie, “Spatial understanding from videos: Structured prompts meet simulation data,” arXiv preprint arXiv:2506.03642, 2025

  15. [15]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 vision-language models,

    M. Du, B. Wu, Z. Li, X.-J. Huang, and Z. Wei, “Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 vision-language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2024, pp. 346–355

  16. [16]

    Sphere: Unveiling spatial blind spots in vision-language mod- els through hierarchical evaluation,

    W. Zhang, W. E. Ng, L. Ma, Y . Wang, J. Zhao, A. Koenecke, B. Li, and L. Wang, “Sphere: Unveiling spatial blind spots in vision-language mod- els through hierarchical evaluation,”arXiv preprint arXiv:2412.12693, 2024

  17. [17]

    Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors,

    C. Ma, K. Lu, T.-Y . Cheng, N. Trigoni, and A. Markham, “Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems

  18. [18]

    Rag-guided large language models for visual spatial description with adaptive hallucination corrector,

    J. Yu, Y . Zhang, Z. Zhang, Z. Yang, G. Zhao, F. Sun, F. Zhang, Q. Liu, J. Sun, J. Lianget al., “Rag-guided large language models for visual spatial description with adaptive hallucination corrector,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11 407–11 413

  19. [19]

    Robopoint: A vision-language model for spatial affordance prediction in robotics,

    W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox, “Robopoint: A vision-language model for spatial affordance prediction in robotics,” in8th Annual Conference on Robot Learning

  20. [20]

    Robospatial: Teaching spatial understanding to 2d and 3d vision- language models for robotics,

    C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield, “Robospatial: Teaching spatial understanding to 2d and 3d vision- language models for robotics,”arXiv preprint arXiv:2411.16537, 2024

  21. [21]

    Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning, 2025 b

    Y . Liu, D. Chi, S. Wu, Z. Zhang, Y . Hu, L. Zhang, Y . Zhang, S. Wu, T. Cao, G. Huanget al., “Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning,”arXiv preprint arXiv:2501.10074, 2025

  22. [22]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,

    G. Baruch, Z. Chen, A. Dehghan, Y . Feigin, P. Fu, T. Gebauer, D. Kurz, T. Dimry, B. Joffe, A. Schwartzet al., “Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)

  23. [23]

    Matterport3d: Learning from rgb-d data in indoor environments,

    A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” inInternational Conference on 3D Vision (3DV), 2017

  24. [24]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

  25. [25]

    Procthor: Large- scale embodied ai using procedural generation,

    M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, K. Ehsani, J. Salvador, W. Han, E. Kolve, A. Kembhavi, and R. Mottaghi, “Procthor: Large- scale embodied ai using procedural generation,”Advances in Neural Information Processing Systems, vol. 35, pp. 5982–5994, 2022

  26. [26]

    Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,

    Y . Mao, Y . Zhang, H. Jiang, A. Chang, and M. Savva, “Multiscan: Scalable rgbd scanning for 3d environments with articulated objects,” Advances in neural information processing systems, vol. 35, pp. 9058– 9071, 2022

  27. [27]

    Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Changet al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  28. [28]

    Scannet++: A high- fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12–22

  29. [29]

    Softgroup for 3d instance segmentation on point clouds,

    T. Vu, K. Kim, T. M. Luu, T. Nguyen, and C. D. Yoo, “Softgroup for 3d instance segmentation on point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2708– 2717

  30. [30]

    Point transformer v3: Simpler faster stronger,

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4840–4851

  31. [31]

    Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance,

    P. Nguyen, T. D. Ngo, E. Kalogerakis, C. Gan, A. Tran, C. Pham, and K. Nguyen, “Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4018–4028

  32. [32]

    Unscene3d: Unsupervised 3d instance segmentation for indoor scenes,

    D. Rozenberszki, O. Litany, and A. Dai, “Unscene3d: Unsupervised 3d instance segmentation for indoor scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 957–19 967

  33. [33]

    arXiv preprint arXiv:2309.00615 , year=

    Z. Guo, R. Zhang, X. Zhu, Y . Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Liet al., “Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following,”arXiv preprint arXiv:2309.00615, 2023

  34. [34]

    Shapellm: Universal 3d object understanding for embodied interaction,

    Z. Qi, R. Dong, S. Zhang, H. Geng, C. Han, Z. Ge, L. Yi, and K. Ma, “Shapellm: Universal 3d object understanding for embodied interaction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 214– 238

  35. [35]

    Lion: Linear group rnn for 3d object detection in point clouds,

    Z. Liu, J. Hou, X. Wang, X. Ye, J. Wang, H. Zhao, and X. Bai, “Lion: Linear group rnn for 3d object detection in point clouds,”Advances in Neural Information Processing Systems, vol. 37, pp. 13 601–13 626, 2024

  36. [36]

    Pointllm: Empowering large language models to understand point clouds,

    R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,” in European Conference on Computer Vision. Springer, 2024, pp. 131– 147

  37. [37]

    Lexicon3d: Probing visual foundation models for complex 3d scene understanding,

    Y . Man, S. Zheng, Z. Bao, M. Hebert, L. Gui, and Y .-X. Wang, “Lexicon3d: Probing visual foundation models for complex 3d scene understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 76 819–76 847, 2024

  38. [38]

    Improved visual-spatial reasoning via r1-zero-like training,

    Z. Liao, Q. Xie, Y . Zhang, Z. Kong, H. Lu, Z. Yang, and Z. Deng, “Improved visual-spatial reasoning via r1-zero-like training,”arXiv preprint arXiv:2504.00883, 2025

  39. [39]

    Struct2d: A perception-guided framework for spatial reasoning in large multimodal models.arXiv preprint arXiv:2506.04220,

    F. Zhu, H. Wang, Y . Xie, J. Gu, T. Ding, J. Yang, and H. Jiang, “Struct2d: A perception-guided framework for spatial reasoning in large multimodal models,”arXiv preprint arXiv:2506.04220, 2025

  40. [40]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    K. Ouyang, Y . Liu, H. Wu, Y . Liu, H. Zhou, J. Zhou, F. Meng, and X. Sun, “Spacer: Reinforcing mllms in video spatial reasoning,”arXiv preprint arXiv:2504.01805, 2025

  41. [41]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

  42. [42]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633–638, 2025

  43. [43]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liaoet al., “Kimi k1. 5: Scaling reinforcement learning with llms,”arXiv preprint arXiv:2501.12599, 2025

  44. [44]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  45. [45]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Y . Peng, G. Zhang, M. Zhang, Z. You, J. Liu, Q. Zhu, K. Yang, X. Xu, X. Geng, and X. Yang, “Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl,”arXiv preprint arXiv:2503.07536, 2025

  46. [46]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    D. Wu, F. Liu, Y .-H. Hung, and Y . Duan, “Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence,”arXiv preprint arXiv:2505.23747, 2025

  47. [47]

    Spartun3d: Situated spatial understanding of 3d world in large language models,

    Y . Zhang, Z. Xu, Y . Shen, P. Kordjamshidi, and L. Huang, “Spartun3d: Situated spatial understanding of 3d world in large language models,” arXiv preprint arXiv:2410.03878, 2024

  48. [48]

    Multi- modal situated reasoning in 3d scenes,

    X. Linghu, J. Huang, X. Niu, X. S. Ma, B. Jia, and S. Huang, “Multi- modal situated reasoning in 3d scenes,”Advances in Neural Information Processing Systems, vol. 37, pp. 140 903–140 936, 2024

  49. [49]

    3d-front: 3d furnished rooms with layouts and semantics,

    H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhaoet al., “3d-front: 3d furnished rooms with layouts and semantics,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 933–10 942

  50. [50]

    Holodeck: Language guided generation of 3d embodied ai environments,

    Y . Yang, F.-Y . Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liuet al., “Holodeck: Language guided generation of 3d embodied ai environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 227–16 237

  51. [51]

    Objaverse: A universe of annotated 3d objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 13 142–13 153

  52. [52]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306

  53. [53]

    Openeqa: Embod- ied question answering in the era of foundation models,

    A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaudet al., “Openeqa: Embod- ied question answering in the era of foundation models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 488–16 498

  54. [54]

    Scanqa: 3d question answering for spatial scene understanding,

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inproceedings of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15 the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 129–19 139

  55. [55]

    Sqa3d: Situated question answering in 3d scenes,

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,” inThe Eleventh International Conference on Learning Representations

  56. [56]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  57. [57]

    SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

    P. Sun, S. Lang, D. Wu, Y . Ding, K. Feng, H. Liu, Z. Ye, R. Liu, Y .-H. Liu, J. Wanget al., “Spacevista: All-scale visual spatial reasoning from mm to km,”arXiv preprint arXiv:2510.09606, 2025

  58. [58]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

  59. [59]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  60. [60]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025. Weili Guanreceived the master’s degree from National University of Singapore, and the Ph.D. degree from Monash University. She has about 6 years of working experience at the enterprise. She is curr...

  61. [61]

    degree at the School of Information Science and Techonology, Harbin Institute of Techonology, Shenzhen, China

    He is currently pursuing the Ph.D. degree at the School of Information Science and Techonology, Harbin Institute of Techonology, Shenzhen, China. His research has been published in top-tier confer- ences including CVPR. He has served as a reviewer for various conferences and journals, such as IEEE TPAMI, ACM MM and IEEE TCSVT. His main research interests ...