hub Mixed citations

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou · 2025 · cs.CV · arXiv 2504.01805

Mixed citation behavior. Most common role is background (67%).

29 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 3 dataset 2

citation-polarity summary

background 10 baseline 3 use dataset 2

representative citing papers

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

4DThinker enables VLMs to perform dynamic spatial reasoning by thinking with 4D latent mental imagery using new fine-tuning and reinforcement learning methods.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

Motion-o: Trajectory-Grounded Video Reasoning

cs.CV · 2026-03-19 · conditional · novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

cs.CV · 2025-12-18 · unverdicted · novelty 7.0

4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

cs.CV · 2025-05-22 · conditional · novelty 7.0

Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.

Cambrian-P: Pose-Grounded Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.

Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

cs.CV · 2026-03-18 · unverdicted · novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

cs.CL · 2026-02-04 · unverdicted · novelty 6.0

Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.

Cambrian-S: Towards Spatial Supersensing in Video

cs.CV · 2025-11-06 · unverdicted · novelty 6.0

Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

cs.CV · 2026-04-19 · unverdicted · novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

citing papers explorer

Showing 29 of 29 citing papers.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 28 · internal anchor
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Count Anything at Any Granularity cs.CV · 2026-05-11 · unverdicted · none · ref 60 · internal anchor
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 15 · 2 links · internal anchor
4DThinker enables VLMs to perform dynamic spatial reasoning by thinking with 4D latent mental imagery using new fine-tuning and reinforcement learning methods.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 88 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 72 · internal anchor
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Motion-o: Trajectory-Grounded Video Reasoning cs.CV · 2026-03-19 · conditional · none · ref 15 · internal anchor
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 35 · internal anchor
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation cs.CV · 2025-12-18 · unverdicted · none · ref 14 · internal anchor
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 63 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 69 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 41 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment cs.CV · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 30 · internal anchor
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding cs.CV · 2026-03-18 · unverdicted · none · ref 45 · internal anchor
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 12 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
Cambrian-S: Towards Spatial Supersensing in Video cs.CV · 2025-11-06 · unverdicted · none · ref 99 · internal anchor
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 31 · 4 links · internal anchor
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 51 · internal anchor
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning cs.CV · 2026-04-19 · unverdicted · none · ref 37 · internal anchor
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding cs.CV · 2026-04-10 · unverdicted · none · ref 34 · internal anchor
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 34 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs cs.CV · 2026-03-29 · unverdicted · none · ref 28 · internal anchor
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning cs.CV · 2025-12-03 · unverdicted · none · ref 40 · internal anchor
TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 62 · 2 links · internal anchor
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 74 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unreviewed · ref 32 · internal anchor

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer