hub

Sat: Spa- tial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 3

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al · 2024 · arXiv 2412.07755

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

cs.AI · 2025-03-07 · conditional · novelty 7.0

RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.

MolmoAct2: Action Reasoning Models for Real-world Deployment

cs.RO · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

cs.AI · 2026-02-12 · unverdicted · novelty 6.0

MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

cs.CL · 2026-02-04 · unverdicted · novelty 6.0

Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.

Mull-Tokens: Modality-Agnostic Latent Thinking

cs.CV · 2025-12-11 · unverdicted · novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

cs.RO · 2025-08-19 · conditional · novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

cs.CV · 2025-05-22 · unverdicted · novelty 6.0

Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

cs.CL · 2025-01-13 · unverdicted · novelty 6.0

MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures while preserving language capabilities.

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

cs.CV · 2026-05-12 · unverdicted · novelty 4.0

LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

citing papers explorer

Showing 18 of 18 citing papers.

Exploring Spatial Intelligence from a Generative Perspective cs.CV · 2026-04-22 · unverdicted · none · ref 24
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 38
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model cs.AI · 2025-03-07 · conditional · none · ref 10
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs cs.AI · 2026-05-22 · unverdicted · none · ref 25
VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 30 · 2 links
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning cs.CV · 2026-04-29 · unverdicted · none · ref 41
Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 40
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation cs.AI · 2026-02-12 · unverdicted · none · ref 27
MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 27
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV · 2025-12-11 · unverdicted · none · ref 48
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 44
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation cs.RO · 2025-08-19 · conditional · none · ref 28
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 50
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 54
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought cs.CL · 2025-01-13 · unverdicted · none · ref 17
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 35
GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures while preserving language capabilities.
LychSim: A Controllable and Interactive Simulation Framework for Vision Research cs.CV · 2026-05-12 · unverdicted · none · ref 42
LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 83
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

Sat: Spa- tial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 3

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer