Spatialcot: Advancing spatial reasoning through coordinate alignment and chain- of-thought for embodied task planning

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al · 2025 · arXiv 2501.10074

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.

Spatio-Temporal Grounding of Large Language Models from Perception Streams

cs.RO · 2026-04-08 · unverdicted · novelty 6.0

FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 while staying far smaller.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 7 of 7 citing papers.

Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 57
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 31
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 31
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment cs.CV · 2026-05-08 · unverdicted · none · ref 24
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 14
4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.
Spatio-Temporal Grounding of Large Language Models from Perception Streams cs.RO · 2026-04-08 · unverdicted · none · ref 18
FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 while staying far smaller.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 141
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain- of-thought for embodied task planning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer