hub

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, Bo Zhao · 2025 · arXiv 2503.23765

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

cs.AI · 2025-11-26 · unverdicted · novelty 7.0

SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

cs.CV · 2025-04-02 · unverdicted · novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

cs.CV · 2025-12-11 · conditional · novelty 6.0

SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

cs.CV · 2025-05-29 · unverdicted · novelty 6.0 · 2 refs

Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.

citing papers explorer

Showing 10 of 10 citing papers.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 21
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 24
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 30
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition cs.AI · 2025-11-26 · unverdicted · none · ref 41
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning cs.CV · 2025-04-02 · unverdicted · none · ref 18
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly cs.CV · 2026-05-20 · unverdicted · none · ref 13
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 20
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 28
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving cs.CV · 2025-12-11 · conditional · none · ref 36
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence cs.CV · 2025-05-29 · unverdicted · none · ref 53 · 2 links
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer