pith. sign in

How far are vlms from visual spatial intel- ligence? a benchmark-driven perspective

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

fields

cs.CV 7

years

2026 7

roles

background 2

polarities

background 1 unclear 1

representative citing papers

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.

Why MLLMs Struggle to Determine Object Orientations

cs.CV · 2026-04-14 · accept · novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

citing papers explorer

Showing 7 of 7 citing papers.

  • Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs cs.CV · 2026-05-19 · unverdicted · none · ref 44

    Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.

  • Why MLLMs Struggle to Determine Object Orientations cs.CV · 2026-04-14 · accept · none · ref 37

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  • Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 116

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  • SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 61

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  • Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 83

    Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

  • SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 71

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  • World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning cs.CV · 2026-04-29 · unverdicted · none · ref 61

    Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.