Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas , author= · 2025 · arXiv 2503.01773

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.

3D Primitives are a Spatial Language for VLMs

cs.CV · 2026-05-12 · conditional · novelty 7.0

3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

cs.CL · 2025-09-18 · conditional · novelty 7.0

V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.

AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

SatAgent is a UAV-satellite collaborative spatial reasoning model using geometric 3D encoding, multi-view alignment, and a new 130K dataset that reports 25.91% and 11.69% gains over general and specialized baselines.

Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework

cs.HC · 2026-06-29 · unverdicted · novelty 6.0

Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.

Self-Improving Small Object Grounding in LVLMs

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Attention maps in LVLMs enable an IoU regressor (Pearson r > 0.67) and a training-free entropy-based selector that improves small-object localization by up to 19% on COCO and Objects365.

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

EAGLE achieves up to 94.4% anomaly detection accuracy on MVTec-AD and 88.1% on VisA by guiding frozen MLLMs with expert-derived thresholds and confidence-aware attention without parameter updates.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer