Can Large Vision Language Models Read Maps Like a Human?

Dezhen Song; Jiachen Li; Kaiyuan Chen; Shuangyu Xie; Shuo Xing; Yanjia Huang; Yuping Wang; Zezhou Sun; Zhengzhong Tu

arxiv: 2503.14607 · v1 · pith:6LA4VCVCnew · submitted 2025-03-18 · 💻 cs.CV

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing , Zezhou Sun , Shuangyu Xie , Kaiyuan Chen , Yanjia Huang , Yuping Wang , Jiachen Li , Dezhen Song

show 1 more author

Zhengzhong Tu

This is my paper

classification 💻 cs.CV

keywords mapbenchlvlmsnavigationdatasetfindinglanguagemapspath

0 comments

read the original abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MapReason-OSM: Can Vision-Language Models Make Graph-Verifiable Mobility Decisions from Street Maps ?
cs.CV 2026-06 unverdicted novelty 7.0

MapReason-OSM supplies 6000 graph-verifiable instances across 12 mobility tasks on rendered OSM maps from 10 U.S. downtowns and shows that seven VLMs succeed at simple routing but perform near chance on cost-based fac...
Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation
physics.soc-ph 2026-06 unverdicted novelty 7.0

A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.
Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
cs.CV 2026-06 unverdicted novelty 7.0

PERIA augments VLMs with vision perception and interaction tools trained via supervised trajectories and OR-GIGPO to deliver 10% and 4.4% gains on in- and out-of-distribution spatial reasoning benchmarks.
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
cs.CL 2026-05 unverdicted novelty 7.0

TransitLM is a large-scale dataset and benchmark for training LLMs to generate structurally valid map-free transit routes from origin-destination pairs.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
TraversalBench: Challenging Paths to Follow for Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

TraversalBench shows self-intersections cause the sharpest performance drops for VLMs on exact path traversal, with errors localized at the first crossing.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models
cs.AI 2025-11 unverdicted novelty 6.0

RLVR on synthetic mazes enables VLMs to solve spatial reasoning tasks unreachable by the base model and generalizes to real-world navigation benchmarks.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.