pith. sign in

arxiv: 2602.18600 · v3 · pith:F7OKRRH4new · submitted 2026-02-20 · 💻 cs.LG

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Pith reviewed 2026-05-22 10:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords multimodal large language modelsmulti-criteria reasoningroute planningbenchmarkmap imagestabular datavisual perceptionheterogeneous graphs
0
0 comments X

The pith

MapTab shows current MLLMs struggle with multi-criteria multimodal reasoning on route planning that mixes map images and tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MapTab as a benchmark that tests multimodal large language models on route planning tasks requiring them to balance four criteria—Time, Price, Comfort, and Reliability—by grounding visual information from map images with structured attributes from tables. It covers metro networks across 160 cities and 168 tourist attractions, generating hundreds of thousands of queries. Evaluations of 15 models find substantial difficulties in this holistic reasoning, with multimodal approaches often performing worse than unimodal ones when visual perception is limited. This matters because route planning reflects real decision tasks where models must weigh competing factors from heterogeneous sources.

Core claim

MapTab is a multimodal benchmark for holistic multi-criteria reasoning in MLLMs via route planning that requires perceiving visual cues from map images alongside route attributes from tabular data. The benchmark includes Metromap for metro networks in 160 cities across 52 countries and Travelmap for 168 tourist attractions across 19 countries, with 328 images, 196800 route planning queries, and 3936 QA queries using the criteria Time, Price, Comfort, and Reliability. Extensive evaluations across 15 MLLMs show current models face substantial challenges in multi-criteria multimodal reasoning, and multimodal collaboration often underperforms unimodal approaches under limited visual perception.

What carries the argument

MapTab benchmark, which pairs map images for visual grounding with tabular route attributes to test multi-criteria reasoning in heterogeneous graphs.

If this is right

  • Multimodal collaboration does not reliably outperform unimodal processing when visual perception is constrained.
  • Current MLLMs have difficulty integrating visual map cues with quantitative tabular attributes for balanced decision-making.
  • Benchmarks like MapTab can serve as a testbed to measure progress in realistic multi-criteria planning tasks.
  • Improvements in visual grounding are needed before MLLMs can handle heterogeneous data sources in route planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The findings suggest that training data for MLLMs should emphasize more examples of map and table integration to reduce the observed multimodal penalty.
  • MapTab-style tasks could be adapted to evaluate models on other multi-criteria problems such as logistics scheduling or urban resource allocation.
  • Developers might prioritize separate visual perception modules that feed cleaner signals into reasoning stages rather than end-to-end multimodal fusion.
  • The benchmark points to broader difficulties MLLMs face when processing real-world graphs that combine spatial visuals with attribute tables.

Load-bearing premise

The constructed queries and criteria plus the visual grounding from map images provide a faithful and unbiased measure of holistic multi-criteria reasoning capability in MLLMs without artifacts from data generation or query design.

What would settle it

A controlled experiment where models are retrained specifically on MapTab-style map-tabular pairs and then retested to check if the reported performance gaps close or persist.

Figures

Figures reproduced from arXiv: 2602.18600 by Bin Liu, Lan-Zhe Guo, Lingyue Ge, Shi-Yu Tian, Weiming Wu, Wenbo Fu, Xiangwen Zhang, Yang Chen, Yu-Feng Li, Yulan Hu, Zhenyu Huang, Zi-Jian Cheng, Ziqiao Shang.

Figure 1
Figure 1. Figure 1: Composition and Statistical Overview of the MapTab Benchmark. MapTab [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic overview of the MapTab construction pipeline, comprising 5 main steps: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Image Resolution on RP and QA Tasks in Metromap and Travelmap Sce [PITH_FULL_IMAGE:figures/full_fig_p049_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of model accuracy across Map Difficulty and Query Difficulty under different [PITH_FULL_IMAGE:figures/full_fig_p050_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model accuracy matrix under different combinations of Map Difficulty and Query Difficulty. [PITH_FULL_IMAGE:figures/full_fig_p050_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: All images have the exact same height. Widths are adjusted automatically. [PITH_FULL_IMAGE:figures/full_fig_p052_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Error Case 1: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Error Case 2: Missed Transfer Label [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error Case 3: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p055_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Error Case 4: Illegal Line Jumping 56 [PITH_FULL_IMAGE:figures/full_fig_p056_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Error Case 5: Overthinking 57 [PITH_FULL_IMAGE:figures/full_fig_p057_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error Case 6: Failed multi-criteria Reasoning [PITH_FULL_IMAGE:figures/full_fig_p058_12.png] view at source ↗
read the original abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MapTab, a multimodal benchmark for evaluating MLLMs on multi-criteria route planning in heterogeneous graphs. It comprises 328 map images (Metromap metro networks across 160 cities and Travelmap attractions), 196800 route-planning queries, and 3936 QA items using four fixed criteria (Time, Price, Comfort, Reliability). Evaluations across 15 MLLMs claim that current models face substantial challenges in multi-criteria multimodal reasoning and that multimodal inputs often underperform unimodal ones under limited visual perception.

Significance. If the queries prove free of systematic artifacts from programmatic generation over fixed criteria and map renderings, the benchmark would offer a useful large-scale testbed for probing MLLM integration of visual and tabular information under realistic constraints. The scale, the open code release, and the counter-intuitive multimodal-under-unimodal observation could usefully guide future work on multimodal reasoning.

major comments (2)
  1. [Abstract] Abstract: the headline claims of 'substantial challenges' and multimodal underperformance rest on 196800 queries and 3936 QA items, yet no details are supplied on query-generation mechanics, statistical controls, or operationalization of 'limited visual perception.' This directly affects whether the reported performance gaps can be attributed to intrinsic reasoning limits.
  2. [Experiments] Experiments section: no ablation is presented that holds attribute values constant while removing the map image or randomizes criteria weights per query. Without such controls, surface correlations between textual attributes, topology, or rendering choices cannot be ruled out as drivers of success/failure, which is load-bearing for the central claim that results reflect genuine multi-criteria grounding.
minor comments (2)
  1. [Related Work] Add explicit comparison tables against prior multimodal reasoning benchmarks to sharpen the novelty claim.
  2. [Dataset Construction] Clarify in the dataset statistics whether all 328 images are used uniformly across the 196800 queries or whether some images contribute disproportionately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully reviewed the major comments and provide point-by-point responses below. Where appropriate, we outline specific revisions that will be incorporated into the next version of the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claims of 'substantial challenges' and multimodal underperformance rest on 196800 queries and 3936 QA items, yet no details are supplied on query-generation mechanics, statistical controls, or operationalization of 'limited visual perception.' This directly affects whether the reported performance gaps can be attributed to intrinsic reasoning limits.

    Authors: We appreciate the referee highlighting the need for greater transparency in the abstract. The query-generation mechanics, including the programmatic construction of the 196,800 route-planning queries and 3,936 QA items over the four fixed criteria (Time, Price, Comfort, Reliability) and the two map scenarios (Metromap and Travelmap), are described in detail in Section 3 of the manuscript. Statistical controls for query validity and diversity are also outlined there, along with the definition of limited visual perception as the use of low-resolution or cropped map images that restrict full visual access to topological and attribute information. To directly address the concern, we will revise the abstract to include a concise summary of the query-generation process and the operationalization of limited visual perception, ensuring readers can immediately assess the grounding of the reported performance gaps. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation is presented that holds attribute values constant while removing the map image or randomizes criteria weights per query. Without such controls, surface correlations between textual attributes, topology, or rendering choices cannot be ruled out as drivers of success/failure, which is load-bearing for the central claim that results reflect genuine multi-criteria grounding.

    Authors: We agree that these additional controls would strengthen the central claim regarding genuine multi-criteria multimodal reasoning. The current manuscript includes multimodal versus unimodal comparisons and evaluations across 15 MLLMs, but does not contain the specific ablations suggested. In the revised version, we will add two new ablation studies in the Experiments section: (1) an ablation that holds all attribute values (Time, Price, Comfort, Reliability) constant while removing the map image entirely, and (2) an ablation that randomizes the criteria weights on a per-query basis. These will help isolate whether performance differences arise from multimodal integration or from potential surface correlations with textual attributes, topology, or rendering choices. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and direct empirical evaluations

full rationale

The paper introduces MapTab as a fresh multimodal benchmark consisting of 328 map images, 196800 programmatically generated route-planning queries, and 3936 QA items over Metromap and Travelmap scenarios with four fixed criteria. All reported results are direct performance measurements of 15 MLLMs on this newly constructed dataset; no parameters are fitted to subsets of the target data, no predictions are derived from prior fits, and no self-citation chain is invoked to justify uniqueness or force the central claims. The evaluation methodology is therefore self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that route-planning queries with the chosen criteria constitute a valid proxy for general multi-criteria multimodal reasoning; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Route planning tasks with visual maps and tabular attributes can serve as a rigorous test of holistic multi-criteria reasoning in MLLMs
    Invoked in the abstract when the benchmark is positioned as evaluating reasoning capabilities under multi-criteria constraints.

pith-pipeline@v0.9.0 · 5817 in / 1285 out tokens · 45855 ms · 2026-05-22T10:25:54.672433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

Reference graph

Works this paper leans on

211 extracted references · 211 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

  2. [2]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  3. [3]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

    Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

  5. [5]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  7. [7]

    doubao-seed-1.6-thinking

    ByteDance. doubao-seed-1.6-thinking. https://www.volcengine.com/docs/82379/ 1593702?utm_source=chatgpt.com&lang=zh, 2025. 10

  8. [8]

    Seed1.6: Tech introduction

    ByteDance Seed Team. Seed1.6: Tech introduction. https://seed.bytedance.com/en/ seed1_6, June 2025. Model ID: doubao-seed-1-6-251015. Accessed: 2025-12-25

  9. [9]

    Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

  10. [10]

    Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

    Chao Cao, Hongbiao Zhu, Zhongqiang Ren, Howie Choset, and Ji Zhang. Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

  11. [11]

    Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

    Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 21819–21830, 2024

  12. [12]

    Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

    Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

  13. [13]

    Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

    Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

  14. [14]

    Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

    Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

  15. [15]

    Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  16. [16]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

  17. [17]

    A survey on multimodal large language models for autonomous driving

    Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024

  18. [18]

    MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

    Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024

  19. [19]

    Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

    Bowen Fang, Zixiao Yang, and Xuan Di. Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

  20. [20]

    Citybench: Evaluating the capabilities of large language models for urban tasks

    Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

  21. [21]

    Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

    Jie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

  22. [22]

    Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

    Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

  23. [23]

    Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025

    Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 11

  24. [24]

    Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

    Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

  25. [25]

    Drive like a human: Rethinking autonomous driving with large language models

    Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024

  26. [26]

    Gemini 3 flash: Frontier intelligence built for speed

    Google. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

  27. [27]

    Accessed: 2025-12-25

    Model ID: gemini-3-flash-preview. Accessed: 2025-12-25

  28. [28]

    Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T Kwok, and Yu Zhang. Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

  29. [29]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  30. [30]

    Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

    Lingfeng Guo, Zihan Li, and Shengjie Min. Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

  31. [31]

    R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

    Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

  32. [32]

    Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

    Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

  33. [33]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

  34. [34]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

  35. [35]

    Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

    Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

  36. [36]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  37. [37]

    Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks

    Varvara Krechetova and Denis Kochedykov. Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, pages 27–35, 2025

  38. [38]

    Mmcode: Evaluating multi-modal code large language models with visually rich programming problems

    Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024

  39. [39]

    Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark

    Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13337–13349, 2025. 12

  40. [40]

    Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

    Zekun Li, Malcolm Grossman, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang, et al. Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

  41. [41]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  42. [42]

    Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

    Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

  43. [43]

    Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

    Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen. Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

  44. [44]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  45. [45]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

  46. [46]

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

  47. [47]

    Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

    Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

  48. [48]

    Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

    Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, and Jingrui He. Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

  49. [49]

    OpenAI o1.https://openai.com/o1/, 2024

    OpenAI. OpenAI o1.https://openai.com/o1/, 2024

  50. [50]

    Gpt-4.1 model card

    OpenAI. Gpt-4.1 model card. https://platform.openai.com/docs/models/gpt-4.1, April 2025. Released on April 14, 2025

  51. [51]

    OpenAI o3 and o4-mini System Card

    OpenAI. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

  52. [52]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

  53. [53]

    Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

    Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, et al. Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

  54. [54]

    Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

    Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, et al. Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

  55. [55]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

  56. [56]

    Navbench: Probing multimodal large language models for embodied navigation

    Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. arXiv preprint arXiv:2506.01031, 2025. 13

  57. [57]

    Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems

    Jarabala Ranga, A ARUL PRASATH, Neeraj Kumar, R Naveenkumar, Parashuram S Vadar, and AS Syed Fiaz. Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems. In2025 8th International Conference on Trends in Electronics and Informatics (ICOEI), pages 1575–1582. IEEE, 2025

  58. [58]

    Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

    Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

  59. [59]

    Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

    Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan Ruan, and Weijun Hu. Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

  60. [60]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  61. [61]

    A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems

    Mohamed R Shoaib, Heba M Emara, and Jun Zhao. A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1–7. IEEE, 2023

  62. [62]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  63. [63]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

  64. [64]

    Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

    Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

  65. [65]

    Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

    Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

  66. [66]

    Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

    Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, and Ross Maciejewski. Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

  67. [67]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

  68. [68]

    Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

    Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

  69. [69]

    Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

    Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

  70. [70]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  71. [71]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 14

  72. [72]

    Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding

    Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, et al. Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding. InProceedings of the 33rd ACM International Conference on Advances in Geographic I...

  73. [73]

    A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

    Sangeeth Venu and Muralimohan Gurusamy. A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

  74. [74]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  75. [75]

    Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

    Wenzhuang Wang, Xiaoguang Di, Maozhen Liu, and Feng Gao. Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

  76. [76]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

  77. [77]

    Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

    Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

  78. [78]

    Dilu: A knowledge-driven approach to au- tonomous driving with large language models

    Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

  79. [79]

    A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

    Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

  80. [80]

    SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

    Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

Showing first 80 references.