MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Bin Liu; Lan-Zhe Guo; Lingyue Ge; Shi-Yu Tian; Weiming Wu; Wenbo Fu; Xiangwen Zhang; Yang Chen; Yu-Feng Li; Yulan Hu

arxiv: 2602.18600 · v3 · pith:F7OKRRH4new · submitted 2026-02-20 · 💻 cs.LG

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Ziqiao Shang , Lingyue Ge , Zi-Jian Cheng , Shi-Yu Tian , Zhenyu Huang , Wenbo Fu , Weiming Wu , Yang Chen

show 5 more authors

Xiangwen Zhang Yulan Hu Bin Liu Yu-Feng Li Lan-Zhe Guo

This is my paper

Pith reviewed 2026-05-22 10:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal large language modelsmulti-criteria reasoningroute planningbenchmarkmap imagestabular datavisual perceptionheterogeneous graphs

0 comments

The pith

MapTab shows current MLLMs struggle with multi-criteria multimodal reasoning on route planning that mixes map images and tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MapTab as a benchmark that tests multimodal large language models on route planning tasks requiring them to balance four criteria—Time, Price, Comfort, and Reliability—by grounding visual information from map images with structured attributes from tables. It covers metro networks across 160 cities and 168 tourist attractions, generating hundreds of thousands of queries. Evaluations of 15 models find substantial difficulties in this holistic reasoning, with multimodal approaches often performing worse than unimodal ones when visual perception is limited. This matters because route planning reflects real decision tasks where models must weigh competing factors from heterogeneous sources.

Core claim

MapTab is a multimodal benchmark for holistic multi-criteria reasoning in MLLMs via route planning that requires perceiving visual cues from map images alongside route attributes from tabular data. The benchmark includes Metromap for metro networks in 160 cities across 52 countries and Travelmap for 168 tourist attractions across 19 countries, with 328 images, 196800 route planning queries, and 3936 QA queries using the criteria Time, Price, Comfort, and Reliability. Extensive evaluations across 15 MLLMs show current models face substantial challenges in multi-criteria multimodal reasoning, and multimodal collaboration often underperforms unimodal approaches under limited visual perception.

What carries the argument

MapTab benchmark, which pairs map images for visual grounding with tabular route attributes to test multi-criteria reasoning in heterogeneous graphs.

If this is right

Multimodal collaboration does not reliably outperform unimodal processing when visual perception is constrained.
Current MLLMs have difficulty integrating visual map cues with quantitative tabular attributes for balanced decision-making.
Benchmarks like MapTab can serve as a testbed to measure progress in realistic multi-criteria planning tasks.
Improvements in visual grounding are needed before MLLMs can handle heterogeneous data sources in route planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings suggest that training data for MLLMs should emphasize more examples of map and table integration to reduce the observed multimodal penalty.
MapTab-style tasks could be adapted to evaluate models on other multi-criteria problems such as logistics scheduling or urban resource allocation.
Developers might prioritize separate visual perception modules that feed cleaner signals into reasoning stages rather than end-to-end multimodal fusion.
The benchmark points to broader difficulties MLLMs face when processing real-world graphs that combine spatial visuals with attribute tables.

Load-bearing premise

The constructed queries and criteria plus the visual grounding from map images provide a faithful and unbiased measure of holistic multi-criteria reasoning capability in MLLMs without artifacts from data generation or query design.

What would settle it

A controlled experiment where models are retrained specifically on MapTab-style map-tabular pairs and then retested to check if the reported performance gaps close or persist.

Figures

Figures reproduced from arXiv: 2602.18600 by Bin Liu, Lan-Zhe Guo, Lingyue Ge, Shi-Yu Tian, Weiming Wu, Wenbo Fu, Xiangwen Zhang, Yang Chen, Yu-Feng Li, Yulan Hu, Zhenyu Huang, Zi-Jian Cheng, Ziqiao Shang.

**Figure 2.** Figure 2: Schematic overview of the MapTab construction pipeline, comprising 5 main steps: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of Image Resolution on RP and QA Tasks in Metromap and Travelmap Sce [PITH_FULL_IMAGE:figures/full_fig_p049_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of model accuracy across Map Difficulty and Query Difficulty under different [PITH_FULL_IMAGE:figures/full_fig_p050_4.png] view at source ↗

**Figure 5.** Figure 5: Model accuracy matrix under different combinations of Map Difficulty and Query Difficulty. [PITH_FULL_IMAGE:figures/full_fig_p050_5.png] view at source ↗

**Figure 6.** Figure 6: All images have the exact same height. Widths are adjusted automatically. [PITH_FULL_IMAGE:figures/full_fig_p052_6.png] view at source ↗

**Figure 7.** Figure 7: Error Case 1: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗

**Figure 8.** Figure 8: Error Case 2: Missed Transfer Label [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗

**Figure 9.** Figure 9: Error Case 3: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p055_9.png] view at source ↗

**Figure 10.** Figure 10: Error Case 4: Illegal Line Jumping 56 [PITH_FULL_IMAGE:figures/full_fig_p056_10.png] view at source ↗

**Figure 11.** Figure 11: Error Case 5: Overthinking 57 [PITH_FULL_IMAGE:figures/full_fig_p057_11.png] view at source ↗

**Figure 12.** Figure 12: Error Case 6: Failed multi-criteria Reasoning [PITH_FULL_IMAGE:figures/full_fig_p058_12.png] view at source ↗

read the original abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapTab gives a new scaled-up benchmark for MLLM multi-criteria planning but its multimodal underperformance claim needs more ablations to rule out artifacts.

read the letter

The main takeaway is that MapTab introduces a new benchmark for testing MLLMs on multi-criteria route planning that mixes map images with tabular attributes like time, price, comfort, and reliability. They evaluate 15 models and note that multimodal setups can lag behind unimodal ones when visual perception is limited. What the paper does well is the scale and coverage. It pulls together 328 images covering metro networks in 160 cities across 52 countries and 168 tourist attractions from 19 countries. That leads to 196,800 route planning queries plus some QA items. The geographic spread and the focus on four practical criteria make it more grounded than many existing benchmarks. The soft spots are in the details around how the queries were generated and what supports the key finding. Without clear descriptions of the query generation process or ablations that isolate the effect of the map images versus the text attributes, it's possible the performance differences come from artifacts in the data construction rather than fundamental limits in the models. The counter-intuitive result on multimodal collaboration needs tighter controls to hold up. This paper is for researchers focused on multimodal large language models and their use in planning or decision tasks. Anyone building benchmarks or testing agents for real-world scenarios like navigation would find the dataset and results worth looking at. It deserves peer review because the benchmark itself is new and substantial, even though the current evidence for the claims could be firmer with additional experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces MapTab, a multimodal benchmark for evaluating MLLMs on multi-criteria route planning in heterogeneous graphs. It comprises 328 map images (Metromap metro networks across 160 cities and Travelmap attractions), 196800 route-planning queries, and 3936 QA items using four fixed criteria (Time, Price, Comfort, Reliability). Evaluations across 15 MLLMs claim that current models face substantial challenges in multi-criteria multimodal reasoning and that multimodal inputs often underperform unimodal ones under limited visual perception.

Significance. If the queries prove free of systematic artifacts from programmatic generation over fixed criteria and map renderings, the benchmark would offer a useful large-scale testbed for probing MLLM integration of visual and tabular information under realistic constraints. The scale, the open code release, and the counter-intuitive multimodal-under-unimodal observation could usefully guide future work on multimodal reasoning.

major comments (2)

[Abstract] Abstract: the headline claims of 'substantial challenges' and multimodal underperformance rest on 196800 queries and 3936 QA items, yet no details are supplied on query-generation mechanics, statistical controls, or operationalization of 'limited visual perception.' This directly affects whether the reported performance gaps can be attributed to intrinsic reasoning limits.
[Experiments] Experiments section: no ablation is presented that holds attribute values constant while removing the map image or randomizes criteria weights per query. Without such controls, surface correlations between textual attributes, topology, or rendering choices cannot be ruled out as drivers of success/failure, which is load-bearing for the central claim that results reflect genuine multi-criteria grounding.

minor comments (2)

[Related Work] Add explicit comparison tables against prior multimodal reasoning benchmarks to sharpen the novelty claim.
[Dataset Construction] Clarify in the dataset statistics whether all 328 images are used uniformly across the 196800 queries or whether some images contribute disproportionately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully reviewed the major comments and provide point-by-point responses below. Where appropriate, we outline specific revisions that will be incorporated into the next version of the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of 'substantial challenges' and multimodal underperformance rest on 196800 queries and 3936 QA items, yet no details are supplied on query-generation mechanics, statistical controls, or operationalization of 'limited visual perception.' This directly affects whether the reported performance gaps can be attributed to intrinsic reasoning limits.

Authors: We appreciate the referee highlighting the need for greater transparency in the abstract. The query-generation mechanics, including the programmatic construction of the 196,800 route-planning queries and 3,936 QA items over the four fixed criteria (Time, Price, Comfort, Reliability) and the two map scenarios (Metromap and Travelmap), are described in detail in Section 3 of the manuscript. Statistical controls for query validity and diversity are also outlined there, along with the definition of limited visual perception as the use of low-resolution or cropped map images that restrict full visual access to topological and attribute information. To directly address the concern, we will revise the abstract to include a concise summary of the query-generation process and the operationalization of limited visual perception, ensuring readers can immediately assess the grounding of the reported performance gaps. revision: yes
Referee: [Experiments] Experiments section: no ablation is presented that holds attribute values constant while removing the map image or randomizes criteria weights per query. Without such controls, surface correlations between textual attributes, topology, or rendering choices cannot be ruled out as drivers of success/failure, which is load-bearing for the central claim that results reflect genuine multi-criteria grounding.

Authors: We agree that these additional controls would strengthen the central claim regarding genuine multi-criteria multimodal reasoning. The current manuscript includes multimodal versus unimodal comparisons and evaluations across 15 MLLMs, but does not contain the specific ablations suggested. In the revised version, we will add two new ablation studies in the Experiments section: (1) an ablation that holds all attribute values (Time, Price, Comfort, Reliability) constant while removing the map image entirely, and (2) an ablation that randomizes the criteria weights on a per-query basis. These will help isolate whether performance differences arise from multimodal integration or from potential surface correlations with textual attributes, topology, or rendering choices. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and direct empirical evaluations

full rationale

The paper introduces MapTab as a fresh multimodal benchmark consisting of 328 map images, 196800 programmatically generated route-planning queries, and 3936 QA items over Metromap and Travelmap scenarios with four fixed criteria. All reported results are direct performance measurements of 15 MLLMs on this newly constructed dataset; no parameters are fitted to subsets of the target data, no predictions are derived from prior fits, and no self-citation chain is invoked to justify uniqueness or force the central claims. The evaluation methodology is therefore self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that route-planning queries with the chosen criteria constitute a valid proxy for general multi-criteria multimodal reasoning; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Route planning tasks with visual maps and tabular attributes can serve as a rigorous test of holistic multi-criteria reasoning in MLLMs
Invoked in the abstract when the benchmark is positioned as evaluating reasoning capabilities under multi-criteria constraints.

pith-pipeline@v0.9.0 · 5817 in / 1285 out tokens · 45855 ms · 2026-05-22T10:25:54.672433+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MapTab comprises 328 images, 196800 route planning queries... four key criteria: Time, Price, Comfort, and Reliability... optimization objective minimizes weighted sum w1T + w2P + w3(1-C) + w4(1-R)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shortest paths computed with Dijkstra algorithm... reference route collection and label annotation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

Reference graph

Works this paper leans on

211 extracted references · 211 canonical work pages · cited by 1 Pith paper · 21 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page 2024
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

work page arXiv 2026
[5]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

doubao-seed-1.6-thinking

ByteDance. doubao-seed-1.6-thinking. https://www.volcengine.com/docs/82379/ 1593702?utm_source=chatgpt.com&lang=zh, 2025. 10

work page 2025
[8]

Seed1.6: Tech introduction

ByteDance Seed Team. Seed1.6: Tech introduction. https://seed.bytedance.com/en/ seed1_6, June 2025. Model ID: doubao-seed-1-6-251015. Accessed: 2025-12-25

work page 2025
[9]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025
[10]

Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

Chao Cao, Hongbiao Zhu, Zhongqiang Ren, Howie Choset, and Ji Zhang. Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

work page 2023
[11]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[12]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

work page arXiv 2025
[13]

Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

work page 2025
[14]

Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

work page arXiv 2025
[15]

Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025
[16]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

A survey on multimodal large language models for autonomous driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024

work page 2024
[18]

MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024

work page arXiv 2024
[19]

Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

Bowen Fang, Zixiao Yang, and Xuan Di. Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

work page arXiv 2024
[20]

Citybench: Evaluating the capabilities of large language models for urban tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

work page 2025
[21]

Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

Jie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

work page 2024
[22]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025
[23]

Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 11

work page arXiv 2025
[24]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025
[25]

Drive like a human: Rethinking autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024

work page 2024
[26]

Gemini 3 flash: Frontier intelligence built for speed

Google. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

work page
[27]

Accessed: 2025-12-25

Model ID: gemini-3-flash-preview. Accessed: 2025-12-25

work page 2025
[28]

Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T Kwok, and Yu Zhang. Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

work page arXiv 2025
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

Lingfeng Guo, Zihan Li, and Shengjie Min. Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

work page 2025
[31]

R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

work page arXiv 2025
[32]

Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

work page arXiv 2025
[33]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025
[34]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

work page 2024
[35]

Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

work page arXiv 2025
[36]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks

Varvara Krechetova and Denis Kochedykov. Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, pages 27–35, 2025

work page 2025
[38]

Mmcode: Evaluating multi-modal code large language models with visually rich programming problems

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024

work page arXiv 2024
[39]

Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark

Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13337–13349, 2025. 12

work page 2025
[40]

Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

Zekun Li, Malcolm Grossman, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang, et al. Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

work page arXiv 2025
[41]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[42]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

work page arXiv 2025
[43]

Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen. Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

work page arXiv 2025
[44]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021
[46]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

work page arXiv 2025
[48]

Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, and Jingrui He. Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

work page arXiv 2026
[49]

OpenAI o1.https://openai.com/o1/, 2024

OpenAI. OpenAI o1.https://openai.com/o1/, 2024

work page 2024
[50]

Gpt-4.1 model card

OpenAI. Gpt-4.1 model card. https://platform.openai.com/docs/models/gpt-4.1, April 2025. Released on April 14, 2025

work page 2025
[51]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025
[52]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, et al. Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

work page arXiv 2025
[54]

Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, et al. Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

work page arXiv 2025
[55]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

work page 2025
[56]

Navbench: Probing multimodal large language models for embodied navigation

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. arXiv preprint arXiv:2506.01031, 2025. 13

work page arXiv 2025
[57]

Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems

Jarabala Ranga, A ARUL PRASATH, Neeraj Kumar, R Naveenkumar, Parashuram S Vadar, and AS Syed Fiaz. Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems. In2025 8th International Conference on Trends in Electronics and Informatics (ICOEI), pages 1575–1582. IEEE, 2025

work page 2025
[58]

Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

work page arXiv 2025
[59]

Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan Ruan, and Weijun Hu. Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

work page arXiv 2025
[60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems

Mohamed R Shoaib, Heba M Emara, and Jun Zhao. A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1–7. IEEE, 2023

work page 2023
[62]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[64]

Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

work page arXiv 2025
[65]

Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

work page arXiv 2025
[66]

Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, and Ross Maciejewski. Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

work page arXiv 2025
[67]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025
[68]

Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

work page arXiv 2025
[69]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

work page arXiv 2025
[70]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding

Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, et al. Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding. InProceedings of the 33rd ACM International Conference on Advances in Geographic I...

work page 2025
[73]

A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

Sangeeth Venu and Muralimohan Gurusamy. A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

work page 2025
[74]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024
[75]

Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

Wenzhuang Wang, Xiaoguang Di, Maozhen Liu, and Feng Gao. Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

work page 2024
[76]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

work page arXiv 2025
[78]

Dilu: A knowledge-driven approach to au- tonomous driving with large language models

Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

work page arXiv 2023
[79]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

work page arXiv 2025
[80]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page 2024

[2] [2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

work page arXiv 2026

[5] [5]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page 2025

[6] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

doubao-seed-1.6-thinking

ByteDance. doubao-seed-1.6-thinking. https://www.volcengine.com/docs/82379/ 1593702?utm_source=chatgpt.com&lang=zh, 2025. 10

work page 2025

[8] [8]

Seed1.6: Tech introduction

ByteDance Seed Team. Seed1.6: Tech introduction. https://seed.bytedance.com/en/ seed1_6, June 2025. Model ID: doubao-seed-1-6-251015. Accessed: 2025-12-25

work page 2025

[9] [9]

Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025

[10] [10]

Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

Chao Cao, Hongbiao Zhu, Zhongqiang Ren, Howie Choset, and Ji Zhang. Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

work page 2023

[11] [11]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024

[12] [12]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

work page arXiv 2025

[13] [13]

Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

work page 2025

[14] [14]

Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

work page arXiv 2025

[15] [15]

Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025

[16] [16]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

A survey on multimodal large language models for autonomous driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024

work page 2024

[18] [18]

MapEval: A Map-Based Evaluation of Geo-Spatial Reason- ing in Foundation Models, 2025

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024

work page arXiv 2024

[19] [19]

Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

Bowen Fang, Zixiao Yang, and Xuan Di. Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

work page arXiv 2024

[20] [20]

Citybench: Evaluating the capabilities of large language models for urban tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

work page 2025

[21] [21]

Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

Jie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

work page 2024

[22] [22]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025

[23] [23]

Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 11

work page arXiv 2025

[24] [24]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025

[25] [25]

Drive like a human: Rethinking autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024

work page 2024

[26] [26]

Gemini 3 flash: Frontier intelligence built for speed

Google. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

work page

[27] [27]

Accessed: 2025-12-25

Model ID: gemini-3-flash-preview. Accessed: 2025-12-25

work page 2025

[28] [28]

Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T Kwok, and Yu Zhang. Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

work page arXiv 2025

[29] [29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

Lingfeng Guo, Zihan Li, and Shengjie Min. Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

work page 2025

[31] [31]

R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

work page arXiv 2025

[32] [32]

Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

work page arXiv 2025

[33] [33]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025

[34] [34]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

work page 2024

[35] [35]

Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

work page arXiv 2025

[36] [36]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks

Varvara Krechetova and Denis Kochedykov. Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, pages 27–35, 2025

work page 2025

[38] [38]

Mmcode: Evaluating multi-modal code large language models with visually rich programming problems

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024

work page arXiv 2024

[39] [39]

Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark

Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13337–13349, 2025. 12

work page 2025

[40] [40]

Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

Zekun Li, Malcolm Grossman, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang, et al. Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

work page arXiv 2025

[41] [41]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023

[42] [42]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

work page arXiv 2025

[43] [43]

Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen. Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

work page arXiv 2025

[44] [44]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021

[46] [46]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

work page arXiv 2025

[48] [48]

Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, and Jingrui He. Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

work page arXiv 2026

[49] [49]

OpenAI o1.https://openai.com/o1/, 2024

OpenAI. OpenAI o1.https://openai.com/o1/, 2024

work page 2024

[50] [50]

Gpt-4.1 model card

OpenAI. Gpt-4.1 model card. https://platform.openai.com/docs/models/gpt-4.1, April 2025. Released on April 14, 2025

work page 2025

[51] [51]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025

[52] [52]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, et al. Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

work page arXiv 2025

[54] [54]

Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, et al. Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

work page arXiv 2025

[55] [55]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

work page 2025

[56] [56]

Navbench: Probing multimodal large language models for embodied navigation

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. arXiv preprint arXiv:2506.01031, 2025. 13

work page arXiv 2025

[57] [57]

Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems

Jarabala Ranga, A ARUL PRASATH, Neeraj Kumar, R Naveenkumar, Parashuram S Vadar, and AS Syed Fiaz. Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems. In2025 8th International Conference on Trends in Electronics and Informatics (ICOEI), pages 1575–1582. IEEE, 2025

work page 2025

[58] [58]

Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

work page arXiv 2025

[59] [59]

Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan Ruan, and Weijun Hu. Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

work page arXiv 2025

[60] [60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems

Mohamed R Shoaib, Heba M Emara, and Jun Zhao. A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1–7. IEEE, 2023

work page 2023

[62] [62]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[63] [63]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024

[64] [64]

Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

work page arXiv 2025

[65] [65]

Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

work page arXiv 2025

[66] [66]

Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, and Ross Maciejewski. Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

work page arXiv 2025

[67] [67]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025

[68] [68]

Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

work page arXiv 2025

[69] [69]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

work page arXiv 2025

[70] [70]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding

Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, et al. Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding. InProceedings of the 33rd ACM International Conference on Advances in Geographic I...

work page 2025

[73] [73]

A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

Sangeeth Venu and Muralimohan Gurusamy. A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

work page 2025

[74] [74]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024

[75] [75]

Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

Wenzhuang Wang, Xiaoguang Di, Maozhen Liu, and Feng Gao. Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

work page 2024

[76] [76]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

work page arXiv 2025

[78] [78]

Dilu: A knowledge-driven approach to au- tonomous driving with large language models

Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

work page arXiv 2023

[79] [79]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

work page arXiv 2025

[80] [80]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025