MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin, et al · 2026 · cs.AI · arXiv 2602.22638

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

TransitLM is a large-scale dataset and benchmark for training LLMs to generate structurally valid map-free transit routes from origin-destination pairs.

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

cs.AI · 2026-05-31 · unverdicted · novelty 2.0

A survey synthesizing LLM and MM-LLM uses in transportation operations, mobility services, and decision support while noting challenges like data heterogeneity and real-time needs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.

MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer