pith. sign in

arxiv: 2606.01046 · v1 · pith:XPZZPEHFnew · submitted 2026-05-31 · 💻 cs.AI

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

Pith reviewed 2026-06-28 17:18 UTC · model grok-4.3

classification 💻 cs.AI
keywords travel planningLLM evaluationbenchmarking frameworkmulti-dimensional assessmentspatio-temporal reasoningagentic strategiessimulation-based evaluation
0
0 comments X

The pith

TravelEval introduces a six-dimensional benchmark and global simulation method that exposes LLMs' consistent failures in optimized multi-day travel planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prior benchmarks for LLM travel agents over-focus on isolated constraint checks while ignoring holistic plan qualities such as timing across days, spatial movement efficiency, and total budget adherence. It introduces TravelEval to fix this with realistic lodging and transport data plus a full-trip simulator that tracks cumulative effects like queuing times and accommodation carry-over. When the authors apply the benchmark to twelve common LLM approaches, the results indicate that models still produce plans that violate global constraints even when local rules are met, and that adding agentic scaffolding does not reliably close the gap. A sympathetic reader would care because travel planning is a concrete, high-stakes test of whether LLMs can integrate multiple conflicting objectives over time rather than just answer single questions.

Core claim

TravelEval supplies a six-dimensional evaluation framework covering accuracy, compliance, temporality, spatiality, economy and utility, a data sandbox with precise accommodation prices and intercity transport, and a simulation engine that runs complete itineraries with geographic APIs and fine-grained queuing. When twelve mainstream LLM and agent pipelines are tested inside this environment, they exhibit systematic shortfalls in globally consistent spatio-temporal reasoning and budget compliance, with agentic reasoning variants showing no uniform advantage over simpler prompting.

What carries the argument

The six-dimensional evaluation framework together with the simulation-based global evaluation method that emulates full travel plans rather than isolated daily segments.

If this is right

  • LLM travel planners must be evaluated on cumulative effects across an entire trip rather than per-day compliance alone.
  • Improvements in spatio-temporal reasoning and budget tracking are required before these systems can be deployed for realistic multi-city itineraries.
  • Agentic scaffolding alone does not guarantee better global optimization in this domain.
  • Future benchmarks should incorporate authentic pricing and transport data to avoid optimistic results from synthetic datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-simulation approach could be adapted to evaluate LLM agents in other sequential planning tasks such as logistics routing or event scheduling.
  • If the benchmark's weaknesses prove real, training objectives for travel LLMs should shift from local constraint satisfaction toward explicit multi-objective optimization across time and space.
  • Developers might need hybrid systems that combine LLMs with dedicated optimization solvers for the dimensions the models handle least reliably.

Load-bearing premise

The six-dimensional scores plus the simulation engine together produce measurements that correspond to the quality and outcomes that real travelers would experience.

What would settle it

A controlled user study in which plans ranked highest by TravelEval are actually booked and executed, then rated for satisfaction, missed connections, and total cost overruns; if those real-world outcomes diverge sharply from the benchmark rankings, the framework's validity is undermined.

Figures

Figures reproduced from arXiv: 2606.01046 by Chen Jason Zhang, Kaichun Hu, Lei Chen, Shimin Di, Shuaixiong Wang, Wangze Ni, Weiyi Chen, Ziyun Gao.

Figure 1
Figure 1. Figure 1: How travel plans generated by an agent 1 Introduction According to the World Travel & Tourism Council (WTTC), travel and tourism contributed US$10.9 trillion to the global GDP in 2024, accounting for 10% of the global economy and supporting 357 mil￾lion jobs worldwide [27]. With the development of LLM-powered agents, their application in travel planning has attracted widespread attention from academia (e.g… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TravelEval framework. compliance metrics and propose novel ones (marked with *) to re￾fine fine-grained constraint compliance assessment. This dimen￾sion evaluates whether a plan meets the hard constraints speci￾fied by users, including: (1) Budget Compliance Sign (BCS): Indi￾cate whether total travel cost is strictly within the user-set budget, measuring the model’s budget adherence abilit… view at source ↗
Figure 3
Figure 3. Figure 3: All performances in BCS. Differences across Models under the Same Strategy. To fur￾ther answer Q1.a, we analyze model performance under a uni￾fied strategy, with results in Figure 4a. Under the same strategy, Claude-Sonnet-4.5 performs the best, particularly in Accuracy and Utility dimensions. However, it displays notably poor performance in the TTCS and Profit, indicating that even the state-of-the-art (S… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of six-dimension normalized scores: (a) Overall Dataset, (b) Easy Mode, (c) Medium Mode and (d) Hard [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pearson Correlation Coefficients between metrics. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TravelEval, a benchmark for LLM-powered travel planning agents that addresses gaps in prior work via (1) a six-dimensional evaluation framework (accuracy, compliance, temporality, spatiality, economy, utility), (2) a realistic data sandbox with precise accommodation pricing and authentic intercity transport data, and (3) a simulation-based global evaluator that incorporates API-integrated geography and fine-grained queuing times. Evaluating 12 mainstream approaches yields the claims that LLMs struggle with globally-optimized multi-dimensional planning (especially spatio-temporal reasoning and budget compliance) and that agentic reasoning strategies provide no consistent improvement.

Significance. If the six-dimensional framework and simulation method are shown to track real-world user outcomes, TravelEval would supply a more holistic evaluation tool than existing constraint-focused or daily-isolated benchmarks, potentially guiding progress on complex planning agents. The reported insights on LLM limitations would then be actionable for the field.

major comments (2)
  1. [Abstract / Evaluation Framework] Abstract and § on the six-dimensional framework: the headline finding that LLMs struggle with globally-optimized multi-dimensional planning is derived entirely from scores produced by the new benchmark. No external validation (human ratings, correlation with existing user studies, or ablation demonstrating that the six dimensions predict downstream success) is supplied, leaving open the possibility that reported deficiencies are metric artifacts. This assumption is load-bearing for all reported insights.
  2. [Abstract / Simulation Method] Abstract and § on simulation-based global evaluation: the claim that isolated daily assessments miss critical inter-day effects (accommodation, pacing) is used to motivate the new method, yet the manuscript supplies no implementation details, reproducibility artifacts, or comparison against human judgments of complete itineraries. Without such grounding the quantitative results on the 12 approaches cannot be interpreted reliably.
minor comments (1)
  1. [Abstract] The abstract states that the data sandbox includes 'precise accommodation pricing and authentic intercity transportation data' but does not name the sources, coverage, or update frequency; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation and grounding of the evaluation framework and simulation method. We respond to each major comment below, proposing targeted revisions to qualify claims and add details where feasible while remaining faithful to the manuscript's content.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Framework] Abstract and § on the six-dimensional framework: the headline finding that LLMs struggle with globally-optimized multi-dimensional planning is derived entirely from scores produced by the new benchmark. No external validation (human ratings, correlation with existing user studies, or ablation demonstrating that the six dimensions predict downstream success) is supplied, leaving open the possibility that reported deficiencies are metric artifacts. This assumption is load-bearing for all reported insights.

    Authors: We agree that external validation such as human ratings or correlations with user studies would strengthen the claims. The six dimensions are motivated by gaps identified in prior benchmarks (constraint-only focus) and travel planning literature on multi-faceted user needs, but the manuscript does not include such validation or ablations. In revision we will add a subsection detailing the literature-based rationale for each dimension and revise the abstract, results, and conclusion to present the LLM limitations as observations within the TravelEval benchmark rather than unvalidated real-world conclusions. This tempers the headline findings appropriately. revision: yes

  2. Referee: [Abstract / Simulation Method] Abstract and § on simulation-based global evaluation: the claim that isolated daily assessments miss critical inter-day effects (accommodation, pacing) is used to motivate the new method, yet the manuscript supplies no implementation details, reproducibility artifacts, or comparison against human judgments of complete itineraries. Without such grounding the quantitative results on the 12 approaches cannot be interpreted reliably.

    Authors: The simulation method is described in the methods section as emulating full itineraries via API-integrated geography and queuing times to capture inter-day effects. However, we acknowledge the referee's point that additional implementation details and reproducibility artifacts are needed. In revision we will expand the methods with pseudocode for the simulation process, explicit data sources, and example itineraries illustrating inter-day impacts. Direct human judgment comparisons are not present in the current work; we will add a limitations paragraph noting this and suggesting it as future validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark is independently defined and applied

full rationale

The paper introduces TravelEval as a novel six-dimensional evaluation framework plus simulation-based global evaluator, then reports empirical scores obtained by running 12 external LLM approaches through it. No equations, parameters, or results are fitted inside the paper and then renamed as predictions; the six dimensions (accuracy, compliance, temporality, spatiality, economy, utility) and the simulation method are presented as author-defined contributions rather than derived from the LLM outputs themselves. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claims. The reported finding that LLMs struggle with globally-optimized planning is therefore an external measurement, not a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim that TravelEval reveals LLM limitations rests on the untested premise that the chosen six dimensions and simulation capture the relevant qualities of travel plans; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption The six dimensions (accuracy, compliance, temporality, spatiality, economy, utility) together provide a holistic assessment of travel plans.
    Invoked when the abstract states that existing benchmarks overemphasize constraint compliance while neglecting multi-dimensional qualities.
  • domain assumption The simulation-based global evaluation with API-integrated geographic information and queuing times produces outcomes that reflect real travel feasibility.
    Invoked when the abstract claims that isolated daily plan assessments miss critical details such as accommodation impact and visit pacing.

pith-pipeline@v0.9.1-grok · 5801 in / 1431 out tokens · 22991 ms · 2026-06-28T17:18:23.721464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    Yu, Qiang Yang, and Xing Xie

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. doi:10.1145/3641289

  2. [2]

    Soumyabrata Chaudhuri, Pranav Purkar, Ritwik Raghav, Shubhojit Mallick, Manish Gupta, Abhik Jana, and Shreya Ghosh. 2025. TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning. arXiv: 2502.20508 [cs.CL] https://arxiv.org/abs/2502.20508

  3. [3]

    Aili Chen, Xuyang Ge, Ziquan Fu, Yanghua Xiao, and Jiangjie Chen. 2024. TravelAgent: An AI Assistant for Personalized Travel Planning. (2024). arXiv:2409.08069 [cs.AI] https://arxiv.org/abs/2409.08069

  4. [4]

    Shimin Di, Xujie Yuan, Hanghui Guo, Chaoqian Ouyang, Zhangze Chen, Ling Yue, Libin Zheng, Jia Zhu, Shaowu Pan, Jian Yin, Min-Ling Zhang, and Yong Rui. [n. d.]. ToolRosetta: Bridging Open-Source Repositories and Large Lan- guage Model Agents through Automated Tool Standardization. https://api. semanticscholar.org/CorpusID:287352048

  5. [5]

    Fraga, Aline C

    João Paulo Esper, Luciano de S. Fraga, Aline C. Viana, Kleber Vieira Cardoso, and Sand Luz Correa. 2025. +Tour: Recommending personalized itineraries for smart tourism. Computer Networks 260 (2025), 111118. doi:10.1016/j.comnet. 2025.111118

  6. [6]

    Fedor and I.F

    P.J. Fedor and I.F. Spellerberg. 2013. Shannon ⚶Wiener Index. In Reference Mod- ule in Earth Systems and Environmental Sciences . Elsevier. doi:10.1016/B978-0- 12-409548-9.00602-3

  7. [7]

    Damianos Gavalas, Charalampos Konstantopoulos, Konstantinos Mastakas, and Grammati Pantziou. 2014. A survey on algorithmic approaches for solving tourist trip design problems. Journal of Heuristics 20 (06 2014), 291–328. doi:10. 1007/s10732-014-9242-5

  8. [8]

    Sunyoung Hlee, Zhijun Yan, and Ping Li. 2025. AI-powered travel recom- mendations and decision-making: The role of spatio-temporal efficiency, des- tination type, and travel party composition. Electron. Mark. 35, 1 (2025), 75. doi:10.1007/S12525-025-00825-4

  9. [9]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv abs/2205.11916 (2022). https://api.semanticscholar.org/CorpusID:249017743

  10. [10]

    Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Li Qing, et al. 2025. Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language mod- els. In Findings of the Association for Computational Linguistics: ACL 2025. 20004– 20026

  11. [11]

    Kwan Hui Lim, Jeffrey Chan, Christopher Leckie, and Shanika Karunasekera

  12. [12]

    In IJCAI, Vol

    Personalized tour recommendation based on user interests and points of interest visit durations.. In IJCAI, Vol. 15. 1778–1784

  13. [13]

    Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, and Hao Liu. 2025. TP-RAG: Benchmarking Retrieval-Augmented Large Lan- guage Model Agents for Spatiotemporal-Aware Travel Planning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 12403 ⚶12429....

  14. [14]

    Juhyun Oh, Eunsu Kim, and Alice Oh. 2025. Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents. arXiv: 2506.04649 [cs.CL] https: //arxiv.org/abs/2506.04649

  15. [15]

    Chao Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang. 2025. Code2MCP: Transforming Code Repositories into MCP Services. ArXiv abs/2509.05941 (2025). https://api.semanticscholar. org/CorpusID:281203560

  16. [16]

    Shaowu Pan, Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Zhangze Chen, and Shimin Di. 2026. Automating Computational Fluid Dynam- ics with LLM-based Multi-Agent Systems. (2026)

  17. [17]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large lan- guage models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789 (2023)

  18. [18]

    Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai, and Xi- aoru Dai. 2025. TripScore: Benchmarking and rewarding real-world travel plan- ning with fine-grained evaluation. arXiv: 2510.09011 [cs.AI] https://arxiv.org/ abs/2510.09011

  19. [19]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2023), 68539–68551

  20. [20]

    Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, and Yu-Feng Li. 2024. Chi- naTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning. arXiv preprint arXiv:2412.13682 (2024). arXiv: 2412.13682 [cs.AI]

  21. [21]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv: 2303.11366 [cs.AI] https://arxiv.org/abs/2303. 11366

  22. [22]

    Yihong Tang, Zhaokai Wang, Ao Qu, Yihao Yan, Zhaofeng Wu, Dingyi Zhuang, Jushi Kai, Kebing Hou, Xiaotong Guo, Jinhua Zhao, Zhan Zhao, and Wei Ma

  23. [23]

    ItiNera: Integrating Spatial Optimization with Large Language Models for Open-domain Urban Itinerary Planning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , Franck Der- noncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina (Eds.). Association for Computational Linguistics, Miami, Florida, U...

  24. [24]

    Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, and Lei Chen. 2026. When AI reviews sci- ence: Can we trust the referee? The Innovation Informatics (2026). https: //api.semanticscholar.org/CorpusID:285573947

  25. [25]

    Kaimin Wang, Yuanzhe Shen, Changze Lv, Xiaoqing Zheng, and Xuan-Jing Huang. 2025. Triptailor: A real-world benchmark for personalized travel plan- ning. In Findings of the Association for Computational Linguistics: ACL 2025 . 9705–9723

  26. [26]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (2024), 186345. doi:10.1007/s11704-024-40231-1

  27. [27]

    Yishu Wang, Wei Liu, Yifan Li, Shengxiang Xu, Xujie Yuan, Ran Li, Yuyu Luo, Jia Zhu, Shimin Di, Min-Ling Zhang, et al. 2026. PeroMAS: A Multi-agent System of Perovskite Material Discovery. arXiv preprint arXiv:2602.13312 (2026)

  28. [28]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv: 2201.11903 [cs.CL] https: //arxiv.org/abs/2201.11903

  29. [29]

    World Travel & Tourism Council. 2024. Travel & Tourism Economic Impact. https://wttc.org/research/economic-impact. Accessed: 2026-01-12

  30. [30]

    Xiongbin Wu, Hongzhi Guan, Yan Han, and Jiaqi Ma. 2017. A tour route planning model for tourism experience utility maximization. Advances in Mechanical Engineering 9, 10 (2017), 1687814017732309. arXiv:https://doi.org/10.1177/1687814017732309 doi:10.1177/1687814017732309

  31. [31]

    Jian Xie, Yidan Liang, Jingping Liu, Yanghua Xiao, Baohua Wu, and Shenghua Ni. 2023. QUERT: Continual Pre-training of Language Model for Query Un- derstanding in Travel Domain Search. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Long Beach, CA, USA) (KDD ’23) . Association for Computing Machinery, New York, NY, ...

  32. [32]

    doi:10.1145/3580305.3599891

  33. [33]

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. TravelPlanner: A Benchmark for Real-World Planning with Language Agents. In Proceedings of the 41st International Confer- ence on Machine Learning (ICML) . arXiv: 2402.01622 [cs.CL] Spotlight Paper

  34. [34]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv: 2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

  35. [35]

    Undeclared Preferences

    Lin Zhong, Jun Zeng, Ziwei Wang, Wei Zhou, and Junhao Wen. 2024. SCFL: Spatio-temporal consistency federated learning for next POI recommendation. Information Processing & Management 61, 6 (2024), 103852. doi:10.1016/j.ipm. 2024.103852 TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents Conference’17, July ...