Trip+: Benchmarking Agents in Personalized Interactive Travel Planning
Pith reviewed 2026-06-26 14:36 UTC · model grok-4.3
The pith
Language models generate feasible but exhausting travel itineraries that diverge from user preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trip+ requires agents to produce and revise minute-level itineraries given traveler profiles and dynamic interactions, with end-to-end experiences scored by an LLM simulator on subjective metrics including fatigue. Evaluations across 18 LMs demonstrate a consistent gap in experiential quality, where models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.
What carries the argument
Trip+ benchmark with LLM-based simulator for assessing subjective experiential quality in profile-conditioned interactive itinerary planning.
If this is right
- Agents require stronger mechanisms to incorporate and maintain profile preferences across interaction turns.
- Evaluation of travel planning must extend beyond technical feasibility to include experiential costs like fatigue.
- Performance gaps widen in complex replanning scenarios compared to simple request handling.
- Current models systematically underweight long-term traveler comfort when optimizing itineraries.
Where Pith is reading between the lines
- Similar simulator-driven evaluation could be applied to other multi-turn agent planning domains such as logistics or daily scheduling.
- The observed preference divergence suggests limits in how models integrate and prioritize user profiles over multiple revisions.
- Deployment of such agents in real travel services would likely need additional layers to detect and mitigate user fatigue.
Load-bearing premise
The LLM-based simulator provides a reliable proxy for subjective human judgments of fatigue and overall experiential quality.
What would settle it
A direct comparison study where human travelers rate fatigue and quality for the same set of generated itineraries and the scores are checked against simulator outputs.
read the original abstract
Interactive travel planning has become a popular use case for language models. Agents are deployed to manage evolving preferences and unexpected disruptions over multiple turns. Such settings require models to make complex, profile-conditioned planning decisions. However, existing benchmarks often evaluate feasibility, personalization, or interaction in relatively isolated settings. We therefore introduce Trip+ to measure the ability of agents to plan travel holistically. In Trip+, given traveler profiles and dynamic interactions, agents must generate and revise minute-level itineraries. End-to-end traveler experiences are evaluated via an LLM-based simulator, enabling the assessment of subjective metrics like fatigue. Our scenarios range from simple request resolutions to complex environment-driven replanning. We evaluate 18 LMs and find a consistent gap in experiential quality. Models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Trip+, a benchmark for language model agents in personalized interactive travel planning. Given traveler profiles and multi-turn dynamic interactions, agents generate and revise minute-level itineraries; end-to-end experiences are scored by an LLM-based simulator on subjective metrics including fatigue. The authors evaluate 18 LMs and report a consistent gap: models produce technically feasible but exhausting itineraries that diverge from profiled preferences.
Significance. If the simulator is shown to track human judgments, the benchmark would offer a useful addition to existing feasibility- or personalization-focused evaluations by stressing holistic, profile-conditioned replanning under realistic disruptions.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation setup: the headline result (consistent gap in experiential quality) is obtained entirely from the LLM-based simulator scoring fatigue and related subjective metrics, yet the manuscript supplies no human calibration study, correlation coefficients, or inter-rater agreement data with human raters. This directly affects the validity of the reported gap.
- [Methods] Methods (simulator description): without an ablation or sensitivity analysis showing that the simulator does not systematically over-penalize dense but feasible schedules, it remains possible that the observed divergence from traveler profiles is an artifact of the evaluator rather than agent behavior.
minor comments (2)
- [Evaluation] Clarify the exact prompting template and temperature settings used for the LLM simulator so that the evaluation is reproducible.
- [Experiments] The abstract states '18 LMs' but does not list model names or sizes; add this information in the main text or a table.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing the Trip+ benchmark. We address each major comment below in a point-by-point manner.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation setup: the headline result (consistent gap in experiential quality) is obtained entirely from the LLM-based simulator scoring fatigue and related subjective metrics, yet the manuscript supplies no human calibration study, correlation coefficients, or inter-rater agreement data with human raters. This directly affects the validity of the reported gap.
Authors: We agree that the absence of a human calibration study, correlation coefficients, or inter-rater agreement data is a limitation in the current manuscript. The headline results on the experiential quality gap rely on the LLM-based simulator without direct validation against human judgments. In the revision, we will add an explicit discussion of this limitation in the evaluation section, including the simulator's design rationale based on profile-conditioned criteria, and note it as an important direction for future work. The consistency of the gap across 18 models provides some supporting evidence, but we recognize this does not substitute for human validation. revision: partial
-
Referee: [Methods] Methods (simulator description): without an ablation or sensitivity analysis showing that the simulator does not systematically over-penalize dense but feasible schedules, it remains possible that the observed divergence from traveler profiles is an artifact of the evaluator rather than agent behavior.
Authors: The manuscript does not include an ablation or sensitivity analysis examining the simulator's response to schedule density. This is a valid concern, as it leaves open the possibility that evaluator artifacts contribute to the reported divergence. We will incorporate a sensitivity analysis in the methods section of the revised manuscript, testing the simulator on varied schedule densities while holding other factors constant, to address potential systematic biases. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or self-referential reductions
full rationale
The paper introduces Trip+ as an empirical benchmark for evaluating language model agents on personalized interactive travel planning tasks. It describes scenarios, an LLM-based simulator for scoring subjective metrics such as fatigue, and results from evaluating 18 models. No equations, derivations, fitted parameters, or mathematical claims are present that could reduce to inputs by construction. The simulator is presented as an explicit methodological choice for assessing experiential quality, not as a derived or self-defined quantity. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is a standard benchmark study whose central claims rest on the reported evaluations rather than any circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based simulator accurately measures subjective experiential quality including fatigue
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2402.01622 , year=
Travelplanner: A benchmark for real-world planning with language agents , author=. arXiv preprint arXiv:2402.01622 , year=
-
[2]
arXiv preprint arXiv:2601.18137 , year=
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints , author=. arXiv preprint arXiv:2601.18137 , year=
-
[3]
2026 , eprint=
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios , author=. 2026 , eprint=
2026
-
[4]
arXiv preprint arXiv:2512.22673 , year=
TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning , author=. arXiv preprint arXiv:2512.22673 , year=
-
[5]
The Fourteenth International Conference on Learning Representations , year=
Llms get lost in multi-turn conversation , author=. The Fourteenth International Conference on Learning Representations , year=
-
[6]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[7]
2025 , eprint=
UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=
2025
-
[8]
arXiv preprint arXiv:2602.16173 , year=
Learning Personalized Agents from Human Feedback , author=. arXiv preprint arXiv:2602.16173 , year=
-
[9]
arXiv preprint arXiv:2412.13682 , year=
Chinatravel: An open-ended benchmark for language agents in chinese travel planning , author=. arXiv preprint arXiv:2412.13682 , year=
-
[10]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Triptailor: A real-world benchmark for personalized travel planning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[11]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Tripcraft: A benchmark for spatio-temporally fine grained travel planning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[12]
RETAIL : Towards Real-world Travel Planning for Large Language Models
Deng, Bin and Feng, Yizhe and Liu, Zeming and Wei, Qing and Zhu, Xiangrong and Chen, Shuai and Guo, Yuanfang and Wang, Yunhong. RETAIL : Towards Real-world Travel Planning for Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.752
-
[13]
arXiv preprint arXiv:2506.12421 , year=
Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints , author=. arXiv preprint arXiv:2506.12421 , year=
-
[14]
Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=
-
[15]
arXiv preprint arXiv:2604.08455 , year=
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation , author=. arXiv preprint arXiv:2604.08455 , year=
-
[16]
arXiv preprint arXiv:2510.21329 , year=
TripTide: A Benchmark for Adaptive Travel Planning under Disruptions , author=. arXiv preprint arXiv:2510.21329 , year=
-
[17]
Transactions on Machine Learning Research , year=
Cognitive architectures for language agents , author=. Transactions on Machine Learning Research , year=
-
[18]
2026 , month =
Claude Fable 5 and Claude Mythos 5 , howpublished =. 2026 , month =
2026
-
[19]
2026 , month =
Introducing. 2026 , month =
2026
-
[20]
2025 , eprint=
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=
2025
-
[21]
Proceedings of the 29th symposium on operating systems principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
-
[22]
Tourism management , volume=
Building and testing theories of decision making by travellers , author=. Tourism management , volume=. 2005 , publisher=
2005
-
[23]
Tourism management , volume=
A grounded typology of vacation decision-making , author=. Tourism management , volume=. 2005 , publisher=
2005
-
[24]
Tourism management , volume=
Development and validation of a tourism fatigue scale , author=. Tourism management , volume=. 2020 , publisher=
2020
-
[25]
Procedia Environmental Sciences , volume=
Weather and climate information for tourism , author=. Procedia Environmental Sciences , volume=. 2010 , publisher=
2010
-
[26]
Journal of transport geography , volume=
An investigation of the relationship between public transport performance and destination satisfaction , author=. Journal of transport geography , volume=. 2007 , publisher=
2007
-
[27]
Annals of tourism research , volume=
Value, satisfaction and behavioral intentions in an adventure tourism context , author=. Annals of tourism research , volume=. 2009 , publisher=
2009
-
[28]
Journal of Travel research , volume=
Development of a scale to measure memorable tourism experiences , author=. Journal of Travel research , volume=. 2012 , publisher=
2012
-
[29]
, author=
The proof and measurement of association between two things. , author=. 1961 , publisher=
1961
-
[30]
psychometrika , volume=
Coefficient alpha and the internal structure of tests , author=. psychometrika , volume=. 1951 , publisher=
1951
-
[31]
2023 , eprint=
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
2023
-
[32]
From generation to judgment: Opportunities and challenges of LLM-as-a-judge
Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...
-
[33]
arXiv preprint arXiv:2404.04475 , year=
Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=
-
[34]
The Twelfth International Conference on Learning Representations , year=
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.