Trip+: Benchmarking Agents in Personalized Interactive Travel Planning

Junle Chen; Kai Wang; Lei Wang; Wei Chen; Xiaofang Zhou; Yehong Xu; Yuqian Wu; Zhengjun Huang; Zhoujin Tian

arxiv: 2606.21169 · v1 · pith:PZIBQU7Bnew · submitted 2026-06-19 · 💻 cs.AI

Trip+: Benchmarking Agents in Personalized Interactive Travel Planning

Junle Chen , Wei Chen , Yehong Xu , Zhengjun Huang , Yuqian Wu , Zhoujin Tian , Kai Wang , Lei Wang

show 1 more author

Xiaofang Zhou

This is my paper

Pith reviewed 2026-06-26 14:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords travel planninglanguage modelsbenchmarkingpersonalizationinteractive agentsexperiential qualityfatigueitineraries

0 comments

The pith

Language models generate feasible but exhausting travel itineraries that diverge from user preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Trip+, a benchmark for evaluating language model agents on personalized interactive travel planning across multiple turns with evolving preferences and disruptions. Agents must generate and revise minute-level itineraries conditioned on traveler profiles, with evaluation performed via an LLM-based simulator that scores subjective aspects such as fatigue. Testing 18 models reveals a consistent gap where technically feasible plans are selected even when they produce tiring experiences that sharply mismatch profiled preferences. The benchmark spans simple request resolutions to complex replanning driven by environment changes. This setup isolates the challenge of holistic, profile-aware planning beyond isolated feasibility or interaction metrics.

Core claim

Trip+ requires agents to produce and revise minute-level itineraries given traveler profiles and dynamic interactions, with end-to-end experiences scored by an LLM simulator on subjective metrics including fatigue. Evaluations across 18 LMs demonstrate a consistent gap in experiential quality, where models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.

What carries the argument

Trip+ benchmark with LLM-based simulator for assessing subjective experiential quality in profile-conditioned interactive itinerary planning.

If this is right

Agents require stronger mechanisms to incorporate and maintain profile preferences across interaction turns.
Evaluation of travel planning must extend beyond technical feasibility to include experiential costs like fatigue.
Performance gaps widen in complex replanning scenarios compared to simple request handling.
Current models systematically underweight long-term traveler comfort when optimizing itineraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar simulator-driven evaluation could be applied to other multi-turn agent planning domains such as logistics or daily scheduling.
The observed preference divergence suggests limits in how models integrate and prioritize user profiles over multiple revisions.
Deployment of such agents in real travel services would likely need additional layers to detect and mitigate user fatigue.

Load-bearing premise

The LLM-based simulator provides a reliable proxy for subjective human judgments of fatigue and overall experiential quality.

What would settle it

A direct comparison study where human travelers rate fatigue and quality for the same set of generated itineraries and the scores are checked against simulator outputs.

read the original abstract

Interactive travel planning has become a popular use case for language models. Agents are deployed to manage evolving preferences and unexpected disruptions over multiple turns. Such settings require models to make complex, profile-conditioned planning decisions. However, existing benchmarks often evaluate feasibility, personalization, or interaction in relatively isolated settings. We therefore introduce Trip+ to measure the ability of agents to plan travel holistically. In Trip+, given traveler profiles and dynamic interactions, agents must generate and revise minute-level itineraries. End-to-end traveler experiences are evaluated via an LLM-based simulator, enabling the assessment of subjective metrics like fatigue. Our scenarios range from simple request resolutions to complex environment-driven replanning. We evaluate 18 LMs and find a consistent gap in experiential quality. Models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trip+ adds a benchmark for multi-turn personalized travel planning with LLM-scored experiential metrics, but the headline gap finding rests on an uncalibrated simulator.

read the letter

The main thing here is a new benchmark called Trip+ that tries to test agents on realistic travel planning: traveler profiles, back-and-forth changes, minute-level itineraries, and then scoring things like fatigue through an LLM simulator. That combination of elements is not in the prior benchmarks mentioned in the abstract, so the setup itself is the clearest addition.

What works is the focus on end-to-end experience rather than just feasibility or single-turn personalization. Evaluating 18 models and noting that they produce technically ok but exhausting plans that ignore the profile is a concrete observation worth checking.

The soft spot is the simulator. The central claim about a consistent gap in experiential quality comes entirely from LLM judgments of subjective factors like fatigue. The abstract gives no correlation numbers with human raters, no inter-rater stats, and no ablation showing the simulator tracks actual preferences instead of its own biases. If the evaluator systematically dislikes dense but feasible schedules, the reported gap is an artifact. That is not a minor detail; it is load-bearing for the main result.

This is for people building or testing applied planning agents in consumer domains. A reader working on LLM benchmarks or travel-related applications could get value from the scenario design and the reported model behaviors, even if the numbers need re-checking.

It deserves a serious referee once the simulator validation is addressed or at least clearly documented. Without that, the empirical claims stay provisional.

Referee Report

2 major / 2 minor

Summary. The paper introduces Trip+, a benchmark for language model agents in personalized interactive travel planning. Given traveler profiles and multi-turn dynamic interactions, agents generate and revise minute-level itineraries; end-to-end experiences are scored by an LLM-based simulator on subjective metrics including fatigue. The authors evaluate 18 LMs and report a consistent gap: models produce technically feasible but exhausting itineraries that diverge from profiled preferences.

Significance. If the simulator is shown to track human judgments, the benchmark would offer a useful addition to existing feasibility- or personalization-focused evaluations by stressing holistic, profile-conditioned replanning under realistic disruptions.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation setup: the headline result (consistent gap in experiential quality) is obtained entirely from the LLM-based simulator scoring fatigue and related subjective metrics, yet the manuscript supplies no human calibration study, correlation coefficients, or inter-rater agreement data with human raters. This directly affects the validity of the reported gap.
[Methods] Methods (simulator description): without an ablation or sensitivity analysis showing that the simulator does not systematically over-penalize dense but feasible schedules, it remains possible that the observed divergence from traveler profiles is an artifact of the evaluator rather than agent behavior.

minor comments (2)

[Evaluation] Clarify the exact prompting template and temperature settings used for the LLM simulator so that the evaluation is reproducible.
[Experiments] The abstract states '18 LMs' but does not list model names or sizes; add this information in the main text or a table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the Trip+ benchmark. We address each major comment below in a point-by-point manner.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation setup: the headline result (consistent gap in experiential quality) is obtained entirely from the LLM-based simulator scoring fatigue and related subjective metrics, yet the manuscript supplies no human calibration study, correlation coefficients, or inter-rater agreement data with human raters. This directly affects the validity of the reported gap.

Authors: We agree that the absence of a human calibration study, correlation coefficients, or inter-rater agreement data is a limitation in the current manuscript. The headline results on the experiential quality gap rely on the LLM-based simulator without direct validation against human judgments. In the revision, we will add an explicit discussion of this limitation in the evaluation section, including the simulator's design rationale based on profile-conditioned criteria, and note it as an important direction for future work. The consistency of the gap across 18 models provides some supporting evidence, but we recognize this does not substitute for human validation. revision: partial
Referee: [Methods] Methods (simulator description): without an ablation or sensitivity analysis showing that the simulator does not systematically over-penalize dense but feasible schedules, it remains possible that the observed divergence from traveler profiles is an artifact of the evaluator rather than agent behavior.

Authors: The manuscript does not include an ablation or sensitivity analysis examining the simulator's response to schedule density. This is a valid concern, as it leaves open the possibility that evaluator artifacts contribute to the reported divergence. We will incorporate a sensitivity analysis in the methods section of the revised manuscript, testing the simulator on varied schedule densities while holding other factors constant, to address potential systematic biases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential reductions

full rationale

The paper introduces Trip+ as an empirical benchmark for evaluating language model agents on personalized interactive travel planning tasks. It describes scenarios, an LLM-based simulator for scoring subjective metrics such as fatigue, and results from evaluating 18 models. No equations, derivations, fitted parameters, or mathematical claims are present that could reduce to inputs by construction. The simulator is presented as an explicit methodological choice for assessing experiential quality, not as a derived or self-defined quantity. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is a standard benchmark study whose central claims rest on the reported evaluations rather than any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the unstated assumption that the LLM simulator faithfully captures human-like fatigue and preference alignment without additional validation data.

axioms (1)

domain assumption LLM-based simulator accurately measures subjective experiential quality including fatigue
Invoked to enable end-to-end evaluation of itineraries; location: abstract description of evaluation method.

pith-pipeline@v0.9.1-grok · 5683 in / 1157 out tokens · 19792 ms · 2026-06-26T14:36:49.927102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2402.01622 , year=

Travelplanner: A benchmark for real-world planning with language agents , author=. arXiv preprint arXiv:2402.01622 , year=

arXiv
[2]

arXiv preprint arXiv:2601.18137 , year=

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints , author=. arXiv preprint arXiv:2601.18137 , year=

arXiv
[3]

2026 , eprint=

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios , author=. 2026 , eprint=

2026
[4]

arXiv preprint arXiv:2512.22673 , year=

TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning , author=. arXiv preprint arXiv:2512.22673 , year=

Pith/arXiv arXiv
[5]

The Fourteenth International Conference on Learning Representations , year=

Llms get lost in multi-turn conversation , author=. The Fourteenth International Conference on Learning Representations , year=
[6]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[7]

2025 , eprint=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=

2025
[8]

arXiv preprint arXiv:2602.16173 , year=

Learning Personalized Agents from Human Feedback , author=. arXiv preprint arXiv:2602.16173 , year=

arXiv
[9]

arXiv preprint arXiv:2412.13682 , year=

Chinatravel: An open-ended benchmark for language agents in chinese travel planning , author=. arXiv preprint arXiv:2412.13682 , year=

Pith/arXiv arXiv
[10]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Triptailor: A real-world benchmark for personalized travel planning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[11]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Tripcraft: A benchmark for spatio-temporally fine grained travel planning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[12]

RETAIL : Towards Real-world Travel Planning for Large Language Models

Deng, Bin and Feng, Yizhe and Liu, Zeming and Wei, Qing and Zhu, Xiangrong and Chen, Shuai and Guo, Yuanfang and Wang, Yunhong. RETAIL : Towards Real-world Travel Planning for Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.752

work page doi:10.18653/v1/2025.emnlp-main.752 2025
[13]

arXiv preprint arXiv:2506.12421 , year=

Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints , author=. arXiv preprint arXiv:2506.12421 , year=

arXiv
[14]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=
[15]

arXiv preprint arXiv:2604.08455 , year=

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation , author=. arXiv preprint arXiv:2604.08455 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2510.21329 , year=

TripTide: A Benchmark for Adaptive Travel Planning under Disruptions , author=. arXiv preprint arXiv:2510.21329 , year=

arXiv
[17]

Transactions on Machine Learning Research , year=

Cognitive architectures for language agents , author=. Transactions on Machine Learning Research , year=
[18]

2026 , month =

Claude Fable 5 and Claude Mythos 5 , howpublished =. 2026 , month =

2026
[19]

2026 , month =

Introducing. 2026 , month =

2026
[20]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

2025
[21]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[22]

Tourism management , volume=

Building and testing theories of decision making by travellers , author=. Tourism management , volume=. 2005 , publisher=

2005
[23]

Tourism management , volume=

A grounded typology of vacation decision-making , author=. Tourism management , volume=. 2005 , publisher=

2005
[24]

Tourism management , volume=

Development and validation of a tourism fatigue scale , author=. Tourism management , volume=. 2020 , publisher=

2020
[25]

Procedia Environmental Sciences , volume=

Weather and climate information for tourism , author=. Procedia Environmental Sciences , volume=. 2010 , publisher=

2010
[26]

Journal of transport geography , volume=

An investigation of the relationship between public transport performance and destination satisfaction , author=. Journal of transport geography , volume=. 2007 , publisher=

2007
[27]

Annals of tourism research , volume=

Value, satisfaction and behavioral intentions in an adventure tourism context , author=. Annals of tourism research , volume=. 2009 , publisher=

2009
[28]

Journal of Travel research , volume=

Development of a scale to measure memorable tourism experiences , author=. Journal of Travel research , volume=. 2012 , publisher=

2012
[29]

, author=

The proof and measurement of association between two things. , author=. 1961 , publisher=

1961
[30]

psychometrika , volume=

Coefficient alpha and the internal structure of tests , author=. psychometrika , volume=. 1951 , publisher=

1951
[31]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023
[32]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[33]

arXiv preprint arXiv:2404.04475 , year=

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

Pith/arXiv arXiv
[34]

The Twelfth International Conference on Learning Representations , year=

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[1] [1]

arXiv preprint arXiv:2402.01622 , year=

Travelplanner: A benchmark for real-world planning with language agents , author=. arXiv preprint arXiv:2402.01622 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2601.18137 , year=

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints , author=. arXiv preprint arXiv:2601.18137 , year=

arXiv

[3] [3]

2026 , eprint=

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios , author=. 2026 , eprint=

2026

[4] [4]

arXiv preprint arXiv:2512.22673 , year=

TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning , author=. arXiv preprint arXiv:2512.22673 , year=

Pith/arXiv arXiv

[5] [5]

The Fourteenth International Conference on Learning Representations , year=

Llms get lost in multi-turn conversation , author=. The Fourteenth International Conference on Learning Representations , year=

[6] [6]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025

[7] [7]

2025 , eprint=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=

2025

[8] [8]

arXiv preprint arXiv:2602.16173 , year=

Learning Personalized Agents from Human Feedback , author=. arXiv preprint arXiv:2602.16173 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2412.13682 , year=

Chinatravel: An open-ended benchmark for language agents in chinese travel planning , author=. arXiv preprint arXiv:2412.13682 , year=

Pith/arXiv arXiv

[10] [10]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Triptailor: A real-world benchmark for personalized travel planning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[11] [11]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Tripcraft: A benchmark for spatio-temporally fine grained travel planning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[12] [12]

RETAIL : Towards Real-world Travel Planning for Large Language Models

Deng, Bin and Feng, Yizhe and Liu, Zeming and Wei, Qing and Zhu, Xiangrong and Chen, Shuai and Guo, Yuanfang and Wang, Yunhong. RETAIL : Towards Real-world Travel Planning for Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.752

work page doi:10.18653/v1/2025.emnlp-main.752 2025

[13] [13]

arXiv preprint arXiv:2506.12421 , year=

Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints , author=. arXiv preprint arXiv:2506.12421 , year=

arXiv

[14] [14]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

[15] [15]

arXiv preprint arXiv:2604.08455 , year=

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation , author=. arXiv preprint arXiv:2604.08455 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2510.21329 , year=

TripTide: A Benchmark for Adaptive Travel Planning under Disruptions , author=. arXiv preprint arXiv:2510.21329 , year=

arXiv

[17] [17]

Transactions on Machine Learning Research , year=

Cognitive architectures for language agents , author=. Transactions on Machine Learning Research , year=

[18] [18]

2026 , month =

Claude Fable 5 and Claude Mythos 5 , howpublished =. 2026 , month =

2026

[19] [19]

2026 , month =

Introducing. 2026 , month =

2026

[20] [20]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

2025

[21] [21]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

[22] [22]

Tourism management , volume=

Building and testing theories of decision making by travellers , author=. Tourism management , volume=. 2005 , publisher=

2005

[23] [23]

Tourism management , volume=

A grounded typology of vacation decision-making , author=. Tourism management , volume=. 2005 , publisher=

2005

[24] [24]

Tourism management , volume=

Development and validation of a tourism fatigue scale , author=. Tourism management , volume=. 2020 , publisher=

2020

[25] [25]

Procedia Environmental Sciences , volume=

Weather and climate information for tourism , author=. Procedia Environmental Sciences , volume=. 2010 , publisher=

2010

[26] [26]

Journal of transport geography , volume=

An investigation of the relationship between public transport performance and destination satisfaction , author=. Journal of transport geography , volume=. 2007 , publisher=

2007

[27] [27]

Annals of tourism research , volume=

Value, satisfaction and behavioral intentions in an adventure tourism context , author=. Annals of tourism research , volume=. 2009 , publisher=

2009

[28] [28]

Journal of Travel research , volume=

Development of a scale to measure memorable tourism experiences , author=. Journal of Travel research , volume=. 2012 , publisher=

2012

[29] [29]

, author=

The proof and measurement of association between two things. , author=. 1961 , publisher=

1961

[30] [30]

psychometrika , volume=

Coefficient alpha and the internal structure of tests , author=. psychometrika , volume=. 1951 , publisher=

1951

[31] [31]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023

[32] [32]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...

work page doi:10.18653/v1/2025.emnlp-main.138 2025

[33] [33]

arXiv preprint arXiv:2404.04475 , year=

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

Pith/arXiv arXiv

[34] [34]

The Twelfth International Conference on Learning Representations , year=

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=