COMPASS: Benchmarking Constrained Optimization in LLM Agents
Pith reviewed 2026-05-18 08:41 UTC · model grok-4.3
The pith
LLM agents satisfy constraints in travel planning at 70-90% rates but optimize user utility only 20-60% of the time because they explore too little of the solution space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the COMPASS benchmark, LLM agents must engage in multi-turn conversations to gather task details and use tools to query a database, after which they propose a travel plan that satisfies all hard constraints while optimizing the user's utility objective. State-of-the-art models achieve 70-90 percent feasibility but only 20-60 percent optimality. Tool use is not the limiting factor; the core limitation is insufficient exploration of the search space, and performance correlates strongly with the amount of information gathered during interaction. Coding agents show promise in reducing this feasible-optimal gap.
What carries the argument
The COMPASS benchmark of travel planning tasks that require multi-turn user interaction, tool-based database queries, constraint satisfaction, and utility optimization.
If this is right
- Better exploration strategies are required for LLM agents to move from feasible to near-optimal solutions in constrained tasks.
- The amount of information gathered through interaction directly predicts success in balancing constraints and utility.
- Coding agents provide one practical route to improve exploration and narrow the observed optimality shortfall.
- Deployed agents for tasks such as scheduling or shopping will need explicit mechanisms to search solution spaces more thoroughly.
Where Pith is reading between the lines
- If the exploration deficit appears across other constrained domains, then general methods for prompting or training wider search may improve utility optimization in many agent applications.
- Extending COMPASS-style evaluations to non-travel tasks such as resource allocation would test whether the feasible-optimal gap is domain-specific or general.
- Measuring exploration depth via metrics like number of distinct candidates considered could become a standard diagnostic for agent performance on optimization problems.
Load-bearing premise
The travel planning tasks, constraint definitions, and utility objectives chosen for COMPASS are representative of the broader class of constrained optimization problems that LLM agents would encounter in deployed real-world settings.
What would settle it
A controlled experiment in which models are forced to generate and evaluate a fixed larger number of candidate solutions on the same COMPASS tasks and still show no rise in optimality rates would falsify the claim that insufficient exploration is the primary cause of the gap.
Figures
read the original abstract
Human decision-making often involves constrained optimization. As LLM agents are deployed to assist with real-world tasks like travel planning, shopping, and scheduling, they must mirror this capability. We introduce COMPASS, a benchmark that evaluates whether LLM agents can perform constrained optimization in realistic travel planning settings. To success in these tasks, agents must engage in multi-turn conversations with user to gather task information as well as use tools to gather information from the database. Then agents must propose a solution that not only satisfies hard constraints but also optimizes user's utility objective. Evaluating state-of-the-art models, we reveal a significant feasible-optimal gap: while models achieve 70-90% feasibility (constraint satisfaction), they reach only 20-60% optimality (utility optimization). Our analysis shows that tool use is not the bottleneck. Instead, the core limitation is insufficient exploration of the search space, with success strongly correlating with information gathered. Coding agents show a promising approach to mitigate this gap. Together, COMPASS provides a testbed for developing LLM agents that can truly mirror human decision-making by both satisfying constraints and optimizing objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COMPASS, a benchmark for LLM agents performing constrained optimization in multi-turn travel planning tasks. Agents must gather information via conversation and tools, then output solutions satisfying hard constraints (budget, time, preferences) while maximizing a scalar utility objective. Evaluations of state-of-the-art models reveal a feasible-optimal gap (70-90% feasibility vs. 20-60% optimality), with analysis attributing the gap primarily to insufficient search-space exploration rather than tool-use failures; coding agents are shown to narrow the gap.
Significance. If the central empirical findings hold under the stated task definitions, COMPASS supplies a reproducible, multi-turn benchmark that isolates exploration as a bottleneck in LLM agents for constrained decision-making. The reported correlation between information gathered and optimality success offers a concrete, falsifiable signal for future agent work. The benchmark's emphasis on both constraint satisfaction and utility optimization fills a gap between existing planning and tool-use evaluations.
major comments (2)
- [§3] §3 (Benchmark Design): The central claim that the feasible-optimal gap stems from insufficient exploration (rather than domain-specific artifacts) rests on the untested assumption that travel-planning tasks are representative of broader constrained optimization. No cross-domain validation or argument is provided showing that the search-space geometry, constraint density, or information-gathering cost in travel planning generalizes to domains such as job-shop scheduling or supply-chain allocation. A minimal fix would be to add at least one additional domain with materially different feasible-set structure and report whether the same exploration diagnosis holds.
- [§5] §5 (Experimental Analysis): The statement that 'tool use is not the bottleneck' is load-bearing for the diagnosis yet lacks quantitative controls. The paper should report per-model tool-call success rates, the fraction of failures attributable to tool errors versus exploration, and an ablation that forces exhaustive tool use to isolate whether exploration remains the dominant factor once tool access is perfect.
minor comments (2)
- [Table 1] Table 1 and §4.2: Clarify whether the reported feasibility and optimality percentages are macro-averages across tasks or micro-averages across runs; include standard deviations or confidence intervals.
- [§2] §2 (Related Work): The discussion of prior agent benchmarks omits several recent works on constrained planning and multi-objective optimization; add citations to ensure the novelty claim is fully contextualized.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate while maintaining the scope and focus of the current work.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Design): The central claim that the feasible-optimal gap stems from insufficient exploration (rather than domain-specific artifacts) rests on the untested assumption that travel-planning tasks are representative of broader constrained optimization. No cross-domain validation or argument is provided showing that the search-space geometry, constraint density, or information-gathering cost in travel planning generalizes to domains such as job-shop scheduling or supply-chain allocation. A minimal fix would be to add at least one additional domain with materially different feasible-set structure and report whether the same exploration diagnosis holds.
Authors: We acknowledge that explicit cross-domain validation would strengthen claims of broader applicability. COMPASS was intentionally scoped to travel planning as a representative real-world task that integrates multi-turn user interaction, tool-based information retrieval, hard constraints, and scalar utility optimization—challenges that recur across constrained decision-making settings. The discrete choice structure with budget/time constraints and information-gathering costs shares geometric properties with other domains. To address the concern, we will add a dedicated paragraph in §3 explaining these structural similarities and the rationale for the chosen domain, while explicitly noting the absence of cross-domain experiments as a limitation. A full additional domain implementation exceeds the resources available for this revision. revision: partial
-
Referee: [§5] §5 (Experimental Analysis): The statement that 'tool use is not the bottleneck' is load-bearing for the diagnosis yet lacks quantitative controls. The paper should report per-model tool-call success rates, the fraction of failures attributable to tool errors versus exploration, and an ablation that forces exhaustive tool use to isolate whether exploration remains the dominant factor once tool access is perfect.
Authors: We agree that additional quantitative controls will make the analysis more rigorous. Our current results already indicate high tool-call success rates and that most optimality failures occur despite successful tool use, but we will expand §5 to include explicit per-model tool-call success rates and a breakdown of failure attributions (tool errors versus insufficient exploration). For the suggested ablation with forced exhaustive tool use, we will attempt to incorporate preliminary results from modified agent runs in the revision; if computational limits prevent full execution, we will report the available data and discuss the ablation as future work. revision: partial
- Implementation and evaluation of at least one additional domain with materially different feasible-set structure (e.g., job-shop scheduling).
Circularity Check
No significant circularity in this empirical benchmark paper
full rationale
This is an empirical benchmark paper that defines travel-planning tasks, evaluates state-of-the-art LLM agents on them through direct model runs, and reports observed feasibility and optimality rates plus correlations with information gathered. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claims about the feasible-optimal gap and exploration limitations are grounded in the experimental results on the chosen tasks rather than reducing to any input by construction, satisfying the criteria for a self-contained benchmark evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Travel planning scenarios with hard constraints and user utility objectives are representative of real-world constrained optimization problems for LLM agents.
invented entities (1)
-
COMPASS benchmark
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce COMPASS... agents must satisfy hard constraints while simultaneously optimizing soft user preferences... acceptable rate... optimal rate... feasible-optimal gap
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Task Levels... Level III (Hotel, Flight, Permit)... plan-coordination gap
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks
TravelBench is a new benchmark with three subtasks and ten cached real-world tools to evaluate LLM agents on realistic multi-turn travel planning and capability boundaries.
-
MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs
MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
Reference graph
Works this paper leans on
-
[1]
URLhttp://arxiv.org/abs/2408.04682. OpenAI. GPT-5. https://openai.com/index/introducing-gpt-5/, August 2025. Large language model. OpenAI, Achiam, Josh, Agarwal, Sandhini, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 11 Preprint Shishir G Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E....
-
[2]
Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , rights =
URLhttp://arxiv.org/abs/1909.05855. Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2Call: When (not) to call tools.arXiv [cs.CL], 2025. URLhttp://arxiv.org/abs/2504.18851. Pararth Shah, Dilek Hakkani-T¨ur, Gokhan T¨ur, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. Building a conversational agent overnight with dialogue self-p...
-
[3]
Travelplanner: A benchmark for real-world planning with language agents
URLhttp://arxiv.org/abs/2402.01622. Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, and Kai Yu. Reducing tool hallucination via reliability alignment.arXiv [cs.CL], 2024. URL http://arxiv.org/abs/2412.04141. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenx...
-
[4]
URLhttp://arxiv.org/abs/2501.10132. Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li. SWEET-RL: Training multi-turn LLM agents on collaborative reasoning tasks, 2025. URLhttp://arxiv.org/abs/2503.15478. 13 Preprint A ADDITIONALRESULTS A.1 FULLBENCHMARKRESULT In Table 3, we report the benchmark performan...
-
[5]
often feature simplified instruction-following setups where agents interact autonomously with complete information upfront, lacking realistic human-in-the-loop interaction.Wang et al. (2023) provides multi-turn tool use evaluation but focuses on coding scenarios with overly helpful users, while Patil et al. (2025) introduces multi-step and multi-turn func...
work page 2023
-
[6]
Wait, that costs $850 but I said my budget is maximum $800
Verify recommendations meet ALL requirements: User checks if suggested hotels satisfy every must-have • Example: “Wait, that costs $850 but I said my budget is maximum $800” • Example: “Does this hotel allow pets? That’s a must-have for me” 4.Answer agent’s clarifying questions: When agent needs more information, user provides it based on their script
-
[7]
Just to be clear, it MUST be pet-friendly AND under $800 total
Restate requirements if needed: Suspicious users may repeat their needs to ensure agent understood • Example: “Just to be clear, it MUST be pet-friendly AND under $800 total”
-
[8]
Request formal recommendations: If agent doesn’t properly flag their recommendations, user asks them to formally recommend options Important Note for Annotators:Users should not invent information beyond what’s provided in their script. Error Detection Categories Annotators tagged each response for six types of objective errors (True if error detected, Fa...
-
[9]
Compare the given user response to the instructions given to the user
-
[10]
Answer 6 true/false questions to identify any errors that occurred • AnswerTrueif error or failure is identified • AnswerFalseif no error or if question is not applicable
-
[11]
Answer 2 rubric grading questions to score the quality of the user’s response 20 Preprint D ADDITIONALBENCHMARKDETAILS D.1 TASKSTATISTICS ANDDISTRIBUTION Our benchmark consists of 241 tasks spanning diverse travel planning scenarios across 20 U.S. National Parks destination . This section provides detailed breakdowns of the task characteristics. D.1.1 CON...
work page 2025
-
[12]
Respond to agent questions if asked
-
[13]
Incorporate hard constraint check results
-
[14]
State utility objective if you haven’t done so in the conversation
-
[15]
Mention additional new hard constraints if instructed
-
[16]
DO NOT independently check recommendations against your constraints
Repeat utility goals/constraints if instructed Important Guidelines: • You only mention constraint violations listed in HARD CONSTRAINT CHECK RESULTS. DO NOT independently check recommendations against your constraints. • Answer agent questions using ONLY information explicitly stated in your constraints and utility objective. Ignore questions that cannot...
-
[17]
question response:Identify if agent asked questions answerable from constraints/util- ity
-
[18]
constraint check:Review constraint check instructions; decide if violations need mentioning 3.state utility:Check if you need to state/re-state your utility objective
-
[19]
reveal progressive constraint:Check if instructed to reveal new progressive con- straints
-
[20]
question and verify:Check if instructed to question and verify the recommendation Step 2: Response Generate a natural response continuing the conversation as a user. Be authentic to your persona and respond naturally according to conversation flow. Do not repeat what was already stated unless instructed to be repetitive. 28 Preprint RESPONSEEXAMPLES { "an...
-
[21]
constraint_check: Constraint check indicates flight time violation, I will mention that
-
[22]
state_utility: Already stated utility objective, skip
-
[23]
reveal_progressive_constraint: Not instructed to reveal new constraints, skip
-
[24]
question_and_verify: Not instructed to question, skip .", "user_response": "Thank you for finding this travel package! I really appreciate the effort. However, I noticed the outbound flight departs at 3:30 PM, but I specifically need a morning departure before noon. Could you find an earlier flight option that works with the rest of the package?", } OUTPU...
-
[25]
reveal_progressive_constraint:
-
[26]
question_and_verify: ...", "user_response": "what user would naturally respond based on analysis", "terminating_condition": "continue" } 29 Preprint E EXPERIMENTDETAILS E.1 AGENTSYSTEMPROMPT COREINSTRUCTIONS Role Definition: You are a helpful and proactive travel planning assistant. The current date is June 1st, 2025. Your goal is to help the user find th...
work page 2025
-
[27]
Engage in conversation:Natural, friendly interaction to understand user needs for complete travel packages 2.Use relevant tools:Find available flights, hotels, and permits based on user criteria
-
[28]
Validate recommendations:MUST use recommend itinerary tool to validate complete package before response • Provide flight package IDs, hotel package IDs, and optional permit IDs • Explain reasoning for the complete itinerary selection
-
[29]
Optional note-taking:Use Notebook tool as scratch pad for complex itinerary planning RESPONSEFORMATSPECIFICATION When making an itinerary recommendation (after validation): { "message": "I found a great travel package for your Yosemite trip! Flight Package: - United Airlines: JFK - SFO - Outbound: Aug 15 at 10:30 AM - Return: Aug 19 at 6:00 PM - Cost: $45...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.