pith. sign in

arxiv: 2510.07043 · v2 · submitted 2025-10-08 · 💻 cs.LG

COMPASS: Benchmarking Constrained Optimization in LLM Agents

Pith reviewed 2026-05-18 08:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM agentsconstrained optimizationbenchmarkfeasibilityoptimalitytravel planningsearch space explorationtool use
0
0 comments X

The pith

LLM agents satisfy constraints in travel planning at 70-90% rates but optimize user utility only 20-60% of the time because they explore too little of the solution space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMPASS as a benchmark for testing whether LLM agents can carry out constrained optimization in realistic settings such as travel planning. Agents must collect information through multi-turn conversations with users and tool calls to a database, then output plans that meet hard constraints while maximizing a utility objective. Evaluations of current models reveal a clear gap: high feasibility rates but much lower optimality. The analysis identifies insufficient exploration of the search space as the central bottleneck rather than difficulties with tool use. Success tracks closely with how much information agents gather, and coding agents appear to help close the gap by supporting broader search.

Core claim

In the COMPASS benchmark, LLM agents must engage in multi-turn conversations to gather task details and use tools to query a database, after which they propose a travel plan that satisfies all hard constraints while optimizing the user's utility objective. State-of-the-art models achieve 70-90 percent feasibility but only 20-60 percent optimality. Tool use is not the limiting factor; the core limitation is insufficient exploration of the search space, and performance correlates strongly with the amount of information gathered during interaction. Coding agents show promise in reducing this feasible-optimal gap.

What carries the argument

The COMPASS benchmark of travel planning tasks that require multi-turn user interaction, tool-based database queries, constraint satisfaction, and utility optimization.

If this is right

  • Better exploration strategies are required for LLM agents to move from feasible to near-optimal solutions in constrained tasks.
  • The amount of information gathered through interaction directly predicts success in balancing constraints and utility.
  • Coding agents provide one practical route to improve exploration and narrow the observed optimality shortfall.
  • Deployed agents for tasks such as scheduling or shopping will need explicit mechanisms to search solution spaces more thoroughly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the exploration deficit appears across other constrained domains, then general methods for prompting or training wider search may improve utility optimization in many agent applications.
  • Extending COMPASS-style evaluations to non-travel tasks such as resource allocation would test whether the feasible-optimal gap is domain-specific or general.
  • Measuring exploration depth via metrics like number of distinct candidates considered could become a standard diagnostic for agent performance on optimization problems.

Load-bearing premise

The travel planning tasks, constraint definitions, and utility objectives chosen for COMPASS are representative of the broader class of constrained optimization problems that LLM agents would encounter in deployed real-world settings.

What would settle it

A controlled experiment in which models are forced to generate and evaluate a fixed larger number of candidate solutions on the same COMPASS tasks and still show no rise in optimality rates would falsify the claim that insufficient exploration is the primary cause of the gap.

Figures

Figures reproduced from arXiv: 2510.07043 by Bowen Jin, Felix Bai, Hema Swetha Koppula, Jiarui Lu, Meng Cao, Mert Cemri, Raviteja Vemulapalli, Tian Qin, Ting-Yao Hu, Zhiyang Xu, Zirui Wang.

Figure 1
Figure 1. Figure 1: COMPASS benchmark framework. The environment integrates three key components for realistic evaluation of agentic capabilities in travel planning. (A) A modular LLM-based user simulator enables controllable multi-turn interactions, progressive constraint revelation, and diverse user personas. (B) We formalize travel planning as constrained preference optimization, where agents must satisfy hard constraints … view at source ↗
Figure 2
Figure 2. Figure 2: COMPASS benchmark main results. Acceptable rate measures feasibility (satisfying all hard constraints). Optimal rate measures preference optimization (achieving utility within the top 10% of feasible solutions). All models show a ∼ 20% gap between high acceptable rates and low optimal rates, revealing that agents settle for feasible solutions rather than optimizing preferences. Encouragingly, open-source m… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of task types and dynamic user simulator prompt. (A) Two task types are defined based on the soft preference-optimization objective. Each task type includes hard constraints but differs in optimization objective (Sec. 3.2). (B) The dynamic LLM user simulator prompt (Sec. 3.4) controls multi-turn conversation dynamics. The system prompt consists of static instructions (orange), fixed for each conve… view at source ↗
Figure 4
Figure 4. Figure 4: Performance breakdown across benchmark dimensions. (A) Performance degrades with increasing plan coordination complexity (Levels II–III), with open-source models showing especially steep declines (green). (B) Constraint satisfaction rates drop as the number of hard constraints increases, with only the strongest models handling 8+ constraints reliably. (C) Preference optimization weakens as search complexit… view at source ↗
Figure 5
Figure 5. Figure 5: Case study of tool calls and reasoning traces. Top: Prompt given to models for a Level II task, with explicit reasoning requested. Bottom: GPT-5 (left) demonstrates strategic planning by avoiding weekends, systematically exploring date ranges, and using optional parameters (e.g., price filters) to narrow searches. Claude-Sonnet-4 (right) applies optional parameters but searches only two arbitrary dates wit… view at source ↗
Figure 6
Figure 6. Figure 6: Task level breakdown performances. Full result for [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Conversation efficiency versus Acceptable Rate. Additional results for [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conversation efficiency controls. Top: Distribution of user turn t ∗ when all task infor￾mation is revealed, showing consistent user behavior across models. Bottom: Box plots of overall conversation length, illustrating efficiency differences across agent models. Colors indicate model categories (e.g. open-source, closed-source). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of total constraints (initial + progressive) across benchmark tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of optimization objectives for continuous metric tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of target attribute counts for feature count maximization tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Human decision-making often involves constrained optimization. As LLM agents are deployed to assist with real-world tasks like travel planning, shopping, and scheduling, they must mirror this capability. We introduce COMPASS, a benchmark that evaluates whether LLM agents can perform constrained optimization in realistic travel planning settings. To success in these tasks, agents must engage in multi-turn conversations with user to gather task information as well as use tools to gather information from the database. Then agents must propose a solution that not only satisfies hard constraints but also optimizes user's utility objective. Evaluating state-of-the-art models, we reveal a significant feasible-optimal gap: while models achieve 70-90% feasibility (constraint satisfaction), they reach only 20-60% optimality (utility optimization). Our analysis shows that tool use is not the bottleneck. Instead, the core limitation is insufficient exploration of the search space, with success strongly correlating with information gathered. Coding agents show a promising approach to mitigate this gap. Together, COMPASS provides a testbed for developing LLM agents that can truly mirror human decision-making by both satisfying constraints and optimizing objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces COMPASS, a benchmark for LLM agents performing constrained optimization in multi-turn travel planning tasks. Agents must gather information via conversation and tools, then output solutions satisfying hard constraints (budget, time, preferences) while maximizing a scalar utility objective. Evaluations of state-of-the-art models reveal a feasible-optimal gap (70-90% feasibility vs. 20-60% optimality), with analysis attributing the gap primarily to insufficient search-space exploration rather than tool-use failures; coding agents are shown to narrow the gap.

Significance. If the central empirical findings hold under the stated task definitions, COMPASS supplies a reproducible, multi-turn benchmark that isolates exploration as a bottleneck in LLM agents for constrained decision-making. The reported correlation between information gathered and optimality success offers a concrete, falsifiable signal for future agent work. The benchmark's emphasis on both constraint satisfaction and utility optimization fills a gap between existing planning and tool-use evaluations.

major comments (2)
  1. [§3] §3 (Benchmark Design): The central claim that the feasible-optimal gap stems from insufficient exploration (rather than domain-specific artifacts) rests on the untested assumption that travel-planning tasks are representative of broader constrained optimization. No cross-domain validation or argument is provided showing that the search-space geometry, constraint density, or information-gathering cost in travel planning generalizes to domains such as job-shop scheduling or supply-chain allocation. A minimal fix would be to add at least one additional domain with materially different feasible-set structure and report whether the same exploration diagnosis holds.
  2. [§5] §5 (Experimental Analysis): The statement that 'tool use is not the bottleneck' is load-bearing for the diagnosis yet lacks quantitative controls. The paper should report per-model tool-call success rates, the fraction of failures attributable to tool errors versus exploration, and an ablation that forces exhaustive tool use to isolate whether exploration remains the dominant factor once tool access is perfect.
minor comments (2)
  1. [Table 1] Table 1 and §4.2: Clarify whether the reported feasibility and optimality percentages are macro-averages across tasks or micro-averages across runs; include standard deviations or confidence intervals.
  2. [§2] §2 (Related Work): The discussion of prior agent benchmarks omits several recent works on constrained planning and multi-objective optimization; add citations to ensure the novelty claim is fully contextualized.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate while maintaining the scope and focus of the current work.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Design): The central claim that the feasible-optimal gap stems from insufficient exploration (rather than domain-specific artifacts) rests on the untested assumption that travel-planning tasks are representative of broader constrained optimization. No cross-domain validation or argument is provided showing that the search-space geometry, constraint density, or information-gathering cost in travel planning generalizes to domains such as job-shop scheduling or supply-chain allocation. A minimal fix would be to add at least one additional domain with materially different feasible-set structure and report whether the same exploration diagnosis holds.

    Authors: We acknowledge that explicit cross-domain validation would strengthen claims of broader applicability. COMPASS was intentionally scoped to travel planning as a representative real-world task that integrates multi-turn user interaction, tool-based information retrieval, hard constraints, and scalar utility optimization—challenges that recur across constrained decision-making settings. The discrete choice structure with budget/time constraints and information-gathering costs shares geometric properties with other domains. To address the concern, we will add a dedicated paragraph in §3 explaining these structural similarities and the rationale for the chosen domain, while explicitly noting the absence of cross-domain experiments as a limitation. A full additional domain implementation exceeds the resources available for this revision. revision: partial

  2. Referee: [§5] §5 (Experimental Analysis): The statement that 'tool use is not the bottleneck' is load-bearing for the diagnosis yet lacks quantitative controls. The paper should report per-model tool-call success rates, the fraction of failures attributable to tool errors versus exploration, and an ablation that forces exhaustive tool use to isolate whether exploration remains the dominant factor once tool access is perfect.

    Authors: We agree that additional quantitative controls will make the analysis more rigorous. Our current results already indicate high tool-call success rates and that most optimality failures occur despite successful tool use, but we will expand §5 to include explicit per-model tool-call success rates and a breakdown of failure attributions (tool errors versus insufficient exploration). For the suggested ablation with forced exhaustive tool use, we will attempt to incorporate preliminary results from modified agent runs in the revision; if computational limits prevent full execution, we will report the available data and discuss the ablation as future work. revision: partial

standing simulated objections not resolved
  • Implementation and evaluation of at least one additional domain with materially different feasible-set structure (e.g., job-shop scheduling).

Circularity Check

0 steps flagged

No significant circularity in this empirical benchmark paper

full rationale

This is an empirical benchmark paper that defines travel-planning tasks, evaluates state-of-the-art LLM agents on them through direct model runs, and reports observed feasibility and optimality rates plus correlations with information gathered. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claims about the feasible-optimal gap and exploration limitations are grounded in the experimental results on the chosen tasks rather than reducing to any input by construction, satisfying the criteria for a self-contained benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the realism of the travel-planning domain and the assumption that multi-turn tool use plus solution proposal captures constrained optimization; no free parameters are fitted in a modeling sense.

axioms (1)
  • domain assumption Travel planning scenarios with hard constraints and user utility objectives are representative of real-world constrained optimization problems for LLM agents.
    Invoked when generalizing the benchmark results to broader agent capabilities.
invented entities (1)
  • COMPASS benchmark no independent evidence
    purpose: To evaluate LLM agents on constrained optimization via feasibility and optimality metrics.
    Newly constructed testbed introduced in this work.

pith-pipeline@v0.9.0 · 5759 in / 1401 out tokens · 43898 ms · 2026-05-18T08:41:30.558084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

    cs.AI 2025-12 accept novelty 7.0

    TravelBench is a new benchmark with three subtasks and ten cached real-world tools to evaluate LLM agents on realistic multi-turn travel planning and capability boundaries.

  2. MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs

    cs.HC 2026-04 unverdicted novelty 6.0

    MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers

  1. [1]

    URLhttp://arxiv.org/abs/2408.04682. OpenAI. GPT-5. https://openai.com/index/introducing-gpt-5/, August 2025. Large language model. OpenAI, Achiam, Josh, Agarwal, Sandhini, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 11 Preprint Shishir G Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E....

  2. [2]

    Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , rights =

    URLhttp://arxiv.org/abs/1909.05855. Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2Call: When (not) to call tools.arXiv [cs.CL], 2025. URLhttp://arxiv.org/abs/2504.18851. Pararth Shah, Dilek Hakkani-T¨ur, Gokhan T¨ur, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. Building a conversational agent overnight with dialogue self-p...

  3. [3]

    Travelplanner: A benchmark for real-world planning with language agents

    URLhttp://arxiv.org/abs/2402.01622. Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, and Kai Yu. Reducing tool hallucination via reliability alignment.arXiv [cs.CL], 2024. URL http://arxiv.org/abs/2412.04141. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenx...

  4. [4]

    ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

    URLhttp://arxiv.org/abs/2501.10132. Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li. SWEET-RL: Training multi-turn LLM agents on collaborative reasoning tasks, 2025. URLhttp://arxiv.org/abs/2503.15478. 13 Preprint A ADDITIONALRESULTS A.1 FULLBENCHMARKRESULT In Table 3, we report the benchmark performan...

  5. [5]

    Must allow pets

    often feature simplified instruction-following setups where agents interact autonomously with complete information upfront, lacking realistic human-in-the-loop interaction.Wang et al. (2023) provides multi-turn tool use evaluation but focuses on coding scenarios with overly helpful users, while Patil et al. (2025) introduces multi-step and multi-turn func...

  6. [6]

    Wait, that costs $850 but I said my budget is maximum $800

    Verify recommendations meet ALL requirements: User checks if suggested hotels satisfy every must-have • Example: “Wait, that costs $850 but I said my budget is maximum $800” • Example: “Does this hotel allow pets? That’s a must-have for me” 4.Answer agent’s clarifying questions: When agent needs more information, user provides it based on their script

  7. [7]

    Just to be clear, it MUST be pet-friendly AND under $800 total

    Restate requirements if needed: Suspicious users may repeat their needs to ensure agent understood • Example: “Just to be clear, it MUST be pet-friendly AND under $800 total”

  8. [8]

    I need a gym

    Request formal recommendations: If agent doesn’t properly flag their recommendations, user asks them to formally recommend options Important Note for Annotators:Users should not invent information beyond what’s provided in their script. Error Detection Categories Annotators tagged each response for six types of objective errors (True if error detected, Fa...

  9. [9]

    Compare the given user response to the instructions given to the user

  10. [10]

    Answer 6 true/false questions to identify any errors that occurred • AnswerTrueif error or failure is identified • AnswerFalseif no error or if question is not applicable

  11. [11]

    National Parks destination

    Answer 2 rubric grading questions to score the quality of the user’s response 20 Preprint D ADDITIONALBENCHMARKDETAILS D.1 TASKSTATISTICS ANDDISTRIBUTION Our benchmark consists of 241 tasks spanning diverse travel planning scenarios across 20 U.S. National Parks destination . This section provides detailed breakdowns of the task characteristics. D.1.1 CON...

  12. [12]

    Respond to agent questions if asked

  13. [13]

    Incorporate hard constraint check results

  14. [14]

    State utility objective if you haven’t done so in the conversation

  15. [15]

    Mention additional new hard constraints if instructed

  16. [16]

    DO NOT independently check recommendations against your constraints

    Repeat utility goals/constraints if instructed Important Guidelines: • You only mention constraint violations listed in HARD CONSTRAINT CHECK RESULTS. DO NOT independently check recommendations against your constraints. • Answer agent questions using ONLY information explicitly stated in your constraints and utility objective. Ignore questions that cannot...

  17. [17]

    question response:Identify if agent asked questions answerable from constraints/util- ity

  18. [18]

    constraint check:Review constraint check instructions; decide if violations need mentioning 3.state utility:Check if you need to state/re-state your utility objective

  19. [19]

    reveal progressive constraint:Check if instructed to reveal new progressive con- straints

  20. [20]

    analysis

    question and verify:Check if instructed to question and verify the recommendation Step 2: Response Generate a natural response continuing the conversation as a user. Be authentic to your persona and respond naturally according to conversation flow. Do not repeat what was already stated unless instructed to be repetitive. 28 Preprint RESPONSEEXAMPLES { "an...

  21. [21]

    constraint_check: Constraint check indicates flight time violation, I will mention that

  22. [22]

    state_utility: Already stated utility objective, skip

  23. [23]

    reveal_progressive_constraint: Not instructed to reveal new constraints, skip

  24. [24]

    , "user_response

    question_and_verify: Not instructed to question, skip .", "user_response": "Thank you for finding this travel package! I really appreciate the effort. However, I noticed the outbound flight departs at 3:30 PM, but I specifically need a morning departure before noon. Could you find an earlier flight option that works with the rest of the package?", } OUTPU...

  25. [25]

    reveal_progressive_constraint:

  26. [26]

    , "user_response

    question_and_verify: ...", "user_response": "what user would naturally respond based on analysis", "terminating_condition": "continue" } 29 Preprint E EXPERIMENTDETAILS E.1 AGENTSYSTEMPROMPT COREINSTRUCTIONS Role Definition: You are a helpful and proactive travel planning assistant. The current date is June 1st, 2025. Your goal is to help the user find th...

  27. [27]

    Engage in conversation:Natural, friendly interaction to understand user needs for complete travel packages 2.Use relevant tools:Find available flights, hotels, and permits based on user criteria

  28. [28]

    Validate recommendations:MUST use recommend itinerary tool to validate complete package before response • Provide flight package IDs, hotel package IDs, and optional permit IDs • Explain reasoning for the complete itinerary selection

  29. [29]

    message":

    Optional note-taking:Use Notebook tool as scratch pad for complex itinerary planning RESPONSEFORMATSPECIFICATION When making an itinerary recommendation (after validation): { "message": "I found a great travel package for your Yosemite trip! Flight Package: - United Airlines: JFK - SFO - Outbound: Aug 15 at 10:30 AM - Return: Aug 19 at 6:00 PM - Cost: $45...