Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization
Pith reviewed 2026-05-13 20:40 UTC · model grok-4.3
The pith
Conversational optimization agents reach much higher solution quality than one-shot queries with the same model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through thousands of simulated conversations on a school scheduling problem, an optimization agent converges to higher-quality solutions when it interacts conversationally with role-playing stakeholder agents, each governed by an internal utility function, compared with one-shot queries. Tailored agents that incorporate domain-specific prompts and structured tools further increase solution quality and reduce the number of interactions needed relative to general-purpose chatbots.
What carries the argument
LLM-powered stakeholder agents that role-play decision-makers with fixed internal utility functions, generating conversations to evaluate how an optimization agent iteratively proposes, interprets, and refines solutions.
If this is right
- The same optimization agent produces better solutions when it receives iterative feedback rather than answering once.
- Domain-specific prompts and structured tools let tailored agents reach high-quality outcomes in fewer turns than general chatbots.
- Simulated multi-stakeholder conversations provide a scalable way to test agent designs without recruiting human subjects.
- Interactive agents expand the reach of optimization by handling clarification of objectives and trade-offs in natural dialogue.
Where Pith is reading between the lines
- Real optimization software could adopt persistent chat interfaces that maintain context across turns to improve practical usability.
- The same evaluation approach could be applied to other planning domains such as supply chain or workforce scheduling to check whether conversational gains generalize.
- If simulated utility functions diverge from human preferences, the measured quality improvements may overstate real-world benefits.
Load-bearing premise
That conversations and solution-quality judgments produced by LLM agents role-playing stakeholders with internal utilities accurately reflect real human decision-makers.
What would settle it
A direct comparison of final schedules and interaction patterns from the simulated conversations against outcomes from actual human stakeholders performing the same school scheduling task.
read the original abstract
Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation-based interaction than traditional one-shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM-powered decision agents that role-play diverse stakeholders, each governed by an internal utility function but communicating like a real decision-maker. We generate thousands of conversations in a school scheduling case study. Results show that one-shot evaluation is severely limiting: the same optimization agent converges to much higher-quality solutions through conversations. Then, this paper uses this methodology to demonstrate that tailored optimization agents, endowed with domain-specific prompts and structured tools, can lead to significant improvements in solution quality in fewer interactions, as compared to general-purpose chatbots. These findings provide evidence of the benefits of emerging solutions at the AI-optimization interface to expand the reach of optimization technologies in practice. They also uncover the impact of operations research expertise to facilitate interactive deployments through the design of effective and reliable optimization agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a scalable methodology for evaluating LLM-powered interactive optimization agents by generating thousands of simulated conversations in a school scheduling case study. Stakeholder agents are role-played with internal utility functions that shape both their dialogue and the ground-truth solution quality metric. Results indicate that the same optimization agent reaches substantially higher-quality solutions via multi-turn conversations than via one-shot evaluation, and that domain-tailored agents (with specialized prompts and tools) outperform general-purpose chatbots in both quality and interaction efficiency.
Significance. If the simulation faithfully captures real stakeholder behavior, the work supplies a replicable evaluation framework for interactive optimization systems and concrete evidence that conversational interfaces plus domain-specific agent design can materially improve solution quality and reduce interaction count. This would strengthen the case for embedding OR expertise into LLM agents to broaden practical deployment of optimization technology.
major comments (3)
- [Methodology] Methodology section: The same LLM class generates the stakeholder utilities, produces the dialogue, and supplies the implicit scoring signal for solution quality. No ablation that removes the utility scaffolding, no inter-rater agreement with domain experts, and no external calibration against actual human stakeholders are reported; this makes it impossible to rule out that reported gains are artifacts of the closed simulation loop rather than genuine interactive optimization benefits.
- [Results] Results section: The abstract and evaluation description provide only directional statements on solution quality; no concrete metrics (e.g., normalized utility, feasibility rate), statistical tests, variance estimates, or implementation details for the one-shot versus conversational baselines are supplied, preventing assessment of effect size or robustness of the central claim.
- [Section 5] Section 5 (comparison of tailored vs. general agents): The reported advantage of tailored agents rests on the same internal-utility evaluation; without an independent human or expert panel validation, it remains unclear whether the observed reduction in interactions and quality lift would survive when the utility functions are replaced by real decision-maker preferences.
minor comments (2)
- [Abstract] The abstract states that 'one-shot evaluation is severely limiting' but does not define the precise one-shot protocol (prompt length, number of proposals allowed, stopping criterion), making the comparison hard to replicate.
- [Methodology] Notation for the internal utility functions is introduced without an explicit equation or pseudocode listing the components (objectives, constraints, trade-off weights), which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our simulation-based evaluation framework. We address each major comment below, clarifying our design choices while incorporating revisions to strengthen the presentation of metrics, add ablations, and discuss limitations. The core contribution remains a scalable methodology for evaluating conversational optimization agents, with the simulation serving as a controlled, replicable proxy rather than a direct substitute for human studies.
read point-by-point responses
-
Referee: [Methodology] Methodology section: The same LLM class generates the stakeholder utilities, produces the dialogue, and supplies the implicit scoring signal for solution quality. No ablation that removes the utility scaffolding, no inter-rater agreement with domain experts, and no external calibration against actual human stakeholders are reported; this makes it impossible to rule out that reported gains are artifacts of the closed simulation loop rather than genuine interactive optimization benefits.
Authors: We designed the simulation with separate LLM instances and prompts for utility generation versus dialogue to reduce direct circularity, allowing stakeholder agents to communicate naturally while utilities provide an independent ground-truth metric. This enables the scalable generation of thousands of conversations as described. We acknowledge the closed-loop concern and have added an ablation study in the revised Section 3 that removes explicit utility scaffolding from stakeholder prompts, showing that conversational gains persist (though attenuated). We also expanded the limitations subsection to discuss the absence of inter-rater agreement and human calibration, noting these as directions for future work rather than claiming the simulation fully replicates real stakeholders. revision: yes
-
Referee: [Results] Results section: The abstract and evaluation description provide only directional statements on solution quality; no concrete metrics (e.g., normalized utility, feasibility rate), statistical tests, variance estimates, or implementation details for the one-shot versus conversational baselines are supplied, preventing assessment of effect size or robustness of the central claim.
Authors: The original results section reports normalized utility improvements and interaction counts, but we agree the presentation was insufficiently detailed. In the revision, we have added explicit tables with mean normalized utility scores, feasibility rates, standard deviations across 10 random seeds, and paired t-test p-values comparing one-shot versus conversational conditions. We also include hyperparameter details for the baselines and effect size calculations (Cohen's d) to quantify the gains. revision: yes
-
Referee: [Section 5] Section 5 (comparison of tailored vs. general agents): The reported advantage of tailored agents rests on the same internal-utility evaluation; without an independent human or expert panel validation, it remains unclear whether the observed reduction in interactions and quality lift would survive when the utility functions are replaced by real decision-maker preferences.
Authors: The tailored-agent advantages are demonstrated within the same reproducible simulation framework to isolate the impact of domain-specific prompts and tools. We concur that real-stakeholder validation is ultimately needed for deployment claims. The revised Section 5 now includes a dedicated paragraph on generalizability, potential biases from synthetic utilities, and a proposed protocol for future human-subject experiments. This positions the current results as evidence for the methodology's utility in controlled settings. revision: partial
- Full external calibration against actual human stakeholders and inter-rater agreement studies, which would require new data collection, ethics approvals, and resources outside the scope of the current work.
Circularity Check
No significant circularity; simulation framework is self-contained
full rationale
The paper defines a simulation methodology using author-specified internal utility functions to generate stakeholder dialogues and to score final solution quality. The core empirical claims compare one-shot versus multi-turn interactions and tailored versus general-purpose agents, all measured inside the same fixed simulation. These differences are not forced by construction: the conversational advantage arises from the interaction protocol itself rather than from re-using the same fitted values or renaming inputs as outputs. No equations reduce a claimed prediction to its own inputs, no load-bearing self-citations are invoked to justify uniqueness, and no ansatz is smuggled via prior work. The study is therefore self-contained against its own benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM role-play agents governed by internal utility functions can generate conversations that meaningfully represent real stakeholder decision-making processes.
Reference graph
Works this paper leans on
-
[1]
Reason through the best way to address the user’s feedback
-
[2]
If necessary, use relevant tools from your toolbox to adjust the model
-
[3]
Use thecall solvertool to generate a new solution
-
[4]
I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption
Report back to the user: ec4 e-companion to:LLM Agents for Interactive Optimization •Anychanges made to the model, including explicitly stating any constraints im- posed on the objectives (e.g., “I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption.”) •Thenew schedule, using a table of school name and proposed start tim...
-
[5]
I limited congestion to 2,500 students
If the user is satisfied, stop. Otherwise, await further feedback and respond accordingly. Use the current model summary visible in the chat history to keep track of objective weights and active constraints. Output Format Format responses in GitHub-flavored markdown: •Use tables with|and---. •Use bullet points for lists. •Useboldoritalicsfor emphasis. •Us...
-
[6]
Start the conversation with your thoughts on the current proposed schedule, based on your character’s preferences
-
[7]
When a new schedule is provided, callcheck utilityon the provided solution(s) and use the output to guide your reply
-
[8]
If total utility is less than 0.748, continue the conversation in-character and offer aligned feedback
-
[9]
When satisfied, conclude the conversation by including the phrase END
-
[10]
Never end before utility is maximized, even if the assistant frames a solution as “bal- anced” or “ideal.”
-
[11]
Never reference internal terms like “model”, “preferences”, “objectives”, “utility”, or “solution.”
-
[12]
Remain open to suggestions from the optimization assistant that may help guide you to your maximum utility. EC.5. Baseline vs. Optimization-Aware One-Shot: An Example In the following example, a decision agent acts as a transportation coordinator who would like Muir (John) PK to move to an earlier time slot under our baseline one-shot evaluation paradigm:...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.