Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization

Alexandre Jacquillat; Joshua Drossman; S\'ebastien Martin

arxiv: 2604.02666 · v1 · submitted 2026-04-03 · 💻 cs.AI · math.OC

Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization

Joshua Drossman , Alexandre Jacquillat , S\'ebastien Martin This is my paper

Pith reviewed 2026-05-13 20:40 UTC · model grok-4.3

classification 💻 cs.AI math.OC

keywords interactive optimizationLLM agentsconversational AIsolution qualityschool schedulingagent evaluationmulti-turn interaction

0 comments

The pith

Conversational optimization agents reach much higher solution quality than one-shot queries with the same model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that one-shot evaluations severely underestimate optimization agents because the same agent improves solutions substantially once it can converse with stakeholders and refine proposals over multiple turns. To measure this reliably, the authors generate thousands of conversations using LLM-powered decision agents that role-play stakeholders, each driven by a fixed internal utility function that scores solutions privately while communicating naturally. In a school scheduling case study, tailored agents equipped with domain-specific prompts and structured tools produce better final schedules and converge faster than general-purpose chatbots. This shows that interactive, multi-turn designs make optimization technology more practical for real decision-makers who need to clarify objectives and trade-offs.

Core claim

Through thousands of simulated conversations on a school scheduling problem, an optimization agent converges to higher-quality solutions when it interacts conversationally with role-playing stakeholder agents, each governed by an internal utility function, compared with one-shot queries. Tailored agents that incorporate domain-specific prompts and structured tools further increase solution quality and reduce the number of interactions needed relative to general-purpose chatbots.

What carries the argument

LLM-powered stakeholder agents that role-play decision-makers with fixed internal utility functions, generating conversations to evaluate how an optimization agent iteratively proposes, interprets, and refines solutions.

If this is right

The same optimization agent produces better solutions when it receives iterative feedback rather than answering once.
Domain-specific prompts and structured tools let tailored agents reach high-quality outcomes in fewer turns than general chatbots.
Simulated multi-stakeholder conversations provide a scalable way to test agent designs without recruiting human subjects.
Interactive agents expand the reach of optimization by handling clarification of objectives and trade-offs in natural dialogue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real optimization software could adopt persistent chat interfaces that maintain context across turns to improve practical usability.
The same evaluation approach could be applied to other planning domains such as supply chain or workforce scheduling to check whether conversational gains generalize.
If simulated utility functions diverge from human preferences, the measured quality improvements may overstate real-world benefits.

Load-bearing premise

That conversations and solution-quality judgments produced by LLM agents role-playing stakeholders with internal utilities accurately reflect real human decision-makers.

What would settle it

A direct comparison of final schedules and interaction patterns from the simulated conversations against outcomes from actual human stakeholders performing the same school scheduling task.

read the original abstract

Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation-based interaction than traditional one-shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM-powered decision agents that role-play diverse stakeholders, each governed by an internal utility function but communicating like a real decision-maker. We generate thousands of conversations in a school scheduling case study. Results show that one-shot evaluation is severely limiting: the same optimization agent converges to much higher-quality solutions through conversations. Then, this paper uses this methodology to demonstrate that tailored optimization agents, endowed with domain-specific prompts and structured tools, can lead to significant improvements in solution quality in fewer interactions, as compared to general-purpose chatbots. These findings provide evidence of the benefits of emerging solutions at the AI-optimization interface to expand the reach of optimization technologies in practice. They also uncover the impact of operations research expertise to facilitate interactive deployments through the design of effective and reliable optimization agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable simulation method to test conversational optimization agents at scale and shows interaction beats one-shot prompting while tailored agents beat generic ones, but the gains rest on internal utilities that may not track real stakeholders.

read the letter

The paper's main move is to evaluate LLM optimization agents by generating thousands of conversations with simulated stakeholders, each driven by an internal utility function. This setup lets them run head-to-head comparisons: the same agent does better when it can iterate through dialogue than when it gets a single prompt, and agents with domain-specific prompts and tools improve solution quality faster than plain chatbots, all in a school scheduling example. The scale of the generated conversations is the practical advance here; it supplies statistical comparisons that small human studies rarely reach and makes the case that natural-language interaction can expand who can use optimization tools. That part is cleanly executed and directly addresses a real bottleneck in the field. The soft spot is the closed loop in the evaluation. The same utility functions shape the stakeholder utterances and supply the quality scores, so reported improvements could partly reflect how the simulation is wired rather than genuine gains against actual decision makers. No external calibration with real users or experts is described, and the abstract omits the exact metrics and statistical tests, which leaves the size of the effect hard to judge. This is aimed at researchers working on LLM agents for operations research and interactive decision support. A reader who wants concrete evidence on conversational versus static optimization will get value from the methodology even if they share the simulation concern. It deserves a serious referee to check the implementation details and push for some human grounding. I'd send it to review.

Referee Report

3 major / 2 minor

Summary. The paper proposes a scalable methodology for evaluating LLM-powered interactive optimization agents by generating thousands of simulated conversations in a school scheduling case study. Stakeholder agents are role-played with internal utility functions that shape both their dialogue and the ground-truth solution quality metric. Results indicate that the same optimization agent reaches substantially higher-quality solutions via multi-turn conversations than via one-shot evaluation, and that domain-tailored agents (with specialized prompts and tools) outperform general-purpose chatbots in both quality and interaction efficiency.

Significance. If the simulation faithfully captures real stakeholder behavior, the work supplies a replicable evaluation framework for interactive optimization systems and concrete evidence that conversational interfaces plus domain-specific agent design can materially improve solution quality and reduce interaction count. This would strengthen the case for embedding OR expertise into LLM agents to broaden practical deployment of optimization technology.

major comments (3)

[Methodology] Methodology section: The same LLM class generates the stakeholder utilities, produces the dialogue, and supplies the implicit scoring signal for solution quality. No ablation that removes the utility scaffolding, no inter-rater agreement with domain experts, and no external calibration against actual human stakeholders are reported; this makes it impossible to rule out that reported gains are artifacts of the closed simulation loop rather than genuine interactive optimization benefits.
[Results] Results section: The abstract and evaluation description provide only directional statements on solution quality; no concrete metrics (e.g., normalized utility, feasibility rate), statistical tests, variance estimates, or implementation details for the one-shot versus conversational baselines are supplied, preventing assessment of effect size or robustness of the central claim.
[Section 5] Section 5 (comparison of tailored vs. general agents): The reported advantage of tailored agents rests on the same internal-utility evaluation; without an independent human or expert panel validation, it remains unclear whether the observed reduction in interactions and quality lift would survive when the utility functions are replaced by real decision-maker preferences.

minor comments (2)

[Abstract] The abstract states that 'one-shot evaluation is severely limiting' but does not define the precise one-shot protocol (prompt length, number of proposals allowed, stopping criterion), making the comparison hard to replicate.
[Methodology] Notation for the internal utility functions is introduced without an explicit equation or pseudocode listing the components (objectives, constraints, trade-off weights), which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our simulation-based evaluation framework. We address each major comment below, clarifying our design choices while incorporating revisions to strengthen the presentation of metrics, add ablations, and discuss limitations. The core contribution remains a scalable methodology for evaluating conversational optimization agents, with the simulation serving as a controlled, replicable proxy rather than a direct substitute for human studies.

read point-by-point responses

Referee: [Methodology] Methodology section: The same LLM class generates the stakeholder utilities, produces the dialogue, and supplies the implicit scoring signal for solution quality. No ablation that removes the utility scaffolding, no inter-rater agreement with domain experts, and no external calibration against actual human stakeholders are reported; this makes it impossible to rule out that reported gains are artifacts of the closed simulation loop rather than genuine interactive optimization benefits.

Authors: We designed the simulation with separate LLM instances and prompts for utility generation versus dialogue to reduce direct circularity, allowing stakeholder agents to communicate naturally while utilities provide an independent ground-truth metric. This enables the scalable generation of thousands of conversations as described. We acknowledge the closed-loop concern and have added an ablation study in the revised Section 3 that removes explicit utility scaffolding from stakeholder prompts, showing that conversational gains persist (though attenuated). We also expanded the limitations subsection to discuss the absence of inter-rater agreement and human calibration, noting these as directions for future work rather than claiming the simulation fully replicates real stakeholders. revision: yes
Referee: [Results] Results section: The abstract and evaluation description provide only directional statements on solution quality; no concrete metrics (e.g., normalized utility, feasibility rate), statistical tests, variance estimates, or implementation details for the one-shot versus conversational baselines are supplied, preventing assessment of effect size or robustness of the central claim.

Authors: The original results section reports normalized utility improvements and interaction counts, but we agree the presentation was insufficiently detailed. In the revision, we have added explicit tables with mean normalized utility scores, feasibility rates, standard deviations across 10 random seeds, and paired t-test p-values comparing one-shot versus conversational conditions. We also include hyperparameter details for the baselines and effect size calculations (Cohen's d) to quantify the gains. revision: yes
Referee: [Section 5] Section 5 (comparison of tailored vs. general agents): The reported advantage of tailored agents rests on the same internal-utility evaluation; without an independent human or expert panel validation, it remains unclear whether the observed reduction in interactions and quality lift would survive when the utility functions are replaced by real decision-maker preferences.

Authors: The tailored-agent advantages are demonstrated within the same reproducible simulation framework to isolate the impact of domain-specific prompts and tools. We concur that real-stakeholder validation is ultimately needed for deployment claims. The revised Section 5 now includes a dedicated paragraph on generalizability, potential biases from synthetic utilities, and a proposed protocol for future human-subject experiments. This positions the current results as evidence for the methodology's utility in controlled settings. revision: partial

standing simulated objections not resolved

Full external calibration against actual human stakeholders and inter-rater agreement studies, which would require new data collection, ethics approvals, and resources outside the scope of the current work.

Circularity Check

0 steps flagged

No significant circularity; simulation framework is self-contained

full rationale

The paper defines a simulation methodology using author-specified internal utility functions to generate stakeholder dialogues and to score final solution quality. The core empirical claims compare one-shot versus multi-turn interactions and tailored versus general-purpose agents, all measured inside the same fixed simulation. These differences are not forced by construction: the conversational advantage arises from the interaction protocol itself rather than from re-using the same fitted values or renaming inputs as outputs. No equations reduce a claimed prediction to its own inputs, no load-bearing self-citations are invoked to justify uniqueness, and no ansatz is smuggled via prior work. The study is therefore self-contained against its own benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that simulated conversations with utility-based agents serve as a valid proxy for real interactive optimization; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption LLM role-play agents governed by internal utility functions can generate conversations that meaningfully represent real stakeholder decision-making processes.
This underpins the entire evaluation methodology and the claim that conversations yield higher-quality solutions.

pith-pipeline@v0.9.0 · 5520 in / 1183 out tokens · 40092 ms · 2026-05-13T20:40:21.127273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Reason through the best way to address the user’s feedback

work page
[2]

If necessary, use relevant tools from your toolbox to adjust the model

work page
[3]

Use thecall solvertool to generate a new solution

work page
[4]

I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption

Report back to the user: ec4 e-companion to:LLM Agents for Interactive Optimization •Anychanges made to the model, including explicitly stating any constraints im- posed on the objectives (e.g., “I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption.”) •Thenew schedule, using a table of school name and proposed start tim...

work page
[5]

I limited congestion to 2,500 students

If the user is satisfied, stop. Otherwise, await further feedback and respond accordingly. Use the current model summary visible in the chat history to keep track of objective weights and active constraints. Output Format Format responses in GitHub-flavored markdown: •Use tables with|and---. •Use bullet points for lists. •Useboldoritalicsfor emphasis. •Us...

work page
[6]

Start the conversation with your thoughts on the current proposed schedule, based on your character’s preferences

work page
[7]

When a new schedule is provided, callcheck utilityon the provided solution(s) and use the output to guide your reply

work page
[8]

If total utility is less than 0.748, continue the conversation in-character and offer aligned feedback

work page
[9]

When satisfied, conclude the conversation by including the phrase END

work page
[10]

bal- anced

Never end before utility is maximized, even if the assistant frames a solution as “bal- anced” or “ideal.”

work page
[11]

model”, “preferences

Never reference internal terms like “model”, “preferences”, “objectives”, “utility”, or “solution.”

work page
[12]

Remain open to suggestions from the optimization assistant that may help guide you to your maximum utility. EC.5. Baseline vs. Optimization-Aware One-Shot: An Example In the following example, a decision agent acts as a transportation coordinator who would like Muir (John) PK to move to an earlier time slot under our baseline one-shot evaluation paradigm:...

work page

[1] [1]

Reason through the best way to address the user’s feedback

work page

[2] [2]

If necessary, use relevant tools from your toolbox to adjust the model

work page

[3] [3]

Use thecall solvertool to generate a new solution

work page

[4] [4]

I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption

Report back to the user: ec4 e-companion to:LLM Agents for Interactive Optimization •Anychanges made to the model, including explicitly stating any constraints im- posed on the objectives (e.g., “I imposedschedule deviation≤18 minutes based on your comment about minimizing disruption.”) •Thenew schedule, using a table of school name and proposed start tim...

work page

[5] [5]

I limited congestion to 2,500 students

If the user is satisfied, stop. Otherwise, await further feedback and respond accordingly. Use the current model summary visible in the chat history to keep track of objective weights and active constraints. Output Format Format responses in GitHub-flavored markdown: •Use tables with|and---. •Use bullet points for lists. •Useboldoritalicsfor emphasis. •Us...

work page

[6] [6]

Start the conversation with your thoughts on the current proposed schedule, based on your character’s preferences

work page

[7] [7]

When a new schedule is provided, callcheck utilityon the provided solution(s) and use the output to guide your reply

work page

[8] [8]

If total utility is less than 0.748, continue the conversation in-character and offer aligned feedback

work page

[9] [9]

When satisfied, conclude the conversation by including the phrase END

work page

[10] [10]

bal- anced

Never end before utility is maximized, even if the assistant frames a solution as “bal- anced” or “ideal.”

work page

[11] [11]

model”, “preferences

Never reference internal terms like “model”, “preferences”, “objectives”, “utility”, or “solution.”

work page

[12] [12]

Remain open to suggestions from the optimization assistant that may help guide you to your maximum utility. EC.5. Baseline vs. Optimization-Aware One-Shot: An Example In the following example, a decision agent acts as a transportation coordinator who would like Muir (John) PK to move to an earlier time slot under our baseline one-shot evaluation paradigm:...

work page