Generating Robust Portfolios of Optimization Models using Large Language Models

Cheol Woo Kim; Eleni Straitouri; Milind Tambe

arxiv: 2605.27013 · v1 · pith:7CZZBTK2new · submitted 2026-05-26 · 💻 cs.AI

Generating Robust Portfolios of Optimization Models using Large Language Models

Eleni Straitouri , Cheol Woo Kim , Milind Tambe This is my paper

Pith reviewed 2026-06-29 16:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsoptimization modelingportfolio generationtheoretical guaranteeshuman-in-the-looprobust decision making

0 comments

The pith

A single LLM acting as both stochastic generator and reasoning evaluator produces portfolios of optimization models that are guaranteed to contain high-quality candidates whenever at least one of those roles aligns with human preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an algorithm that turns natural-language problem descriptions into a portfolio of candidate optimization models instead of a single output. It does so by assigning one LLM two complementary jobs inside the same framework: generating varied model formulations at random and then scoring those formulations through explicit reasoning. Theoretical analysis shows that the resulting portfolio is assured to include good candidates as long as the generator role or the evaluator role matches human judgment. This setup supports a human review step in which a decision maker inspects several options before selecting one. Empirical tests across multiple modeling tasks confirm that the method yields stronger results than single-model baselines.

Core claim

The central claim is that a unified framework in which one LLM serves simultaneously as a stochastic generator of candidate optimization models and as a reasoning evaluator of those candidates yields portfolios that are robust to LLM limitations; specifically, the framework carries a guarantee that high-quality models will be present whenever either the generation process or the evaluation process is well-aligned with human preferences.

What carries the argument

unified framework in which a single LLM performs both stochastic generation of model candidates and reasoning-based evaluation of those candidates

If this is right

Decision makers can review multiple candidates before selecting a final optimization model.
Risk from any single unreliable LLM output is reduced by the portfolio construction.
The same dual-role procedure applies across a range of optimization modeling tasks.
Human-in-the-loop selection becomes a principled step rather than an ad-hoc check.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-role construction may transfer to other structured generation tasks that currently output only one candidate.
Empirical checks could compare portfolio quality when the same LLM is used in both roles versus when two different LLMs are assigned the roles.
The guarantees might be tested by deliberately degrading alignment in one role while keeping the other intact.

Load-bearing premise

A single LLM can be used reliably in both the generator role and the evaluator role so that alignment of either role with human preferences is enough to guarantee portfolio quality.

What would settle it

An experiment in which both the generator and the evaluator are shown to be misaligned with human preferences yet the portfolio still contains high-quality models, or in which one role is aligned yet the portfolio contains no high-quality models.

Figures

Figures reproduced from arXiv: 2605.27013 by Cheol Woo Kim, Eleni Straitouri, Milind Tambe.

**Figure 1.** Figure 1: Portfolio mean coverage against the value of 1 − α for the WEAKLY ALIGNED generator paired with each evaluator for K = 100. The mean is over 40 iterations and shaded areas represent 95% confidence intervals. 0.00 0.25 0.50 0.75 1.00 1 − α 0 50 100 Portfolio Size Evaluator 0.0 0.3 0.5 0.7 1.0 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Portfolio mean size against the value of 1 − α for the WEAKLY ALIGNED generator paired with each evaluator for K = 100. The mean is over 40 iterations and shaded areas represent 95% confidence intervals. 0.00 0.25 0.50 0.75 1.00 1 − α 0.0 0.5 1.0 Coverage Generator Aligned Weakly Aligned Misaligned Uniform y = x [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Portfolio mean coverage against the value of 1−α for the evaluator with ϵ = 1.0 paired with each generator for K = 100. The mean is over 40 iterations and shaded areas represent 95% confidence intervals. evaluator for ϵ ∈ {0, 0.3, 0.5, 0.7, 1}, where ϵ = 0 characterizes a human-aligned evaluator with π ∗ = πe, and ϵ = 1, an evaluator e such that ∀i ∈ [K], o(i) e = o(K+1−i) ∗ . Results [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Kde plot of scores assigned by gpt-5.4-as-a-judge to portfolios of size s ∈ {2, 4, 6, 8} against s randomly selected candidates for two evaluator types. The score distributions are over 25 problems and 30 samplings of the random candidates. Dashed and dotted lines represent the mean score value over the randomly chosen and the portfolio candidates respectively. ranking. Further, we implement an additional … view at source ↗

**Figure 6.** Figure 6: Generator prompt and system prompt [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluator prompt and system prompt. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: LLM-as-a-judge prompt and system prompt. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Portfolio coverage against 1 − α for several generator evaluator pairs and value of K. The mean is over 40 iterations and shaded areas represent 95% confidence intervals. 0 100 Portfolio Size = 0.3 = 0.5 = 0.7 K = 10 = 1.0 0 100 Portfolio Size K = 20 0 100 Portfolio Size K = 50 0 1 1 − α 0 100 Portfolio Size 0 1 1 − α 0 1 1 − α 0 1 1 − α K = 100 Generator Aligned Weakly Aligned Misaligned Uniform … view at source ↗

**Figure 10.** Figure 10: Portfolio size against 1 − α for several generator evaluator pairs and value of K. The mean is over 40 iterations and shaded areas represent 95% confidence intervals. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is using one LLM in dual roles to build a portfolio of optimization models with a conditional guarantee that at least one role being aligned suffices for quality, but the abstract gives no proof or experiment details to check it.

read the letter

The new piece is the portfolio construction that treats the same LLM as both stochastic generator and reasoning evaluator, plus the claim that alignment in either role alone guarantees high-quality candidates in the set. That framing directly tackles the single-model risk that prior LLM optimization work leaves open.

It does a clean job laying out why a human-in-the-loop review of several candidates is safer than committing to one output. The conditional guarantee is stated plainly and avoids overclaiming that both roles must work.

The soft spot is that the abstract asserts theoretical guarantees and empirical validation without showing even a sketch of the proof, the precise definition of alignment, or any experiment setup, metrics, or baselines. That makes it impossible to judge whether the guarantee is tight or whether the empirical results actually support the claim. The assumption that one model can reliably switch between the two roles without interference also sits unexamined.

This is for researchers building LLM tools for optimization modeling or automated decision support. A reader already working on LLM reliability or portfolio methods would get the most out of the formal statement once it is filled in.

I would send it to peer review so the proofs and experiments can be checked; the idea is distinct enough from single-model baselines to merit that step.

Referee Report

2 major / 0 minor

Summary. The paper proposes an algorithm to generate a portfolio of optimization models from natural language descriptions by using a single LLM in dual roles: as a stochastic generator of candidate models and as a reasoning evaluator. It claims theoretical guarantees that the resulting portfolio is guaranteed to contain high-quality candidates provided that either the generator or the evaluator is well-aligned with human preferences, thereby supporting a human-in-the-loop selection process. The approach is further validated empirically on a range of optimization modeling tasks.

Significance. If the conditional theoretical guarantee can be established with a clear formal statement and proof, the work would offer a principled way to mitigate the unreliability of individual LLM-generated optimization models. The dual-role framing of a single LLM and the emphasis on portfolio robustness rather than single-model correctness are potentially valuable contributions to LLM-assisted optimization modeling.

major comments (2)

[Abstract] Abstract (paragraph on unified framework): The central theoretical claim is a conditional guarantee resting on the assumption that one LLM can reliably serve in two complementary roles (stochastic generator and reasoning evaluator) such that alignment of either suffices for portfolio quality. No formal definition of alignment, no statement of the guarantee, and no proof sketch are supplied, rendering the claim impossible to verify or falsify from the manuscript.
[Abstract] Abstract (empirical validation sentence): The manuscript asserts 'strong performance across a range of optimization modeling tasks' but supplies no experiment details, baselines, metrics, error analysis, or statistical tests, so the empirical support for the portfolio approach cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below with specific references to the full paper and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on unified framework): The central theoretical claim is a conditional guarantee resting on the assumption that one LLM can reliably serve in two complementary roles (stochastic generator and reasoning evaluator) such that alignment of either suffices for portfolio quality. No formal definition of alignment, no statement of the guarantee, and no proof sketch are supplied, rendering the claim impossible to verify or falsify from the manuscript.

Authors: The abstract provides a high-level summary of the contribution. The full manuscript supplies the requested elements: alignment is formally defined in Definition 1 (Section 3.1), the conditional guarantee is stated as Theorem 1 (Section 4.2), and the complete proof appears in Appendix A. We will revise the abstract to include a one-sentence statement of the theorem for improved self-containment. revision: partial
Referee: [Abstract] Abstract (empirical validation sentence): The manuscript asserts 'strong performance across a range of optimization modeling tasks' but supplies no experiment details, baselines, metrics, error analysis, or statistical tests, so the empirical support for the portfolio approach cannot be assessed.

Authors: Experiment details are provided in the main body rather than the abstract. Section 5 describes the tasks, baselines (including single-model LLM generation and existing portfolio methods), metrics (feasibility rate, objective value gap), error analysis, and statistical tests (paired t-tests with p-values in Table 3). The abstract follows standard length constraints; we see no need to expand it further. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is a conditional theoretical guarantee: if either the LLM generator or evaluator aligns with human preferences, the generated portfolio contains high-quality candidates. This rests on an external assumption about alignment rather than any internal fitting, redefinition, or self-referential derivation. No equations, self-citations, or ansatzes are presented in the provided text that reduce the guarantee to a tautology or to the method's own outputs by construction. The dual-role observation is used to motivate the framework but does not create a self-definitional loop. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central guarantee depends on the unproven premise that LLMs can function effectively in both generator and evaluator roles with measurable alignment to human preferences; no free parameters or invented entities are described.

axioms (2)

domain assumption A single LLM can play two distinct roles as stochastic generator and reasoning evaluator in a complementary manner.
Invoked to justify the unified framework in the abstract.
domain assumption Alignment of generator or evaluator with human preferences is a meaningful and usable condition for the guarantee.
Load-bearing for the theoretical result stated in abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 1224 out tokens · 38629 ms · 2026-06-29T16:46:03.919813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679,

Astorga, N., Liu, T., Xiao, Y ., and Van Der Schaar, M. Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679,

work page arXiv
[2]

and Caarls, W

Cardenoso, F. and Caarls, W. Leveraging llms for reward function design in reinforcement learning control tasks. arXiv preprint arXiv:2511.19355,

work page arXiv
[3]

Llmopt: Learning to define and solve gen- eral optimization problems from scratch.arXiv preprint arXiv:2410.13213,

Jiang, C., Shu, X., Qian, H., Lu, X., Zhou, J., Zhou, A., and Yu, Y . Llmopt: Learning to define and solve gen- eral optimization problems from scratch.arXiv preprint arXiv:2410.13213,

work page arXiv
[4]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

doi: 10.1016/j.knosys.2025.114065

ISSN 0950-7051. doi: 10.1016/j.knosys.2025.114065. URL https://doi. org/10.1016/j.knosys.2025.114065. Vercellis, C.Business intelligence: data mining and opti- mization for decision making. John Wiley & Sons,

work page doi:10.1016/j.knosys.2025.114065 2025
[6]

A survey of optimiza- tion modeling meets llms: Progress and future directions

5 Generating Robust Portfolios of Optimization Models using Large Language Models Xiao, Z., Xie, J., Xu, L., Guan, S., Zhu, J., Han, X., Fu, X., Yu, W., Wu, H., Shi, W., et al. A survey of optimiza- tion modeling meets llms: Progress and future directions. arXiv preprint arXiv:2508.10047,

work page arXiv
[7]

Optibench meets resocratic: Measure and improve llms for opti- mization modeling.arXiv preprint arXiv:2407.09887,

Yang, Z., Wang, Y ., Huang, Y ., Guo, Z., Shi, W., Han, X., Feng, L., Song, L., Liang, X., and Tang, J. Optibench meets resocratic: Measure and improve llms for opti- mization modeling.arXiv preprint arXiv:2407.09887,

work page arXiv
[8]

Solving general natural-language-description optimization problems with large language models

Zhang, J., Wang, W., Guo, S., Wang, L., Lin, F., Yang, C., and Yin, W. Solving general natural-language-description optimization problems with large language models. In Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 6: In- dustry Track), pp. 483–490,

2024
[9]

Using this and the fact that |P \ X |=|P ∗ \ X |=k ∗(α)− |X | , Eq

ForX ⊆ Pwe have X o∈X p(o) + X o′∈P\X p(o′)≥1−α.(6) Since the generator model is human aligned, it must hold that ∀o∈ P \ X and ∀o′ ∈ P ∗ \ X , rank(o)≥rank(o ′) given the human ranking π∗(d) and as a result, p(o)≤p(o ′). Using this and the fact that |P \ X |=|P ∗ \ X |=k ∗(α)− |X | , Eq. 6 becomes X o∈X p(o) + X o′′∈P ∗\X p(o′′)≥ X o∈X p(o) + X o′∈P\X p(...

2025

[1] [1]

Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679,

Astorga, N., Liu, T., Xiao, Y ., and Van Der Schaar, M. Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679,

work page arXiv

[2] [2]

and Caarls, W

Cardenoso, F. and Caarls, W. Leveraging llms for reward function design in reinforcement learning control tasks. arXiv preprint arXiv:2511.19355,

work page arXiv

[3] [3]

Llmopt: Learning to define and solve gen- eral optimization problems from scratch.arXiv preprint arXiv:2410.13213,

Jiang, C., Shu, X., Qian, H., Lu, X., Zhou, J., Zhou, A., and Yu, Y . Llmopt: Learning to define and solve gen- eral optimization problems from scratch.arXiv preprint arXiv:2410.13213,

work page arXiv

[4] [4]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

doi: 10.1016/j.knosys.2025.114065

ISSN 0950-7051. doi: 10.1016/j.knosys.2025.114065. URL https://doi. org/10.1016/j.knosys.2025.114065. Vercellis, C.Business intelligence: data mining and opti- mization for decision making. John Wiley & Sons,

work page doi:10.1016/j.knosys.2025.114065 2025

[6] [6]

A survey of optimiza- tion modeling meets llms: Progress and future directions

5 Generating Robust Portfolios of Optimization Models using Large Language Models Xiao, Z., Xie, J., Xu, L., Guan, S., Zhu, J., Han, X., Fu, X., Yu, W., Wu, H., Shi, W., et al. A survey of optimiza- tion modeling meets llms: Progress and future directions. arXiv preprint arXiv:2508.10047,

work page arXiv

[7] [7]

Optibench meets resocratic: Measure and improve llms for opti- mization modeling.arXiv preprint arXiv:2407.09887,

Yang, Z., Wang, Y ., Huang, Y ., Guo, Z., Shi, W., Han, X., Feng, L., Song, L., Liang, X., and Tang, J. Optibench meets resocratic: Measure and improve llms for opti- mization modeling.arXiv preprint arXiv:2407.09887,

work page arXiv

[8] [8]

Solving general natural-language-description optimization problems with large language models

Zhang, J., Wang, W., Guo, S., Wang, L., Lin, F., Yang, C., and Yin, W. Solving general natural-language-description optimization problems with large language models. In Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 6: In- dustry Track), pp. 483–490,

2024

[9] [9]

Using this and the fact that |P \ X |=|P ∗ \ X |=k ∗(α)− |X | , Eq

ForX ⊆ Pwe have X o∈X p(o) + X o′∈P\X p(o′)≥1−α.(6) Since the generator model is human aligned, it must hold that ∀o∈ P \ X and ∀o′ ∈ P ∗ \ X , rank(o)≥rank(o ′) given the human ranking π∗(d) and as a result, p(o)≤p(o ′). Using this and the fact that |P \ X |=|P ∗ \ X |=k ∗(α)− |X | , Eq. 6 becomes X o∈X p(o) + X o′′∈P ∗\X p(o′′)≥ X o∈X p(o) + X o′∈P\X p(...

2025