arxiv: 2605.11813 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

Jinbiao Chen , Shuang Jin , Guoyun Zhang , Junyu Zhang , Guanyi Wang , Hanzhang Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords robust optimizationlarge language modelsreformulationmemory augmentationautomationbenchmarkAutoRO-BenchAutoREM

0 comments

The pith

Reflecting on failed reformulation attempts lets LLMs build reusable memory that improves robust optimization automation without tuning or experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that large language models can automate the conversion of robust optimization models with uncertain parameters into equivalent deterministic problems by constructing their own experience memory from past mistakes. Robust optimization supports better decisions under uncertainty but is rarely used because the required reformulations demand precise, multi-step mathematical reasoning that is typically done by hand. The authors introduce a benchmark for systematic testing and a tuning-free framework that stores structured insights from failed trajectories in textual memory, then applies that memory to guide future attempts. This matters because the memory transfers across different base models and problem distributions while raising both accuracy and speed, removing the usual barriers of expert knowledge or retraining.

Core claim

AutoREM autonomously builds a structured textual experience memory through an offline adaptation procedure that reflects on previously failed reformulation trajectories. The memory encodes reusable patterns for mathematically consistent transformations and is then used at inference time to steer the LLM toward correct deterministic counterparts. The resulting system improves reformulation accuracy and efficiency on both in-distribution and out-of-distribution instances and works with multiple base LLMs without any parameter updates or domain-specific input.

What carries the argument

Structured textual experience memory generated by reflecting on failed trajectories via an offline adaptation procedure, which supplies reusable guidance for multi-step mathematical reformulations.

If this is right

AutoREM raises reformulation accuracy on both familiar and unseen problem sets.
The same memory transfers directly to different base LLMs without modification or retraining.
Efficiency improves because the LLM requires fewer attempts to reach a valid deterministic equivalent.
No domain expertise or parameter changes are needed for the gains to appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-reflection mechanism could be tested on reformulation tasks outside robust optimization, such as stochastic or nonlinear programming.
Practitioners in operations research might adopt the approach to apply robust methods to supply-chain or financial models without hiring specialists.
If memory size grows with problem complexity, the method may require new compression or retrieval techniques for very large instances.
The framework implies that error-based memory can substitute for explicit fine-tuning in other technical domains that demand chained reasoning.

Load-bearing premise

Reflecting on failed trajectories produces a memory that generalizes reliably to new robust optimization instances without requiring domain-specific expert knowledge or any parameter updates to the underlying LLM.

What would settle it

Apply the same memory to a fresh collection of robust optimization problems or a previously unseen base LLM and observe no gain, or a decline, in the fraction of correctly reformulated instances relative to the unaugmented model.

Figures

Figures reproduced from arXiv: 2605.11813 by Guanyi Wang, Guoyun Zhang, Hanzhang Qin, Jinbiao Chen, Junyu Zhang, Shuang Jin.

**Figure 2.** Figure 2: Overview of the AutoREM pipeline. structured memory operators (SMO) address the editing dimension by providing atomic add, update, and delete operations for precise, interpretable memory modification. The second principle is highquality memory verification: dual-check commit (DCC) operates at the step level, stress-testing each proposed update against a targeted validation batch before committing; validat… view at source ↗

**Figure 3.** Figure 3: Comparison with LLM-based benchmarks. w/o ULE w/o SMO w/o DCC w/o VBA AutoREM 86 88 90 92 94 96 98 100 Accuracy (%) 90.6% 92.2% 93.8% 93.8% 97.4% Effect of Each Component [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Effect of dual-check size B. 8 16 32 64 Size 85.0 87.5 90.0 92.5 95.0 97.5 100.0 102.5 105.0 Accuracy (%) 95.3% 96.9% 93.8% 97.4% 100.0% 100.0% 96.9% 98.4% Test Accuracy Validation Accuracy Accuracy Difference 0 2 4 6 8 10 12 14 Diff (Val - Test) (%) 4.7% 3.1% 3.1% 1.0% Accuracy & Difference vs. Validation Set Size [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy curve of offline adaptation. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives us a new benchmark for LLM-based RO reformulation plus a memory-augmented method that works without tuning, but the experimental claims rest on thin details.

read the letter

The paper's main contribution is a new benchmark called AutoRO-Bench for evaluating how well LLMs can reformulate robust optimization problems, along with AutoREM, a method that builds an experience memory from reflecting on failed reformulation attempts. This setup lets them test without any model updates or expert input, and they report better accuracy and speed on both standard and out-of-distribution cases, plus across different base LLMs. Creating the benchmark with an automated generation pipeline is a solid step since no such dedicated resource existed before. The memory approach adapts ideas from other LLM memory work to this specific optimization task in a tuning-free way. The experiments claim consistent gains, but the abstract gives no numbers, no mention of statistical tests, and no breakdown of how the out-of-distribution instances were built or verified. That makes it hard to judge how strong the evidence is. The stress-test point about the reflection process is worth checking in the full paper: if the prompts used for building the memory already embed RO-specific reasoning, then the autonomous part might be overstated and the improvements could come down to better prompting rather than true memory augmentation. The work stays focused on reformulation rather than full solving or new theory, which keeps the scope clear. Citations seem to cover relevant LLM and optimization papers without obvious gaps. This paper is for researchers in AI-assisted optimization and practitioners who deal with robust models but want to reduce manual reformulation work. A reader interested in LLM agents for math-heavy tasks would find the benchmark and transfer results useful. It deserves peer review. The benchmark alone makes it worth referee attention, and the method is straightforward enough that reviewers can push for more details on the memory construction and evaluation metrics.

Referee Report

2 major / 1 minor

Summary. The paper introduces AutoRO-Bench, a benchmark with an automated data generation pipeline and curated dataset for evaluating LLM-based reformulation of robust optimization (RO) problems into deterministic equivalents. It proposes AutoREM, a tuning-free memory-augmented framework that builds a structured textual experience memory by reflecting on failed reformulation trajectories through an offline adaptation procedure. AutoREM claims to require no domain-specific expert knowledge or parameter updates to the base LLM, with the memory transferring across different LLMs. Experiments reportedly demonstrate consistent gains in accuracy and efficiency on in-distribution, out-of-distribution, and cross-LLM settings.

Significance. If the claims hold, the work could meaningfully advance automation of RO reformulation, a bottleneck due to manual multi-step mathematical transformations. AutoRO-Bench fills a gap by providing a dedicated evaluation resource. The tuning-free memory approach is attractive for practical use across LLMs. Credit is due for the focus on mathematical consistency and the transferability claim. However, insufficient experimental detail on metrics, statistics, OOD construction, and the reflection process limits assessment of whether the gains are robust or generalizable beyond prompt engineering.

major comments (2)

[Abstract] Abstract: The central claim of consistent improvements in accuracy and efficiency across in-distribution, out-of-distribution datasets, and diverse base LLMs is presented without any information on the evaluation metrics, statistical significance tests, error bars, number of runs, or how out-of-distribution cases were constructed. This directly undermines verification of the strongest claim.
[AutoREM framework] AutoREM description (offline adaptation procedure): The framework's core assumption—that reflecting on failed trajectories autonomously produces a memory that generalizes without expert knowledge or parameter updates—is load-bearing, yet no details are given on the reflection prompt template, memory indexing/retrieval structure, or mechanisms enforcing mathematical consistency (e.g., dualization or worst-case enumeration). This leaves open whether observed gains reduce to implicit heuristics in the prompts rather than true memory augmentation.

minor comments (1)

[Introduction / Benchmark section] The benchmark name AutoRO-Bench and its two components (reformulation task vs. application task) are introduced clearly in the abstract but would benefit from an explicit high-level diagram or table summarizing the data generation pipeline in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We have revised the manuscript to address the concerns about experimental details and framework transparency. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of consistent improvements in accuracy and efficiency across in-distribution, out-of-distribution datasets, and diverse base LLMs is presented without any information on the evaluation metrics, statistical significance tests, error bars, number of runs, or how out-of-distribution cases were constructed. This directly undermines verification of the strongest claim.

Authors: We agree the abstract should briefly contextualize the metrics to support the claim. In the revised version we specify that accuracy is the fraction of reformulations verified as mathematically equivalent by an automated checker, efficiency is measured by token count and wall-clock time, statistical significance is assessed via paired t-tests (p<0.05) over five independent runs with standard-deviation error bars, and OOD instances are generated by systematically altering uncertainty-set shapes and constraint structures absent from the training distribution (detailed in Section 4.3). These additions make the central claim verifiable while remaining within abstract length constraints. revision: yes
Referee: [AutoREM framework] AutoREM description (offline adaptation procedure): The framework's core assumption—that reflecting on failed trajectories autonomously produces a memory that generalizes without expert knowledge or parameter updates—is load-bearing, yet no details are given on the reflection prompt template, memory indexing/retrieval structure, or mechanisms enforcing mathematical consistency (e.g., dualization or worst-case enumeration). This leaves open whether observed gains reduce to implicit heuristics in the prompts rather than true memory augmentation.

Authors: We accept that the original description lacked sufficient implementation detail. The revised manuscript adds the complete reflection prompt template in Appendix B; it directs the LLM to diagnose specific failure modes (incorrect dualization, missed worst-case realizations, etc.) and distill reusable rules without injecting external expert knowledge. Memory is stored as a feature-keyed dictionary (keys encode uncertainty type and constraint signature; values store the corresponding reformulation strategy) and retrieved by cosine similarity on sentence embeddings. Post-retrieval validation enforces mathematical consistency by cross-checking generated dual variables and worst-case enumerations against a lightweight symbolic verifier. These additions demonstrate that performance gains derive from the structured memory rather than prompt heuristics alone; we also include pseudocode for the offline adaptation loop. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation on independently generated benchmark

full rationale

The paper introduces AutoRO-Bench via an automated data generation pipeline and proposes the AutoREM framework that builds textual memory from reflection on failed trajectories. All performance claims are supported by experimental results across in-distribution, out-of-distribution, and multi-LLM settings rather than any mathematical derivation or parameter fit that reduces to the inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the described chain; the method is presented as tuning-free with results serving as external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLMs possess sufficient multi-step mathematical reasoning to benefit from textual memory of past failures, plus the existence of a reliable automated data generation pipeline for the benchmark. No explicit free parameters or new physical entities are introduced.

axioms (2)

domain assumption Large language models can perform precise multi-step mathematical transformations when guided by structured memory of prior failures.
Invoked in the description of AutoREM's offline adaptation procedure.
domain assumption An automated pipeline can generate representative robust optimization instances that cover both in-distribution and out-of-distribution cases.
Stated as the basis for creating AutoRO-Bench.

pith-pipeline@v0.9.0 · 5514 in / 1284 out tokens · 20801 ms · 2026-05-13T05:56:28.803268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 1 internal anchor

[1]

Tractable stochastic analysis in high dimensions via robust optimization.Mathematical Programming, 134(1):23–70, 2012

Chaithanya Bandi and Dimitris Bertsimas. Tractable stochastic analysis in high dimensions via robust optimization.Mathematical Programming, 134(1):23–70, 2012

work page 2012
[2]

The price of robustness.Operations Research, 52(1):35–53, 2004

Dimitris Bertsimas and Melvyn Sim. The price of robustness.Operations Research, 52(1):35–53, 2004

work page 2004
[3]

Robust solutions of uncertain linear programs.Operations Research Letters, 25(1):1–13, 1999

Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of uncertain linear programs.Operations Research Letters, 25(1):1–13, 1999

work page 1999
[4]

Robust solutions of Linear Programming problems contaminated with uncertain data.Mathematical Programming, 88(3):411–424, 2000

Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of Linear Programming problems contaminated with uncertain data.Mathematical Programming, 88(3):411–424, 2000

work page 2000
[5]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638, 2025

work page 2025
[6]

Open- Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. In Advances in Neural Information Processing Systems. arXiv, July 2025

work page 2025
[7]

A survey of optimization modeling meets llms: progress and future directions

Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, et al. A survey of optimization modeling meets llms: progress and future directions. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 10742– 10750, 2025

work page 2025
[8]

Andre He, Daniel Fried, and S. Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, 2025

work page 2025
[9]

Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947, 2025

work page arXiv 2025
[10]

Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

work page arXiv 2025
[11]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InAdvances in Neural Information Processing Systems, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InAdvances in Neural Information Processing Systems, 2025

work page 2025
[12]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the Age of AI Agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025
[14]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InInternational Conference on Learning Representations, 2026

work page 2026
[15]

Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling agent self-evolving with reasoning memory. In International Conference on Learning Representations, 2026. 10

work page 2026
[16]

Shao, Dongdong Ge, and Yinyu Ye

Yitian Chen, Jingfan Xia, S. Shao, Dongdong Ge, and Yinyu Ye. Solver-informed RL: Grounding large language models for authentic optimization modeling. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[17]

OR-R1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning

Zezhen Ding, Zhen Tan, Jiheng Zhang, and Tianlong Chen. OR-R1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 228–236, 2026

work page 2026
[18]

ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

work page 2025
[19]

LLMOPT: Learning to define and solve general optimization problems from scratch

Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu. LLMOPT: Learning to define and solve general optimization problems from scratch. InInternational Conference on Learning Representations, 2025

work page 2025
[20]

DeepOR: A deep reasoning foundation model for optimization modeling

Ziyang Xiao, Yuan Jessica Wang, Xiongwei Han, Shisi Guan, Jingyan Zhu, Jingrong Xie, Lilin Xu, Han Wu, Wing Yin Yu, Zehua Liu, Xiaojin Fu, Gang Chen, and Dongxiang Zhang. DeepOR: A deep reasoning foundation model for optimization modeling. InProceedings of the AAAI Conference on Artificial Intelligence, pages 34052–34060, 2026

work page 2026
[21]

MURKA: Multi-Reward Reinforce- ment Learning with Knowledge Alignment for Optimization Tasks

Wantong Xie, Yi-Xiang Hu, Jieyang Xu, Feng Wu, and Xiang-Yang Li. MURKA: Multi-Reward Reinforce- ment Learning with Knowledge Alignment for Optimization Tasks. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[22]

StepORLM: A self-evolving framework with generative process supervision for operations research language models

Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models. InInternational Conference on Learning Representations, 2026

work page 2026
[23]

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. InInternational Conference on Machine Learning, 2024

work page 2024
[24]

Autoformulation of Mathematical Optimization Models Using LLMs

Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela Van Der Schaar. Autoformulation of Mathematical Optimization Models Using LLMs. InInternational Conference on Machine Learning, pages 1864–1886. PMLR, October 2025

work page 2025
[25]

DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints

Xia Jiang, Yaoxin Wu, Chenhao Zhang, and Yingqian Zhang. DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints. InInternational Conference on Learning Representations, volume 2025, pages 46731–46752, May 2025

work page 2025
[26]

AlphaOPT: Formulating optimization programs with self-improving LLM experience library.arXiv preprint arXiv:2510.18428, 2025

Minwei Kong, Ao Qu, Xiaotong Guo, Wenbin Ouyang, Chonghe Jiang, Han Zheng, Yining Ma, Dingyi Zhuang, Yuhan Tang, Junyi Li, Shenhao Wang, Haris Koutsopoulos, Hai Wang, Cathy Wu, and Jinhua Zhao. AlphaOPT: Formulating optimization programs with self-improving LLM experience library.arXiv preprint arXiv:2510.18428, 2025

work page arXiv 2025
[27]

SolverLLM: Leveraging test-time scaling for optimization problem via LLM-guided search

Dong Li, Xiaoyu Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, and Haifeng Chen. SolverLLM: Leveraging test-time scaling for optimization problem via LLM-guided search. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[28]

Large-scale optimization model auto-formulation: Harnessing LLM flexibility via structured workflow.arXiv preprint arXiv:2601.09635, 2026

Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. Large-scale optimization model auto-formulation: Harnessing LLM flexibility via structured workflow.arXiv preprint arXiv:2601.09635, 2026

work page arXiv 2026
[29]

MM-agent: LLM as agents for real-world mathematical modeling problem

Fan Liu, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, and Hao Liu. MM-agent: LLM as agents for real-world mathematical modeling problem. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[30]

OptiTree: Hierarchical thoughts generation with tree search for LLM optimization modeling

Haoyang Liu, Jie Wang, Yuyang Cai, Xiongwei Han, Yufei Kuang, and Jianye Hao. OptiTree: Hierarchical thoughts generation with tree search for LLM optimization modeling. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[31]

MATHMO: Automated Mathematical Modeling Through Adaptive Search

Tennison Liu and Mihaela van der Schaar. MATHMO: Automated Mathematical Modeling Through Adaptive Search. InInternational Conference on Learning Representations, 2026

work page 2026
[32]

Guiding large language models in modeling optimization problems via question partitioning

Xiaotian Pan, Junhao Fang, Feng Wu, Sijia Zhang, Yi-Xiang Hu, Shaoang Li, and Xiang-Yang Li. Guiding large language models in modeling optimization problems via question partitioning. InProceedings of the International Joint Conference on Artificial Intelligence, pages 2657–2665, 2024. 11

work page 2024
[33]

Augmenting operations research with auto-formulation of optimization models from problem descriptions

Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimization models from problem descriptions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 29–62, 2022

work page 2022
[34]

BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving

Teng Wang, Wing Yin Yu, Zhenqi He, Zehua Liu, HaileiGong HaileiGong, Han Wu, Xiongwei Han, Wei Shi, Ruifeng She, Fangzhou Zhu, and Tao Zhong. BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 821–838, 2025

work page 2025
[35]

ORMind: A cognitive-inspired end-to-end reasoning framework for operations research

Zhiyuan Wang, Bokui Chen, Yinya Huang, Qingxing Cao, Ming He, Jianping Fan, and Xiaodan Liang. ORMind: A cognitive-inspired end-to-end reasoning framework for operations research. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 104–131, 2025

work page 2025
[36]

Chain-of-Experts: When LLMs Meet Complex Operations Research Problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs Meet Complex Operations Research Problems. InInternational Conference on Learning Representations, 2024

work page 2024
[37]

Solving general natural-language-description optimization problems with large language models

Jihai Zhang, Wei Wang, Siyan Guo, Li Wang, Fangquan Lin, Cheng Yang, and Wotao Yin. Solving general natural-language-description optimization problems with large language models. InProceedings of the North American Chapter of the Association for Computational Linguistics, pages 483–490, 2024

work page 2024
[38]

Decision Information Meets Large Language Models: The Future of Explainable Operations Research

Yansen Zhang, Qingcan Kang, Wing Yin Yu, Xiaojin Fu, Xiongwei Han, Tao Zhong, and Chen Ma. Decision Information Meets Large Language Models: The Future of Explainable Operations Research. In International Conference on Learning Representations, volume 2025, pages 65698–65722, May 2025

work page 2025
[39]

Robust and adaptive optimization under a large language model lens.arXiv preprint arXiv:2501.00568, 2024

Dimitris Bertsimas and Georgios Margaritis. Robust and adaptive optimization under a large language model lens.arXiv preprint arXiv:2501.00568, 2024

work page arXiv 2024
[40]

Large language models are zero-shot reasoners

Shixiang Shane Gu, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo, and Machel Reid. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, pages 22199– 22213, 2022

work page 2022
[41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[42]

Self-refine: Iterative refinement with self-feedback

Uri Alon, Peter Clark, Nouha Dziri, Luyu Gao, Prakhar Gupta, Shashank Gupta, Skyler Hallinan, Katherine Hermann, Aman Madaan, Bodhisattwa Prasad Majumder, Shrimai Prabhumoye, Niket Tandon, Sean Welleck, Sarah Wiegreffe, Yiming Yang, and Amir Yazdanbakhsh. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

work page 2023
[43]

Reflexion: Language agents with verbal reinforcement learning

Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Noah Shinn, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, pages 8634–8652, 2023

work page 2023
[44]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024

work page 2024
[45]

Tree of thoughts: Deliberate problem solving with large language models

Yuan Cao, Tom Griffiths, Karthik Narasimhan, Izhak Shafran, Shunyu Yao, Dian Yu, and Jeffrey Zhao. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, pages 11809–11822, 2023

work page 2023
[46]

Deep think with confidence

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. InInternational Conference on Learning Representations, 2026

work page 2026
[47]

Scalable best-of-n selection for large language models via self-certainty

Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[48]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. In International Conference on Learning Representations, 2026

work page 2026
[49]

Mutual reasoning makes smaller llms stronger problem-solvers

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers. InInternational Conference on Learning Representations, 2025. 12

work page 2025
[50]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

work page 2023
[51]

Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InInternational Confe...

work page 2026
[52]

System prompt optimization with meta-learning

Yumin Choi, Jinheon Baek, and Sung Ju Hwang. System prompt optimization with meta-learning. In Advances in Neural Information Processing Systems, 2025

work page 2025
[53]

Xing, and Zhiting Hu

Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and Zhiting Hu. PromptAgent: Strategic planning with language models enables expert-level prompt optimization. InInternational Conference on Learning Representations, 2024

work page 2024
[54]

Evolving prompts in-context: An open-ended, self-replicating perspective

Jianyu Wang, Zhiqiang Hu, and Lidong Bing. Evolving prompts in-context: An open-ended, self-replicating perspective. InInternational Conference on Machine Learning, 2025

work page 2025
[55]

PREFER: Prompt ensemble learning via feedback-reflect-refine

Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao Sun, Hongyu Wang, Jinpeng Wang, and Mingchen Cai. PREFER: Prompt ensemble learning via feedback-reflect-refine. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19525–19532, 2024

work page 2024
[56]

AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents

Kyunghoon Bae, Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Honglak Lee, Lajanugen Logeswaran, and Sungryull Sohn. AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents. InAdvances in Neural Information Processing Systems, pages 119919–119948, 2024

work page 2024
[57]

Contextual experience replay for self-improvement of language agents

Yitao Liu, Chenglei Si, Karthik R Narasimhan, and Shunyu Yao. Contextual experience replay for self-improvement of language agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 14179–14198, 2025

work page 2025
[58]

REMem: Reasoning with episodic memory in language agent

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InInternational Conference on Learning Representations, 2026

work page 2026
[59]

Liu, and Gao Huang

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Y . Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19632–19642, 2023

work page 2023
[60]

Robust convex optimization.Mathematics of Operations Research, 23(4):769–805, 1998

Aharon Ben-Tal and Arkadi Nemirovski. Robust convex optimization.Mathematics of Operations Research, 23(4):769–805, 1998. doi: 10.1287/moor.23.4.769

work page doi:10.1287/moor.23.4.769 1998
[61]

Princeton Series in Applied Mathematics

Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, Princeton, NJ, 2009. ISBN 978-0-691-14368-2

work page 2009
[62]

RSOME in Python: An open-source package for robust stochastic optimization made easy.INFORMS Journal of Computing, 35(4):717–724, 2023

Zhi Chen and Peng Xiong. RSOME in Python: An open-source package for robust stochastic optimization made easy.INFORMS Journal of Computing, 35(4):717–724, 2023

work page 2023
[63]

Robust stochastic optimization made easy with RSOME

Zhi Chen, Melvyn Sim, and Peng Xiong. Robust stochastic optimization made easy with RSOME. Management Science, 66(8):3329–3339, 2020

work page 2020
[64]

Optibench meets resocratic: Measure and improve llms for optimization modeling

Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve llms for optimization modeling. InInternational Conference on Learning Representations, 2025. 13 Automated Reformulation of Robust Optimization via Memory-Augmented Large Languag...

work page 2025
[65]

Not all-deterministic.At least one row (objective or constraint) must have a positive uncertainty type; instances where every row is deterministic are discarded before solving

work page
[66]

Feasible and bounded.The solver must return status optimal; instances that are infeasible, unbounded, or produce a solver error are discarded

work page
[67]

Formally, the instance is rejected if x∗ i ≈x l or x∗ i ≈x u forall i, as such solutions are considered degenerate boundary solutions that offer little reformulation challenge

Non-degenerate solution.The optimal solution x∗ must not be identical to the variable lower or upper bound at every component, i.e., it is required that x∗ is not coordinatewise equal to xl or xu. Formally, the instance is rejected if x∗ i ≈x l or x∗ i ≈x u forall i, as such solutions are considered degenerate boundary solutions that offer little reformul...

work page
[68]

Nominal LP(Step 1): generate the constraint matrix A and right-hand side b such that x0 is strictly feasible

work page
[69]

Box Uncertainty

Polyhedral uncertainty set(Step 3): generate the halfspace representation (F,g) of the perturbation polytope Upoly such that the sampled interior point ζ0 is guaranteed to lie strictly inside. 3 Algorithm 2FEASIBLEPOLYTOPE: Random Feasible Inequality System Generator Require:Dimensiond; interior pointv 0 ∈R d; coefficient range[l, u]; number of rowsm Ensu...

work page
[70]

Preamble.A fixed paragraph explaining that certain parameters are uncertain, perturbations are confined to uncertainty sets, and the goal is a worst-case feasible decision

work page
[71]

Each coefficient may deviate by at most ±[∆1, . . . ,∆m] from its nominal value

Uncertain parameter descriptions.One bullet per uncertain row, identifying the physical meaning of the uncertain coefficients and specifying the uncertainty set. The set is described as: • Box:“Each coefficient may deviate by at most ±[∆1, . . . ,∆m] from its nominal value.” • Budget:Same as box, followed by “the sum of normalized deviations ( P |ξj|/∆j) ...

work page
[72]

The robust model seeks a decision that guarantees [maximizing/min- imizing] the objective value under all worst-case realizations within the uncertainty sets above

Closing statement.“The robust model seeks a decision that guarantees [maximizing/min- imizing] the objective value under all worst-case realizations within the uncertainty sets above.” Example.The following is an excerpt from one instance (a bakery production problem from OptiBench), illustrating the box uncertainty format: Robust Extension:In this robust...

work page 2006