pith. machine review for the scientific record. sign in

arxiv: 2605.11813 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords robust optimizationlarge language modelsreformulationmemory augmentationautomationbenchmarkAutoRO-BenchAutoREM
0
0 comments X

The pith

Reflecting on failed reformulation attempts lets LLMs build reusable memory that improves robust optimization automation without tuning or experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that large language models can automate the conversion of robust optimization models with uncertain parameters into equivalent deterministic problems by constructing their own experience memory from past mistakes. Robust optimization supports better decisions under uncertainty but is rarely used because the required reformulations demand precise, multi-step mathematical reasoning that is typically done by hand. The authors introduce a benchmark for systematic testing and a tuning-free framework that stores structured insights from failed trajectories in textual memory, then applies that memory to guide future attempts. This matters because the memory transfers across different base models and problem distributions while raising both accuracy and speed, removing the usual barriers of expert knowledge or retraining.

Core claim

AutoREM autonomously builds a structured textual experience memory through an offline adaptation procedure that reflects on previously failed reformulation trajectories. The memory encodes reusable patterns for mathematically consistent transformations and is then used at inference time to steer the LLM toward correct deterministic counterparts. The resulting system improves reformulation accuracy and efficiency on both in-distribution and out-of-distribution instances and works with multiple base LLMs without any parameter updates or domain-specific input.

What carries the argument

Structured textual experience memory generated by reflecting on failed trajectories via an offline adaptation procedure, which supplies reusable guidance for multi-step mathematical reformulations.

If this is right

  • AutoREM raises reformulation accuracy on both familiar and unseen problem sets.
  • The same memory transfers directly to different base LLMs without modification or retraining.
  • Efficiency improves because the LLM requires fewer attempts to reach a valid deterministic equivalent.
  • No domain expertise or parameter changes are needed for the gains to appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-reflection mechanism could be tested on reformulation tasks outside robust optimization, such as stochastic or nonlinear programming.
  • Practitioners in operations research might adopt the approach to apply robust methods to supply-chain or financial models without hiring specialists.
  • If memory size grows with problem complexity, the method may require new compression or retrieval techniques for very large instances.
  • The framework implies that error-based memory can substitute for explicit fine-tuning in other technical domains that demand chained reasoning.

Load-bearing premise

Reflecting on failed trajectories produces a memory that generalizes reliably to new robust optimization instances without requiring domain-specific expert knowledge or any parameter updates to the underlying LLM.

What would settle it

Apply the same memory to a fresh collection of robust optimization problems or a previously unseen base LLM and observe no gain, or a decline, in the fraction of correctly reformulated instances relative to the unaugmented model.

Figures

Figures reproduced from arXiv: 2605.11813 by Guanyi Wang, Guoyun Zhang, Hanzhang Qin, Jinbiao Chen, Junyu Zhang, Shuang Jin.

Figure 1
Figure 1. Figure 1: The robust optimization pipeline and our focus on automated reformulation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AutoREM pipeline. structured memory operators (SMO) address the editing dimension by providing atomic add, update, and delete operations for precise, interpretable memory modification. The second principle is high￾quality memory verification: dual-check commit (DCC) operates at the step level, stress-testing each proposed update against a targeted validation batch before committing; validat… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with LLM-based benchmarks. w/o ULE w/o SMO w/o DCC w/o VBA AutoREM 86 88 90 92 94 96 98 100 Accuracy (%) 90.6% 92.2% 93.8% 93.8% 97.4% Effect of Each Component [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of dual-check size B. 8 16 32 64 Size 85.0 87.5 90.0 92.5 95.0 97.5 100.0 102.5 105.0 Accuracy (%) 95.3% 96.9% 93.8% 97.4% 100.0% 100.0% 96.9% 98.4% Test Accuracy Validation Accuracy Accuracy Difference 0 2 4 6 8 10 12 14 Diff (Val - Test) (%) 4.7% 3.1% 3.1% 1.0% Accuracy & Difference vs. Validation Set Size [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Validation accuracy curve of offline adaptation. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AutoRO-Bench, a benchmark with an automated data generation pipeline and curated dataset for evaluating LLM-based reformulation of robust optimization (RO) problems into deterministic equivalents. It proposes AutoREM, a tuning-free memory-augmented framework that builds a structured textual experience memory by reflecting on failed reformulation trajectories through an offline adaptation procedure. AutoREM claims to require no domain-specific expert knowledge or parameter updates to the base LLM, with the memory transferring across different LLMs. Experiments reportedly demonstrate consistent gains in accuracy and efficiency on in-distribution, out-of-distribution, and cross-LLM settings.

Significance. If the claims hold, the work could meaningfully advance automation of RO reformulation, a bottleneck due to manual multi-step mathematical transformations. AutoRO-Bench fills a gap by providing a dedicated evaluation resource. The tuning-free memory approach is attractive for practical use across LLMs. Credit is due for the focus on mathematical consistency and the transferability claim. However, insufficient experimental detail on metrics, statistics, OOD construction, and the reflection process limits assessment of whether the gains are robust or generalizable beyond prompt engineering.

major comments (2)
  1. [Abstract] Abstract: The central claim of consistent improvements in accuracy and efficiency across in-distribution, out-of-distribution datasets, and diverse base LLMs is presented without any information on the evaluation metrics, statistical significance tests, error bars, number of runs, or how out-of-distribution cases were constructed. This directly undermines verification of the strongest claim.
  2. [AutoREM framework] AutoREM description (offline adaptation procedure): The framework's core assumption—that reflecting on failed trajectories autonomously produces a memory that generalizes without expert knowledge or parameter updates—is load-bearing, yet no details are given on the reflection prompt template, memory indexing/retrieval structure, or mechanisms enforcing mathematical consistency (e.g., dualization or worst-case enumeration). This leaves open whether observed gains reduce to implicit heuristics in the prompts rather than true memory augmentation.
minor comments (1)
  1. [Introduction / Benchmark section] The benchmark name AutoRO-Bench and its two components (reformulation task vs. application task) are introduced clearly in the abstract but would benefit from an explicit high-level diagram or table summarizing the data generation pipeline in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We have revised the manuscript to address the concerns about experimental details and framework transparency. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of consistent improvements in accuracy and efficiency across in-distribution, out-of-distribution datasets, and diverse base LLMs is presented without any information on the evaluation metrics, statistical significance tests, error bars, number of runs, or how out-of-distribution cases were constructed. This directly undermines verification of the strongest claim.

    Authors: We agree the abstract should briefly contextualize the metrics to support the claim. In the revised version we specify that accuracy is the fraction of reformulations verified as mathematically equivalent by an automated checker, efficiency is measured by token count and wall-clock time, statistical significance is assessed via paired t-tests (p<0.05) over five independent runs with standard-deviation error bars, and OOD instances are generated by systematically altering uncertainty-set shapes and constraint structures absent from the training distribution (detailed in Section 4.3). These additions make the central claim verifiable while remaining within abstract length constraints. revision: yes

  2. Referee: [AutoREM framework] AutoREM description (offline adaptation procedure): The framework's core assumption—that reflecting on failed trajectories autonomously produces a memory that generalizes without expert knowledge or parameter updates—is load-bearing, yet no details are given on the reflection prompt template, memory indexing/retrieval structure, or mechanisms enforcing mathematical consistency (e.g., dualization or worst-case enumeration). This leaves open whether observed gains reduce to implicit heuristics in the prompts rather than true memory augmentation.

    Authors: We accept that the original description lacked sufficient implementation detail. The revised manuscript adds the complete reflection prompt template in Appendix B; it directs the LLM to diagnose specific failure modes (incorrect dualization, missed worst-case realizations, etc.) and distill reusable rules without injecting external expert knowledge. Memory is stored as a feature-keyed dictionary (keys encode uncertainty type and constraint signature; values store the corresponding reformulation strategy) and retrieved by cosine similarity on sentence embeddings. Post-retrieval validation enforces mathematical consistency by cross-checking generated dual variables and worst-case enumerations against a lightweight symbolic verifier. These additions demonstrate that performance gains derive from the structured memory rather than prompt heuristics alone; we also include pseudocode for the offline adaptation loop. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation on independently generated benchmark

full rationale

The paper introduces AutoRO-Bench via an automated data generation pipeline and proposes the AutoREM framework that builds textual memory from reflection on failed trajectories. All performance claims are supported by experimental results across in-distribution, out-of-distribution, and multi-LLM settings rather than any mathematical derivation or parameter fit that reduces to the inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the described chain; the method is presented as tuning-free with results serving as external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLMs possess sufficient multi-step mathematical reasoning to benefit from textual memory of past failures, plus the existence of a reliable automated data generation pipeline for the benchmark. No explicit free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Large language models can perform precise multi-step mathematical transformations when guided by structured memory of prior failures.
    Invoked in the description of AutoREM's offline adaptation procedure.
  • domain assumption An automated pipeline can generate representative robust optimization instances that cover both in-distribution and out-of-distribution cases.
    Stated as the basis for creating AutoRO-Bench.

pith-pipeline@v0.9.0 · 5514 in / 1284 out tokens · 20801 ms · 2026-05-13T05:56:28.803268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 1 internal anchor

  1. [1]

    Tractable stochastic analysis in high dimensions via robust optimization.Mathematical Programming, 134(1):23–70, 2012

    Chaithanya Bandi and Dimitris Bertsimas. Tractable stochastic analysis in high dimensions via robust optimization.Mathematical Programming, 134(1):23–70, 2012

  2. [2]

    The price of robustness.Operations Research, 52(1):35–53, 2004

    Dimitris Bertsimas and Melvyn Sim. The price of robustness.Operations Research, 52(1):35–53, 2004

  3. [3]

    Robust solutions of uncertain linear programs.Operations Research Letters, 25(1):1–13, 1999

    Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of uncertain linear programs.Operations Research Letters, 25(1):1–13, 1999

  4. [4]

    Robust solutions of Linear Programming problems contaminated with uncertain data.Mathematical Programming, 88(3):411–424, 2000

    Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of Linear Programming problems contaminated with uncertain data.Mathematical Programming, 88(3):411–424, 2000

  5. [5]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638, 2025

  6. [6]

    Open- Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. In Advances in Neural Information Processing Systems. arXiv, July 2025

  7. [7]

    A survey of optimization modeling meets llms: progress and future directions

    Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, et al. A survey of optimization modeling meets llms: progress and future directions. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 10742– 10750, 2025

  8. [8]

    Andre He, Daniel Fried, and S. Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, 2025

  9. [9]

    Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947, 2025

  10. [10]

    Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

  11. [11]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InAdvances in Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InAdvances in Neural Information Processing Systems, 2025

  12. [12]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the Age of AI Agents.arXiv preprint arXiv:2512.13564, 2025

  13. [13]

    A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

  14. [14]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InInternational Conference on Learning Representations, 2026

  15. [15]

    Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister

    Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling agent self-evolving with reasoning memory. In International Conference on Learning Representations, 2026. 10

  16. [16]

    Shao, Dongdong Ge, and Yinyu Ye

    Yitian Chen, Jingfan Xia, S. Shao, Dongdong Ge, and Yinyu Ye. Solver-informed RL: Grounding large language models for authentic optimization modeling. InAdvances in Neural Information Processing Systems, 2025

  17. [17]

    OR-R1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning

    Zezhen Ding, Zhen Tan, Jiheng Zhang, and Tianlong Chen. OR-R1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 228–236, 2026

  18. [18]

    ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

    Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

  19. [19]

    LLMOPT: Learning to define and solve general optimization problems from scratch

    Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu. LLMOPT: Learning to define and solve general optimization problems from scratch. InInternational Conference on Learning Representations, 2025

  20. [20]

    DeepOR: A deep reasoning foundation model for optimization modeling

    Ziyang Xiao, Yuan Jessica Wang, Xiongwei Han, Shisi Guan, Jingyan Zhu, Jingrong Xie, Lilin Xu, Han Wu, Wing Yin Yu, Zehua Liu, Xiaojin Fu, Gang Chen, and Dongxiang Zhang. DeepOR: A deep reasoning foundation model for optimization modeling. InProceedings of the AAAI Conference on Artificial Intelligence, pages 34052–34060, 2026

  21. [21]

    MURKA: Multi-Reward Reinforce- ment Learning with Knowledge Alignment for Optimization Tasks

    Wantong Xie, Yi-Xiang Hu, Jieyang Xu, Feng Wu, and Xiang-Yang Li. MURKA: Multi-Reward Reinforce- ment Learning with Knowledge Alignment for Optimization Tasks. InAdvances in Neural Information Processing Systems, 2025

  22. [22]

    StepORLM: A self-evolving framework with generative process supervision for operations research language models

    Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models. InInternational Conference on Learning Representations, 2026

  23. [23]

    OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models

    Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. InInternational Conference on Machine Learning, 2024

  24. [24]

    Autoformulation of Mathematical Optimization Models Using LLMs

    Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela Van Der Schaar. Autoformulation of Mathematical Optimization Models Using LLMs. InInternational Conference on Machine Learning, pages 1864–1886. PMLR, October 2025

  25. [25]

    DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints

    Xia Jiang, Yaoxin Wu, Chenhao Zhang, and Yingqian Zhang. DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints. InInternational Conference on Learning Representations, volume 2025, pages 46731–46752, May 2025

  26. [26]

    AlphaOPT: Formulating optimization programs with self-improving LLM experience library.arXiv preprint arXiv:2510.18428, 2025

    Minwei Kong, Ao Qu, Xiaotong Guo, Wenbin Ouyang, Chonghe Jiang, Han Zheng, Yining Ma, Dingyi Zhuang, Yuhan Tang, Junyi Li, Shenhao Wang, Haris Koutsopoulos, Hai Wang, Cathy Wu, and Jinhua Zhao. AlphaOPT: Formulating optimization programs with self-improving LLM experience library.arXiv preprint arXiv:2510.18428, 2025

  27. [27]

    SolverLLM: Leveraging test-time scaling for optimization problem via LLM-guided search

    Dong Li, Xiaoyu Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, and Haifeng Chen. SolverLLM: Leveraging test-time scaling for optimization problem via LLM-guided search. InAdvances in Neural Information Processing Systems, 2025

  28. [28]

    Large-scale optimization model auto-formulation: Harnessing LLM flexibility via structured workflow.arXiv preprint arXiv:2601.09635, 2026

    Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. Large-scale optimization model auto-formulation: Harnessing LLM flexibility via structured workflow.arXiv preprint arXiv:2601.09635, 2026

  29. [29]

    MM-agent: LLM as agents for real-world mathematical modeling problem

    Fan Liu, Zherui Yang, Cancheng Liu, Tianrui Song, Xiaofeng Gao, and Hao Liu. MM-agent: LLM as agents for real-world mathematical modeling problem. InAdvances in Neural Information Processing Systems, 2025

  30. [30]

    OptiTree: Hierarchical thoughts generation with tree search for LLM optimization modeling

    Haoyang Liu, Jie Wang, Yuyang Cai, Xiongwei Han, Yufei Kuang, and Jianye Hao. OptiTree: Hierarchical thoughts generation with tree search for LLM optimization modeling. InAdvances in Neural Information Processing Systems, 2025

  31. [31]

    MATHMO: Automated Mathematical Modeling Through Adaptive Search

    Tennison Liu and Mihaela van der Schaar. MATHMO: Automated Mathematical Modeling Through Adaptive Search. InInternational Conference on Learning Representations, 2026

  32. [32]

    Guiding large language models in modeling optimization problems via question partitioning

    Xiaotian Pan, Junhao Fang, Feng Wu, Sijia Zhang, Yi-Xiang Hu, Shaoang Li, and Xiang-Yang Li. Guiding large language models in modeling optimization problems via question partitioning. InProceedings of the International Joint Conference on Artificial Intelligence, pages 2657–2665, 2024. 11

  33. [33]

    Augmenting operations research with auto-formulation of optimization models from problem descriptions

    Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimization models from problem descriptions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 29–62, 2022

  34. [34]

    BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving

    Teng Wang, Wing Yin Yu, Zhenqi He, Zehua Liu, HaileiGong HaileiGong, Han Wu, Xiongwei Han, Wei Shi, Ruifeng She, Fangzhou Zhu, and Tao Zhong. BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 821–838, 2025

  35. [35]

    ORMind: A cognitive-inspired end-to-end reasoning framework for operations research

    Zhiyuan Wang, Bokui Chen, Yinya Huang, Qingxing Cao, Ming He, Jianping Fan, and Xiaodan Liang. ORMind: A cognitive-inspired end-to-end reasoning framework for operations research. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 104–131, 2025

  36. [36]

    Chain-of-Experts: When LLMs Meet Complex Operations Research Problems

    Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs Meet Complex Operations Research Problems. InInternational Conference on Learning Representations, 2024

  37. [37]

    Solving general natural-language-description optimization problems with large language models

    Jihai Zhang, Wei Wang, Siyan Guo, Li Wang, Fangquan Lin, Cheng Yang, and Wotao Yin. Solving general natural-language-description optimization problems with large language models. InProceedings of the North American Chapter of the Association for Computational Linguistics, pages 483–490, 2024

  38. [38]

    Decision Information Meets Large Language Models: The Future of Explainable Operations Research

    Yansen Zhang, Qingcan Kang, Wing Yin Yu, Xiaojin Fu, Xiongwei Han, Tao Zhong, and Chen Ma. Decision Information Meets Large Language Models: The Future of Explainable Operations Research. In International Conference on Learning Representations, volume 2025, pages 65698–65722, May 2025

  39. [39]

    Robust and adaptive optimization under a large language model lens.arXiv preprint arXiv:2501.00568, 2024

    Dimitris Bertsimas and Georgios Margaritis. Robust and adaptive optimization under a large language model lens.arXiv preprint arXiv:2501.00568, 2024

  40. [40]

    Large language models are zero-shot reasoners

    Shixiang Shane Gu, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo, and Machel Reid. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, pages 22199– 22213, 2022

  41. [41]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

  42. [42]

    Self-refine: Iterative refinement with self-feedback

    Uri Alon, Peter Clark, Nouha Dziri, Luyu Gao, Prakhar Gupta, Shashank Gupta, Skyler Hallinan, Katherine Hermann, Aman Madaan, Bodhisattwa Prasad Majumder, Shrimai Prabhumoye, Niket Tandon, Sean Welleck, Sarah Wiegreffe, Yiming Yang, and Amir Yazdanbakhsh. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

  43. [43]

    Reflexion: Language agents with verbal reinforcement learning

    Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Noah Shinn, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, pages 8634–8652, 2023

  44. [44]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024

  45. [45]

    Tree of thoughts: Deliberate problem solving with large language models

    Yuan Cao, Tom Griffiths, Karthik Narasimhan, Izhak Shafran, Shunyu Yao, Dian Yu, and Jeffrey Zhao. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, pages 11809–11822, 2023

  46. [46]

    Deep think with confidence

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. InInternational Conference on Learning Representations, 2026

  47. [47]

    Scalable best-of-n selection for large language models via self-certainty

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty. InAdvances in Neural Information Processing Systems, 2025

  48. [48]

    Reasoning with sampling: Your base model is smarter than you think

    Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. In International Conference on Learning Representations, 2026

  49. [49]

    Mutual reasoning makes smaller llms stronger problem-solvers

    Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers. InInternational Conference on Learning Representations, 2025. 12

  50. [50]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

  51. [51]

    Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InInternational Confe...

  52. [52]

    System prompt optimization with meta-learning

    Yumin Choi, Jinheon Baek, and Sung Ju Hwang. System prompt optimization with meta-learning. In Advances in Neural Information Processing Systems, 2025

  53. [53]

    Xing, and Zhiting Hu

    Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and Zhiting Hu. PromptAgent: Strategic planning with language models enables expert-level prompt optimization. InInternational Conference on Learning Representations, 2024

  54. [54]

    Evolving prompts in-context: An open-ended, self-replicating perspective

    Jianyu Wang, Zhiqiang Hu, and Lidong Bing. Evolving prompts in-context: An open-ended, self-replicating perspective. InInternational Conference on Machine Learning, 2025

  55. [55]

    PREFER: Prompt ensemble learning via feedback-reflect-refine

    Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao Sun, Hongyu Wang, Jinpeng Wang, and Mingchen Cai. PREFER: Prompt ensemble learning via feedback-reflect-refine. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19525–19532, 2024

  56. [56]

    AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents

    Kyunghoon Bae, Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Honglak Lee, Lajanugen Logeswaran, and Sungryull Sohn. AutoGuide: Automated generation and selection of context-aware guidelines for large language model agents. InAdvances in Neural Information Processing Systems, pages 119919–119948, 2024

  57. [57]

    Contextual experience replay for self-improvement of language agents

    Yitao Liu, Chenglei Si, Karthik R Narasimhan, and Shunyu Yao. Contextual experience replay for self-improvement of language agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 14179–14198, 2025

  58. [58]

    REMem: Reasoning with episodic memory in language agent

    Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InInternational Conference on Learning Representations, 2026

  59. [59]

    Liu, and Gao Huang

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Y . Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19632–19642, 2023

  60. [60]

    Robust convex optimization.Mathematics of Operations Research, 23(4):769–805, 1998

    Aharon Ben-Tal and Arkadi Nemirovski. Robust convex optimization.Mathematics of Operations Research, 23(4):769–805, 1998. doi: 10.1287/moor.23.4.769

  61. [61]

    Princeton Series in Applied Mathematics

    Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, Princeton, NJ, 2009. ISBN 978-0-691-14368-2

  62. [62]

    RSOME in Python: An open-source package for robust stochastic optimization made easy.INFORMS Journal of Computing, 35(4):717–724, 2023

    Zhi Chen and Peng Xiong. RSOME in Python: An open-source package for robust stochastic optimization made easy.INFORMS Journal of Computing, 35(4):717–724, 2023

  63. [63]

    Robust stochastic optimization made easy with RSOME

    Zhi Chen, Melvyn Sim, and Peng Xiong. Robust stochastic optimization made easy with RSOME. Management Science, 66(8):3329–3339, 2020

  64. [64]

    Optibench meets resocratic: Measure and improve llms for optimization modeling

    Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve llms for optimization modeling. InInternational Conference on Learning Representations, 2025. 13 Automated Reformulation of Robust Optimization via Memory-Augmented Large Languag...

  65. [65]

    Not all-deterministic.At least one row (objective or constraint) must have a positive uncertainty type; instances where every row is deterministic are discarded before solving

  66. [66]

    Feasible and bounded.The solver must return status optimal; instances that are infeasible, unbounded, or produce a solver error are discarded

  67. [67]

    Formally, the instance is rejected if x∗ i ≈x l or x∗ i ≈x u forall i, as such solutions are considered degenerate boundary solutions that offer little reformulation challenge

    Non-degenerate solution.The optimal solution x∗ must not be identical to the variable lower or upper bound at every component, i.e., it is required that x∗ is not coordinatewise equal to xl or xu. Formally, the instance is rejected if x∗ i ≈x l or x∗ i ≈x u forall i, as such solutions are considered degenerate boundary solutions that offer little reformul...

  68. [68]

    Nominal LP(Step 1): generate the constraint matrix A and right-hand side b such that x0 is strictly feasible

  69. [69]

    Box Uncertainty

    Polyhedral uncertainty set(Step 3): generate the halfspace representation (F,g) of the perturbation polytope Upoly such that the sampled interior point ζ0 is guaranteed to lie strictly inside. 3 Algorithm 2FEASIBLEPOLYTOPE: Random Feasible Inequality System Generator Require:Dimensiond; interior pointv 0 ∈R d; coefficient range[l, u]; number of rowsm Ensu...

  70. [70]

    Preamble.A fixed paragraph explaining that certain parameters are uncertain, perturbations are confined to uncertainty sets, and the goal is a worst-case feasible decision

  71. [71]

    Each coefficient may deviate by at most ±[∆1, . . . ,∆m] from its nominal value

    Uncertain parameter descriptions.One bullet per uncertain row, identifying the physical meaning of the uncertain coefficients and specifying the uncertainty set. The set is described as: • Box:“Each coefficient may deviate by at most ±[∆1, . . . ,∆m] from its nominal value.” • Budget:Same as box, followed by “the sum of normalized deviations ( P |ξj|/∆j) ...

  72. [72]

    The robust model seeks a decision that guarantees [maximizing/min- imizing] the objective value under all worst-case realizations within the uncertainty sets above

    Closing statement.“The robust model seeks a decision that guarantees [maximizing/min- imizing] the objective value under all worst-case realizations within the uncertainty sets above.” Example.The following is an excerpt from one instance (a bakery production problem from OptiBench), illustrating the box uncertainty format: Robust Extension:In this robust...