ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

Chaoyu Zhang; Chung-Piaw Teo; Hanzhang Qin; Huiling Chen; Junbo Jacob Lian; Yujun Sun

arxiv: 2602.15983 · v2 · submitted 2026-02-17 · 💻 cs.SE · cs.AI· cs.LG· math.OC

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

Junbo Jacob Lian , Yujun Sun , Huiling Chen , Chaoyu Zhang , Hanzhang Qin , Chung-Piaw Teo This is my paper

Pith reviewed 2026-05-15 21:18 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LGmath.OC

keywords LLM code generationoptimization modelingstructured reasoningbehavioral verificationsemantic correctnessRetailOpt-190feasibility-correctness gapparameter perturbation

0 comments

The pith

ReLoop closes the gap between executable and semantically correct optimization code from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can produce optimization code that runs without crashing yet encodes mathematically wrong formulations, creating a feasibility-correctness gap as large as 90 percentage points on problems with many interacting constraints. ReLoop counters this with two linked mechanisms: a structured generation process that forces the model through four explicit stages of understanding the request, formalizing the math, writing the code, and checking it internally, plus an external behavioral check that perturbs input parameters and verifies whether the solver's output changes as the formulation requires. The two steps complement each other by error type, with structured generation helping most on compositional retail problems and behavioral verification catching localized mistakes. When paired with simple execution recovery, the method produces fully executable code on Claude Opus 4.6 and raises accuracy on chat-tuned models across three benchmarks while releasing a new set of 190 retail scenarios focused on the hardest cases.

Core claim

ReLoop decomposes LLM code generation for optimization into a four-stage reasoning chain (understand, formalize, synthesize, verify) and augments it with behavioral verification that tests whether the formulation produces expected changes under solver-based parameter perturbations, thereby closing the feasibility-correctness gap, reaching 100 percent executable code on Claude Opus 4.6, and delivering consistent accuracy gains on foundation models across RetailOpt-190, MAMO-ComplexLP, and related benchmarks.

What carries the argument

ReLoop's two complementary mechanisms: a four-stage structured reasoning chain that prevents formulation errors at generation time and solver-driven parameter perturbation testing that supplies an external semantic signal without ground truth.

If this is right

Structured generation yields its largest accuracy lift of 8.5 percentage points on compositional problems such as RetailOpt-190 with Claude Opus 4.6.
Behavioral verification contributes its largest gain of 4.4 percentage points on localized defects such as MAMO-ComplexLP.
Diagnostic execution recovery combined with the two mechanisms produces 100 percent executable code on Claude Opus 4.6.
Chat-tuned foundation models receive consistent accuracy improvements across the three evaluated benchmarks.
Narrowly-tuned supervised fine-tuned models remain brittle when chain-of-thought prompts alter their learned output formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The perturbation-testing approach could extend to other domains where generated code or models must satisfy implicit mathematical constraints beyond simple executability.
RetailOpt-190 supplies a focused testbed that future work can use to measure progress on multi-constraint optimization formulations.
The documented interaction between chain-of-thought prompting and supervised fine-tuned output formats suggests targeted fine-tuning strategies may be needed for reliable reasoning in narrow domains.

Load-bearing premise

That a formulation's correct response to solver-based parameter perturbations reliably indicates its semantic correctness without any ground-truth solution.

What would settle it

A formulation that passes all perturbation tests yet produces an optimal objective value differing from the known correct optimum on a verifiable problem instance.

Figures

Figures reproduced from arXiv: 2602.15983 by Chaoyu Zhang, Chung-Piaw Teo, Hanzhang Qin, Huiling Chen, Junbo Jacob Lian, Yujun Sun.

**Figure 1.** Figure 1: ReLoop overview. Structured Generation mirrors expert modeling practice: understand the problem, formalize the mathematical model with explicit variable-type reasoning, synthesize Gurobi code with data extraction, and self-verify completeness. Behavioral Verification: L1 checks execution correctness (FATAL blocks output); L2 tests constraint (CPT) and objective (OPT) presence via solver-based perturbation … view at source ↗

read the original abstract

Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations -- a feasibility-correctness gap reaching 90 percentage points on compositional problems. We introduce ReLoop, which addresses this gap through two complementary mechanisms. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify), preventing formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation -- an external semantic signal that bypasses LLM self-review and requires no ground truth. The two mechanisms are complementary by error structure: structured generation drives the largest gains on compositional problems (+8.5pp accuracy on RetailOpt-190 with Claude Opus 4.6), while behavioral verification dominates on localized defects (+4.4pp on MAMO-ComplexLP, its largest contribution across benchmarks). Combined with diagnostic execution recovery, ReLoop reaches 100% executable code on Claude Opus 4.6 and consistently improves accuracy on chat-tuned foundation models across three benchmarks; we further identify a known limitation of narrowly-tuned SFT models, whose learned output formats are brittle to chain-of-thought prompts -- an interaction we document and analyze. We release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReLoop adds a four-stage generation chain and solver-parameter perturbation checks to catch semantic errors in LLM optimization code, but the verification step still needs clearer rules for what counts as a correct response without ground truth.

read the letter

ReLoop breaks LLM code generation for optimization into four stages—understand, formalize, synthesize, verify—and adds a behavioral check that perturbs solver parameters to flag formulations that do not behave as expected. The authors release RetailOpt-190, a new set of 190 retail scenarios built around the multi-constraint cases where models usually fail. On the reported runs the structured chain lifts accuracy most on compositional problems while the perturbation step helps more with localized defects, and they reach 100% executability on Claude Opus 4.6 plus steady gains on other chat models. They also document that some narrow SFT models break when given chain-of-thought prompts, which is a concrete observation for anyone running these pipelines in practice. That combination of mechanism and data is the part worth looking at. The verification method is the soft spot. To decide whether a perturbed run is “correct” you still need a rule for the expected change in objective or feasibility. The abstract does not spell out how that rule is derived from the formulation and solver output alone, so any formulation that happens to pass the chosen test gets accepted even if it is wrong. That assumption sits behind the accuracy numbers, and without seeing the exact test definitions or ablations it is hard to judge how often it holds. The experiments are described at a high level with no error bars or full baseline tables visible in the summary, so the size of the gains is difficult to assess. This is aimed at researchers building LLM tools for operations research or automated modeling. A reader who wants concrete prompting patterns and a new benchmark focused on real failure modes will find usable pieces. The work shows clear thinking about the feasibility-correctness gap and ships new data, so it deserves a serious referee to examine the verification logic and experimental details.

Referee Report

1 major / 2 minor

Summary. The paper introduces ReLoop for reliable LLM-based optimization modeling. It combines structured generation (a four-stage chain: understand, formalize, synthesize, verify) with behavioral verification that tests whether generated formulations respond correctly to solver-based parameter perturbations. The approach is claimed to close the feasibility-correctness gap, yielding 100% executable code on Claude Opus 4.6, accuracy gains of +8.5pp on the new RetailOpt-190 benchmark and +4.4pp on MAMO-ComplexLP, and consistent improvements across three benchmarks on chat-tuned models. The paper also releases RetailOpt-190 and documents brittleness in narrowly-tuned SFT models to chain-of-thought prompts.

Significance. If the experimental claims are substantiated with full details, ReLoop would offer a practical advance in reducing silent semantic errors in LLM-generated optimization code, a known pain point in operations research and software engineering. The release of a targeted compositional benchmark and the explicit complementarity analysis between generation and verification steps add reusable value beyond the specific method.

major comments (1)

[Behavioral verification mechanism] The behavioral verification procedure is load-bearing for the reported accuracy gains, yet the abstract supplies no explicit definition or derivation of the 'correct response' criterion (e.g., monotonicity of objective value, feasibility after perturbation, or sensitivity bounds) that can be computed solely from the formulation and solver output without ground truth. Without this, it is unclear how the method distinguishes formulation errors from coincidental satisfaction of the chosen test, directly affecting attribution of the +8.5pp and +4.4pp improvements.

minor comments (2)

[Abstract] The abstract states accuracy improvements and 100% executability but omits baseline comparisons, ablation results, error bars, and experimental protocol details; these must be supplied in the main text and appendix for reproducibility.
[Discussion of SFT limitations] The observation on SFT model brittleness to chain-of-thought prompts is noted but would benefit from a short quantitative table showing the interaction across the evaluated models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of ReLoop in addressing the feasibility-correctness gap. We respond to the single major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Behavioral verification mechanism] The behavioral verification procedure is load-bearing for the reported accuracy gains, yet the abstract supplies no explicit definition or derivation of the 'correct response' criterion (e.g., monotonicity of objective value, feasibility after perturbation, or sensitivity bounds) that can be computed solely from the formulation and solver output without ground truth. Without this, it is unclear how the method distinguishes formulation errors from coincidental satisfaction of the chosen test, directly affecting attribution of the +8.5pp and +4.4pp improvements.

Authors: We agree that the abstract would benefit from an explicit definition of the criterion. In the full manuscript (Section 3.2), the 'correct response' is defined via perturbation tests grounded in optimization sensitivity: for an objective-coefficient increase, the optimal value must not decrease; for a constraint tightening, feasibility must be preserved or the objective must adjust monotonically. These checks use only pre/post-perturbation solver outputs (objective value and feasibility status) and require no ground truth. Multiple independent perturbations reduce the chance of coincidental satisfaction, as common formulation errors (e.g., sign flips or omitted terms) systematically violate the expected monotonicity or feasibility response. Ablation results in the paper attribute the +4.4pp gain on MAMO-ComplexLP specifically to this step. We will revise the abstract to include a concise statement of the criterion and its derivation. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new benchmarks and external solver signals

full rationale

The paper introduces ReLoop via structured four-stage generation and behavioral verification through solver-based parameter perturbations as an external semantic signal requiring no ground truth. These mechanisms are evaluated on newly released benchmarks (RetailOpt-190) and existing ones (MAMO-ComplexLP) with reported accuracy deltas attributed to the methods rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivations reduce the central claims to their inputs by construction; the verification procedure is presented as independent of LLM self-review and falsifiable via solver outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs exhibit a large feasibility-correctness gap on compositional problems and that external solver signals can detect semantic errors.

axioms (1)

domain assumption LLMs can translate natural language into optimization code but produce a feasibility-correctness gap reaching 90 percentage points on compositional problems
Stated directly in the abstract as the core motivation and risk.

invented entities (1)

ReLoop no independent evidence
purpose: Framework combining structured generation and behavioral verification to close the feasibility-correctness gap
New system introduced to address the identified gap

pith-pipeline@v0.9.0 · 5584 in / 1215 out tokens · 27988 ms · 2026-05-15T21:18:38.415078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

[1]

Self-Refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[2]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024
[3]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[4]

LEVER: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 26106–26128. PMLR, 2023

work page 2023
[5]

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 1015–1029. PMLR, 2024

work page 2024
[6]

Chain-of-experts: When LLMs meet complex operations research problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, et al. Chain-of-experts: When LLMs meet complex operations research problems. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

work page 2025
[8]

LLMOPT: Learning to define and solve general optimization problems from scratch

Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu. LLMOPT: Learning to define and solve general optimization problems from scratch. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[9]

OptMATH: A scalable bidirectional data synthesis framework for optimization modeling

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025
[10]

Solver-informed RL: Grounding large language models for authentic optimization modeling

Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed RL: Grounding large language models for authentic optimization modeling. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[11]

arXiv preprint arXiv:2601.09635 (2026)

Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. LLM for large-scale optimization model auto-formulation: A lightweight few-shot learning approach.arXiv preprint arXiv:2601.09635, 2026

work page arXiv 2026
[12]

Blood Glucose Prediction Algorithms Require Clinically Relevant Performance Criteria Beyond Accuracy

Hao Chen, Gonzalo Esteban Constante-Flores, Krishna Sri Ipsit Mantri, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, and Can Li. OptiChat: Bridging optimization models and practitioners with large language models.INFORMS Journal on Data Science, 2025. doi:10.1287/ijds.2025.0074

work page doi:10.1287/ijds.2025.0074 2025
[13]

CodeT: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[14]

Tsitsiklis.Introduction to Linear Optimization

Dimitris Bertsimas and John N. Tsitsiklis.Introduction to Linear Optimization. Athena Scientific, Belmont, MA, 1997

work page 1997
[15]

Vanderbei.Linear Programming: Foundations and Extensions

Robert J. Vanderbei.Linear Programming: Foundations and Extensions. International Series in Operations Research & Management Science. Springer, 5th edition, 2020

work page 2020
[16]

Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints

Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, 1977. 10

work page 1977
[17]

Clarke, Orna Grumberg, and Doron A

Edmund M. Clarke, Orna Grumberg, and Doron A. Peled.Model Checking. MIT Press, Cambridge, MA, 1999

work page 1999
[18]

NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions

Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions. InProceedings of the NeurIPS 2022 Competitions Track, pages 189–203. PMLR, 2023

work page 2022
[19]

Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. MAMO: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

work page arXiv 2024
[20]

Gurobi optimizer reference manual.https://www.gurobi.com, 2024

Gurobi Optimization, LLC. Gurobi optimizer reference manual.https://www.gurobi.com, 2024

work page 2024
[21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Modification

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. 11 A Reference MILP Formulation This appendix provi...

work page 2023
[24]

|"|v) [:4] , where the input string is

Compute a deterministic seed: seed=uint32_le SHA256(n"|"|v) [:4] , where the input string is "{name}|{v}" and uint32_le reads the first 4 bytes of the digest as an unsigned 32-bit little-endian integer

work page
[25]

Initialize a NumPy random generator:rng=default_rng(seed)

work page
[26]

Perturb demand curves: ˜dp,t =⌊d p,t ·U(1−α,1+α)⌋for each product and period

work page
[27]

periods": int ,

Perturb storage capacities: ˜Cl =C l ·U(1−α,1+α)for each location whereU(a, b)denotes independent draws from Uniform(a, b)andα= 0.15. Rationale.The perturbation targets the two parameter groups most likely to shift constraint binding patterns—demand volumes and storage capacities—while preserving the structural skeleton (shelf life, network topology, cost...

work page
[28]

data is pre-loaded, do NOT use file I/O

Schema-basedhas [DATA SCHEMA] (types only) +[DATA ACCESS] (tells LLM: “data is pre-loaded, do NOT use file I/O”)

work page
[29]

No[DATA SCHEMA]or[DATA ACCESS]sections appear

Data-embeddedreplaces both with [DATA] containing the full JSON (694 lines for this instance). No[DATA SCHEMA]or[DATA ACCESS]sections appear

work page
[30]

parse the JSON data above (usejson.loadson the string)

The [OUTPUT FORMAT] in data-embedded additionally requires import json, and the [TASK]explicitly instructs “parse the JSON data above (usejson.loadson the string)”. Listing 3 shows only the sections that structurally differ from the schema-based prompt (Listing 2). This is the default reporting format, as it enables self-contained evaluation without exter...

work page
[31]

loads on the string )

Parses the JSON data above ( use json . loads on the string )

work page
[32]

Models and solves the o p t i m i z a t i o n problem

work page
[33]

The[BUSINESS DESCRIPTION]is identical and omitted here

Prints status and o b j e c t i v e value Listing 3: Data-embedded prompt (.full.txt): sections that differ from the schema-based format. The[BUSINESS DESCRIPTION]is identical and omitted here. Table 16 summarizes the structural contrast between the two formats. D Solver Configuration and Evaluation Protocol D.1 Ground Truth Solver Settings All 190 ground...

work page
[34]

Modification

Status match: predicted and ground-truth statuses agree (both feasible, or both infeasible). 2.Objective match: for feasible instances,|y pred −y ref|/|yref|< ϵ. The tolerance ϵ is family-dependent because problem structure determines the achievable precision within the 60-second time limit (Table 18). Families F1–F5 and F7–F8 produce LP relaxations that ...

work page
[35]

capacity

Try extraction: An LLM call extracts all numerical parameters from the problem description into a structured JSON dictionary. If successful, a CoT prompt instructs the model to reference this pre-loadeddatavariable (e.g.,data["capacity"]) rather than embedding values

work page
[36]

The extraction path enables L2 behavioral testing via data-dict perturbation

Fallback: If extraction fails (invalid JSON, empty result, or the generated code contains json.loads), the pipeline falls back to self-contained generation where the LLM embeds all data directly in the code. The extraction path enables L2 behavioral testing via data-dict perturbation. When extraction fails, L2 falls back to source-code AST perturbation (S...

work page
[37]

Define ALL data within your code (extract numbers from the problem 35description above)

work page
[38]

Model variable must be named ‘m‘

work page
[39]

Set ‘m.Params.OutputFlag = 0‘

work page
[40]

status: {m.Status}

Print exactly: ‘print(f"status: {m.Status}")‘ and 39‘print(f"objective: {m.ObjVal}")‘

work page
[41]

Implement ALL constraints mentioned in the problem description 41(not just those in Step 2 -- re-read the problem to ensure 42nothing is missed)

work page
[42]

Parameters (reference the data keys listed below)

Include ALL cost/revenue terms from the problem in the objective 44function 45 46**Big-M Guidelines (if using indicator/logical constraints):** 47- NEVER hardcode Big-M values like ‘M = 1e6‘ 48- ALWAYS compute M dynamically from data parameters 49 50**Edge Case Handling:** 51- Check array length before iteration 52- Avoid division by zero: ‘max(value, 1e-...

work page
[43]

Solver status: Check for INFEASIBLE(with IIS diagnostics), UNBOUNDED(with unbounded ray variables), or TIMEOUT

work page
[44]

Code is self-contained — all data is defined within the code itself

Duality check: If OPTIMAL, compare primal–dual gap. Gap >1% emits INFO(doesnottrigger repair) Any check failure emits a FATALdiagnostic, triggering the regeneration loop (up to 3 attempts). L1 Regeneration Prompt.When L1 detects a FATALerror, the pipeline regenerates code using the error message as feedback. The data instructions section isconditional: if...

work page
[45]

Analyze why the previous code failed

work page
[46]

Generate completely new code that avoids the error

work page
[47]

## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data[

Handle edge cases (empty arrays, division by zero) 23 24Return ONLY the corrected Python code in a ‘‘‘python block. Where{data_instructions}expands to either: • Data-dict mode: “## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data["key"] directly.” • Self-contained mode: “## Not...

work page
[48]

Capacity constraints (resource limits, maximum values)

work page
[49]

Demand constraints (minimum requirements, must-satisfy conditions)

work page
[50]

description

Balance constraints (flow balance, inventory balance) 15 16## Output Format 17Return ONLY a JSON array with this exact format: 18‘‘‘json 19[ 20{"description": "minimum protein requirement", 21"type": "demand", "parameters": ["min_protein"]}, 22{"description": "capacity limit on production", 23"type": "capacity", "parameters": ["capacity"]} 24] 25‘‘‘ 26 27...

work page
[51]

Used when the code reads from thedatavariable

Strategy 1 (data-dict): Perturb the parameter in the external data dictionary and re-execute the original code. Used when the code reads from thedatavariable

work page
[52]

Strategy 2 (source-code fallback): If the code embeds data directly in source (detected automati- cally), or if Strategy 1 produced <1% objective change in hybrid mode, the pipeline falls back to AST-based source-code perturbation: it locates the parameter’s assignment in the code via fuzzy name matching, modifies the literal value, and re-executes. The p...

work page
[53]

If perturbation causes INFEASIBLE: constraint is present→PASS

work page
[54]

Compute change ratio:r=|z new −z ∗|/|z ∗|(absolute change used when|z ∗|< ε)

work page
[55]

31 Objective Term Extraction Prompt

Classify (consistent with §3.3.2): •r < τ ℓ = 5%: Constraint likely missing→WARNING(triggers repair) •τ ℓ ≤r≤τ h = 30%: Uncertain→INFO(no repair) •r > τ h or infeasibility: Constraint present→PASS E.5 L2: Objective Presence Testing (OPT) L2 OPT tests whether expected cost and revenue terms are present in the generated objective function, using the same pe...

work page
[56]

**Cost terms**: purchasing/procurement cost, holding/storage cost, 15transportation cost, shortage/backorder cost, setup/fixed cost, 16penalty cost

work page
[57]

description

**Revenue terms**: sales revenue, demand revenue, return/salvage 18value 19 20For each term, identify which data parameter(s) provide its 21coefficient. 22 23## Output Format 24Return ONLY a JSON array with this exact format: 25‘‘‘json 26[ 27{"description": "unit purchasing cost", 28"role": "cost", "parameters": ["unit_cost"]}, 29{"description": "sales re...

work page
[58]

ONLY fix the actionable issues listed in the ISSUES DETECTED 5section

work page
[59]

Items in REFERENCE ONLY are for context -- DO NOT modify code 7based on them

work page
[60]

Be conservative -- only make changes that are clearly necessary

work page
[61]

Preserve all working code -- only change what is broken

work page
[62]

Do NOT change hardcoded data values unless the diagnostic 11evidence specifically requires it 12 13Fix this optimization code based on the behavioral verification 14report. 15 16## Problem 17{problem_description} 18 19{data_section} 20 21## Current Code 22‘‘‘python 23{code} 24‘‘‘ 25 26## Current objective value: {current_obj} 27 28--- 29## ISSUES DETECTED...

work page
[63]

[{layer}] {issue_type} -- {target_name} 46{evidence} 47Action: DO NOT FIX (unless 100% certain this is an error) 48 49--- 50## REPAIR INSTRUCTIONS 51

work page
[64]

Read each Issue carefully, especially the Evidence field

work page
[65]

Identify the root cause in your code for each actionable issue

work page
[66]

Fix ALL actionable issues above

work page
[67]

DO NOT fix items in the REFERENCE section -- they are likely 56normal

work page
[68]

Code is self-contained

Preserve all working code -- only change what is broken 58 59{safety_rules} 60 61Return the COMPLETE fixed code in a ‘‘‘python block. Where{data_section}and{safety_rules}are conditional: • Data-dict mode: The data section shows the schema (keys, types, dimensions but not values). Safety rules prohibit redefining data, using json.loads(), and mutating data...

work page
[69]

Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values

Data reassignment: Redefiningdata with a dict literal (data = {...}) is blocked. Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values. 33 2.Data mutation: Modifying data contents (data["key"] = value) is blocked. 3.Dangerous imports:osandsubprocessmodules are blocked. If violations are detected, a guide...

work page 2055
[70]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Self-Refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[2] [2]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024

[3] [3]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[4] [4]

LEVER: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 26106–26128. PMLR, 2023

work page 2023

[5] [5]

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 1015–1029. PMLR, 2024

work page 2024

[6] [6]

Chain-of-experts: When LLMs meet complex operations research problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, et al. Chain-of-experts: When LLMs meet complex operations research problems. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024

[7] [7]

ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

work page 2025

[8] [8]

LLMOPT: Learning to define and solve general optimization problems from scratch

Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu. LLMOPT: Learning to define and solve general optimization problems from scratch. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[9] [9]

OptMATH: A scalable bidirectional data synthesis framework for optimization modeling

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025

[10] [10]

Solver-informed RL: Grounding large language models for authentic optimization modeling

Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed RL: Grounding large language models for authentic optimization modeling. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[11] [11]

arXiv preprint arXiv:2601.09635 (2026)

Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. LLM for large-scale optimization model auto-formulation: A lightweight few-shot learning approach.arXiv preprint arXiv:2601.09635, 2026

work page arXiv 2026

[12] [12]

Blood Glucose Prediction Algorithms Require Clinically Relevant Performance Criteria Beyond Accuracy

Hao Chen, Gonzalo Esteban Constante-Flores, Krishna Sri Ipsit Mantri, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, and Can Li. OptiChat: Bridging optimization models and practitioners with large language models.INFORMS Journal on Data Science, 2025. doi:10.1287/ijds.2025.0074

work page doi:10.1287/ijds.2025.0074 2025

[13] [13]

CodeT: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[14] [14]

Tsitsiklis.Introduction to Linear Optimization

Dimitris Bertsimas and John N. Tsitsiklis.Introduction to Linear Optimization. Athena Scientific, Belmont, MA, 1997

work page 1997

[15] [15]

Vanderbei.Linear Programming: Foundations and Extensions

Robert J. Vanderbei.Linear Programming: Foundations and Extensions. International Series in Operations Research & Management Science. Springer, 5th edition, 2020

work page 2020

[16] [16]

Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints

Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, 1977. 10

work page 1977

[17] [17]

Clarke, Orna Grumberg, and Doron A

Edmund M. Clarke, Orna Grumberg, and Doron A. Peled.Model Checking. MIT Press, Cambridge, MA, 1999

work page 1999

[18] [18]

NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions

Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions. InProceedings of the NeurIPS 2022 Competitions Track, pages 189–203. PMLR, 2023

work page 2022

[19] [19]

Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. MAMO: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

work page arXiv 2024

[20] [20]

Gurobi optimizer reference manual.https://www.gurobi.com, 2024

Gurobi Optimization, LLC. Gurobi optimizer reference manual.https://www.gurobi.com, 2024

work page 2024

[21] [21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Modification

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. 11 A Reference MILP Formulation This appendix provi...

work page 2023

[24] [24]

|"|v) [:4] , where the input string is

Compute a deterministic seed: seed=uint32_le SHA256(n"|"|v) [:4] , where the input string is "{name}|{v}" and uint32_le reads the first 4 bytes of the digest as an unsigned 32-bit little-endian integer

work page

[25] [25]

Initialize a NumPy random generator:rng=default_rng(seed)

work page

[26] [26]

Perturb demand curves: ˜dp,t =⌊d p,t ·U(1−α,1+α)⌋for each product and period

work page

[27] [27]

periods": int ,

Perturb storage capacities: ˜Cl =C l ·U(1−α,1+α)for each location whereU(a, b)denotes independent draws from Uniform(a, b)andα= 0.15. Rationale.The perturbation targets the two parameter groups most likely to shift constraint binding patterns—demand volumes and storage capacities—while preserving the structural skeleton (shelf life, network topology, cost...

work page

[28] [28]

data is pre-loaded, do NOT use file I/O

Schema-basedhas [DATA SCHEMA] (types only) +[DATA ACCESS] (tells LLM: “data is pre-loaded, do NOT use file I/O”)

work page

[29] [29]

No[DATA SCHEMA]or[DATA ACCESS]sections appear

Data-embeddedreplaces both with [DATA] containing the full JSON (694 lines for this instance). No[DATA SCHEMA]or[DATA ACCESS]sections appear

work page

[30] [30]

parse the JSON data above (usejson.loadson the string)

The [OUTPUT FORMAT] in data-embedded additionally requires import json, and the [TASK]explicitly instructs “parse the JSON data above (usejson.loadson the string)”. Listing 3 shows only the sections that structurally differ from the schema-based prompt (Listing 2). This is the default reporting format, as it enables self-contained evaluation without exter...

work page

[31] [31]

loads on the string )

Parses the JSON data above ( use json . loads on the string )

work page

[32] [32]

Models and solves the o p t i m i z a t i o n problem

work page

[33] [33]

The[BUSINESS DESCRIPTION]is identical and omitted here

Prints status and o b j e c t i v e value Listing 3: Data-embedded prompt (.full.txt): sections that differ from the schema-based format. The[BUSINESS DESCRIPTION]is identical and omitted here. Table 16 summarizes the structural contrast between the two formats. D Solver Configuration and Evaluation Protocol D.1 Ground Truth Solver Settings All 190 ground...

work page

[34] [34]

Modification

Status match: predicted and ground-truth statuses agree (both feasible, or both infeasible). 2.Objective match: for feasible instances,|y pred −y ref|/|yref|< ϵ. The tolerance ϵ is family-dependent because problem structure determines the achievable precision within the 60-second time limit (Table 18). Families F1–F5 and F7–F8 produce LP relaxations that ...

work page

[35] [35]

capacity

Try extraction: An LLM call extracts all numerical parameters from the problem description into a structured JSON dictionary. If successful, a CoT prompt instructs the model to reference this pre-loadeddatavariable (e.g.,data["capacity"]) rather than embedding values

work page

[36] [36]

The extraction path enables L2 behavioral testing via data-dict perturbation

Fallback: If extraction fails (invalid JSON, empty result, or the generated code contains json.loads), the pipeline falls back to self-contained generation where the LLM embeds all data directly in the code. The extraction path enables L2 behavioral testing via data-dict perturbation. When extraction fails, L2 falls back to source-code AST perturbation (S...

work page

[37] [37]

Define ALL data within your code (extract numbers from the problem 35description above)

work page

[38] [38]

Model variable must be named ‘m‘

work page

[39] [39]

Set ‘m.Params.OutputFlag = 0‘

work page

[40] [40]

status: {m.Status}

Print exactly: ‘print(f"status: {m.Status}")‘ and 39‘print(f"objective: {m.ObjVal}")‘

work page

[41] [41]

Implement ALL constraints mentioned in the problem description 41(not just those in Step 2 -- re-read the problem to ensure 42nothing is missed)

work page

[42] [42]

Parameters (reference the data keys listed below)

Include ALL cost/revenue terms from the problem in the objective 44function 45 46**Big-M Guidelines (if using indicator/logical constraints):** 47- NEVER hardcode Big-M values like ‘M = 1e6‘ 48- ALWAYS compute M dynamically from data parameters 49 50**Edge Case Handling:** 51- Check array length before iteration 52- Avoid division by zero: ‘max(value, 1e-...

work page

[43] [43]

Solver status: Check for INFEASIBLE(with IIS diagnostics), UNBOUNDED(with unbounded ray variables), or TIMEOUT

work page

[44] [44]

Code is self-contained — all data is defined within the code itself

Duality check: If OPTIMAL, compare primal–dual gap. Gap >1% emits INFO(doesnottrigger repair) Any check failure emits a FATALdiagnostic, triggering the regeneration loop (up to 3 attempts). L1 Regeneration Prompt.When L1 detects a FATALerror, the pipeline regenerates code using the error message as feedback. The data instructions section isconditional: if...

work page

[45] [45]

Analyze why the previous code failed

work page

[46] [46]

Generate completely new code that avoids the error

work page

[47] [47]

## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data[

Handle edge cases (empty arrays, division by zero) 23 24Return ONLY the corrected Python code in a ‘‘‘python block. Where{data_instructions}expands to either: • Data-dict mode: “## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data["key"] directly.” • Self-contained mode: “## Not...

work page

[48] [48]

Capacity constraints (resource limits, maximum values)

work page

[49] [49]

Demand constraints (minimum requirements, must-satisfy conditions)

work page

[50] [50]

description

Balance constraints (flow balance, inventory balance) 15 16## Output Format 17Return ONLY a JSON array with this exact format: 18‘‘‘json 19[ 20{"description": "minimum protein requirement", 21"type": "demand", "parameters": ["min_protein"]}, 22{"description": "capacity limit on production", 23"type": "capacity", "parameters": ["capacity"]} 24] 25‘‘‘ 26 27...

work page

[51] [51]

Used when the code reads from thedatavariable

Strategy 1 (data-dict): Perturb the parameter in the external data dictionary and re-execute the original code. Used when the code reads from thedatavariable

work page

[52] [52]

Strategy 2 (source-code fallback): If the code embeds data directly in source (detected automati- cally), or if Strategy 1 produced <1% objective change in hybrid mode, the pipeline falls back to AST-based source-code perturbation: it locates the parameter’s assignment in the code via fuzzy name matching, modifies the literal value, and re-executes. The p...

work page

[53] [53]

If perturbation causes INFEASIBLE: constraint is present→PASS

work page

[54] [54]

Compute change ratio:r=|z new −z ∗|/|z ∗|(absolute change used when|z ∗|< ε)

work page

[55] [55]

31 Objective Term Extraction Prompt

Classify (consistent with §3.3.2): •r < τ ℓ = 5%: Constraint likely missing→WARNING(triggers repair) •τ ℓ ≤r≤τ h = 30%: Uncertain→INFO(no repair) •r > τ h or infeasibility: Constraint present→PASS E.5 L2: Objective Presence Testing (OPT) L2 OPT tests whether expected cost and revenue terms are present in the generated objective function, using the same pe...

work page

[56] [56]

**Cost terms**: purchasing/procurement cost, holding/storage cost, 15transportation cost, shortage/backorder cost, setup/fixed cost, 16penalty cost

work page

[57] [57]

description

**Revenue terms**: sales revenue, demand revenue, return/salvage 18value 19 20For each term, identify which data parameter(s) provide its 21coefficient. 22 23## Output Format 24Return ONLY a JSON array with this exact format: 25‘‘‘json 26[ 27{"description": "unit purchasing cost", 28"role": "cost", "parameters": ["unit_cost"]}, 29{"description": "sales re...

work page

[58] [58]

ONLY fix the actionable issues listed in the ISSUES DETECTED 5section

work page

[59] [59]

Items in REFERENCE ONLY are for context -- DO NOT modify code 7based on them

work page

[60] [60]

Be conservative -- only make changes that are clearly necessary

work page

[61] [61]

Preserve all working code -- only change what is broken

work page

[62] [62]

Do NOT change hardcoded data values unless the diagnostic 11evidence specifically requires it 12 13Fix this optimization code based on the behavioral verification 14report. 15 16## Problem 17{problem_description} 18 19{data_section} 20 21## Current Code 22‘‘‘python 23{code} 24‘‘‘ 25 26## Current objective value: {current_obj} 27 28--- 29## ISSUES DETECTED...

work page

[63] [63]

[{layer}] {issue_type} -- {target_name} 46{evidence} 47Action: DO NOT FIX (unless 100% certain this is an error) 48 49--- 50## REPAIR INSTRUCTIONS 51

work page

[64] [64]

Read each Issue carefully, especially the Evidence field

work page

[65] [65]

Identify the root cause in your code for each actionable issue

work page

[66] [66]

Fix ALL actionable issues above

work page

[67] [67]

DO NOT fix items in the REFERENCE section -- they are likely 56normal

work page

[68] [68]

Code is self-contained

Preserve all working code -- only change what is broken 58 59{safety_rules} 60 61Return the COMPLETE fixed code in a ‘‘‘python block. Where{data_section}and{safety_rules}are conditional: • Data-dict mode: The data section shows the schema (keys, types, dimensions but not values). Safety rules prohibit redefining data, using json.loads(), and mutating data...

work page

[69] [69]

Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values

Data reassignment: Redefiningdata with a dict literal (data = {...}) is blocked. Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values. 33 2.Data mutation: Modifying data contents (data["key"] = value) is blocked. 3.Dangerous imports:osandsubprocessmodules are blocked. If violations are detected, a guide...

work page 2055

[70] [70]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page