pith. sign in

arxiv: 2602.15983 · v2 · submitted 2026-02-17 · 💻 cs.SE · cs.AI· cs.LG· math.OC

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

Pith reviewed 2026-05-15 21:18 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LGmath.OC
keywords LLM code generationoptimization modelingstructured reasoningbehavioral verificationsemantic correctnessRetailOpt-190feasibility-correctness gapparameter perturbation
0
0 comments X

The pith

ReLoop closes the gap between executable and semantically correct optimization code from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can produce optimization code that runs without crashing yet encodes mathematically wrong formulations, creating a feasibility-correctness gap as large as 90 percentage points on problems with many interacting constraints. ReLoop counters this with two linked mechanisms: a structured generation process that forces the model through four explicit stages of understanding the request, formalizing the math, writing the code, and checking it internally, plus an external behavioral check that perturbs input parameters and verifies whether the solver's output changes as the formulation requires. The two steps complement each other by error type, with structured generation helping most on compositional retail problems and behavioral verification catching localized mistakes. When paired with simple execution recovery, the method produces fully executable code on Claude Opus 4.6 and raises accuracy on chat-tuned models across three benchmarks while releasing a new set of 190 retail scenarios focused on the hardest cases.

Core claim

ReLoop decomposes LLM code generation for optimization into a four-stage reasoning chain (understand, formalize, synthesize, verify) and augments it with behavioral verification that tests whether the formulation produces expected changes under solver-based parameter perturbations, thereby closing the feasibility-correctness gap, reaching 100 percent executable code on Claude Opus 4.6, and delivering consistent accuracy gains on foundation models across RetailOpt-190, MAMO-ComplexLP, and related benchmarks.

What carries the argument

ReLoop's two complementary mechanisms: a four-stage structured reasoning chain that prevents formulation errors at generation time and solver-driven parameter perturbation testing that supplies an external semantic signal without ground truth.

If this is right

  • Structured generation yields its largest accuracy lift of 8.5 percentage points on compositional problems such as RetailOpt-190 with Claude Opus 4.6.
  • Behavioral verification contributes its largest gain of 4.4 percentage points on localized defects such as MAMO-ComplexLP.
  • Diagnostic execution recovery combined with the two mechanisms produces 100 percent executable code on Claude Opus 4.6.
  • Chat-tuned foundation models receive consistent accuracy improvements across the three evaluated benchmarks.
  • Narrowly-tuned supervised fine-tuned models remain brittle when chain-of-thought prompts alter their learned output formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The perturbation-testing approach could extend to other domains where generated code or models must satisfy implicit mathematical constraints beyond simple executability.
  • RetailOpt-190 supplies a focused testbed that future work can use to measure progress on multi-constraint optimization formulations.
  • The documented interaction between chain-of-thought prompting and supervised fine-tuned output formats suggests targeted fine-tuning strategies may be needed for reliable reasoning in narrow domains.

Load-bearing premise

That a formulation's correct response to solver-based parameter perturbations reliably indicates its semantic correctness without any ground-truth solution.

What would settle it

A formulation that passes all perturbation tests yet produces an optimal objective value differing from the known correct optimum on a verifiable problem instance.

Figures

Figures reproduced from arXiv: 2602.15983 by Chaoyu Zhang, Chung-Piaw Teo, Hanzhang Qin, Huiling Chen, Junbo Jacob Lian, Yujun Sun.

Figure 1
Figure 1. Figure 1: ReLoop overview. Structured Generation mirrors expert modeling practice: understand the problem, formalize the mathematical model with explicit variable-type reasoning, synthesize Gurobi code with data extraction, and self-verify completeness. Behavioral Verification: L1 checks execution correctness (FATAL blocks output); L2 tests constraint (CPT) and objective (OPT) presence via solver-based perturbation … view at source ↗
read the original abstract

Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations -- a feasibility-correctness gap reaching 90 percentage points on compositional problems. We introduce ReLoop, which addresses this gap through two complementary mechanisms. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify), preventing formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation -- an external semantic signal that bypasses LLM self-review and requires no ground truth. The two mechanisms are complementary by error structure: structured generation drives the largest gains on compositional problems (+8.5pp accuracy on RetailOpt-190 with Claude Opus 4.6), while behavioral verification dominates on localized defects (+4.4pp on MAMO-ComplexLP, its largest contribution across benchmarks). Combined with diagnostic execution recovery, ReLoop reaches 100% executable code on Claude Opus 4.6 and consistently improves accuracy on chat-tuned foundation models across three benchmarks; we further identify a known limitation of narrowly-tuned SFT models, whose learned output formats are brittle to chain-of-thought prompts -- an interaction we document and analyze. We release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ReLoop for reliable LLM-based optimization modeling. It combines structured generation (a four-stage chain: understand, formalize, synthesize, verify) with behavioral verification that tests whether generated formulations respond correctly to solver-based parameter perturbations. The approach is claimed to close the feasibility-correctness gap, yielding 100% executable code on Claude Opus 4.6, accuracy gains of +8.5pp on the new RetailOpt-190 benchmark and +4.4pp on MAMO-ComplexLP, and consistent improvements across three benchmarks on chat-tuned models. The paper also releases RetailOpt-190 and documents brittleness in narrowly-tuned SFT models to chain-of-thought prompts.

Significance. If the experimental claims are substantiated with full details, ReLoop would offer a practical advance in reducing silent semantic errors in LLM-generated optimization code, a known pain point in operations research and software engineering. The release of a targeted compositional benchmark and the explicit complementarity analysis between generation and verification steps add reusable value beyond the specific method.

major comments (1)
  1. [Behavioral verification mechanism] The behavioral verification procedure is load-bearing for the reported accuracy gains, yet the abstract supplies no explicit definition or derivation of the 'correct response' criterion (e.g., monotonicity of objective value, feasibility after perturbation, or sensitivity bounds) that can be computed solely from the formulation and solver output without ground truth. Without this, it is unclear how the method distinguishes formulation errors from coincidental satisfaction of the chosen test, directly affecting attribution of the +8.5pp and +4.4pp improvements.
minor comments (2)
  1. [Abstract] The abstract states accuracy improvements and 100% executability but omits baseline comparisons, ablation results, error bars, and experimental protocol details; these must be supplied in the main text and appendix for reproducibility.
  2. [Discussion of SFT limitations] The observation on SFT model brittleness to chain-of-thought prompts is noted but would benefit from a short quantitative table showing the interaction across the evaluated models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of ReLoop in addressing the feasibility-correctness gap. We respond to the single major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Behavioral verification mechanism] The behavioral verification procedure is load-bearing for the reported accuracy gains, yet the abstract supplies no explicit definition or derivation of the 'correct response' criterion (e.g., monotonicity of objective value, feasibility after perturbation, or sensitivity bounds) that can be computed solely from the formulation and solver output without ground truth. Without this, it is unclear how the method distinguishes formulation errors from coincidental satisfaction of the chosen test, directly affecting attribution of the +8.5pp and +4.4pp improvements.

    Authors: We agree that the abstract would benefit from an explicit definition of the criterion. In the full manuscript (Section 3.2), the 'correct response' is defined via perturbation tests grounded in optimization sensitivity: for an objective-coefficient increase, the optimal value must not decrease; for a constraint tightening, feasibility must be preserved or the objective must adjust monotonically. These checks use only pre/post-perturbation solver outputs (objective value and feasibility status) and require no ground truth. Multiple independent perturbations reduce the chance of coincidental satisfaction, as common formulation errors (e.g., sign flips or omitted terms) systematically violate the expected monotonicity or feasibility response. Ablation results in the paper attribute the +4.4pp gain on MAMO-ComplexLP specifically to this step. We will revise the abstract to include a concise statement of the criterion and its derivation. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new benchmarks and external solver signals

full rationale

The paper introduces ReLoop via structured four-stage generation and behavioral verification through solver-based parameter perturbations as an external semantic signal requiring no ground truth. These mechanisms are evaluated on newly released benchmarks (RetailOpt-190) and existing ones (MAMO-ComplexLP) with reported accuracy deltas attributed to the methods rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivations reduce the central claims to their inputs by construction; the verification procedure is presented as independent of LLM self-review and falsifiable via solver outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs exhibit a large feasibility-correctness gap on compositional problems and that external solver signals can detect semantic errors.

axioms (1)
  • domain assumption LLMs can translate natural language into optimization code but produce a feasibility-correctness gap reaching 90 percentage points on compositional problems
    Stated directly in the abstract as the core motivation and risk.
invented entities (1)
  • ReLoop no independent evidence
    purpose: Framework combining structured generation and behavioral verification to close the feasibility-correctness gap
    New system introduced to address the identified gap

pith-pipeline@v0.9.0 · 5584 in / 1215 out tokens · 27988 ms · 2026-05-15T21:18:38.415078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

  1. [1]

    Self-Refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

  2. [2]

    Large language models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

  3. [3]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  4. [4]

    LEVER: Learning to verify language-to-code generation with execution

    Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 26106–26128. PMLR, 2023

  5. [5]

    OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models

    Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 1015–1029. PMLR, 2024

  6. [6]

    Chain-of-experts: When LLMs meet complex operations research problems

    Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, et al. Chain-of-experts: When LLMs meet complex operations research problems. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

  7. [7]

    ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

    Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A customizable framework in training large models for automated optimization modeling.Operations Research, 73(6):2986–3009, 2025

  8. [8]

    LLMOPT: Learning to define and solve general optimization problems from scratch

    Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, and Yang Yu. LLMOPT: Learning to define and solve general optimization problems from scratch. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  9. [9]

    OptMATH: A scalable bidirectional data synthesis framework for optimization modeling

    Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  10. [10]

    Solver-informed RL: Grounding large language models for authentic optimization modeling

    Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed RL: Grounding large language models for authentic optimization modeling. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

  11. [11]

    arXiv preprint arXiv:2601.09635 (2026)

    Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. LLM for large-scale optimization model auto-formulation: A lightweight few-shot learning approach.arXiv preprint arXiv:2601.09635, 2026

  12. [12]

    Blood Glucose Prediction Algorithms Require Clinically Relevant Performance Criteria Beyond Accuracy

    Hao Chen, Gonzalo Esteban Constante-Flores, Krishna Sri Ipsit Mantri, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, and Can Li. OptiChat: Bridging optimization models and practitioners with large language models.INFORMS Journal on Data Science, 2025. doi:10.1287/ijds.2025.0074

  13. [13]

    CodeT: Code generation with generated tests

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

  14. [14]

    Tsitsiklis.Introduction to Linear Optimization

    Dimitris Bertsimas and John N. Tsitsiklis.Introduction to Linear Optimization. Athena Scientific, Belmont, MA, 1997

  15. [15]

    Vanderbei.Linear Programming: Foundations and Extensions

    Robert J. Vanderbei.Linear Programming: Foundations and Extensions. International Series in Operations Research & Management Science. Springer, 5th edition, 2020

  16. [16]

    Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints

    Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, 1977. 10

  17. [17]

    Clarke, Orna Grumberg, and Doron A

    Edmund M. Clarke, Orna Grumberg, and Doron A. Peled.Model Checking. MIT Press, Cambridge, MA, 1999

  18. [18]

    NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions

    Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt com- petition: Formulating optimization problems based on their natural language descriptions. InProceedings of the NeurIPS 2022 Competitions Track, pages 189–203. PMLR, 2023

  19. [19]

    Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

    Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. MAMO: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

  20. [20]

    Gurobi optimizer reference manual.https://www.gurobi.com, 2024

    Gurobi Optimization, LLC. Gurobi optimizer reference manual.https://www.gurobi.com, 2024

  21. [21]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  23. [23]

    Modification

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023. 11 A Reference MILP Formulation This appendix provi...

  24. [24]

    |"|v) [:4] , where the input string is

    Compute a deterministic seed: seed=uint32_le SHA256(n"|"|v) [:4] , where the input string is "{name}|{v}" and uint32_le reads the first 4 bytes of the digest as an unsigned 32-bit little-endian integer

  25. [25]

    Initialize a NumPy random generator:rng=default_rng(seed)

  26. [26]

    Perturb demand curves: ˜dp,t =⌊d p,t ·U(1−α,1+α)⌋for each product and period

  27. [27]

    periods": int ,

    Perturb storage capacities: ˜Cl =C l ·U(1−α,1+α)for each location whereU(a, b)denotes independent draws from Uniform(a, b)andα= 0.15. Rationale.The perturbation targets the two parameter groups most likely to shift constraint binding patterns—demand volumes and storage capacities—while preserving the structural skeleton (shelf life, network topology, cost...

  28. [28]

    data is pre-loaded, do NOT use file I/O

    Schema-basedhas [DATA SCHEMA] (types only) +[DATA ACCESS] (tells LLM: “data is pre-loaded, do NOT use file I/O”)

  29. [29]

    No[DATA SCHEMA]or[DATA ACCESS]sections appear

    Data-embeddedreplaces both with [DATA] containing the full JSON (694 lines for this instance). No[DATA SCHEMA]or[DATA ACCESS]sections appear

  30. [30]

    parse the JSON data above (usejson.loadson the string)

    The [OUTPUT FORMAT] in data-embedded additionally requires import json, and the [TASK]explicitly instructs “parse the JSON data above (usejson.loadson the string)”. Listing 3 shows only the sections that structurally differ from the schema-based prompt (Listing 2). This is the default reporting format, as it enables self-contained evaluation without exter...

  31. [31]

    loads on the string )

    Parses the JSON data above ( use json . loads on the string )

  32. [32]

    Models and solves the o p t i m i z a t i o n problem

  33. [33]

    The[BUSINESS DESCRIPTION]is identical and omitted here

    Prints status and o b j e c t i v e value Listing 3: Data-embedded prompt (.full.txt): sections that differ from the schema-based format. The[BUSINESS DESCRIPTION]is identical and omitted here. Table 16 summarizes the structural contrast between the two formats. D Solver Configuration and Evaluation Protocol D.1 Ground Truth Solver Settings All 190 ground...

  34. [34]

    Modification

    Status match: predicted and ground-truth statuses agree (both feasible, or both infeasible). 2.Objective match: for feasible instances,|y pred −y ref|/|yref|< ϵ. The tolerance ϵ is family-dependent because problem structure determines the achievable precision within the 60-second time limit (Table 18). Families F1–F5 and F7–F8 produce LP relaxations that ...

  35. [35]

    capacity

    Try extraction: An LLM call extracts all numerical parameters from the problem description into a structured JSON dictionary. If successful, a CoT prompt instructs the model to reference this pre-loadeddatavariable (e.g.,data["capacity"]) rather than embedding values

  36. [36]

    The extraction path enables L2 behavioral testing via data-dict perturbation

    Fallback: If extraction fails (invalid JSON, empty result, or the generated code contains json.loads), the pipeline falls back to self-contained generation where the LLM embeds all data directly in the code. The extraction path enables L2 behavioral testing via data-dict perturbation. When extraction fails, L2 falls back to source-code AST perturbation (S...

  37. [37]

    Define ALL data within your code (extract numbers from the problem 35description above)

  38. [38]

    Model variable must be named ‘m‘

  39. [39]

    Set ‘m.Params.OutputFlag = 0‘

  40. [40]

    status: {m.Status}

    Print exactly: ‘print(f"status: {m.Status}")‘ and 39‘print(f"objective: {m.ObjVal}")‘

  41. [41]

    Implement ALL constraints mentioned in the problem description 41(not just those in Step 2 -- re-read the problem to ensure 42nothing is missed)

  42. [42]

    Parameters (reference the data keys listed below)

    Include ALL cost/revenue terms from the problem in the objective 44function 45 46**Big-M Guidelines (if using indicator/logical constraints):** 47- NEVER hardcode Big-M values like ‘M = 1e6‘ 48- ALWAYS compute M dynamically from data parameters 49 50**Edge Case Handling:** 51- Check array length before iteration 52- Avoid division by zero: ‘max(value, 1e-...

  43. [43]

    Solver status: Check for INFEASIBLE(with IIS diagnostics), UNBOUNDED(with unbounded ray variables), or TIMEOUT

  44. [44]

    Code is self-contained — all data is defined within the code itself

    Duality check: If OPTIMAL, compare primal–dual gap. Gap >1% emits INFO(doesnottrigger repair) Any check failure emits a FATALdiagnostic, triggering the regeneration loop (up to 3 attempts). L1 Regeneration Prompt.When L1 detects a FATALerror, the pipeline regenerates code using the error message as feedback. The data instructions section isconditional: if...

  45. [45]

    Analyze why the previous code failed

  46. [46]

    Generate completely new code that avoids the error

  47. [47]

    ## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data[

    Handle edge cases (empty arrays, division by zero) 23 24Return ONLY the corrected Python code in a ‘‘‘python block. Where{data_instructions}expands to either: • Data-dict mode: “## Data Structure / Thedata variable is PRE-DEFINED with these keys: {schema} /CRITICAL: Do NOT create data = {...}. Just use data["key"] directly.” • Self-contained mode: “## Not...

  48. [48]

    Capacity constraints (resource limits, maximum values)

  49. [49]

    Demand constraints (minimum requirements, must-satisfy conditions)

  50. [50]

    description

    Balance constraints (flow balance, inventory balance) 15 16## Output Format 17Return ONLY a JSON array with this exact format: 18‘‘‘json 19[ 20{"description": "minimum protein requirement", 21"type": "demand", "parameters": ["min_protein"]}, 22{"description": "capacity limit on production", 23"type": "capacity", "parameters": ["capacity"]} 24] 25‘‘‘ 26 27...

  51. [51]

    Used when the code reads from thedatavariable

    Strategy 1 (data-dict): Perturb the parameter in the external data dictionary and re-execute the original code. Used when the code reads from thedatavariable

  52. [52]

    Strategy 2 (source-code fallback): If the code embeds data directly in source (detected automati- cally), or if Strategy 1 produced <1% objective change in hybrid mode, the pipeline falls back to AST-based source-code perturbation: it locates the parameter’s assignment in the code via fuzzy name matching, modifies the literal value, and re-executes. The p...

  53. [53]

    If perturbation causes INFEASIBLE: constraint is present→PASS

  54. [54]

    Compute change ratio:r=|z new −z ∗|/|z ∗|(absolute change used when|z ∗|< ε)

  55. [55]

    31 Objective Term Extraction Prompt

    Classify (consistent with §3.3.2): •r < τ ℓ = 5%: Constraint likely missing→WARNING(triggers repair) •τ ℓ ≤r≤τ h = 30%: Uncertain→INFO(no repair) •r > τ h or infeasibility: Constraint present→PASS E.5 L2: Objective Presence Testing (OPT) L2 OPT tests whether expected cost and revenue terms are present in the generated objective function, using the same pe...

  56. [56]

    **Cost terms**: purchasing/procurement cost, holding/storage cost, 15transportation cost, shortage/backorder cost, setup/fixed cost, 16penalty cost

  57. [57]

    description

    **Revenue terms**: sales revenue, demand revenue, return/salvage 18value 19 20For each term, identify which data parameter(s) provide its 21coefficient. 22 23## Output Format 24Return ONLY a JSON array with this exact format: 25‘‘‘json 26[ 27{"description": "unit purchasing cost", 28"role": "cost", "parameters": ["unit_cost"]}, 29{"description": "sales re...

  58. [58]

    ONLY fix the actionable issues listed in the ISSUES DETECTED 5section

  59. [59]

    Items in REFERENCE ONLY are for context -- DO NOT modify code 7based on them

  60. [60]

    Be conservative -- only make changes that are clearly necessary

  61. [61]

    Preserve all working code -- only change what is broken

  62. [62]

    Do NOT change hardcoded data values unless the diagnostic 11evidence specifically requires it 12 13Fix this optimization code based on the behavioral verification 14report. 15 16## Problem 17{problem_description} 18 19{data_section} 20 21## Current Code 22‘‘‘python 23{code} 24‘‘‘ 25 26## Current objective value: {current_obj} 27 28--- 29## ISSUES DETECTED...

  63. [63]

    [{layer}] {issue_type} -- {target_name} 46{evidence} 47Action: DO NOT FIX (unless 100% certain this is an error) 48 49--- 50## REPAIR INSTRUCTIONS 51

  64. [64]

    Read each Issue carefully, especially the Evidence field

  65. [65]

    Identify the root cause in your code for each actionable issue

  66. [66]

    Fix ALL actionable issues above

  67. [67]

    DO NOT fix items in the REFERENCE section -- they are likely 56normal

  68. [68]

    Code is self-contained

    Preserve all working code -- only change what is broken 58 59{safety_rules} 60 61Return the COMPLETE fixed code in a ‘‘‘python block. Where{data_section}and{safety_rules}are conditional: • Data-dict mode: The data section shows the schema (keys, types, dimensions but not values). Safety rules prohibit redefining data, using json.loads(), and mutating data...

  69. [69]

    Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values

    Data reassignment: Redefiningdata with a dict literal (data = {...}) is blocked. Exception: data = json.loads(...) is allowed as it re-parses existing data rather than fabricating values. 33 2.Data mutation: Modifying data contents (data["key"] = value) is blocked. 3.Dangerous imports:osandsubprocessmodules are blocked. If violations are detected, a guide...

  70. [70]

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...