Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

Changhong He; Chenggong Shi; Dongdong Geng; Xizi Luo; Yu Mei

arxiv: 2606.04816 · v1 · pith:7FBCY6F7new · submitted 2026-06-03 · 💻 cs.AI · cs.LG

Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

Xizi Luo , Changhong He , Dongdong Geng , Chenggong Shi , Yu Mei This is my paper

Pith reviewed 2026-06-28 05:49 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords large language modelsvehicle routing problemsconstraint injectionoptimization modelingcode generationGurobipolicy optimization

0 comments

The pith

Constraint injection with probes catches errors in LLM-generated VRP solver code that objective checks miss

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce solver code for optimization problems that passes objective-equivalence tests but contains extra or missing constraints when those constraints do not affect the tested cases. The paper proposes constraint injection to address this by using feasible probes to detect over-constraining and single-violation probes to detect omitted constraints. These are combined with differential testing into a dual verifier. The verifier is applied during data synthesis and as a reward in training an 8B model called VRPCoder-GRPO for translating VRP descriptions into Gurobi code. This leads to high accuracy on benchmarks covering 21 VRP variants.

Core claim

Constraint injection using feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent constraint omission, combined with differential testing, forms a dual verifier that enables an 8B model to reach 93% average Pass@1 on four VRP benchmarks while outperforming larger models on several.

What carries the argument

Constraint injection via feasible and one-constraint-violating probes within a dual verifier alongside differential testing

If this is right

VRPCoder-GRPO achieves 93% average Pass@1 across four VRP benchmarks
The model outperforms Gemini-3.1-Pro Preview on three of the benchmarks
It exceeds Claude-Sonnet-4.5 performance by 28 points on average
It surpasses prior OR-specific LLMs by 78 points on average

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may apply to other constraint-dense optimization domains if suitable probe designs can be created for their constraints.
Focusing verification on explicit constraint behavior rather than solution equivalence could improve reliability in code generation for operations research.
Extending the benchmark to include more interacting constraints would test the generalizability of the probe method.

Load-bearing premise

The expert-verified set of 21 VRP variants covers the space of relevant coupled operational constraints sufficiently for the method to generalize.

What would settle it

A counterexample would be an LLM-generated VRP solver script that passes all probes and differential tests yet produces incorrect solutions on a new operational constraint combination not covered in the 21 variants.

Figures

Figures reproduced from arXiv: 2606.04816 by Changhong He, Chenggong Shi, Dongdong Geng, Xizi Luo, Yu Mei.

**Figure 2.** Figure 2: Two illustrative attack operators for construct [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Data synthesis pipeline. Stage 1 assembles [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of our training pipeline. 4.3.1 Supervised Fine-Tuning For each sample, q is the input and y = (y1, . . . , yT ) the tokenization of Cregen used as the target. The SFT loss is the standard causal LM cross-entropy: LSFT(θ) = −E(q,y)∼D [∑ T t=1 log pθ(yt | q, y<t) ] . (10) The fine-tuned model is denoted VRPCoderSFT. 4.3.2 GRPO Training We further fine-tune the policy πθ initialized from VRPCode… view at source ↗

**Figure 5.** Figure 5: Scenario Instantiation Prompts [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Self-Critique and Repair Prompts. Critic and repair prompts used in the generate–critique–repair loop. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Condensation prompts. The prompt template used for condensation, consisting of a shared system [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Index Rewriting Prompts [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Index Rewriting Check Prompts [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Shared code-regeneration prompt used to produce [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly translate natural-language optimization problems into executable solver code. Yet for constraint-dense operations research (OR) problems, existing data-filtering and training pipelines largely rely on objective-equivalence signals such as differential testing and answer agreement, which a program can pass while adding spurious constraints or silently omitting required ones, whenever those constraints are non-binding on the tested instance. We propose constraint injection, which uses feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent constraint omission. Combined with differential testing, it forms a dual verifier. We instantiate and evaluate it on vehicle routing problems (VRPs), a representative constraint-dense combinatorial optimization testbed with coupled operational constraints. We develop VRPCoder, an 8B end-to-end model that translates natural-language VRP scenarios into Gurobi scripts, together with an expert-verified VRP benchmark suite covering 21 variants. The verifier is reused as a rejection-sampling filter during data synthesis and as a per-rollout reward in group relative policy optimization (GRPO). Across four VRP benchmarks, VRPCoder-GRPO reaches 93\% average Pass@1, outperforms Gemini-3.1-Pro Preview on three benchmarks, exceeds Claude-Sonnet-4.5 by 28 average points, and surpasses prior OR-LLMs by 78 average points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Constraint injection gives a practical way to catch extra or missing constraints in LLM-generated VRP code that objective checks miss, but the 21-variant benchmark is the main limit on how far the claims travel.

read the letter

The new piece is the dual verifier: feasible probes to flag spurious over-constraints and one-constraint-violating probes to flag silent omissions, run alongside differential testing. They reuse this verifier both to filter synthetic data and as the reward signal in GRPO for an 8B model called VRPCoder. On their four VRP benchmarks the model reaches 93% average Pass@1 and beats the cited larger models on most of them.

The approach is straightforward and targets a real weakness in current LLM-to-solver pipelines for constraint-heavy problems. Building an expert-verified suite of 21 variants and showing the numbers is concrete work.

The soft spot is exactly the one the stress test flags. Everything is measured inside that 21-variant set. If certain couplings (time windows plus precedence plus capacity in ways not represented there) appear in practice, the probes could pass code that still has undetected errors. The abstract gives no ablation on how the probes are built, no error bars, and no separate hold-out set, so the 93% figure is hard to read as a general reliability number.

This is for people working on LLM-assisted optimization modeling, especially anyone who has seen generated code pass objective tests but fail on real instances. It is worth sending to referees because the verification primitive is new enough and the results are reported clearly enough to merit a full review, even if the generalization question needs more data.

Referee Report

2 major / 2 minor

Summary. The paper claims that constraint injection—using feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent omission—combined with differential testing forms a dual verifier that enables reliable LLM-based generation of VRP solver code. They introduce VRPCoder (8B model) trained via GRPO where the verifier serves as both rejection-sampling filter and per-rollout reward, reporting 93% average Pass@1 across four VRP benchmarks, outperforming Gemini-3.1-Pro Preview on three, Claude-Sonnet-4.5 by 28 points, and prior OR-LLMs by 78 points, on an expert-verified suite of 21 VRP variants.

Significance. If the dual verifier generalizes, the approach addresses a genuine limitation of objective-equivalence signals in constraint-dense OR problems and could improve reliability of LLM-generated optimization models. The empirical scale of the gains (93% Pass@1, large margins over frontier models) would be notable if independently reproducible; the paper ships no machine-checked proofs or parameter-free derivations but does provide a concrete, reusable probe-based mechanism.

major comments (2)

[Benchmark suite description (abstract and evaluation sections)] Benchmark construction and evaluation protocol: the central claim that the dual verifier reliably detects both over- and under-constraint rests on the premise that the expert-verified 21-variant suite sufficiently covers coupled operational constraints (time-window + capacity + precedence interactions). No analysis or coverage argument is supplied showing that probe-based detection generalizes beyond these instances; this assumption directly supports both the 93% Pass@1 figure and the training pipeline.
[VRPCoder-GRPO training description] Data synthesis and training pipeline: the verifier is reused for rejection sampling during data synthesis and as the GRPO reward signal. This creates a circularity risk where Pass@1 performance may reflect the verifier’s own inductive biases rather than independent correctness; no ablation isolating the contribution of constraint-injection probes versus differential testing alone is reported.

minor comments (2)

[Results] Abstract and results sections report strong aggregate Pass@1 numbers but supply no error bars, per-benchmark breakdowns, or statistical significance tests.
[Constraint injection method] No details are given on how feasible and one-constraint-violating probes are automatically constructed or on their sensitivity to instance size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: Benchmark construction and evaluation protocol: the central claim that the dual verifier reliably detects both over- and under-constraint rests on the premise that the expert-verified 21-variant suite sufficiently covers coupled operational constraints (time-window + capacity + precedence interactions). No analysis or coverage argument is supplied showing that probe-based detection generalizes beyond these instances; this assumption directly supports both the 93% Pass@1 figure and the training pipeline.

Authors: We agree that an explicit discussion of coverage would strengthen the manuscript. The 21-variant suite was constructed with OR domain experts specifically to include representative coupled interactions (time windows with capacity, precedence with time windows, etc.), as described in Section 4.2 and the benchmark appendix. However, we did not supply a formal coverage argument or generalization analysis. In revision we will add a new subsection in the evaluation section that details the expert-driven variant selection process, enumerates the covered constraint interaction classes, and explicitly acknowledges the limits of the current suite with respect to broader generalization of the probe mechanism. revision: yes
Referee: Data synthesis and training pipeline: the verifier is reused for rejection sampling during data synthesis and as the GRPO reward signal. This creates a circularity risk where Pass@1 performance may reflect the verifier’s own inductive biases rather than independent correctness; no ablation isolating the contribution of constraint-injection probes versus differential testing alone is reported.

Authors: We acknowledge the circularity risk as a substantive methodological concern. The constraint-injection probes are defined via instance-independent logical properties (feasible probes and single-constraint violation probes) rather than the evaluation instances themselves, and differential testing supplies an orthogonal objective-equivalence signal. Nevertheless, the reuse of the same verifier for both filtering and reward does warrant explicit isolation. We will add an ablation study in the revised training section that compares (i) differential testing alone versus (ii) the full dual verifier for both rejection sampling and GRPO reward, reporting the resulting Pass@1 differences on the held-out benchmarks. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; dual verifier and performance metrics remain independent of fitted inputs

full rationale

The paper presents the dual verifier (constraint injection via feasible and one-constraint-violating probes plus differential testing) as an independent mechanism that is applied both for data filtering/reward in GRPO training and for evaluation. The reported 93% average Pass@1 and outperformance figures are empirical outcomes measured on the expert-verified benchmark suite; no equations, parameter fits, or self-citations are exhibited that reduce these outcomes to quantities defined by construction from the training process itself. The coverage of the 21 variants is an external validity assumption rather than a definitional reduction. This qualifies as at most minor (score 2) with no steps meeting the strict quotation-and-reduction criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of an expert-verified benchmark that adequately samples constraint interactions and on the assumption that probe violations reliably indicate model errors without introducing new false positives.

axioms (1)

standard math Standard mathematical properties of linear programming feasibility and constraint satisfaction hold for the Gurobi solver outputs.
Invoked implicitly when using solver scripts as ground truth.

invented entities (1)

Constraint injection probes no independent evidence
purpose: Detect spurious or omitted constraints via feasible and violating test instances.
New verification construct introduced to address limitations of objective equivalence.

pith-pipeline@v0.9.1-grok · 5790 in / 1326 out tokens · 19793 ms · 2026-06-28T05:49:14.208523+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Proximal Policy Optimization Algorithms

Association for Computing Machinery. Gerhard Reinelt. 1991. TSPLIB–A Traveling Sales- man Problem Library. ORSA Journal on Computing, 3(4):376–384. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Prox- imal Policy Optimization Algorithms . Preprint, arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx...

work page internal anchor Pith review Pith/arXiv arXiv 1991
[2]

locked” means that the proﬁle family directly uses the re- source values from its grid rather than the default rules in Appendix B.2; “–

New Benchmark Instances for the Capaci- tated V ehicle Routing Problem . European Journal of Operational Research, 257(3):845–858. Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Y u, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Y uan, Jia Zeng, Y uan Wang, Gang Chen, and Dongxiang Zhang. ...

work page arXiv 2025
[3]

We train for 3 epochs with sequence length
[4]

Frontier ﬁltering

The optimizer is AdamW with learning rate 2 × 10−4, cosine scheduling, warmup ratio 0.1, weight decay 0.01, and maximum gradient norm 1.0. Frontier ﬁltering. For each prompt in the SFT pool, we sample M = 6 ofﬂine rollouts from VRPCoder-SFT under the same decoding settings as online GRPO (temperature 0.4, top- p 0.95), score each rollout with the dual ver...

2020
[5]

Do not omit, invent, or round any number

Preserve ALL numerical data exactly (coordinates, demands, capacities, ﬂeet size, time windows, service times, max distances, etc.). Do not omit, invent, or round any number
[6]

No math symbols, variable names, or code jargon

State the optimization objective and every constraint in business language. No math symbols, variable names, or code jargon
[7]

Truck 0 (capacity 50.0, ﬁxed cost 70.0)

Index preservation (critical): every vehicle, customer, and depot must carry its explicit numeric ID from the code. Write "Truck 0 (capacity 50.0, ﬁxed cost 70.0)" −− never describe entities by type alone without their ID. Never renumber or reorder. Unit consistency:
[8]

3.4 pallets

Demand/capacity values may be fractional (e.g. 3.4). Use continuous units only (tons, kilograms, liters). Never use bare countable nouns (pallets, boxes, items) −− "3.4 pallets" is nonsensical
[9]

Use a tangible goods unit instead

If the problem has time−window or scheduling constraints, do NOT also use time−based units for demand/capacity −− this creates a logical contradiction. Use a tangible goods unit instead. Style:
[10]

Structure: background −> resources/data −> operational constraints −> objective
[11]

No Markdown, no bullets/tables/dialogue/email tone

Plain prose only. No Markdown, no bullets/tables/dialogue/email tone
[12]

a capacity of 50.0

If a scene hint is given, adapt terminology naturally while preserving the optimization structure. Scenario Instantiation: User Prompt Generate a natural−language business problem statement from the code below. ===== Scene and Style Instructions ===== {scene_hint} ===== Gurobi Code (with complete data) ===== {code} ===== Critical Numbers That Must Appear ...
[13]

Numerical completeness: all critical values (depot/customer/vehicle counts, capacities, demands, coordinates, time windows, etc.) present?
[14]

Constraint coverage: every code constraint has a corresponding business description?
[15]

Objective: optimization direction correctly stated?
[16]

No fabrication: no numbers or conditions invented beyond the code?
[17]

Index−attribute binding: every entity carries its numeric ID, and ID−to−attribute mappings (ID −> capacity, cost, type) are faithful to the code? Missing or ambiguous IDs = INCOMPLETE
[18]

Unit plausibility: demand/capacity units are continuous (not countable nouns) and do not conﬂict with time−window dimensions? Output exactly one of: − COMPLETE − INCOMPLETE: <discrepancy_1> | <discrepancy_2> | ... Self-Critique: Critic User Prompt Please compare the following content and determine whether the natural−language description fully and accurat...
[19]

Only modify the parts identiﬁed as problematic
[20]

Preserve the correct and ﬂuent parts of the original description
[21]

Ensure all critical numbers appear accurately after revision
[22]

Do not introduce new errors. Repair: User Prompt Review feedback: {criticism} Original code and data: {code} Current natural−language description: {nl_description} Please output the revised complete natural−language description (only modify the parts identiﬁed as problematic; preserve the correct parts): Figure 6: Self-Critique and Repair Prompts. Critic ...
[23]

Do not delete, alter, round, merge, or restate numbers in a different form

Preserve every numeric value exactly as written. Do not delete, alter, round, merge, or restate numbers in a different form
[24]

Do not renumber, reorder, relabel, or collapse depots, customers, vehicles, compartments, pickup−delivery pairs, or any other named entities

Preserve every explicit entity ID exactly as written. Do not renumber, reorder, relabel, or collapse depots, customers, vehicles, compartments, pickup−delivery pairs, or any other named entities
[25]

Preserve every fact that changes the optimization problem: objective terms, route start/end behavior, depot assignment, ﬂeet size, capacities, coordinates, demands, service times, time windows, maximum distance or duration, ﬁxed or travel costs, heterogeneous vehicle attributes, compartment rules, pickup−delivery logic, backhaul ordering, split−delivery p...
[26]

This is condensation, not stylistic rewriting

Keep the original business scene and terminology. This is condensation, not stylistic rewriting
[27]

<See variant−speciﬁc block below>
[28]

each node has exactly one incoming arc and exactly one outgoing arc

Micro−examples of allowed compression: (1)"each node has exactly one incoming arc and exactly one outgoing arc" can usually be shortened to "each customer/location must be visited exactly once" if route start/end behavior is already clear elsewhere. (2)"no self−loops, no arcs into depot 0, no arcs out of sink 13" can be deleted if these are only solver−hy...
[29]

Remove tutorial tone, repeated reminders, and repeated restatements of the same requirement

Merge duplicated statements into one concise statement. Remove tutorial tone, repeated reminders, and repeated restatements of the same requirement
[30]

Do not output bullets, markdown, JSON, XML, commentary, or surrounding quotation marks

Prefer plain prose. Do not output bullets, markdown, JSON, XML, commentary, or surrounding quotation marks
[31]

Output only the condensed problem statement

If the source is already concise, make only small cuts. Output only the condensed problem statement. Condensation: Rule 5 for Pilot 1
[32]

no arcs into the start depot

Y ou MA Y delete solver−side or formulation−side wording when it is not real business content. Typical deletable items include self−loop prohibitions, forbidden−arc wording such as "no arcs into the start depot" or "no arcs out of the sink", exact in−degree / out−degree phrasing, explicit subtour−elimination wording, and explanations of why a virtual depo...
[33]

virtual depot

Y ou MUST delete ALL virtual−node wording. Speciﬁcally remove: (1) Any mention of "virtual depot", "virtual sink", "virtual return node", "dummy node" and their node IDs when used only as route endpoints. (2) Sentences like "Upon completing their routes, all trucks must proceed to Virtual Distribution center N". (3) Replace with natural route termination:...
[34]

First identify every node/location identiﬁer that appears in the source text: depots, customers, pickup nodes, delivery nodes, virtual depots, sink nodes, endpoint nodes, start nodes, and end nodes
[35]

Replace all of those node/location identiﬁers consistently using the requested target style
[36]

Do not rely on any external mapping table

Create the node−ID mapping yourself from the source text. Do not rely on any external mapping table
[37]

Do not change vehicle IDs or any non−ID number
[38]

Preserve all constraints and business meaning
[39]

Do not add nodes or virtual−node details that are absent from the source
[40]

Report the exact mapping you used in the JSON ﬁeld node_id_map
[41]

rewritten_nl

Do not output markdown, explanations, or commentary. Output JSON only: { "rewritten_nl": "...", "node_id_map": [ {"source_id": "0", "target_id": "A", "role": "start_depot"}, {"source_id": "7", "target_id": "B", "role": "virtual_end_depot"} ] } Index Rewriting: User Prompt V ariant: {variant} Target style: {target_style} Source description: {original_nl} R...
[42]

import gurobipy as gp and from gurobipy import GRB
[43]

It must call m.optimize() and return the model object

Deﬁne def build_model(): containing all model logic. It must call m.optimize() and return the model object
[44]

Set m.setParam(’OutputFlag’, 0) before optimize
[45]

Model ALL constraints described in the problem −− omit none, fabricate none
[46]

Use exact numerical values from the problem description. Index convention: − Use internal integer node indices 0..N−1 for routing variables, where N is the total number of routing nodes: depots, customers, pickup/delivery nodes, and explicit virtual/end nodes if present. − If the problem already uses natural zero−based node IDs, use them directly as model...

[1] [1]

Proximal Policy Optimization Algorithms

Association for Computing Machinery. Gerhard Reinelt. 1991. TSPLIB–A Traveling Sales- man Problem Library. ORSA Journal on Computing, 3(4):376–384. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Prox- imal Policy Optimization Algorithms . Preprint, arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx...

work page internal anchor Pith review Pith/arXiv arXiv 1991

[2] [2]

locked” means that the proﬁle family directly uses the re- source values from its grid rather than the default rules in Appendix B.2; “–

New Benchmark Instances for the Capaci- tated V ehicle Routing Problem . European Journal of Operational Research, 257(3):845–858. Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Y u, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Y uan, Jia Zeng, Y uan Wang, Gang Chen, and Dongxiang Zhang. ...

work page arXiv 2025

[3] [3]

We train for 3 epochs with sequence length

[4] [4]

Frontier ﬁltering

The optimizer is AdamW with learning rate 2 × 10−4, cosine scheduling, warmup ratio 0.1, weight decay 0.01, and maximum gradient norm 1.0. Frontier ﬁltering. For each prompt in the SFT pool, we sample M = 6 ofﬂine rollouts from VRPCoder-SFT under the same decoding settings as online GRPO (temperature 0.4, top- p 0.95), score each rollout with the dual ver...

2020

[5] [5]

Do not omit, invent, or round any number

Preserve ALL numerical data exactly (coordinates, demands, capacities, ﬂeet size, time windows, service times, max distances, etc.). Do not omit, invent, or round any number

[6] [6]

No math symbols, variable names, or code jargon

State the optimization objective and every constraint in business language. No math symbols, variable names, or code jargon

[7] [7]

Truck 0 (capacity 50.0, ﬁxed cost 70.0)

Index preservation (critical): every vehicle, customer, and depot must carry its explicit numeric ID from the code. Write "Truck 0 (capacity 50.0, ﬁxed cost 70.0)" −− never describe entities by type alone without their ID. Never renumber or reorder. Unit consistency:

[8] [8]

3.4 pallets

Demand/capacity values may be fractional (e.g. 3.4). Use continuous units only (tons, kilograms, liters). Never use bare countable nouns (pallets, boxes, items) −− "3.4 pallets" is nonsensical

[9] [9]

Use a tangible goods unit instead

If the problem has time−window or scheduling constraints, do NOT also use time−based units for demand/capacity −− this creates a logical contradiction. Use a tangible goods unit instead. Style:

[10] [10]

Structure: background −> resources/data −> operational constraints −> objective

[11] [11]

No Markdown, no bullets/tables/dialogue/email tone

Plain prose only. No Markdown, no bullets/tables/dialogue/email tone

[12] [12]

a capacity of 50.0

If a scene hint is given, adapt terminology naturally while preserving the optimization structure. Scenario Instantiation: User Prompt Generate a natural−language business problem statement from the code below. ===== Scene and Style Instructions ===== {scene_hint} ===== Gurobi Code (with complete data) ===== {code} ===== Critical Numbers That Must Appear ...

[13] [13]

Numerical completeness: all critical values (depot/customer/vehicle counts, capacities, demands, coordinates, time windows, etc.) present?

[14] [14]

Constraint coverage: every code constraint has a corresponding business description?

[15] [15]

Objective: optimization direction correctly stated?

[16] [16]

No fabrication: no numbers or conditions invented beyond the code?

[17] [17]

Index−attribute binding: every entity carries its numeric ID, and ID−to−attribute mappings (ID −> capacity, cost, type) are faithful to the code? Missing or ambiguous IDs = INCOMPLETE

[18] [18]

Unit plausibility: demand/capacity units are continuous (not countable nouns) and do not conﬂict with time−window dimensions? Output exactly one of: − COMPLETE − INCOMPLETE: <discrepancy_1> | <discrepancy_2> | ... Self-Critique: Critic User Prompt Please compare the following content and determine whether the natural−language description fully and accurat...

[19] [19]

Only modify the parts identiﬁed as problematic

[20] [20]

Preserve the correct and ﬂuent parts of the original description

[21] [21]

Ensure all critical numbers appear accurately after revision

[22] [22]

Do not introduce new errors. Repair: User Prompt Review feedback: {criticism} Original code and data: {code} Current natural−language description: {nl_description} Please output the revised complete natural−language description (only modify the parts identiﬁed as problematic; preserve the correct parts): Figure 6: Self-Critique and Repair Prompts. Critic ...

[23] [23]

Do not delete, alter, round, merge, or restate numbers in a different form

Preserve every numeric value exactly as written. Do not delete, alter, round, merge, or restate numbers in a different form

[24] [24]

Do not renumber, reorder, relabel, or collapse depots, customers, vehicles, compartments, pickup−delivery pairs, or any other named entities

Preserve every explicit entity ID exactly as written. Do not renumber, reorder, relabel, or collapse depots, customers, vehicles, compartments, pickup−delivery pairs, or any other named entities

[25] [25]

Preserve every fact that changes the optimization problem: objective terms, route start/end behavior, depot assignment, ﬂeet size, capacities, coordinates, demands, service times, time windows, maximum distance or duration, ﬁxed or travel costs, heterogeneous vehicle attributes, compartment rules, pickup−delivery logic, backhaul ordering, split−delivery p...

[26] [26]

This is condensation, not stylistic rewriting

Keep the original business scene and terminology. This is condensation, not stylistic rewriting

[27] [27]

<See variant−speciﬁc block below>

[28] [28]

each node has exactly one incoming arc and exactly one outgoing arc

Micro−examples of allowed compression: (1)"each node has exactly one incoming arc and exactly one outgoing arc" can usually be shortened to "each customer/location must be visited exactly once" if route start/end behavior is already clear elsewhere. (2)"no self−loops, no arcs into depot 0, no arcs out of sink 13" can be deleted if these are only solver−hy...

[29] [29]

Remove tutorial tone, repeated reminders, and repeated restatements of the same requirement

Merge duplicated statements into one concise statement. Remove tutorial tone, repeated reminders, and repeated restatements of the same requirement

[30] [30]

Do not output bullets, markdown, JSON, XML, commentary, or surrounding quotation marks

Prefer plain prose. Do not output bullets, markdown, JSON, XML, commentary, or surrounding quotation marks

[31] [31]

Output only the condensed problem statement

If the source is already concise, make only small cuts. Output only the condensed problem statement. Condensation: Rule 5 for Pilot 1

[32] [32]

no arcs into the start depot

Y ou MA Y delete solver−side or formulation−side wording when it is not real business content. Typical deletable items include self−loop prohibitions, forbidden−arc wording such as "no arcs into the start depot" or "no arcs out of the sink", exact in−degree / out−degree phrasing, explicit subtour−elimination wording, and explanations of why a virtual depo...

[33] [33]

virtual depot

Y ou MUST delete ALL virtual−node wording. Speciﬁcally remove: (1) Any mention of "virtual depot", "virtual sink", "virtual return node", "dummy node" and their node IDs when used only as route endpoints. (2) Sentences like "Upon completing their routes, all trucks must proceed to Virtual Distribution center N". (3) Replace with natural route termination:...

[34] [34]

First identify every node/location identiﬁer that appears in the source text: depots, customers, pickup nodes, delivery nodes, virtual depots, sink nodes, endpoint nodes, start nodes, and end nodes

[35] [35]

Replace all of those node/location identiﬁers consistently using the requested target style

[36] [36]

Do not rely on any external mapping table

Create the node−ID mapping yourself from the source text. Do not rely on any external mapping table

[37] [37]

Do not change vehicle IDs or any non−ID number

[38] [38]

Preserve all constraints and business meaning

[39] [39]

Do not add nodes or virtual−node details that are absent from the source

[40] [40]

Report the exact mapping you used in the JSON ﬁeld node_id_map

[41] [41]

rewritten_nl

Do not output markdown, explanations, or commentary. Output JSON only: { "rewritten_nl": "...", "node_id_map": [ {"source_id": "0", "target_id": "A", "role": "start_depot"}, {"source_id": "7", "target_id": "B", "role": "virtual_end_depot"} ] } Index Rewriting: User Prompt V ariant: {variant} Target style: {target_style} Source description: {original_nl} R...

[42] [42]

import gurobipy as gp and from gurobipy import GRB

[43] [43]

It must call m.optimize() and return the model object

Deﬁne def build_model(): containing all model logic. It must call m.optimize() and return the model object

[44] [44]

Set m.setParam(’OutputFlag’, 0) before optimize

[45] [45]

Model ALL constraints described in the problem −− omit none, fabricate none

[46] [46]

Use exact numerical values from the problem description. Index convention: − Use internal integer node indices 0..N−1 for routing variables, where N is the total number of routing nodes: depots, customers, pickup/delivery nodes, and explicit virtual/end nodes if present. − If the problem already uses natural zero−based node IDs, use them directly as model...