pith. sign in

arxiv: 2602.02188 · v2 · submitted 2026-02-02 · 💻 cs.AI

Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization

Pith reviewed 2026-05-16 08:16 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelscombinatorial optimizationNLCO benchmarknatural language reasoningfeasibilitysolution optimalitygraph structuresbottleneck objectives
0
0 comments X

The pith

Large language models solve small natural-language combinatorial optimization problems but lose feasibility and quality as instance size grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the NLCO benchmark to test whether LLMs can perform end-to-end combinatorial optimization reasoning from plain-language descriptions alone, without code or external solvers. It evaluates models across 43 problems grouped by a four-layer taxonomy of variables, constraints, patterns, and objectives. Results indicate that leading models produce feasible and near-optimal solutions on small instances, yet both feasibility and solution quality fall as sizes increase even with extra reasoning tokens. Set-based tasks succeed more often than graph-structured problems, and bottleneck objectives prove especially difficult. These patterns reveal concrete limits in current LLM reasoning for constrained, high-dimensional search tasks common in planning and decision-making.

Core claim

High-performing models achieve strong feasibility and solution quality on small instances of the NLCO benchmark, but both degrade as instance size grows even if more tokens are used for reasoning. Systematic effects appear across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.

What carries the argument

The NLCO benchmark, which organizes 43 combinatorial optimization problems into a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes and supplies solver-annotated reference solutions for direct evaluation of feasibility, optimality, and reasoning efficiency.

If this is right

  • LLMs need mechanisms beyond added tokens to sustain performance on larger constrained problems.
  • Success rates vary reliably by problem structure, with set-based tasks outperforming graph-structured ones.
  • Bottleneck objectives produce more failures than other objective classes in the taxonomy.
  • The benchmark supports fine-grained diagnosis of where LLM reasoning breaks in combinatorial settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid systems that combine LLMs with traditional solvers may be required for practical use on realistically sized problems.
  • The taxonomy could guide targeted data collection to strengthen model handling of graph and bottleneck cases.
  • Pure language-based search appears bounded in high-dimensional spaces, suggesting limits even for future scaled models.

Load-bearing premise

The natural-language problem statements accurately encode the intended combinatorial instances without ambiguities or simplifications, and the solver-annotated solutions are correct references.

What would settle it

A controlled test in which one or more LLMs maintain high feasibility and near-optimal solution rates on instances substantially larger than those in the benchmark, or in which multiple distinct valid solutions are found for the same language description.

Figures

Figures reproduced from arXiv: 2602.02188 by Chengpeng Hu, Chenhao Zhang, Cong Zhang, Jie Gao, Jing Chen, Xia Jiang, Yaoxin Wu, Yingqian Zhang.

Figure 1
Figure 1. Figure 1: Overview of the NLCO benchmark. constraints (e.g., allDifferent ), which reflect solver-level modeling choices but are overly de￾tailed for analyzing LLM behavior, we group them into a small set of patterns used to summarize feasible solutions. Concretely, we use eight pat￾terns, such as Permutation/Tour, Budgeted Subset, etc, which are presented in the middle part of Fig￾ure 1(b); each CO task is associat… view at source ↗
Figure 2
Figure 2. Figure 2: Aggregated performance across (a) VarSort and (b) ObjClass dimensions in NLCO taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AFR–ALOG joint distribution by family (left) and infeasibility mode distribution (right). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLM data profiles across all difficulty tiers. (a) AFR vs. Token budget; (b) Acc. vs. Token budget. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM performance profiles on different NLCO difficulty tiers. (a) Set-S; (b) Set-M; (c) Set-L. [PITH_FULL_IMAGE:figures/full_fig_p043_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-model radar charts showing infeasibility mode distribution across LLMs; larger filled area indicates [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Best-of-N performance scaling across four models. We report (a) AFR (%), (b) Acc. (%), (c) ALOG, and (d) tok. (of the selected candidate solution) as functions of N. Set-S, the model computes many explicit Euclidean distances, identifies geographic clusters, and uses Clarke–Wright–style “savings” intuition to justify linking nearby customers far from the depot; it then iteratively improves by reordering st… view at source ↗
read the original abstract

While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces NLCO, a benchmark of 43 natural-language combinatorial optimization problems organized by a four-layer taxonomy (variable types, constraint families, global patterns, objective classes). LLMs are evaluated end-to-end on producing discrete solutions from text descriptions alone, using solver-annotated ground truth to measure feasibility, solution quality, and reasoning efficiency. The central empirical claims are that high-performing models succeed on small instances but degrade in feasibility and optimality as size grows (even with more reasoning tokens), with systematic taxonomy effects such as easier set-based tasks versus harder graph-structured problems and bottleneck objectives.

Significance. If the methodological gaps are closed, the benchmark would provide a valuable structured resource for diagnosing LLM limitations in constrained combinatorial search, with the taxonomy enabling fine-grained identification of failure modes. The solver-annotated references and direct NL-to-solution protocol are strengths that support reproducibility and isolate reasoning from tool use.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The manuscript supplies no description of the procedure used to generate natural-language statements from formal CO instances, no validation steps confirming that the NL encodings preserve the intended constraints and objectives without added ambiguities, and no cross-checks on solver annotations. This is load-bearing for the headline claim that performance degradation with instance size reflects reasoning limits rather than parsing or fidelity issues.
  2. [Section 4] Evaluation protocol (Section 4): There is no account of statistical testing for the reported scaling trends, no controls for prompt sensitivity or phrasing variations, and no analysis of whether longer NL statements for larger instances introduce unintended complexity. These omissions weaken support for the systematic taxonomy effects and the attribution of failures to combinatorial reasoning.
minor comments (2)
  1. [Abstract] The abstract states that 43 problems are covered but does not indicate the distribution across taxonomy layers or the range of instance sizes tested; adding this would improve clarity.
  2. [§4.2] Notation for feasibility and optimality metrics could be defined earlier (e.g., before the results tables) to aid readers unfamiliar with CO evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving clarity and rigor in the benchmark construction and evaluation protocol. We have revised the manuscript to address these points directly, adding the requested details and analyses while preserving the core empirical claims.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The manuscript supplies no description of the procedure used to generate natural-language statements from formal CO instances, no validation steps confirming that the NL encodings preserve the intended constraints and objectives without added ambiguities, and no cross-checks on solver annotations. This is load-bearing for the headline claim that performance degradation with instance size reflects reasoning limits rather than parsing or fidelity issues.

    Authors: We agree that a more explicit account of the NL generation process is needed to rule out parsing or fidelity confounds. In the revised manuscript we have added a dedicated subsection in §3 that describes the templated translation procedure from formal CO instances to natural-language statements, the expert review protocol used to verify constraint and objective preservation (including inter-annotator agreement metrics), and the independent solver re-execution checks performed on a random subset of instances. These additions directly support the attribution of scaling failures to reasoning limits. revision: yes

  2. Referee: [Section 4] Evaluation protocol (Section 4): There is no account of statistical testing for the reported scaling trends, no controls for prompt sensitivity or phrasing variations, and no analysis of whether longer NL statements for larger instances introduce unintended complexity. These omissions weaken support for the systematic taxonomy effects and the attribution of failures to combinatorial reasoning.

    Authors: We accept that the original evaluation section lacked formal statistical support and sensitivity controls. The revised version now includes (i) paired statistical tests (Wilcoxon signed-rank) on feasibility and optimality gaps across instance-size bins, (ii) an ablation using three distinct prompt phrasings per problem to quantify prompt sensitivity, and (iii) a regression analysis relating statement token length to performance while controlling for instance size. These additions strengthen the evidence that the observed taxonomy effects and scaling trends arise from combinatorial reasoning demands rather than extraneous linguistic factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark study

full rationale

This paper introduces an empirical benchmark (NLCO) for evaluating LLMs on natural-language combinatorial optimization tasks. It relies on solver-annotated reference solutions as external ground truth and reports performance metrics across a taxonomy of problem types. No mathematical derivations, fitted parameters, self-referential predictions, or ansatzes appear in the abstract or described structure. Central claims about feasibility degradation with instance size are supported by direct experimental comparison to solver outputs rather than by construction or self-citation chains. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical entities; the work relies on standard definitions of combinatorial optimization problems and external solver annotations.

pith-pipeline@v0.9.0 · 5533 in / 1064 out tokens · 32106 ms · 2026-05-16T08:16:49.140224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  2. Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    InIn- ternational Conference on Machine Learning, pages 577–596

    OptiMUS: Scalable optimization modeling with (mi) lp solvers and large language models. InIn- ternational Conference on Machine Learning, pages 577–596. Majd Alkayyal, Simon Malberg, and Georg Groh. 2026. An llm-based decision support system for strategic decision-making. InMachine Learning and Knowl- edge Discovery in Databases. Applied Data Science Trac...

  2. [2]

    Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- nah Rashkin, Doug Downey, Wen tau Yih, and Yejin Choi

    Machine learning for combinatorial optimiza- tion: a methodological tour d’horizon.European Journal of Operational Research, 290(2):405–421. Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- nah Rashkin, Doug Downey, Wen tau Yih, and Yejin Choi. 2020. Abductive commonsense reasoning. In International Conference ...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Qaplib–a quadratic assignment problem library. Journal of Global optimization, 10(4):391–403. Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, and Jiajun Zhang. 2025. LR²Bench: Evaluating long-chain reflective reasoning capabilities of large language models via constraint satisfaction problems. InFindings of the Association for Computational Linguist...

  4. [4]

    arXiv preprint arXiv:2505.16952 (2025)

    Bpplib: a library for bin packing and cutting stock problems.Optimization Letters, 12(2):235– 250. Elizabeth D Dolan and Jorge J Moré. 2002. Benchmark- ing optimization software with performance profiles. Mathematical programming, 91(2):201–213. Emanuel Falkenauer. 1996. A hybrid grouping genetic algorithm for bin packing.Journal of heuristics, 2(1):5–30....

  5. [5]

    arXiv preprint arXiv:2502.11607 (2025)

    Nested-refinement metamorphosis: Reflective evolution for efficient optimization of networking problems. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 17398–17429. Martin Hoefer. 2006. Ufllib: A collection of benchmark instances for the uncapacitated facility location prob- lem. https://resources.mpiinf.mpg.de/depa rtments...

  6. [6]

    InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 26, pages 493–498

    Non-model-based search guidance for set par- titioning problems. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 26, pages 493–498. Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learning combinatorial opti- mization algorithms over graphs.Advances in neural information processing systems, 30. Gorka Kobe...

  7. [7]

    Wouter Kool, Herke van Hoof, and Max Welling

    An efficient evolutionary algorithm for the orienteering problem.Computers & Operations Re- search, 90:42–59. Wouter Kool, Herke van Hoof, and Max Welling. 2018. Attention, learn to solve routing problems! InInter- national Conference on Learning Representations. Chao Lei, Yanchuan Chang, Nir Lipovetzky, and Krista A. Ehinger. 2025. Planning-driven progra...

  8. [8]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, and Chen Dong, et al. 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556. Fei Liu, Xialiang ...

  9. [9]

    arXiv preprint arXiv:2007.08124 (2020)

    Evolution of heuristics: towards efficient auto- matic algorithm design using large language model. InInternational Conference on Machine Learning. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. Logiqa: A challenge dataset for machine reading compre- hension with logical reasoning.arXiv preprint arXiv:2007.08124. Rafael M...

  10. [10]

    A benchmark library and a comparison of heuristic methods for the linear ordering prob- lem.Computational optimization and applications, 51(3):1297–1317. Meta AI. 2025. The llama 4 model card. Technical report, Meta. Model variant: Llama-4-Maverick. Kostis Michailidis, Dimos Tsouros, and Tias Guns

  11. [11]

    A survey on large language model benchmarks,

    CP-Bench: Evaluating large language models for constraint modelling. In28th European Confer- ence on Artificial Intelligence. Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth Intern...

  12. [12]

    In The Twelfth International Conference on Learning Representations

    Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, and Deyi Xiong. 2025b. Probench: Bench- marking large language models in competitive pro- gramming.arXiv preprint arXiv:2502.20868. Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhi- jiang Guo, Wei S...

  13. [13]

    InThe Tenth Interna- tional Conference on Learning Representations

    minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth Interna- tional Conference on Learning Representations. Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. 2025. Monte carlo tree search for compre- hensive exploration in llm-based automatic heuristic design. InInternational Conference on Machine Learning. Ruiwen Zh...

  14. [14]

    Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models

    RuleArena: A benchmark for rule-guided rea- soning with LLMs in real-world scenarios. InPro- ceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 550–572. Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. 2025. Autologi: Automate...

  15. [15]

    the hub where routes start and end

    by applying random modifications. We spec- ify the number of tasks and resources, resource capacities, task durations, resource demands, and precedence relations. Task durations, resource de- mands, and resource capacities are scaled by in- dependent random factors drawn from predefined ranges, while preserving feasibility by ensuring that each resource c...

  16. [16]

    keep the overall effort as small as possible

    What the decision-maker chooses(e.g., selecting, ordering, grouping, assigning, matching, scheduling, placing) 2.The single optimization objective(its direction, how it is aggregated, and its meaning) 3.All feasibility constraints(no more, no fewer than described in the description) Once inferred,these elements must remain completely fixed: • Donotadd, re...

  17. [17]

    problem_description

    No additional text. Prompt 1: Prompt used to generate diverse natural-language decision-making scenarios. Problem-Consistency Verification Prompt You are a task type verifier for optimization benchmark generation. Target Problem The expected problem type is:expected_problem Problem description: “problem_description” Verification Task Determine whether the...

  18. [18]

    A", ... ,

    Understanding the problem We have: - Shop: node A (0, 71), demand 0. - Houses: B, C, D, E, F, G, H, I, J, each with given coordinates and demand. - Van capacity: 100 units. - Goal: Partition houses into routes (each starting and ending at A) so that total demand per route <= 100, and total Euclidean travel distance is minimized. - Output: JSON list of rou...

  19. [19]

    Since capacity = 100, we need at least 2 vans

    Data table From the table: | Node | x | y | demand | |------|---|---|--------| | A | 0 | 71 | 0 | | B | 62 | 6 | 12 | | C | 71 | 8 | 24 | | D | 62 | 90 | 3 | | E | 76 | 9 | 14 | | F | 87 | 59 | 19 | | G | 70 | 0 | 4 | | H | 100 | 31 | 24 | | I | 75 | 100 | 19 | | J | 68 | 98 | 2 | Total demand (excluding A): 12 + 24 + 3 + 14 + 19 + 4 + 24 + 19 + 2 = 121 u...

  20. [20]

    Euclidean distances Let's compute distances between all points (rounded to 2 decimals for reasoning, but final route choice will be based on minimizing total). From A (0,71): - A->B: sqrt((62-0)^2 + (6-71)^2) = sqrt(3844 + 4225) = sqrt(8069) ~= 89.83 - A->C: sqrt(71^2 + (8-71)^2) = sqrt(5041 + 3969) = sqrt(9010) ~= 94.92 - A->D: sqrt(62^2 + (90-71)^2) = s...

  21. [21]

    Cluster 1 (Lower-left): B(12), C(24), E(14), G(4) -> total = 54 units

    Demand grouping for capacity We want to minimize total travel, so likely group geographically. Cluster 1 (Lower-left): B(12), C(24), E(14), G(4) -> total = 54 units. Could add more if nearby. Cluster 2 (Upper): D(3), I(19), J(2) -> total = 24 units. Could add F or H if they fit geographically. Cluster 3 (Middle-right): F(19), H(24) -> total = 43 units. Bu...

  22. [22]

    We want to merge routes if savings high and capacity allows

    Clarke-Wright savings intuition Savings s(i,j) = d(A,i) + d(A,j) - d(i,j). We want to merge routes if savings high and capacity allows. Let's compute a few key savings (approx): B and C: d(A,B)~=89.83, d(A,C)~=94.92, d(B,C)=sqrt((71-62)^2+(8-6)^2)=sqrt(81+4)=sqrt(85)~=9.22 Savings ~= 89.83+94.92-9.22 ~= 175.53 -> huge because B and C are very close. B and...

  23. [23]

    Option 1: Route1: B(12), C(24), E(14), G(4) = 54, add F(19) = 73, add H(24) = 97 (within 100)

    Try feasible 2-route partition We need each route <= 100 demand. Option 1: Route1: B(12), C(24), E(14), G(4) = 54, add F(19) = 73, add H(24) = 97 (within 100). That leaves D,I,J for other route: D(3)+I(19)+J(2)=24, well under 100. Check distances: Route1: A -> B -> C -> E -> G -> F -> H -> A? Or better order to minimize distance. Route2: A -> D -> I -> J ...

  24. [24]

    -> A with B,C,E,G,F,H

    Optimize route order within each Route1: A -> ... -> A with B,C,E,G,F,H. Start from A: A to nearest among them? A to B=89.83, A to G=99.70, A to F=87.82, A to H=107.70, A to C=94.92, A to E=98.08. Actually F is closest to A in this group (87.82), but F is in middle-right, not close to B,C,E,G. Better to start with cluster B,C,E,G then go to F then H then ...

  25. [25]

    Try to improve Route1 order Maybe A -> G -> B -> C -> E -> F -> H -> A. A->G: 99.70 G->B: 10.00 B->C: 9.22 C->E: 5.10 E->F: sqrt((87-76)^2+(59-9)^2)=sqrt(121+2500)=sqrt(2621)~=51.20 F->H: 30.87 H->A: 107.70 Sum ~= 313.79 (slightly better than 314.94) Total with Route2 (161.69) ~= 475.48 Try A -> G -> B -> C -> E -> H -> F -> A: A->G: 99.70 G->B: 10.00 B->...

  26. [26]

    Check if 3 routes could be shorter If we split into 3 routes, each shorter distances but more A-trips. Example: R1: B,C,E,G = 54 A->B->C->E->G->A: A->B: 89.83 B->C: 9.22 C->E: 5.10 E->G: 10.82 G->A: 99.70 Total ~= 214.67 R2: F,H = 43 A->F->H->A: A->F: 87.82 F->H: 30.87 H->A: 107.70 Total ~= 226.39 R3: D,I,J = 24 A->D->I->J->A: Total ~= 161.69 Total ~= 214...

  27. [27]

    solution

    Final best found Route1: A -> G -> B -> C -> E -> H -> F -> A (demand 97) Route2: A -> D -> I -> J -> A (demand 24) Total distance ~= 436.96. Final JSON: { "solution": [ ["A","G","B","C","E","H","F","A"], ["A","D","I","J","A"] ] } Table 15: DeepSeek CoT trace on a Set-S CVRP instance. CoT (Set-M / CVRP example)

  28. [28]

    - Van capacity: 100 units

    Understanding the problem We have: - Shop: node A (demand 0) - Houses: B, C, D, E, F, G, H, I, J, K (10 houses total) - Each has a demand in units (given in table). - Van capacity: 100 units. - Euclidean distances between coordinates (since coordinates given). - Need to partition houses into routes starting and ending at A, total demand per route <= 100, ...

  29. [29]

    Since van capacity = 100, minimum number of vans = ceil(167/100) = 2 vans

    Data extraction Coordinates (x, y): A: (87, 24) B: (8, 13) C: (84, 43) D: (0, 52) E: (55, 100) F: (88, 43) G: (61, 39) H: (0, 98) I: (97, 31) J: (100, 6) K: (84, 0) Demands: B: 23 C: 12 D: 19 E: 19 F: 17 G: 8 H: 12 I: 18 J: 18 K: 21 Total demand = 23+12+19+19+17+8+12+18+18+21 = 167 units. Since van capacity = 100, minimum number of vans = ceil(167/100) = ...

  30. [30]

    Total 167 means if one van carries X, the other carries 167-X <= 100 -> X >= 67

    Check if 2 vans possible We need to partition 10 houses into 2 groups, each <= 100 units. Total 167 means if one van carries X, the other carries 167-X <= 100 -> X >= 67. So each van must carry between 67 and 100 units. Largest demands: B(23), D(19), E(19), F(17), I(18), J(18), K(21), C(12), G(8), H(12). Try to make two groups ~83.5 each. Group 1: B(23) +...

  31. [31]

    --- ### Van 1: B, D, E, H, C, G (93 units) Coordinates: A(87,24), B(8,13), D(0,52), E(55,100), H(0,98), C(84,43), G(61,39)

    Routing for minimal distance We need to sequence each route starting and ending at A to minimize total distance. --- ### Van 1: B, D, E, H, C, G (93 units) Coordinates: A(87,24), B(8,13), D(0,52), E(55,100), H(0,98), C(84,43), G(61,39). Try a reasonable path: A -> B -> D -> H -> E -> G -> C -> A. Compute distances roughly (Euclidean): d(A,B) = sqrt((87-8)...

  32. [32]

    Check if swapping houses between vans reduces total distance Van 1 length ~= 324.5, van 2 ~= 100.5, total ~= 425. Move C from van 1 to van 2: Van 1: B, D, E, H, G (23+19+19+12+8 = 81) Van 2: F, I, J, K, C (17+18+18+21+12 = 86) Van 2 route: A -> C -> F -> I -> J -> K -> A: d(A,C)=19.24 d(C,F)= sqrt((88-84)^2+(43-43)^2)=4 d(F,I)=15 d(I,J)=25.18 d(J,K)=17.09...

  33. [33]

    Final routes after improvement Van 1: B, D, E, H, G (81 units) Route: A -> B -> D -> H -> E -> G -> A Length ~= 311.93 Van 2: F, I, J, K, C (86 units) Route: A -> C -> F -> I -> J -> K -> A Length ~= 104.7 Total ~= 416.63 ---

  34. [34]

    solution

    Final answer in required JSON { "solution": [ ["A","B","D","H","E","G","A"], ["A","C","F","I","J","K","A"] ] } Table 16: DeepSeek CoT trace on a Set-M CVRP instance. CoT (Set-L / CVRP example) Alright, let's go step by step. Step 1 – Understanding the problem We have: - Shop node ID:`A`(coordinates 43, 54), demand 0. - Total locations: 19 (shop + 18 house...

  35. [35]

    North-East (higher x, higher y): B(76,68), O(55,100), Q(100,100), M(38,98), N(43,67), maybe F (21,90) is more north-west

  36. [36]

    East (higher x, near y=54): G(81,43), P(97,43), H(79,24), D(60,38), I(53,16)

  37. [37]

    Actually, let's just compute distances from A to see natural clusters

    West (low x): C(26,32), J(14,11), R(0,0), L(3,51), K(5,73), S(0,92), F(21,90), E(28,65), M (38,98) -- but M is north of A. Actually, let's just compute distances from A to see natural clusters. --- Step 4 – Distances from A (approx Euclidean) Let's compute roughly: - A to B: dx=33, dy=14 -> dist ~= 35.8 - A to C: dx=-17, dy=-22 -> dist ~= 27.8 - A to D: d...

  38. [38]

    A","N","E

    ["A","N","E","M","O","Q","B","A"]

  39. [39]

    A","D","G

    ["A","D","G","P","H","I","A"]

  40. [40]

    A","C","L

    ["A","C","L","K","S","F","A"]

  41. [41]

    A","J","R

    ["A","J","R","A"] --- { "solution": [ ["A", "N", "E", "M", "O", "Q", "B", "A"], ["A", "D", "G", "P", "H", "I", "A"], ["A", "C", "L", "K", "S", "F", "A"], ["A", "J", "R", "A"] ] } Table 17: DeepSeek CoT trace on a Set-L CVRP instance