pith. sign in

arxiv: 2606.25832 · v2 · pith:SC4RIWNLnew · submitted 2026-06-24 · 💻 cs.LG · cs.AI

MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

Pith reviewed 2026-06-26 05:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords optimization problemsreinforcement learninglanguage modelspolicy optimizationreward functiongeneralizationcompact modelssolving accuracy
0
0 comments X

The pith

MiniOpt trains 3B language models to solve diverse optimization problems accurately via reinforcement learning without expert data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiniOpt, a reinforcement learning framework that teaches compact language models to handle general optimization tasks by decomposing reasoning into problem modeling and executable solver generation. It introduces OptReward, a hierarchical scoring function that evaluates both formulation quality and solution correctness to guide learning without large supervised datasets or expert demonstrations. An optimization-oriented policy optimization strategy is added to improve exploration and stabilize training for smaller models. Experiments show the resulting 3B model generalizes across optimization types, scenarios, and domains while achieving the highest average solving accuracy among models under 10B parameters.

Core claim

MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance.

What carries the argument

The reasoning-to-model-and-solve paradigm together with OptReward, a hierarchical reward function that jointly scores formulation and solution, plus an optimization-oriented policy optimization strategy for efficient exploration.

If this is right

  • Compact models under 10B parameters reach the highest average solving accuracy on optimization tasks compared with other approaches.
  • Optimization problems can be addressed without large-scale supervised datasets, costly annotations, or intermediate step verification.
  • The hierarchical reward enables stable reinforcement learning and efficient exploration specifically for smaller models.
  • Generalization holds across optimization types, problem scenarios, and task domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward design could be adapted to train small models on other structured reasoning tasks such as planning or scheduling.
  • Deployment on edge devices becomes feasible once optimization capability fits inside a 3B model.
  • Hybrid systems might combine the generated solvers with traditional optimization libraries for further gains.

Load-bearing premise

The OptReward hierarchical scoring function can jointly evaluate formulation and solution quality to drive effective policy learning without any expert demonstrations.

What would settle it

A test showing that a 3B model trained with MiniOpt achieves no higher solving accuracy than baselines on a held-out class of optimization problems never encountered during training.

Figures

Figures reproduced from arXiv: 2606.25832 by Bingdong Li, Hong Qian, Jun Zhou, Ke Tang, Ke Zhao, Qitao Shi, Xiangfeng Wang, Xiang Shu, Xingyu Lu, Yang Yu, Yaolin Wen, Zixiang Di.

Figure 1
Figure 1. Figure 1: An overview of the proposed MiniOpt training paradigm. Sub-figure (a) demonstrates the reasoning-to-model-and-solve [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: An informative and easily verifiable reward function [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of training dynamics of reward values and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of average SA against model parameter [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of training dynamics of reward function [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram showing the distribution of optimization types across 8 benchmarks. We categorize the problems in the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histogram showing the distribution of optimization problem scenarios across 8 benchmarks. We categorize the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance variations of the MiniOpt on benchmarks [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Proportion of every scenario of instances in OptMATH-Train (201K). [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Proportion of every problem type of instances in OptMATH-Train (201K). [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns to solve optimization problems through an "reasoning-to-model-and-solve" paradigm. MiniOpt decomposes optimization reasoning into structured optimization modeling and executable solver generation. Building upon this paradigm, we introduce OptReward, a reward function with hierarchical score structure that jointly evaluates formulation and solution, enabling effective policy learning without expert demonstrations. We further develop an optimization-oriented policy optimization strategy that improves exploration efficiency and stabilizes reinforcement learning for compact models. Extensive experiments show that MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance. These results suggest that optimization-oriented reward design and reinforcement learning provide an effective pathway for developing compact optimization-specialized language models with strong optimization generalization capabilities. The code is available at https://github.com/Hsiang-1/MiniOpt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes MiniOpt, a reinforcement learning framework for optimization problems that decomposes reasoning into structured modeling and executable solver generation. It introduces OptReward, a hierarchical scoring function that jointly evaluates formulation and solution to enable policy learning without expert demonstrations, along with an optimization-oriented policy optimization strategy for improved exploration in compact models. The central claim is that MiniOpt-3B exhibits strong optimization generalization across problem types, scenarios, and domains, achieving the highest average solving accuracy (SA) among models with fewer than 10B parameters while remaining competitive for larger models.

Significance. If the experimental results hold, the work would demonstrate an effective pathway for resource-efficient, optimization-specialized LLMs via reward design and RL rather than large supervised datasets. The public release of code at https://github.com/Hsiang-1/MiniOpt is a positive contribution to reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim that MiniOpt-3B achieves the highest average SA among sub-10B models across diverse optimization types is presented without any description of datasets, baselines, statistical tests, number of runs, or controls for confounding factors, so the data-to-claim link cannot be evaluated.
  2. [OptReward] OptReward section: the hierarchical reward is load-bearing for the no-expert-demonstrations claim, yet the manuscript provides no analysis of whether its components were tuned post-hoc to the reported outcomes or whether any 'predictions' reduce to fitted quantities, raising a circularity risk for the generalization results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that MiniOpt-3B achieves the highest average SA among sub-10B models across diverse optimization types is presented without any description of datasets, baselines, statistical tests, number of runs, or controls for confounding factors, so the data-to-claim link cannot be evaluated.

    Authors: The abstract is intentionally concise. Full details on datasets (diverse optimization benchmarks spanning types, scenarios, and domains), baselines (comparable LLMs under 10B parameters), evaluation protocol, and controls appear in Sections 4 and 5. To strengthen the data-to-claim link within the abstract's length constraints, we will add a brief clause such as 'evaluated on standard optimization benchmarks against peer sub-10B models.' revision: yes

  2. Referee: [OptReward] OptReward section: the hierarchical reward is load-bearing for the no-expert-demonstrations claim, yet the manuscript provides no analysis of whether its components were tuned post-hoc to the reported outcomes or whether any 'predictions' reduce to fitted quantities, raising a circularity risk for the generalization results.

    Authors: OptReward components were fixed a priori using standard optimization metrics for formulation correctness and solution feasibility; no post-hoc tuning to experimental outcomes occurred. To directly address the circularity concern, we will expand the OptReward section with a short paragraph documenting the design rationale and confirming independence from reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and description present an empirical RL framework (reasoning-to-model-and-solve paradigm, OptReward hierarchical scoring, optimization-oriented policy optimization) whose performance claims rest on experimental results across optimization problems. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are exhibited in the provided text that would reduce the central claims to inputs by construction. OptReward is described as an enabling design choice for policy learning without expert data, not a quantity whose components are shown to be post-hoc tuned or mathematically forced. This qualifies as a normal self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or newly postulated entities; the OptReward and policy optimization components are introduced but not decomposed into their constituent assumptions or fitted values.

pith-pipeline@v0.9.1-grok · 5802 in / 1086 out tokens · 24164 ms · 2026-06-26T05:16:40.700060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 2 canonical work pages

  1. [1]

    Flexible job-shop scheduling via graph neural network and deep reinforcement learning,

    W. Song, X. Chen, Q. Li, and Z. Cao, “Flexible job-shop scheduling via graph neural network and deep reinforcement learning,” IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1600–1610, 2023

  2. [2]

    Ars: Automatic routing solver with large language models,

    K. Li, F. Liu, Z. Wang, X. Tong, X. Han, M. Yuan, and Q. Zhang, “Ars: Automatic routing solver with large language models,” CoRR, vol. abs/2502.15359, 2025

  3. [3]

    CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,

    H. Deng, B. Zheng, Y . Jiang, and T. H. Tran, “CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,” in The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. [Online]. Available: https: //openreview.net/forum?id=xC2xtBLmri

  4. [4]

    AutoSAT: Automatically optimize SAT solvers via large language models,

    Y . Sun, F. Ye, X. Zhang, S. Huang, B. Zhang, K. Wei, and S. Cai, “AutoSAT: Automatically optimize SAT solvers via large language models,” CoRR, vol. abs/2402.10705, 2024

  5. [5]

    AutoPBO: LLM-powered optimization for local search PBO solvers,

    J. Li, Y . Chu, Y . Sun, M. Zou, and S. Cai, “AutoPBO: LLM-powered optimization for local search PBO solvers,” CoRR, vol. abs/2509.04007, 2025

  6. [6]

    LLMOPT: learning to define and solve general optimization problems from scratch,

    C. Jiang, X. Shu, H. Qian, X. Lu, J. Zhou, A. Zhou, and Y . Yu, “LLMOPT: learning to define and solve general optimization problems from scratch,” in Advances in The Thirteenth International Conference on Learning Representations, Singapore, 2025

  7. [7]

    BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,

    T. Wang, W. Y . Yu, Z. He, Z. Liu, H. HaileiGong, H. Wu, X. Han, W. Shi, R. She, F. Zhu, and T. Zhong, “BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pi...

  8. [8]

    Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,

    A. Singirikonda, S. Kadioglu, and K. Uppuluri, “Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,” CoRR, vol. abs/2503.10642, 2025

  9. [9]

    Training LLMs for optimization modeling via iterative data synthesis and structured validation,

    Y . Wu, Y . Zhang, Y . Wu, Y . Wang, J. Zhang, and J. Cheng, “Training LLMs for optimization modeling via iterative data synthesis and structured validation,” in Findings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics,...

  10. [10]

    OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,

    H. Lu, Z. Xie, Y . Wu, C. Ren, Y . Chen, and Z. Wen, “OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,” in Forty-second International Conference on Machine Learning, 2025

  11. [11]

    A survey of optimization modeling meets LLMs: Progress and future directions,

    Z. Xiao, J. Xie, L. Xu, S. Guan, J. Zhu, X. Han, X. Fu, W. Yu, H. Wu, W. Shi, Q. Kang, J. Duan, T. Zhong, M. Yuan, J. Zeng, Y . Wang, G. Chen, and D. Zhang, “A survey of optimization modeling meets LLMs: Progress and future directions,” in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, J. Kwok, Ed. In...

  12. [12]

    An agent-based framework for the automatic validation of mathematical optimization models,

    A. Zadorojniy, S. Wasserkrug, and E. Farchi, “An agent-based framework for the automatic validation of mathematical optimization models,” 2025. [Online]. Available: https://arxiv.org/abs/2511.16383

  13. [13]

    Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,

    H. Zhai, C. Lawless, E. Vitercik, and L. Leqi, “Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,” in 2nd AI for Math Workshop @ ICML 2025, 2025. [Online]. Available: https://openreview.net/forum?id=RvdjzNlksm

  14. [14]

    OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,

    A. AhmadiTeshnizi, W. Gao, and M. Udell, “OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,” in Advances in Forty-first International Conference on Machine Learning, Vienna, Austria, 2024

  15. [15]

    OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,

    X. Li, J. Chen, X. Fang, S. Ding, H. Duan, Q. Liu, and K. Chen, “OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,” CoRR, vol. abs/2506.10764, 2025

  16. [16]

    LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,

    X. Huang, Q. Shen, Y . Hu, A. Gao, and B. Wang, “LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,” in Findings of the Association for Computational Linguistics 2025, Albuquerque, New Mexico, 2025, pp. 2678–2710

  17. [17]

    Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,

    Z. Yang, Y . Huang, W. Shi, L. Feng, L. Song, Y . Wang, X. Liang, and J. Tang, “Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,” CoRR, vol. abs/2407.09887, 2024

  18. [18]

    OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,

    Z. Yang, Y . Wang, Y . Huang, Z. Guo, W. Shi, X. Han, L. Feng, L. Song, X. Liang, and J. Tang, “OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,” in The Thirteenth International Conference on Learning Representations, Singapore, 2025

  19. [19]

    Chain-of-Experts: When LLMs meet complex operations research problems,

    Z. Xiao, D. Zhang, Y . Wu, L. Xu, Y . J. Wang, X. Han, X. Fu, T. Zhong, J. Zeng, M. Song, and G. Chen, “Chain-of-Experts: When LLMs meet complex operations research problems,” in The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024

  20. [20]

    Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,

    H. Liu, J. Wang, Y . Cai, X. Han, Y . Kuang, and J. HAO, “Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=Ej20yjWMCj

  21. [21]

    Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,

    K. Liang, Y . Lu, J. Mao, S. Sun, C. Yang, C. Zeng, X. Jin, H. Qin, R. Zhu, and C.-P. Teo, “Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,” 2025. [Online]. Available: https://dx.doi.org/10.2139/ssrn.5329027

  22. [22]

    LLaMoCo: Instruction tuning of large language models for optimization code generation,

    Z. Ma, H. Guo, J. Chen, G. Peng, Z. Cao, Y . Ma, and Y . Gong, “LLaMoCo: Instruction tuning of large language models for optimization code generation,” CoRR, vol. abs/2403.01131, 2024

  23. [23]

    Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,

    P. P. Dakle, S. Kadioglu, K. Uppuluri, R. Politi, P. Raghavan, S. Ral- labandi, and R. Srinivasamurthy, “Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,” in Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 20th International Conference, vol. 13884, Nice, France, 2023, pp. 299–319

  24. [24]

    Steporlm: A self-evolving framework with generative process supervision for operations research language models,

    C. Zhou, T. Xu, J. Lin, and D. Ge, “Steporlm: A self-evolving framework with generative process supervision for operations research language models,” 2025. [Online]. Available: https://arxiv.org/abs/2509.22558

  25. [25]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems 3...

  26. [26]

    T ¨ULU 3: Pushing fron- tiers in open language model post-training,

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi, “T ¨ULU 3: Pushing fron- tiers in open language model post-training,” CoRR, vol. abs/2411...

  27. [27]

    Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,

    P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Bangkok, Thailand, 2024, pp. 9426–9439

  28. [28]

    Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,

    T. Xie, Z. Gao, Q. Ren, H. Luo, Y . Hong, B. Dai, J. Zhou, K. Qiu, Z. Wu, and C. Luo, “Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,” CoRR, vol. abs/2502.14768, 2025

  29. [29]

    On designing effective RL reward at training time for LLM reasoning,

    J. Gao, S. Xu, W. Ye, W. Liu, C. He, W. Fu, Z. Mei, G. Wang, and Y . Wu, “On designing effective RL reward at training time for LLM reasoning,” CoRR, vol. abs/2410.15115, 2024

  30. [30]

    CodeDPO: Aligning code models with self generated and verified source code,

    K. Zhang, G. Li, Y . Dong, J. Xu, J. Zhang, J. Su, Y . Liu, and Z. Jin, “CodeDPO: Aligning code models with self generated and verified source code,” in Advances in the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), Vienna, Austria, 2025, pp. 15 854–15 871

  31. [31]

    DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,

    H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, H. Zhang, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan, “DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,” in Advances in the Thirteenth International Conference on Learning Representatio...

  32. [32]

    Training software engineering agents and verifiers with SWE-gym,

    J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y . Zhang, “Training software engineering agents and verifiers with SWE-gym,” in ICLR 2025 Third Workshop on Deep Learning for Code, 2025. JOURNAL OF LATEX CLASS FILES 11

  33. [33]

    Solver-Informed RL: Grounding large language models for authentic optimization modeling,

    Y . Chen, J. Xia, S. Shao, D. Ge, and Y . Ye, “Solver-Informed RL: Grounding large language models for authentic optimization modeling,” CoRR, vol. abs/2505.11792, 2025

  34. [34]

    Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,

    Z. Ding, Z. Tan, J. Zhang, and T. Chen, “Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09092

  35. [35]

    A survey on llm mid-training,

    C. Tu, X. Zhang, R. Weng, R. Li, C. Zhang, Y . Bai, H. Yan, J. Wang, and X. Cai, “A survey on llm mid-training,” 2025. [Online]. Available: https://arxiv.org/abs/2510.23081

  36. [36]

    Qwen2.5-Coder technical report,

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men, F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-Coder technical report,” CoRR, vol. abs/2409.12186, 2024

  37. [37]

    NL4Opt competition: Formulating optimization problems based on their natural language descriptions,

    R. Ramamonjison, T. Yu, R. Li, H. Li, G. Carenini, B. Ghaddar, S. He, M. Mostajabdaveh, A. Banitalebi-Dehkordi, Z. Zhou, and Y . Zhang, “NL4Opt competition: Formulating optimization problems based on their natural language descriptions,” in Proceedings of the NeurIPS 2022 Competitions Track, vol. 220, 2022, pp. 189–203

  38. [38]

    ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,

    Ai4mathICML2024, “ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,” https://www.codabench.org/competitions/2438, 2024

  39. [39]

    DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” CoRR, vol. abs/2402.03300, 2024

  40. [40]

    ORLM: A customizable framework in training large models for automated optimization modeling,

    C. Huang, Z. Tang, S. Hu, R. Jiang, X. Zheng, D. Ge, B. Wang, and Z. Wang, “ORLM: A customizable framework in training large models for automated optimization modeling,” Operations Research, May 2025

  41. [41]

    Densing law of llms,

    C. Xiao, J. Cai, W. Zhao, B. Lin, G. Zeng, J. Zhou, Z. Zheng, X. Han, Z. Liu, and M. Sun, “Densing law of llms,” Nature Machine Intelligence, vol. 7, no. 11, pp. 1823–1833, Nov 2025. [Online]. Available: https://doi.org/10.1038/s42256-025-01137-0

  42. [42]

    Qwen2.5 technical report,

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

  43. [43]

    DeepSeek-V3 technical report,

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

  44. [44]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Ta...

  45. [45]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

  46. [46]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. S. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, L. Marris, S. Petulla, C. Gaffney, A. Aharoni, N. Lintz, T. C. Pais, H. Jacobsson, I. Szpektor, N. Jiang, K. Haridasan, A. Omran, N. Saunshi, D. Bahri, G. Mishra, E. Chu, T. Boyd, B. Hekman, A. Parisi, C. Zhang, K. Kawintiranon, T. Bed...

  47. [47]

    GPT-5 system card,

    OpenAI, “GPT-5 system card,” https://cdn.openai.com/ gpt-5-system-card.pdf, 2025

  48. [48]

    Reflex- ion: language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, New Orleans, LA, 2023

  49. [49]

    Training verifiers to solve math word problems,

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schul- man, “Training verifiers to solve math word problems,” CoRR, vol. abs/2110.14168, 2021

  50. [50]

    Measuring mathematical problem solv- ing with the MATH dataset,

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solv- ing with the MATH dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  51. [51]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in 9th International Conference on Learning Representations, 2021

  52. [52]

    RACE: large-scale ReAding comprehension dataset from examinations,

    G. Lai, Q. Xie, H. Liu, Y . Yang, and E. H. Hovy, “RACE: large-scale ReAding comprehension dataset from examinations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds., 2017

  53. [53]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,

    M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

  54. [54]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  55. [55]

    w” denotes “with

    under a limited training budget. The objective is to construct a training set that simultaneously covers diverse JOURNAL OF LATEX CLASS FILES 15 optimization types and scenarios, while preserving the scenario proportions observed in the real distribution. Starting from the OptMATH-Train pool containing 201K problems, we label each instance with types and ...

  56. [56]

    Don’t give any explanation, just provide the converted pyomo code in the following format: ‘‘‘python [pyomo code here] ‘‘‘

  57. [57]

    Other solvers should not be utilized

    Please note that the following solvers are available for use: ’glpk’, ’cbc’, ’ipopt ’, ’scip’. Other solvers should not be utilized

  58. [58]

    Please add ‘from pyomo.environ import *‘ at the beginning of your code

  59. [59]

    Please print the optimal objective value at the end of the code. **Gurobipy code: ** {gurobipy} JOURNAL OF LATEX CLASS FILES 17 This section provides the system prompt to guide MiniOpt models in autonomously selecting solvers after modeling op- timization problems. PROMPT FORSOLVERSELECTION **Solver Selection Guide: ** - ‘‘glpk‘‘: Best for small-to-medium...

  60. [63]

    SYSTEMPROMPT FORRL TRAINING You are a helpful assistant

    Nonlinearity presence (use ipopt/scip) This section provides the system prompt used by MiniOpt models during reinforcement learning (RL) training. SYSTEMPROMPT FORRL TRAINING You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclo...

  61. [64]

    Detailed reasoning about the problem within <think> </think> tags

  62. [65]

    Write the corresponding five-element model (derived from your analysis)

  63. [66]

    Determine the mathematical properties of problem and select an appropriate solver from ’glpk’, ’cbc’, ’ipopt’, ’scip’

  64. [67]

    - Verify the five-element model fully captures the problem’s requirements

    Recheck and correct if necessary at the end of the <think> </think> section. - Verify the five-element model fully captures the problem’s requirements. - Confirm no constraints/variables are missing or over-simplified. - Ensure the solver choice aligns with the problem’s mathematical properties

  65. [68]

    Provide the corresponding Pyomo code based on checked five-element model within < answer> </answer> tags. In mathematics, optimization problem can be modeled as the following expression $\\ min_{{\\boldsymbol{{x}} \\in \\mathcal{{X }}}} f(\\boldsymbol{{x}}), {{\\rm s.t.}} G(\\boldsymbol{{x}}) \\leq \\boldsymbol{{ c}}$, where $\\boldsymbol{{x}} = (x_1, x_2...

  66. [69]

    Variable types (continuous vs integer/ binary)

  67. [70]

    Linearity of objective/constraints

  68. [71]

    Problem scale (small: glpk/cbc, large: scip/ipopt)

  69. [72]

    ‘‘‘python\n(.*?)‘‘‘

    Nonlinearity presence (use ipopt/scip) Please select an appropriate solver based on the type and quantity of variables, objectives, and constraints. After thinking, when you finally reach the five -element model, you should give the corresponding Pyomo code within the < answer> </answer> tags, i.e., <answer> ‘‘‘python\n code here‘‘‘ </answer>. The user wi...

  70. [73]

    Linear Programming (LP): Problems with linear objective function and linear constraints, all continuous variables

  71. [74]

    Integer Programming (IP): Problems with linear or nonlinear components where ALL variables are discrete/integer

  72. [75]

    Mixed Integer Linear Programming (MILP): Problems with linear components containing BOTH continuous and discrete variables

  73. [76]

    Nonlinear Programming (NLP): Problems with nonlinear objective function and/or nonlinear constraints (variables may be continuous/discrete)

  74. [77]

    Combinatorial Optimization (CO): Problems focused on selecting/discrete structures (graphs, permutations, sets) with typically binary variables

  75. [78]

    Multi-objective Programming (MOP): Problems explicitly optimizing multiple conflicting objectives simultaneously

  76. [79]

    Second-Order Cone Programming (SOCP): Problems with a linear objective function , linear constraints, and second-order cone constraints (e.g., \(\|Ax + b\| \leq cˆT x + d\)) # Problem: {{Question}} # Output Analyze the mathematical structure step by step and classify its type. Finally, output the type abbreviation in the following format: Type: Abbreviati...

  77. [80]

    Supply Chain: Decisions about inventory management, distribution network, warehousing operations

  78. [81]

    Finance: Decisions about portfolio management, investments, risk management, financial planning

  79. [82]

    Manufacturing: Decisions about production processes, quality control, factory operations

  80. [83]

    Transportation: Decisions about routing, vehicle scheduling, fleet management, traffic flow, carrier selection

Showing first 80 references.