MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

Bingdong Li; Hong Qian; Jun Zhou; Ke Tang; Ke Zhao; Qitao Shi; Xiangfeng Wang; Xiang Shu; Xingyu Lu; Yang Yu

arxiv: 2606.25832 · v2 · pith:SC4RIWNLnew · submitted 2026-06-24 · 💻 cs.LG · cs.AI

MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

Ke Zhao , Zixiang Di , Hong Qian , Xiang Shu , Yaolin Wen , Qitao Shi , Bingdong Li , Xingyu Lu

show 4 more authors

Xiangfeng Wang Jun Zhou Ke Tang Yang Yu

This is my paper

Pith reviewed 2026-06-26 05:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords optimization problemsreinforcement learninglanguage modelspolicy optimizationreward functiongeneralizationcompact modelssolving accuracy

0 comments

The pith

MiniOpt trains 3B language models to solve diverse optimization problems accurately via reinforcement learning without expert data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiniOpt, a reinforcement learning framework that teaches compact language models to handle general optimization tasks by decomposing reasoning into problem modeling and executable solver generation. It introduces OptReward, a hierarchical scoring function that evaluates both formulation quality and solution correctness to guide learning without large supervised datasets or expert demonstrations. An optimization-oriented policy optimization strategy is added to improve exploration and stabilize training for smaller models. Experiments show the resulting 3B model generalizes across optimization types, scenarios, and domains while achieving the highest average solving accuracy among models under 10B parameters.

Core claim

MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance.

What carries the argument

The reasoning-to-model-and-solve paradigm together with OptReward, a hierarchical reward function that jointly scores formulation and solution, plus an optimization-oriented policy optimization strategy for efficient exploration.

If this is right

Compact models under 10B parameters reach the highest average solving accuracy on optimization tasks compared with other approaches.
Optimization problems can be addressed without large-scale supervised datasets, costly annotations, or intermediate step verification.
The hierarchical reward enables stable reinforcement learning and efficient exploration specifically for smaller models.
Generalization holds across optimization types, problem scenarios, and task domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward design could be adapted to train small models on other structured reasoning tasks such as planning or scheduling.
Deployment on edge devices becomes feasible once optimization capability fits inside a 3B model.
Hybrid systems might combine the generated solvers with traditional optimization libraries for further gains.

Load-bearing premise

The OptReward hierarchical scoring function can jointly evaluate formulation and solution quality to drive effective policy learning without any expert demonstrations.

What would settle it

A test showing that a 3B model trained with MiniOpt achieves no higher solving accuracy than baselines on a held-out class of optimization problems never encountered during training.

Figures

Figures reproduced from arXiv: 2606.25832 by Bingdong Li, Hong Qian, Jun Zhou, Ke Tang, Ke Zhao, Qitao Shi, Xiangfeng Wang, Xiang Shu, Xingyu Lu, Yang Yu, Yaolin Wen, Zixiang Di.

**Figure 1.** Figure 1: An overview of the proposed MiniOpt training paradigm. Sub-figure (a) demonstrates the reasoning-to-model-and-solve [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 1.** Figure 1: An informative and easily verifiable reward function [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Comparison of training dynamics of reward values and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 2.** Figure 2: Comparison of average SA against model parameter [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of training dynamics of reward function [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Histogram showing the distribution of optimization types across 8 benchmarks. We categorize the problems in the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram showing the distribution of optimization problem scenarios across 8 benchmarks. We categorize the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Performance variations of the MiniOpt on benchmarks [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Proportion of every scenario of instances in OptMATH-Train (201K). [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Proportion of every problem type of instances in OptMATH-Train (201K). [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns to solve optimization problems through an "reasoning-to-model-and-solve" paradigm. MiniOpt decomposes optimization reasoning into structured optimization modeling and executable solver generation. Building upon this paradigm, we introduce OptReward, a reward function with hierarchical score structure that jointly evaluates formulation and solution, enabling effective policy learning without expert demonstrations. We further develop an optimization-oriented policy optimization strategy that improves exploration efficiency and stabilizes reinforcement learning for compact models. Extensive experiments show that MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance. These results suggest that optimization-oriented reward design and reinforcement learning provide an effective pathway for developing compact optimization-specialized language models with strong optimization generalization capabilities. The code is available at https://github.com/Hsiang-1/MiniOpt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniOpt uses RL with a hierarchical OptReward on a modeling-plus-solver decomposition to train a 3B model that claims top average solving accuracy under 10B parameters.

read the letter

MiniOpt's main result is that a 3B model trained this way beats other sub-10B models on average solving accuracy across optimization types.

The decomposition into structured modeling followed by executable solver generation is a clean split. OptReward scores both stages hierarchically so the policy can improve without expert traces, and the optimization-oriented policy update is meant to stabilize training on small models. These choices directly target the data and compute problems that usually hit LLM optimization work. Releasing the code is a concrete plus.

The experimental claims are the soft spot. The abstract states strong generalization and the highest SA for models under 10B but gives no dataset sizes, baseline list, variance numbers, or controls for problem generation. That leaves open whether the reward components were adjusted after seeing results or whether the test distribution favors the method. If the full paper has ablations and clear comparisons, the concern is minor; otherwise it sits at the center of the evaluation.

The paper is for researchers building compact LLMs for operations research or automated planning. Anyone already looking at RL methods that avoid large supervised sets will find the framework and reward design worth examining.

Send it to peer review. The core setup is coherent, the code is public, and the resource claims are specific enough that referees can check the numbers directly.

Referee Report

2 major / 0 minor

Summary. The paper proposes MiniOpt, a reinforcement learning framework for optimization problems that decomposes reasoning into structured modeling and executable solver generation. It introduces OptReward, a hierarchical scoring function that jointly evaluates formulation and solution to enable policy learning without expert demonstrations, along with an optimization-oriented policy optimization strategy for improved exploration in compact models. The central claim is that MiniOpt-3B exhibits strong optimization generalization across problem types, scenarios, and domains, achieving the highest average solving accuracy (SA) among models with fewer than 10B parameters while remaining competitive for larger models.

Significance. If the experimental results hold, the work would demonstrate an effective pathway for resource-efficient, optimization-specialized LLMs via reward design and RL rather than large supervised datasets. The public release of code at https://github.com/Hsiang-1/MiniOpt is a positive contribution to reproducibility.

major comments (2)

[Abstract] Abstract: the claim that MiniOpt-3B achieves the highest average SA among sub-10B models across diverse optimization types is presented without any description of datasets, baselines, statistical tests, number of runs, or controls for confounding factors, so the data-to-claim link cannot be evaluated.
[OptReward] OptReward section: the hierarchical reward is load-bearing for the no-expert-demonstrations claim, yet the manuscript provides no analysis of whether its components were tuned post-hoc to the reported outcomes or whether any 'predictions' reduce to fitted quantities, raising a circularity risk for the generalization results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that MiniOpt-3B achieves the highest average SA among sub-10B models across diverse optimization types is presented without any description of datasets, baselines, statistical tests, number of runs, or controls for confounding factors, so the data-to-claim link cannot be evaluated.

Authors: The abstract is intentionally concise. Full details on datasets (diverse optimization benchmarks spanning types, scenarios, and domains), baselines (comparable LLMs under 10B parameters), evaluation protocol, and controls appear in Sections 4 and 5. To strengthen the data-to-claim link within the abstract's length constraints, we will add a brief clause such as 'evaluated on standard optimization benchmarks against peer sub-10B models.' revision: yes
Referee: [OptReward] OptReward section: the hierarchical reward is load-bearing for the no-expert-demonstrations claim, yet the manuscript provides no analysis of whether its components were tuned post-hoc to the reported outcomes or whether any 'predictions' reduce to fitted quantities, raising a circularity risk for the generalization results.

Authors: OptReward components were fixed a priori using standard optimization metrics for formulation correctness and solution feasibility; no post-hoc tuning to experimental outcomes occurred. To directly address the circularity concern, we will expand the OptReward section with a short paragraph documenting the design rationale and confirming independence from reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and description present an empirical RL framework (reasoning-to-model-and-solve paradigm, OptReward hierarchical scoring, optimization-oriented policy optimization) whose performance claims rest on experimental results across optimization problems. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are exhibited in the provided text that would reduce the central claims to inputs by construction. OptReward is described as an enabling design choice for policy learning without expert data, not a quantity whose components are shown to be post-hoc tuned or mathematically forced. This qualifies as a normal self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or newly postulated entities; the OptReward and policy optimization components are introduced but not decomposed into their constituent assumptions or fitted values.

pith-pipeline@v0.9.1-grok · 5802 in / 1086 out tokens · 24164 ms · 2026-06-26T05:16:40.700060+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 2 canonical work pages

[1]

Flexible job-shop scheduling via graph neural network and deep reinforcement learning,

W. Song, X. Chen, Q. Li, and Z. Cao, “Flexible job-shop scheduling via graph neural network and deep reinforcement learning,” IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1600–1610, 2023

2023
[2]

Ars: Automatic routing solver with large language models,

K. Li, F. Liu, Z. Wang, X. Tong, X. Han, M. Yuan, and Q. Zhang, “Ars: Automatic routing solver with large language models,” CoRR, vol. abs/2502.15359, 2025

arXiv 2025
[3]

CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,

H. Deng, B. Zheng, Y . Jiang, and T. H. Tran, “CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,” in The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. [Online]. Available: https: //openreview.net/forum?id=xC2xtBLmri

2024
[4]

AutoSAT: Automatically optimize SAT solvers via large language models,

Y . Sun, F. Ye, X. Zhang, S. Huang, B. Zhang, K. Wei, and S. Cai, “AutoSAT: Automatically optimize SAT solvers via large language models,” CoRR, vol. abs/2402.10705, 2024

arXiv 2024
[5]

AutoPBO: LLM-powered optimization for local search PBO solvers,

J. Li, Y . Chu, Y . Sun, M. Zou, and S. Cai, “AutoPBO: LLM-powered optimization for local search PBO solvers,” CoRR, vol. abs/2509.04007, 2025

arXiv 2025
[6]

LLMOPT: learning to define and solve general optimization problems from scratch,

C. Jiang, X. Shu, H. Qian, X. Lu, J. Zhou, A. Zhou, and Y . Yu, “LLMOPT: learning to define and solve general optimization problems from scratch,” in Advances in The Thirteenth International Conference on Learning Representations, Singapore, 2025

2025
[7]

BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,

T. Wang, W. Y . Yu, Z. He, Z. Liu, H. HaileiGong, H. Wu, X. Han, W. Shi, R. She, F. Zhu, and T. Zhong, “BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pi...

2025
[8]

Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,

A. Singirikonda, S. Kadioglu, and K. Uppuluri, “Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,” CoRR, vol. abs/2503.10642, 2025

arXiv 2025
[9]

Training LLMs for optimization modeling via iterative data synthesis and structured validation,

Y . Wu, Y . Zhang, Y . Wu, Y . Wang, J. Zhang, and J. Cheng, “Training LLMs for optimization modeling via iterative data synthesis and structured validation,” in Findings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics,...

2025
[10]

OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,

H. Lu, Z. Xie, Y . Wu, C. Ren, Y . Chen, and Z. Wen, “OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,” in Forty-second International Conference on Machine Learning, 2025

2025
[11]

A survey of optimization modeling meets LLMs: Progress and future directions,

Z. Xiao, J. Xie, L. Xu, S. Guan, J. Zhu, X. Han, X. Fu, W. Yu, H. Wu, W. Shi, Q. Kang, J. Duan, T. Zhong, M. Yuan, J. Zeng, Y . Wang, G. Chen, and D. Zhang, “A survey of optimization modeling meets LLMs: Progress and future directions,” in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, J. Kwok, Ed. In...

2025
[12]

An agent-based framework for the automatic validation of mathematical optimization models,

A. Zadorojniy, S. Wasserkrug, and E. Farchi, “An agent-based framework for the automatic validation of mathematical optimization models,” 2025. [Online]. Available: https://arxiv.org/abs/2511.16383

Pith/arXiv arXiv 2025
[13]

Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,

H. Zhai, C. Lawless, E. Vitercik, and L. Leqi, “Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,” in 2nd AI for Math Workshop @ ICML 2025, 2025. [Online]. Available: https://openreview.net/forum?id=RvdjzNlksm

2025
[14]

OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,

A. AhmadiTeshnizi, W. Gao, and M. Udell, “OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,” in Advances in Forty-first International Conference on Machine Learning, Vienna, Austria, 2024

2024
[15]

OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,

X. Li, J. Chen, X. Fang, S. Ding, H. Duan, Q. Liu, and K. Chen, “OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,” CoRR, vol. abs/2506.10764, 2025

arXiv 2025
[16]

LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,

X. Huang, Q. Shen, Y . Hu, A. Gao, and B. Wang, “LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,” in Findings of the Association for Computational Linguistics 2025, Albuquerque, New Mexico, 2025, pp. 2678–2710

2025
[17]

Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,

Z. Yang, Y . Huang, W. Shi, L. Feng, L. Song, Y . Wang, X. Liang, and J. Tang, “Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,” CoRR, vol. abs/2407.09887, 2024

arXiv 2024
[18]

OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,

Z. Yang, Y . Wang, Y . Huang, Z. Guo, W. Shi, X. Han, L. Feng, L. Song, X. Liang, and J. Tang, “OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,” in The Thirteenth International Conference on Learning Representations, Singapore, 2025

2025
[19]

Chain-of-Experts: When LLMs meet complex operations research problems,

Z. Xiao, D. Zhang, Y . Wu, L. Xu, Y . J. Wang, X. Han, X. Fu, T. Zhong, J. Zeng, M. Song, and G. Chen, “Chain-of-Experts: When LLMs meet complex operations research problems,” in The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024

2024
[20]

Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,

H. Liu, J. Wang, Y . Cai, X. Han, Y . Kuang, and J. HAO, “Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=Ej20yjWMCj

2025
[21]

Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,

K. Liang, Y . Lu, J. Mao, S. Sun, C. Yang, C. Zeng, X. Jin, H. Qin, R. Zhu, and C.-P. Teo, “Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,” 2025. [Online]. Available: https://dx.doi.org/10.2139/ssrn.5329027

work page doi:10.2139/ssrn.5329027 2025
[22]

LLaMoCo: Instruction tuning of large language models for optimization code generation,

Z. Ma, H. Guo, J. Chen, G. Peng, Z. Cao, Y . Ma, and Y . Gong, “LLaMoCo: Instruction tuning of large language models for optimization code generation,” CoRR, vol. abs/2403.01131, 2024

arXiv 2024
[23]

Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,

P. P. Dakle, S. Kadioglu, K. Uppuluri, R. Politi, P. Raghavan, S. Ral- labandi, and R. Srinivasamurthy, “Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,” in Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 20th International Conference, vol. 13884, Nice, France, 2023, pp. 299–319

2023
[24]

Steporlm: A self-evolving framework with generative process supervision for operations research language models,

C. Zhou, T. Xu, J. Lin, and D. Ge, “Steporlm: A self-evolving framework with generative process supervision for operations research language models,” 2025. [Online]. Available: https://arxiv.org/abs/2509.22558

arXiv 2025
[25]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems 3...

2022
[26]

T ¨ULU 3: Pushing fron- tiers in open language model post-training,

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi, “T ¨ULU 3: Pushing fron- tiers in open language model post-training,” CoRR, vol. abs/2411...

Pith/arXiv arXiv 2024
[27]

Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Bangkok, Thailand, 2024, pp. 9426–9439

2024
[28]

Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,

T. Xie, Z. Gao, Q. Ren, H. Luo, Y . Hong, B. Dai, J. Zhou, K. Qiu, Z. Wu, and C. Luo, “Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,” CoRR, vol. abs/2502.14768, 2025

Pith/arXiv arXiv 2025
[29]

On designing effective RL reward at training time for LLM reasoning,

J. Gao, S. Xu, W. Ye, W. Liu, C. He, W. Fu, Z. Mei, G. Wang, and Y . Wu, “On designing effective RL reward at training time for LLM reasoning,” CoRR, vol. abs/2410.15115, 2024

arXiv 2024
[30]

CodeDPO: Aligning code models with self generated and verified source code,

K. Zhang, G. Li, Y . Dong, J. Xu, J. Zhang, J. Su, Y . Liu, and Z. Jin, “CodeDPO: Aligning code models with self generated and verified source code,” in Advances in the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), Vienna, Austria, 2025, pp. 15 854–15 871

2025
[31]

DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,

H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, H. Zhang, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan, “DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,” in Advances in the Thirteenth International Conference on Learning Representatio...

2025
[32]

Training software engineering agents and verifiers with SWE-gym,

J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y . Zhang, “Training software engineering agents and verifiers with SWE-gym,” in ICLR 2025 Third Workshop on Deep Learning for Code, 2025. JOURNAL OF LATEX CLASS FILES 11

2025
[33]

Solver-Informed RL: Grounding large language models for authentic optimization modeling,

Y . Chen, J. Xia, S. Shao, D. Ge, and Y . Ye, “Solver-Informed RL: Grounding large language models for authentic optimization modeling,” CoRR, vol. abs/2505.11792, 2025

arXiv 2025
[34]

Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,

Z. Ding, Z. Tan, J. Zhang, and T. Chen, “Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09092

arXiv 2025
[35]

A survey on llm mid-training,

C. Tu, X. Zhang, R. Weng, R. Li, C. Zhang, Y . Bai, H. Yan, J. Wang, and X. Cai, “A survey on llm mid-training,” 2025. [Online]. Available: https://arxiv.org/abs/2510.23081

arXiv 2025
[36]

Qwen2.5-Coder technical report,

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men, F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-Coder technical report,” CoRR, vol. abs/2409.12186, 2024

Pith/arXiv arXiv 2024
[37]

NL4Opt competition: Formulating optimization problems based on their natural language descriptions,

R. Ramamonjison, T. Yu, R. Li, H. Li, G. Carenini, B. Ghaddar, S. He, M. Mostajabdaveh, A. Banitalebi-Dehkordi, Z. Zhou, and Y . Zhang, “NL4Opt competition: Formulating optimization problems based on their natural language descriptions,” in Proceedings of the NeurIPS 2022 Competitions Track, vol. 220, 2022, pp. 189–203

2022
[38]

ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,

Ai4mathICML2024, “ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,” https://www.codabench.org/competitions/2438, 2024

2024
[39]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” CoRR, vol. abs/2402.03300, 2024

Pith/arXiv arXiv 2024
[40]

ORLM: A customizable framework in training large models for automated optimization modeling,

C. Huang, Z. Tang, S. Hu, R. Jiang, X. Zheng, D. Ge, B. Wang, and Z. Wang, “ORLM: A customizable framework in training large models for automated optimization modeling,” Operations Research, May 2025

2025
[41]

Densing law of llms,

C. Xiao, J. Cai, W. Zhao, B. Lin, G. Zeng, J. Zhou, Z. Zheng, X. Han, Z. Liu, and M. Sun, “Densing law of llms,” Nature Machine Intelligence, vol. 7, no. 11, pp. 1823–1833, Nov 2025. [Online]. Available: https://doi.org/10.1038/s42256-025-01137-0

work page doi:10.1038/s42256-025-01137-0 2025
[42]

Qwen2.5 technical report,

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

Pith/arXiv arXiv 2024
[43]

DeepSeek-V3 technical report,

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

Pith/arXiv arXiv 2024
[44]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Ta...

Pith/arXiv arXiv 2025
[45]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

Pith/arXiv arXiv 2025
[46]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. S. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, L. Marris, S. Petulla, C. Gaffney, A. Aharoni, N. Lintz, T. C. Pais, H. Jacobsson, I. Szpektor, N. Jiang, K. Haridasan, A. Omran, N. Saunshi, D. Bahri, G. Mishra, E. Chu, T. Boyd, B. Hekman, A. Parisi, C. Zhang, K. Kawintiranon, T. Bed...

Pith/arXiv arXiv 2025
[47]

GPT-5 system card,

OpenAI, “GPT-5 system card,” https://cdn.openai.com/ gpt-5-system-card.pdf, 2025

2025
[48]

Reflex- ion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, New Orleans, LA, 2023

2023
[49]

Training verifiers to solve math word problems,

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schul- man, “Training verifiers to solve math word problems,” CoRR, vol. abs/2110.14168, 2021

Pith/arXiv arXiv 2021
[50]

Measuring mathematical problem solv- ing with the MATH dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solv- ing with the MATH dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021
[51]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in 9th International Conference on Learning Representations, 2021

2021
[52]

RACE: large-scale ReAding comprehension dataset from examinations,

G. Lai, Q. Xie, H. Liu, Y . Yang, and E. H. Hovy, “RACE: large-scale ReAding comprehension dataset from examinations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds., 2017

2017
[53]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,

M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

2017
[54]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

Pith/arXiv arXiv 2021
[55]

w” denotes “with

under a limited training budget. The objective is to construct a training set that simultaneously covers diverse JOURNAL OF LATEX CLASS FILES 15 optimization types and scenarios, while preserving the scenario proportions observed in the real distribution. Starting from the OptMATH-Train pool containing 201K problems, we label each instance with types and ...

arXiv
[56]

Don’t give any explanation, just provide the converted pyomo code in the following format: ‘‘‘python [pyomo code here] ‘‘‘
[57]

Other solvers should not be utilized

Please note that the following solvers are available for use: ’glpk’, ’cbc’, ’ipopt ’, ’scip’. Other solvers should not be utilized
[58]

Please add ‘from pyomo.environ import *‘ at the beginning of your code
[59]

Please print the optimal objective value at the end of the code. **Gurobipy code: ** {gurobipy} JOURNAL OF LATEX CLASS FILES 17 This section provides the system prompt to guide MiniOpt models in autonomously selecting solvers after modeling op- timization problems. PROMPT FORSOLVERSELECTION **Solver Selection Guide: ** - ‘‘glpk‘‘: Best for small-to-medium...
[63]

SYSTEMPROMPT FORRL TRAINING You are a helpful assistant

Nonlinearity presence (use ipopt/scip) This section provides the system prompt used by MiniOpt models during reinforcement learning (RL) training. SYSTEMPROMPT FORRL TRAINING You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclo...
[64]

Detailed reasoning about the problem within <think> </think> tags
[65]

Write the corresponding five-element model (derived from your analysis)
[66]

Determine the mathematical properties of problem and select an appropriate solver from ’glpk’, ’cbc’, ’ipopt’, ’scip’
[67]

- Verify the five-element model fully captures the problem’s requirements

Recheck and correct if necessary at the end of the <think> </think> section. - Verify the five-element model fully captures the problem’s requirements. - Confirm no constraints/variables are missing or over-simplified. - Ensure the solver choice aligns with the problem’s mathematical properties
[68]

Provide the corresponding Pyomo code based on checked five-element model within < answer> </answer> tags. In mathematics, optimization problem can be modeled as the following expression $\\ min_{{\\boldsymbol{{x}} \\in \\mathcal{{X }}}} f(\\boldsymbol{{x}}), {{\\rm s.t.}} G(\\boldsymbol{{x}}) \\leq \\boldsymbol{{ c}}$, where $\\boldsymbol{{x}} = (x_1, x_2...
[69]

Variable types (continuous vs integer/ binary)
[70]

Linearity of objective/constraints
[71]

Problem scale (small: glpk/cbc, large: scip/ipopt)
[72]

‘‘‘python\n(.*?)‘‘‘

Nonlinearity presence (use ipopt/scip) Please select an appropriate solver based on the type and quantity of variables, objectives, and constraints. After thinking, when you finally reach the five -element model, you should give the corresponding Pyomo code within the < answer> </answer> tags, i.e., <answer> ‘‘‘python\n code here‘‘‘ </answer>. The user wi...
[73]

Linear Programming (LP): Problems with linear objective function and linear constraints, all continuous variables
[74]

Integer Programming (IP): Problems with linear or nonlinear components where ALL variables are discrete/integer
[75]

Mixed Integer Linear Programming (MILP): Problems with linear components containing BOTH continuous and discrete variables
[76]

Nonlinear Programming (NLP): Problems with nonlinear objective function and/or nonlinear constraints (variables may be continuous/discrete)
[77]

Combinatorial Optimization (CO): Problems focused on selecting/discrete structures (graphs, permutations, sets) with typically binary variables
[78]

Multi-objective Programming (MOP): Problems explicitly optimizing multiple conflicting objectives simultaneously
[79]

Second-Order Cone Programming (SOCP): Problems with a linear objective function , linear constraints, and second-order cone constraints (e.g., $\|Ax + b\| \leq cˆT x + d$) # Problem: {{Question}} # Output Analyze the mathematical structure step by step and classify its type. Finally, output the type abbreviation in the following format: Type: Abbreviati...
[80]

Supply Chain: Decisions about inventory management, distribution network, warehousing operations
[81]

Finance: Decisions about portfolio management, investments, risk management, financial planning
[82]

Manufacturing: Decisions about production processes, quality control, factory operations
[83]

Transportation: Decisions about routing, vehicle scheduling, fleet management, traffic flow, carrier selection

Showing first 80 references.

[1] [1]

Flexible job-shop scheduling via graph neural network and deep reinforcement learning,

W. Song, X. Chen, Q. Li, and Z. Cao, “Flexible job-shop scheduling via graph neural network and deep reinforcement learning,” IEEE Transactions on Industrial Informatics, vol. 19, no. 2, pp. 1600–1610, 2023

2023

[2] [2]

Ars: Automatic routing solver with large language models,

K. Li, F. Liu, Z. Wang, X. Tong, X. Han, M. Yuan, and Q. Zhang, “Ars: Automatic routing solver with large language models,” CoRR, vol. abs/2502.15359, 2025

arXiv 2025

[3] [3]

CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,

H. Deng, B. Zheng, Y . Jiang, and T. H. Tran, “CAFA: Coding as auto-formulation can boost large language models in solving linear programming problem,” in The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. [Online]. Available: https: //openreview.net/forum?id=xC2xtBLmri

2024

[4] [4]

AutoSAT: Automatically optimize SAT solvers via large language models,

Y . Sun, F. Ye, X. Zhang, S. Huang, B. Zhang, K. Wei, and S. Cai, “AutoSAT: Automatically optimize SAT solvers via large language models,” CoRR, vol. abs/2402.10705, 2024

arXiv 2024

[5] [5]

AutoPBO: LLM-powered optimization for local search PBO solvers,

J. Li, Y . Chu, Y . Sun, M. Zou, and S. Cai, “AutoPBO: LLM-powered optimization for local search PBO solvers,” CoRR, vol. abs/2509.04007, 2025

arXiv 2025

[6] [6]

LLMOPT: learning to define and solve general optimization problems from scratch,

C. Jiang, X. Shu, H. Qian, X. Lu, J. Zhou, A. Zhou, and Y . Yu, “LLMOPT: learning to define and solve general optimization problems from scratch,” in Advances in The Thirteenth International Conference on Learning Representations, Singapore, 2025

2025

[7] [7]

BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,

T. Wang, W. Y . Yu, Z. He, Z. Liu, H. HaileiGong, H. Wu, X. Han, W. Shi, R. She, F. Zhu, and T. Zhong, “BPP-search: Enhancing tree of thought reasoning for mathematical modeling problem solving,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pi...

2025

[8] [8]

Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,

A. Singirikonda, S. Kadioglu, and K. Uppuluri, “Text2Zinc: A cross- domain dataset for modeling optimization and satisfaction problems in MiniZinc,” CoRR, vol. abs/2503.10642, 2025

arXiv 2025

[9] [9]

Training LLMs for optimization modeling via iterative data synthesis and structured validation,

Y . Wu, Y . Zhang, Y . Wu, Y . Wang, J. Zhang, and J. Cheng, “Training LLMs for optimization modeling via iterative data synthesis and structured validation,” in Findings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics,...

2025

[10] [10]

OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,

H. Lu, Z. Xie, Y . Wu, C. Ren, Y . Chen, and Z. Wen, “OptMATH: A scalable bidirectional data synthesis framework for optimization mod- eling,” in Forty-second International Conference on Machine Learning, 2025

2025

[11] [11]

A survey of optimization modeling meets LLMs: Progress and future directions,

Z. Xiao, J. Xie, L. Xu, S. Guan, J. Zhu, X. Han, X. Fu, W. Yu, H. Wu, W. Shi, Q. Kang, J. Duan, T. Zhong, M. Yuan, J. Zeng, Y . Wang, G. Chen, and D. Zhang, “A survey of optimization modeling meets LLMs: Progress and future directions,” in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, J. Kwok, Ed. In...

2025

[12] [12]

An agent-based framework for the automatic validation of mathematical optimization models,

A. Zadorojniy, S. Wasserkrug, and E. Farchi, “An agent-based framework for the automatic validation of mathematical optimization models,” 2025. [Online]. Available: https://arxiv.org/abs/2511.16383

Pith/arXiv arXiv 2025

[13] [13]

Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,

H. Zhai, C. Lawless, E. Vitercik, and L. Leqi, “Equivamap: Leveraging LLMs for automatic equivalence checking of optimization formulations,” in 2nd AI for Math Workshop @ ICML 2025, 2025. [Online]. Available: https://openreview.net/forum?id=RvdjzNlksm

2025

[14] [14]

OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,

A. AhmadiTeshnizi, W. Gao, and M. Udell, “OptiMUS: Scalable opti- mization modeling with (MI)LP solvers and large language models,” in Advances in Forty-first International Conference on Machine Learning, Vienna, Austria, 2024

2024

[15] [15]

OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,

X. Li, J. Chen, X. Fang, S. Ding, H. Duan, Q. Liu, and K. Chen, “OPT-BENCH: evaluating LLM agent on large-scale search spaces optimization problems,” CoRR, vol. abs/2506.10764, 2025

arXiv 2025

[16] [16]

LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,

X. Huang, Q. Shen, Y . Hu, A. Gao, and B. Wang, “LLMs for mathe- matical modeling: Towards bridging the gap between natural and math- ematical languages,” in Findings of the Association for Computational Linguistics 2025, Albuquerque, New Mexico, 2025, pp. 2678–2710

2025

[17] [17]

Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,

Z. Yang, Y . Huang, W. Shi, L. Feng, L. Song, Y . Wang, X. Liang, and J. Tang, “Benchmarking LLMs for optimization modeling and enhancing reasoning via reverse socratic synthesis,” CoRR, vol. abs/2407.09887, 2024

arXiv 2024

[18] [18]

OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,

Z. Yang, Y . Wang, Y . Huang, Z. Guo, W. Shi, X. Han, L. Feng, L. Song, X. Liang, and J. Tang, “OptiBench meets ReSocratic: Measure and improve LLMs for optimization modeling,” in The Thirteenth International Conference on Learning Representations, Singapore, 2025

2025

[19] [19]

Chain-of-Experts: When LLMs meet complex operations research problems,

Z. Xiao, D. Zhang, Y . Wu, L. Xu, Y . J. Wang, X. Han, X. Fu, T. Zhong, J. Zeng, M. Song, and G. Chen, “Chain-of-Experts: When LLMs meet complex operations research problems,” in The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024

2024

[20] [20]

Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,

H. Liu, J. Wang, Y . Cai, X. Han, Y . Kuang, and J. HAO, “Optitree: Hierarchical thoughts generation with tree search for LLM optimization modeling,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=Ej20yjWMCj

2025

[21] [21]

Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,

K. Liang, Y . Lu, J. Mao, S. Sun, C. Yang, C. Zeng, X. Jin, H. Qin, R. Zhu, and C.-P. Teo, “Llm for large-scale optimization model auto-formulation: A lightweight few-shot learning approach,” 2025. [Online]. Available: https://dx.doi.org/10.2139/ssrn.5329027

work page doi:10.2139/ssrn.5329027 2025

[22] [22]

LLaMoCo: Instruction tuning of large language models for optimization code generation,

Z. Ma, H. Guo, J. Chen, G. Peng, Z. Cao, Y . Ma, and Y . Gong, “LLaMoCo: Instruction tuning of large language models for optimization code generation,” CoRR, vol. abs/2403.01131, 2024

arXiv 2024

[23] [23]

Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,

P. P. Dakle, S. Kadioglu, K. Uppuluri, R. Politi, P. Raghavan, S. Ral- labandi, and R. Srinivasamurthy, “Ner4Opt: Named entity recogni- tion for optimization modelling from natural language,” in Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 20th International Conference, vol. 13884, Nice, France, 2023, pp. 299–319

2023

[24] [24]

Steporlm: A self-evolving framework with generative process supervision for operations research language models,

C. Zhou, T. Xu, J. Lin, and D. Ge, “Steporlm: A self-evolving framework with generative process supervision for operations research language models,” 2025. [Online]. Available: https://arxiv.org/abs/2509.22558

arXiv 2025

[25] [25]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems 3...

2022

[26] [26]

T ¨ULU 3: Pushing fron- tiers in open language model post-training,

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi, “T ¨ULU 3: Pushing fron- tiers in open language model post-training,” CoRR, vol. abs/2411...

Pith/arXiv arXiv 2024

[27] [27]

Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Bangkok, Thailand, 2024, pp. 9426–9439

2024

[28] [28]

Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,

T. Xie, Z. Gao, Q. Ren, H. Luo, Y . Hong, B. Dai, J. Zhou, K. Qiu, Z. Wu, and C. Luo, “Logic-RL: Unleashing LLM reasoning with rule- based reinforcement learning,” CoRR, vol. abs/2502.14768, 2025

Pith/arXiv arXiv 2025

[29] [29]

On designing effective RL reward at training time for LLM reasoning,

J. Gao, S. Xu, W. Ye, W. Liu, C. He, W. Fu, Z. Mei, G. Wang, and Y . Wu, “On designing effective RL reward at training time for LLM reasoning,” CoRR, vol. abs/2410.15115, 2024

arXiv 2024

[30] [30]

CodeDPO: Aligning code models with self generated and verified source code,

K. Zhang, G. Li, Y . Dong, J. Xu, J. Zhang, J. Su, Y . Liu, and Z. Jin, “CodeDPO: Aligning code models with self generated and verified source code,” in Advances in the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), Vienna, Austria, 2025, pp. 15 854–15 871

2025

[31] [31]

DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,

H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, H. Zhang, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan, “DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search,” in Advances in the Thirteenth International Conference on Learning Representatio...

2025

[32] [32]

Training software engineering agents and verifiers with SWE-gym,

J. Pan, X. Wang, G. Neubig, N. Jaitly, H. Ji, A. Suhr, and Y . Zhang, “Training software engineering agents and verifiers with SWE-gym,” in ICLR 2025 Third Workshop on Deep Learning for Code, 2025. JOURNAL OF LATEX CLASS FILES 11

2025

[33] [33]

Solver-Informed RL: Grounding large language models for authentic optimization modeling,

Y . Chen, J. Xia, S. Shao, D. Ge, and Y . Ye, “Solver-Informed RL: Grounding large language models for authentic optimization modeling,” CoRR, vol. abs/2505.11792, 2025

arXiv 2025

[34] [34]

Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,

Z. Ding, Z. Tan, J. Zhang, and T. Chen, “Or-r1: Automating modeling and solving of operations research optimization problem via test-time reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09092

arXiv 2025

[35] [35]

A survey on llm mid-training,

C. Tu, X. Zhang, R. Weng, R. Li, C. Zhang, Y . Bai, H. Yan, J. Wang, and X. Cai, “A survey on llm mid-training,” 2025. [Online]. Available: https://arxiv.org/abs/2510.23081

arXiv 2025

[36] [36]

Qwen2.5-Coder technical report,

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men, F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-Coder technical report,” CoRR, vol. abs/2409.12186, 2024

Pith/arXiv arXiv 2024

[37] [37]

NL4Opt competition: Formulating optimization problems based on their natural language descriptions,

R. Ramamonjison, T. Yu, R. Li, H. Li, G. Carenini, B. Ghaddar, S. He, M. Mostajabdaveh, A. Banitalebi-Dehkordi, Z. Zhou, and Y . Zhang, “NL4Opt competition: Formulating optimization problems based on their natural language descriptions,” in Proceedings of the NeurIPS 2022 Competitions Track, vol. 220, 2022, pp. 189–203

2022

[38] [38]

ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,

Ai4mathICML2024, “ICML 2024 Challenges on Automated Math Rea- soning - Track 3: Automated Optimization Problem-Solving with Code,” https://www.codabench.org/competitions/2438, 2024

2024

[39] [39]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” CoRR, vol. abs/2402.03300, 2024

Pith/arXiv arXiv 2024

[40] [40]

ORLM: A customizable framework in training large models for automated optimization modeling,

C. Huang, Z. Tang, S. Hu, R. Jiang, X. Zheng, D. Ge, B. Wang, and Z. Wang, “ORLM: A customizable framework in training large models for automated optimization modeling,” Operations Research, May 2025

2025

[41] [41]

Densing law of llms,

C. Xiao, J. Cai, W. Zhao, B. Lin, G. Zeng, J. Zhou, Z. Zheng, X. Han, Z. Liu, and M. Sun, “Densing law of llms,” Nature Machine Intelligence, vol. 7, no. 11, pp. 1823–1833, Nov 2025. [Online]. Available: https://doi.org/10.1038/s42256-025-01137-0

work page doi:10.1038/s42256-025-01137-0 2025

[42] [42]

Qwen2.5 technical report,

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

Pith/arXiv arXiv 2024

[43] [43]

DeepSeek-V3 technical report,

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

Pith/arXiv arXiv 2024

[44] [44]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Ta...

Pith/arXiv arXiv 2025

[45] [45]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

Pith/arXiv arXiv 2025

[46] [46]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. S. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, L. Marris, S. Petulla, C. Gaffney, A. Aharoni, N. Lintz, T. C. Pais, H. Jacobsson, I. Szpektor, N. Jiang, K. Haridasan, A. Omran, N. Saunshi, D. Bahri, G. Mishra, E. Chu, T. Boyd, B. Hekman, A. Parisi, C. Zhang, K. Kawintiranon, T. Bed...

Pith/arXiv arXiv 2025

[47] [47]

GPT-5 system card,

OpenAI, “GPT-5 system card,” https://cdn.openai.com/ gpt-5-system-card.pdf, 2025

2025

[48] [48]

Reflex- ion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, New Orleans, LA, 2023

2023

[49] [49]

Training verifiers to solve math word problems,

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schul- man, “Training verifiers to solve math word problems,” CoRR, vol. abs/2110.14168, 2021

Pith/arXiv arXiv 2021

[50] [50]

Measuring mathematical problem solv- ing with the MATH dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solv- ing with the MATH dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021

[51] [51]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in 9th International Conference on Learning Representations, 2021

2021

[52] [52]

RACE: large-scale ReAding comprehension dataset from examinations,

G. Lai, Q. Xie, H. Liu, Y . Yang, and E. H. Hovy, “RACE: large-scale ReAding comprehension dataset from examinations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds., 2017

2017

[53] [53]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,

M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

2017

[54] [54]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

Pith/arXiv arXiv 2021

[55] [55]

w” denotes “with

under a limited training budget. The objective is to construct a training set that simultaneously covers diverse JOURNAL OF LATEX CLASS FILES 15 optimization types and scenarios, while preserving the scenario proportions observed in the real distribution. Starting from the OptMATH-Train pool containing 201K problems, we label each instance with types and ...

arXiv

[56] [56]

Don’t give any explanation, just provide the converted pyomo code in the following format: ‘‘‘python [pyomo code here] ‘‘‘

[57] [57]

Other solvers should not be utilized

Please note that the following solvers are available for use: ’glpk’, ’cbc’, ’ipopt ’, ’scip’. Other solvers should not be utilized

[58] [58]

Please add ‘from pyomo.environ import *‘ at the beginning of your code

[59] [59]

Please print the optimal objective value at the end of the code. **Gurobipy code: ** {gurobipy} JOURNAL OF LATEX CLASS FILES 17 This section provides the system prompt to guide MiniOpt models in autonomously selecting solvers after modeling op- timization problems. PROMPT FORSOLVERSELECTION **Solver Selection Guide: ** - ‘‘glpk‘‘: Best for small-to-medium...

[60] [63]

SYSTEMPROMPT FORRL TRAINING You are a helpful assistant

Nonlinearity presence (use ipopt/scip) This section provides the system prompt used by MiniOpt models during reinforcement learning (RL) training. SYSTEMPROMPT FORRL TRAINING You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclo...

[61] [64]

Detailed reasoning about the problem within <think> </think> tags

[62] [65]

Write the corresponding five-element model (derived from your analysis)

[63] [66]

Determine the mathematical properties of problem and select an appropriate solver from ’glpk’, ’cbc’, ’ipopt’, ’scip’

[64] [67]

- Verify the five-element model fully captures the problem’s requirements

Recheck and correct if necessary at the end of the <think> </think> section. - Verify the five-element model fully captures the problem’s requirements. - Confirm no constraints/variables are missing or over-simplified. - Ensure the solver choice aligns with the problem’s mathematical properties

[65] [68]

Provide the corresponding Pyomo code based on checked five-element model within < answer> </answer> tags. In mathematics, optimization problem can be modeled as the following expression $\\ min_{{\\boldsymbol{{x}} \\in \\mathcal{{X }}}} f(\\boldsymbol{{x}}), {{\\rm s.t.}} G(\\boldsymbol{{x}}) \\leq \\boldsymbol{{ c}}$, where $\\boldsymbol{{x}} = (x_1, x_2...

[66] [69]

Variable types (continuous vs integer/ binary)

[67] [70]

Linearity of objective/constraints

[68] [71]

Problem scale (small: glpk/cbc, large: scip/ipopt)

[69] [72]

‘‘‘python\n(.*?)‘‘‘

Nonlinearity presence (use ipopt/scip) Please select an appropriate solver based on the type and quantity of variables, objectives, and constraints. After thinking, when you finally reach the five -element model, you should give the corresponding Pyomo code within the < answer> </answer> tags, i.e., <answer> ‘‘‘python\n code here‘‘‘ </answer>. The user wi...

[70] [73]

Linear Programming (LP): Problems with linear objective function and linear constraints, all continuous variables

[71] [74]

Integer Programming (IP): Problems with linear or nonlinear components where ALL variables are discrete/integer

[72] [75]

Mixed Integer Linear Programming (MILP): Problems with linear components containing BOTH continuous and discrete variables

[73] [76]

Nonlinear Programming (NLP): Problems with nonlinear objective function and/or nonlinear constraints (variables may be continuous/discrete)

[74] [77]

Combinatorial Optimization (CO): Problems focused on selecting/discrete structures (graphs, permutations, sets) with typically binary variables

[75] [78]

Multi-objective Programming (MOP): Problems explicitly optimizing multiple conflicting objectives simultaneously

[76] [79]

Second-Order Cone Programming (SOCP): Problems with a linear objective function , linear constraints, and second-order cone constraints (e.g., \(\|Ax + b\| \leq cˆT x + d\)) # Problem: {{Question}} # Output Analyze the mathematical structure step by step and classify its type. Finally, output the type abbreviation in the following format: Type: Abbreviati...

[77] [80]

Supply Chain: Decisions about inventory management, distribution network, warehousing operations

[78] [81]

Finance: Decisions about portfolio management, investments, risk management, financial planning

[79] [82]

Manufacturing: Decisions about production processes, quality control, factory operations

[80] [83]

Transportation: Decisions about routing, vehicle scheduling, fleet management, traffic flow, carrier selection