OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Cheng cheng; Dongdong Ge; Yinan Sun; Yitian Chen; Zi Ling

arxiv: 2601.19924 · v2 · pith:FAJJCCRInew · submitted 2026-01-09 · 💻 cs.CL · cs.AI· cs.LG

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen , Cheng cheng , Yinan Sun , Zi Ling , Dongdong Ge This is my paper

Pith reviewed 2026-05-16 16:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLMoptimization modelingbenchmarkconstraint formulationsolver integrationoperations researchmixed-integer programming

0 comments

The pith

Solver-integrated LLMs for optimization modeling are limited primarily by errors in automated constraint formulation as problem complexity scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OPT-Engine, a benchmark that applies controllable complexity scaling to ten standard operations research problems ranging from linear programs to mixed-integer programs. It tests three paradigms: pure text chain-of-thought reasoning, tool-assisted calculation, and solver-integrated reasoning. Pure text approaches lose robustness quickly with added variables and integrality. External tools fix local arithmetic but leave global constraint violations intact. Solver integration improves results yet still breaks down at the step of correctly writing the constraint set itself.

Core claim

For the current solver-integrated reasoning paradigm, the automated formulation of constraints represents the primary bottleneck in LLM performance on optimization modeling tasks.

What carries the argument

OPT-Engine benchmark that scales ten canonical problems by number of variables, number of constraints, and degree of integrality to create measurable difficulty levels.

Load-bearing premise

The ten chosen canonical problems together with the metrics of variable count, constraint count, and integrality level are representative of the optimization modeling tasks LLMs will face.

What would settle it

An LLM that produces error-free constraint formulations for the highest-complexity mixed-integer instances in the benchmark while using solver integration would disprove the claim that constraint formulation is the dominant limit.

Figures

Figures reproduced from arXiv: 2601.19924 by Cheng cheng, Dongdong Ge, Yinan Sun, Yitian Chen, Zi Ling.

**Figure 2.** Figure 2: Overview of the problem instance generation workflow. The pipeline comprises four stages: (1) Numeric Instance Generation, (2) Original Problem Construction, (3) Problem Augmentation, and (4) Instance Validation. This end-to-end process yields comprehensive problem instances, including their specific type, complexity metrics, natural language statements, and ground-truth verifiable solutions. are especiall… view at source ↗

**Figure 3.** Figure 3: Performance comparison between Tool-Integrated Reasoning (TIR) and Pure-Text Reasoning (PTR) as problem size scales. The upper panel reports results for the DeepSeek-V3.2 model, and the lower panel reports results for the GPT-5.1 model. 4.2 Comparative Analysis: TIR vs. PTR Comparative Analysis with Top-Tier Models. In the first phase of our comparative study, we utilized two proprietary API-Accessed LLMs:… view at source ↗

**Figure 4.** Figure 4: Performance scaling of PTR (blue) vs. TIR (red) on the Qwen3-4B series. The upper panel illustrates the reasoning performance of the base Qwen3-4B-Instruct model as problem complexity increases. The lower panel incorporates results from Qwen3-4B-RL, indicating significantly improved accuracy due to RLVR training in TIR modes [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: TSP results with DeepSeek-V3.2: relationship between token length and accuracy across instance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The first row is the accuracy across different perplexities. The second row is the accuracy across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Augmented constraint descriptions and their corresponding mathematical formulations across [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy with and without Extra Constraint [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of Prompt Variation across Three Complexity Tiers. While the underlying TSP [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 17.** Figure 17: Comparative Analysis of DeepSeek-V3.2 Performance in Pure-Text Reasoning for TSP: [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

read the original abstract

We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPT-Engine gives a reproducible scaling benchmark across ten OR problems that isolates constraint formulation as the main SIR failure mode.

read the letter

The core contribution is a benchmark that takes ten standard operations research problems and scales them controllably from LP to MIP using variables, constraints, and integrality. The experiments compare pure text reasoning, tool-augmented, and solver-integrated reasoning, showing PTR collapses with complexity, tools fix local arithmetic but not global constraints, and SIR's bottleneck is constraint formulation. They release code and data, which makes the trends checkable. That combination of scaling plus the three-paradigm split is not in the prior work they cite, so the setup itself is new enough to be useful for tracking progress on language-to-model tasks. The trends are reported clearly in the abstract and the observational claims stay within the tested regime. The main limitation is that full prompt templates, error bars, and exact problem-generation details are not visible here, so it is hard to judge how much the robustness gaps depend on prompt engineering choices versus model limits. The ten problems are a reasonable starting set but, as noted, may not capture the messier structure of real industrial instances. This is aimed at groups building LLM agents for planning and optimization. A reader who wants a controllable testbed for measuring formulation accuracy will get direct value from the scaling curves. It deserves peer review because the framework is reproducible and the central observation is falsifiable with the released code. I would send it out and ask the authors to add the missing experimental controls and any statistical tests on the gaps.

Referee Report

1 major / 2 minor

Summary. The paper introduces OPT-ENGINE, an extensible benchmark spanning ten canonical Operations Research problems with controllable complexity scaling from Linear Programming to Mixed-Integer Programming. It evaluates three paradigms—Pure-Text Reasoning (PTR) via Chain-of-Thought, tool-integrated reasoning, and Solver-integrated Reasoning (SIR)—reporting that PTR exhibits a robustness gap with increasing complexity, external tools mitigate only local arithmetic errors, and constraint formulation is the primary bottleneck for SIR.

Significance. If the empirical trends hold, the work supplies a reproducible framework and concrete failure-mode analysis for LLM-based optimization modeling, with the public code release enabling direct verification and extension. The scoped conclusions on paradigm-specific bottlenecks offer a practical roadmap without overclaiming universality.

major comments (1)

[Experimental results] Experimental results section: the reported robustness gaps and bottleneck attributions for SIR lack accompanying error bars, statistical tests, or exact prompt templates, making it difficult to confirm that post-hoc filtering or prompt choices do not influence the primary claim that constraint formulation is the dominant failure mode.

minor comments (2)

[Methods] Methods section: the precise definitions and formulas for the complexity scaling metrics (number of variables, constraints, integrality) should be stated explicitly with an example instance to allow readers to replicate the scaling procedure.
[Figures] Figure captions: several performance plots would benefit from clearer legends distinguishing the three paradigms and from annotation of the exact complexity levels at which the robustness gap becomes statistically noticeable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on experimental rigor. We address the single major comment below.

read point-by-point responses

Referee: Experimental results section: the reported robustness gaps and bottleneck attributions for SIR lack accompanying error bars, statistical tests, or exact prompt templates, making it difficult to confirm that post-hoc filtering or prompt choices do not influence the primary claim that constraint formulation is the dominant failure mode.

Authors: We agree that the current presentation would benefit from greater statistical transparency. In the revised manuscript we will add error bars (standard deviation over five independent runs) to all key performance metrics, include statistical significance tests (paired t-tests and bootstrap confidence intervals) to support the reported robustness gaps, and provide the complete prompt templates together with any post-processing rules in a new appendix. These additions will allow readers to verify that the constraint-formulation bottleneck remains the dominant failure mode independent of prompt variation or post-hoc filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations from benchmark experiments

full rationale

The paper introduces the OPT-ENGINE benchmark spanning ten canonical OR problems with controllable complexity scaling and reports direct empirical results on PTR, tool integration, and SIR paradigms. The key claim that constraint formulation is the primary bottleneck for SIR follows from observed performance gaps and robustness failures in the experiments, without any reduction to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain consists of benchmark construction followed by experimental measurement, which is self-contained and externally verifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard operations-research problem definitions and existing LLM prompting techniques; no new free parameters, axioms, or invented entities are introduced beyond the benchmark construction itself.

pith-pipeline@v0.9.0 · 5536 in / 1187 out tokens · 32071 ms · 2026-05-16T16:15:21.254376+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling
math.OC 2026-04 unverdicted novelty 6.0

Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Springer, 2007

Andreas Antoniou and Wu-Sheng Lu.Practical optimization: algorithms and engineering applications. Springer, 2007

work page 2007
[7]

Springer, 1984

David G Luenberger, Yinyu Ye, et al.Linear and nonlinear programming, volume 2. Springer, 1984

work page 1984
[8]

Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

work page arXiv 2025
[9]

Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

work page 2025
[10]

Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

Zhicheng Yang, Yinya Huang, Wei Shi, Liang Feng, Linqi Song, Yiwei Wang, Xiaodan Liang, and Jing Tang. Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[11]

Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, and Yingqian Zhang. Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

work page arXiv 2025
[12]

Large language models still can’t plan (a benchmark for llms on planning and reasoning about change)

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

work page 2022
[13]

Gurobi Optimizer Reference Manual, 2024

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024

work page 2024
[14]

Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314, 2022

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (copt) user guide. arXiv preprint arXiv:2208.14314, 2022

work page arXiv 2022
[15]

Augmenting operations research with auto-formulation of optimization models from problem descriptions

Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimization models from problem descriptions. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 29–62, 2022

work page 2022
[16]

International Conference on Machine Learning (ICML) , video=

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Scalable optimization modeling with (mi) lp solvers and large language models.arXiv preprint arXiv:2402.10172, 2024

work page arXiv 2024
[17]

InProceedings of the 41st International Conference on Machine Learning, pages 577–596

Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed rl: Grounding large language models for authentic optimization modeling.arXiv preprint arXiv:2505.11792, 2025

work page arXiv 2025
[18]

Learning to reason with LLMs, September 2024

OpenAI. Learning to reason with LLMs, September 2024. Accessed: 2026-01-07

work page 2024
[19]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023

Pei-Fu Guo, Ying-Hsuan Chen, Yun-Da Tsai, and Shou-De Lin. Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023. 11

work page arXiv 2023
[22]

Nl4opt competition: Formulating optimization problems based on their natural language descriptions

Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, et al. Nl4opt competition: Formulating optimization problems based on their natural language descriptions. InNeurIPS 2022 competition track, pages 189–203. PMLR, 2023

work page 2022
[23]

Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

work page 2024
[24]

Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

work page arXiv 2024
[25]

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

work page arXiv 2025
[26]

Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems.arXiv preprint arXiv:2510.16476, 2025

work page arXiv 2025
[27]

Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model

Duc M Nguyen and Sungahn Ko. Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model. InAI for Math Workshop@ ICML 2024, 2024

work page 2024
[28]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[29]

Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

work page arXiv 2024
[30]

Acpbench: Reasoning about action, change, and planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

work page 2025
[31]

Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations, 2025

Zhao Song, Song Yue, and Jiahao Zhang. Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations.arXiv preprint arXiv:2507.17699, 2025

work page arXiv 2025
[32]

Valmeekam, K

Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench.arXiv preprint arXiv:2409.13373, 2024

work page arXiv 2024
[33]

A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

work page 2025
[34]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Fara- jtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

work page 2024
[37]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations.arXiv preprint arXiv:2502.06453, 2025

work page arXiv 2025
[38]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[39]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Interpolated estimation of markov source parameters from sparse data

Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. InProc. Workshop on Pattern Recognition in Practice, 1980, 1980

work page 1980
[42]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 13 A Technical Background A.1 Auto-formulation of Optimization Problems In this work, auto-formulation denotes the task of using an LLM-based agent to transform a human-readable problem description into this fo...

work page internal anchor Pith review Pith/arXiv arXiv 1904

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Springer, 2007

Andreas Antoniou and Wu-Sheng Lu.Practical optimization: algorithms and engineering applications. Springer, 2007

work page 2007

[7] [7]

Springer, 1984

David G Luenberger, Yinyu Ye, et al.Linear and nonlinear programming, volume 2. Springer, 1984

work page 1984

[8] [8]

Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

work page arXiv 2025

[9] [9]

Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

work page 2025

[10] [10]

Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

Zhicheng Yang, Yinya Huang, Wei Shi, Liang Feng, Linqi Song, Yiwei Wang, Xiaodan Liang, and Jing Tang. Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[11] [11]

Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, and Yingqian Zhang. Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

work page arXiv 2025

[12] [12]

Large language models still can’t plan (a benchmark for llms on planning and reasoning about change)

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

work page 2022

[13] [13]

Gurobi Optimizer Reference Manual, 2024

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024

work page 2024

[14] [14]

Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314, 2022

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (copt) user guide. arXiv preprint arXiv:2208.14314, 2022

work page arXiv 2022

[15] [15]

Augmenting operations research with auto-formulation of optimization models from problem descriptions

Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimization models from problem descriptions. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 29–62, 2022

work page 2022

[16] [16]

International Conference on Machine Learning (ICML) , video=

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Scalable optimization modeling with (mi) lp solvers and large language models.arXiv preprint arXiv:2402.10172, 2024

work page arXiv 2024

[17] [17]

InProceedings of the 41st International Conference on Machine Learning, pages 577–596

Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed rl: Grounding large language models for authentic optimization modeling.arXiv preprint arXiv:2505.11792, 2025

work page arXiv 2025

[18] [18]

Learning to reason with LLMs, September 2024

OpenAI. Learning to reason with LLMs, September 2024. Accessed: 2026-01-07

work page 2024

[19] [19]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[20] [20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023

Pei-Fu Guo, Ying-Hsuan Chen, Yun-Da Tsai, and Shou-De Lin. Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023. 11

work page arXiv 2023

[22] [22]

Nl4opt competition: Formulating optimization problems based on their natural language descriptions

Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, et al. Nl4opt competition: Formulating optimization problems based on their natural language descriptions. InNeurIPS 2022 competition track, pages 189–203. PMLR, 2023

work page 2022

[23] [23]

Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

work page 2024

[24] [24]

Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

work page arXiv 2024

[25] [25]

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

work page arXiv 2025

[26] [26]

Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems.arXiv preprint arXiv:2510.16476, 2025

work page arXiv 2025

[27] [27]

Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model

Duc M Nguyen and Sungahn Ko. Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model. InAI for Math Workshop@ ICML 2024, 2024

work page 2024

[28] [28]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[29] [29]

Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

work page arXiv 2024

[30] [30]

Acpbench: Reasoning about action, change, and planning

Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

work page 2025

[31] [31]

Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations, 2025

Zhao Song, Song Yue, and Jiahao Zhang. Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations.arXiv preprint arXiv:2507.17699, 2025

work page arXiv 2025

[32] [32]

Valmeekam, K

Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench.arXiv preprint arXiv:2409.13373, 2024

work page arXiv 2024

[33] [33]

A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

work page 2025

[34] [34]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Fara- jtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

work page 2024

[37] [37]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations.arXiv preprint arXiv:2502.06453, 2025

work page arXiv 2025

[38] [38]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023

[39] [39]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Interpolated estimation of markov source parameters from sparse data

Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. InProc. Workshop on Pattern Recognition in Practice, 1980, 1980

work page 1980

[42] [42]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 13 A Technical Background A.1 Auto-formulation of Optimization Problems In this work, auto-formulation denotes the task of using an LLM-based agent to transform a human-readable problem description into this fo...

work page internal anchor Pith review Pith/arXiv arXiv 1904