pith. sign in

arxiv: 2601.19924 · v2 · pith:FAJJCCRInew · submitted 2026-01-09 · 💻 cs.CL · cs.AI· cs.LG

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Pith reviewed 2026-05-16 16:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLMoptimization modelingbenchmarkconstraint formulationsolver integrationoperations researchmixed-integer programming
0
0 comments X

The pith

Solver-integrated LLMs for optimization modeling are limited primarily by errors in automated constraint formulation as problem complexity scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OPT-Engine, a benchmark that applies controllable complexity scaling to ten standard operations research problems ranging from linear programs to mixed-integer programs. It tests three paradigms: pure text chain-of-thought reasoning, tool-assisted calculation, and solver-integrated reasoning. Pure text approaches lose robustness quickly with added variables and integrality. External tools fix local arithmetic but leave global constraint violations intact. Solver integration improves results yet still breaks down at the step of correctly writing the constraint set itself.

Core claim

For the current solver-integrated reasoning paradigm, the automated formulation of constraints represents the primary bottleneck in LLM performance on optimization modeling tasks.

What carries the argument

OPT-Engine benchmark that scales ten canonical problems by number of variables, number of constraints, and degree of integrality to create measurable difficulty levels.

Load-bearing premise

The ten chosen canonical problems together with the metrics of variable count, constraint count, and integrality level are representative of the optimization modeling tasks LLMs will face.

What would settle it

An LLM that produces error-free constraint formulations for the highest-complexity mixed-integer instances in the benchmark while using solver integration would disprove the claim that constraint formulation is the dominant limit.

Figures

Figures reproduced from arXiv: 2601.19924 by Cheng cheng, Dongdong Ge, Yinan Sun, Yitian Chen, Zi Ling.

Figure 1
Figure 1. Figure 1: An overview of the OPT-Engine taxonomy. The framework encompasses five [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the problem instance generation workflow. The pipeline comprises four stages: (1) Numeric Instance Generation, (2) Original Problem Construction, (3) Problem Augmentation, and (4) Instance Validation. This end-to-end process yields comprehensive problem instances, including their specific type, complexity metrics, natural language statements, and ground-truth verifiable solutions. are especiall… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison between Tool-Integrated Reasoning (TIR) and Pure-Text Reasoning (PTR) as problem size scales. The upper panel reports results for the DeepSeek-V3.2 model, and the lower panel reports results for the GPT-5.1 model. 4.2 Comparative Analysis: TIR vs. PTR Comparative Analysis with Top-Tier Models. In the first phase of our comparative study, we utilized two proprietary API-Accessed LLMs:… view at source ↗
Figure 4
Figure 4. Figure 4: Performance scaling of PTR (blue) vs. TIR (red) on the Qwen3-4B series. The upper panel illustrates the reasoning performance of the base Qwen3-4B-Instruct model as problem complexity increases. The lower panel incorporates results from Qwen3-4B-RL, indicating significantly improved accuracy due to RLVR training in TIR modes [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TSP results with DeepSeek-V3.2: relationship between token length and accuracy across instance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The first row is the accuracy across different perplexities. The second row is the accuracy across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Augmented constraint descriptions and their corresponding mathematical formulations across [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy with and without Extra Constraint [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of Prompt Variation across Three Complexity Tiers. While the underlying TSP [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparative Analysis of DeepSeek-V3.2 Performance in Pure-Text Reasoning for TSP: [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
read the original abstract

We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces OPT-ENGINE, an extensible benchmark spanning ten canonical Operations Research problems with controllable complexity scaling from Linear Programming to Mixed-Integer Programming. It evaluates three paradigms—Pure-Text Reasoning (PTR) via Chain-of-Thought, tool-integrated reasoning, and Solver-integrated Reasoning (SIR)—reporting that PTR exhibits a robustness gap with increasing complexity, external tools mitigate only local arithmetic errors, and constraint formulation is the primary bottleneck for SIR.

Significance. If the empirical trends hold, the work supplies a reproducible framework and concrete failure-mode analysis for LLM-based optimization modeling, with the public code release enabling direct verification and extension. The scoped conclusions on paradigm-specific bottlenecks offer a practical roadmap without overclaiming universality.

major comments (1)
  1. [Experimental results] Experimental results section: the reported robustness gaps and bottleneck attributions for SIR lack accompanying error bars, statistical tests, or exact prompt templates, making it difficult to confirm that post-hoc filtering or prompt choices do not influence the primary claim that constraint formulation is the dominant failure mode.
minor comments (2)
  1. [Methods] Methods section: the precise definitions and formulas for the complexity scaling metrics (number of variables, constraints, integrality) should be stated explicitly with an example instance to allow readers to replicate the scaling procedure.
  2. [Figures] Figure captions: several performance plots would benefit from clearer legends distinguishing the three paradigms and from annotation of the exact complexity levels at which the robustness gap becomes statistically noticeable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on experimental rigor. We address the single major comment below.

read point-by-point responses
  1. Referee: Experimental results section: the reported robustness gaps and bottleneck attributions for SIR lack accompanying error bars, statistical tests, or exact prompt templates, making it difficult to confirm that post-hoc filtering or prompt choices do not influence the primary claim that constraint formulation is the dominant failure mode.

    Authors: We agree that the current presentation would benefit from greater statistical transparency. In the revised manuscript we will add error bars (standard deviation over five independent runs) to all key performance metrics, include statistical significance tests (paired t-tests and bootstrap confidence intervals) to support the reported robustness gaps, and provide the complete prompt templates together with any post-processing rules in a new appendix. These additions will allow readers to verify that the constraint-formulation bottleneck remains the dominant failure mode independent of prompt variation or post-hoc filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations from benchmark experiments

full rationale

The paper introduces the OPT-ENGINE benchmark spanning ten canonical OR problems with controllable complexity scaling and reports direct empirical results on PTR, tool integration, and SIR paradigms. The key claim that constraint formulation is the primary bottleneck for SIR follows from observed performance gaps and robustness failures in the experiments, without any reduction to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain consists of benchmark construction followed by experimental measurement, which is self-contained and externally verifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard operations-research problem definitions and existing LLM prompting techniques; no new free parameters, axioms, or invented entities are introduced beyond the benchmark construction itself.

pith-pipeline@v0.9.0 · 5536 in / 1187 out tokens · 32071 ms · 2026-05-16T16:15:21.254376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

    math.OC 2026-04 unverdicted novelty 6.0

    Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  3. [3]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  4. [4]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  6. [6]

    Springer, 2007

    Andreas Antoniou and Wu-Sheng Lu.Practical optimization: algorithms and engineering applications. Springer, 2007

  7. [7]

    Springer, 1984

    David G Luenberger, Yinyu Ye, et al.Linear and nonlinear programming, volume 2. Springer, 1984

  8. [8]

    Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

    Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. Optmath: A scalable bidirectional data synthesis framework for optimization modeling.arXiv preprint arXiv:2502.11102, 2025

  9. [9]

    Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

    Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

  10. [10]

    Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

    Zhicheng Yang, Yinya Huang, Wei Shi, Liang Feng, Linqi Song, Yiwei Wang, Xiaodan Liang, and Jing Tang. Benchmarking llms for optimization modeling and enhancing reasoning via reverse socratic synthesis.arXiv e-prints, pages arXiv–2407, 2024

  11. [11]

    Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

    Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, and Yingqian Zhang. Large language models as end-to-end combinatorial optimization solvers.arXiv preprint arXiv:2509.16865, 2025

  12. [12]

    Large language models still can’t plan (a benchmark for llms on planning and reasoning about change)

    Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

  13. [13]

    Gurobi Optimizer Reference Manual, 2024

    Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024

  14. [14]

    Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314, 2022

    Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (copt) user guide. arXiv preprint arXiv:2208.14314, 2022

  15. [15]

    Augmenting operations research with auto-formulation of optimization models from problem descriptions

    Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimization models from problem descriptions. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 29–62, 2022

  16. [16]

    International Conference on Machine Learning (ICML) , video=

    Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Scalable optimization modeling with (mi) lp solvers and large language models.arXiv preprint arXiv:2402.10172, 2024

  17. [17]

    InProceedings of the 41st International Conference on Machine Learning, pages 577–596

    Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, and Yinyu Ye. Solver-informed rl: Grounding large language models for authentic optimization modeling.arXiv preprint arXiv:2505.11792, 2025

  18. [18]

    Learning to reason with LLMs, September 2024

    OpenAI. Learning to reason with LLMs, September 2024. Accessed: 2026-01-07

  19. [19]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  20. [20]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  21. [21]

    Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023

    Pei-Fu Guo, Ying-Hsuan Chen, Yun-Da Tsai, and Shou-De Lin. Towards optimizing with large language models.arXiv preprint arXiv:2310.05204, 2023. 11

  22. [22]

    Nl4opt competition: Formulating optimization problems based on their natural language descriptions

    Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, et al. Nl4opt competition: Formulating optimization problems based on their natural language descriptions. InNeurIPS 2022 competition track, pages 189–203. PMLR, 2023

  23. [23]

    Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

    Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. Mamo: a mathematical modeling benchmark with solvers.arXiv e-prints, pages arXiv–2405, 2024

  24. [24]

    Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

    Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Measure and improve llms for optimization modeling.arXiv preprint arXiv:2407.09887, 2024

  25. [25]

    ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

  26. [26]

    Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

    Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems.arXiv preprint arXiv:2510.16476, 2025

  27. [27]

    Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model

    Duc M Nguyen and Sungahn Ko. Technical report for icml 2024 automated math reasoning challenge: Solving optimization problems with open source large language model. InAI for Math Workshop@ ICML 2024, 2024

  28. [28]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  29. [29]

    Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

    Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation.arXiv preprint arXiv:2407.00379, 2024

  30. [30]

    Acpbench: Reasoning about action, change, and planning

    Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025

  31. [31]

    Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations, 2025

    Zhao Song, Song Yue, and Jiahao Zhang. Thinking isn’t an illusion: Overcoming the limitations of reasoning models via tool augmentations.arXiv preprint arXiv:2507.17699, 2025

  32. [32]

    Valmeekam, K

    Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench.arXiv preprint arXiv:2409.13373, 2024

  33. [33]

    A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

    Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

  34. [34]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941, 2025

  35. [35]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Fara- jtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024

  36. [36]

    Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

    Pengfei Hong, Deepanway Ghosal, Navonil Majumder, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Stuck in the quicksand of numeracy, far from agi summit: Evaluating llms’ mathematical competency through ontology-guided perturbations.CoRR, 2024

  37. [37]

    Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

    Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations.arXiv preprint arXiv:2502.06453, 2025

  38. [38]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

  39. [39]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 12

  40. [40]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  41. [41]

    Interpolated estimation of markov source parameters from sparse data

    Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. InProc. Workshop on Pattern Recognition in Practice, 1980, 1980

  42. [42]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

  43. [43]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 13 A Technical Background A.1 Auto-formulation of Optimization Problems In this work, auto-formulation denotes the task of using an LLM-based agent to transform a human-readable problem description into this fo...