arxiv: 2604.16804 · v2 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

Sumeet Ramesh Motwani , Chuan Du , Aleksander Petrov , Christopher Davis , Philip Torr , Antonio Papania-Davis , Weishi Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords autoformalizationoperations researchlarge language modelsreinforcement learningsynthetic dataoptimization problemspost-trainingsolver feedback

0 comments

The pith

An 8B model post-trained via synthetic data and solver feedback matches larger models at turning natural language optimization problems into solver-ready forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can be scalably post-trained to autoformalize operations research problems by generating verified synthetic data from standard optimization templates and applying reinforcement learning whose reward comes directly from whether a solver executes the output correctly. This pipeline produces an 8B model that reaches state-of-the-art or competitive accuracy on six established benchmarks while matching the performance of much larger frontier models. For non-linear problems involving physical dynamics, where most models score near zero, the authors add a curriculum RL stage that starts from limited initial data and progressively improves the model until the class becomes tractable. A sympathetic reader would care because successful autoformalization removes the need for scarce OR specialists when translating real industrial descriptions into usable solver inputs. The central mechanism is therefore the closed loop of template-based data creation plus execution-based reward that lets post-training substitute for hand-crafted expertise.

Core claim

AutoOR shows that verified synthetic data generated from standard linear, mixed-integer, and non-linear optimization forms, paired with reinforcement learning that uses solver execution success as the sole reward signal, enables an 8B model to autoformalize natural-language optimization problems at state-of-the-art or competitive levels across six benchmarks; a curriculum RL variant further renders previously intractable non-linear physical-dynamics problems solvable from limited seed data.

What carries the argument

The AutoOR pipeline, which generates training examples from standard optimization templates and uses solver execution feedback as the reinforcement-learning reward to train the model to produce correct formalizations.

If this is right

An 8B model becomes competitive with significantly larger models on linear and mixed-integer formalization tasks.
Non-linear problems involving physical dynamics move from near-zero to usable accuracy through staged curriculum reinforcement learning.
Industrial decision-making can be accelerated by replacing manual formalization steps with automated model output.
Training data creation scales without requiring large amounts of human-annotated OR examples.
The same post-training recipe applies across linear, mixed-integer, and selected non-linear problem classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on end-to-end pipelines that take raw sensor or business data and emit both a formalization and a solved schedule.
If the synthetic-to-real gap proves small, similar template-plus-execution loops might apply to formalizing problems in other domains such as chemical process design or financial planning.
A practical next measurement would be accuracy on a corpus of actual company problem statements that have never been seen during training.
Integration with existing solver interfaces might allow non-experts to describe a scheduling task in ordinary language and receive an immediately executable model.

Load-bearing premise

That data produced from clean standard optimization templates together with solver execution feedback will be sufficient to train models that still work when given the varied and often ambiguous wording found in actual industrial problem statements.

What would settle it

A test set of real industrial optimization problems described in natural language where the post-trained 8B model produces formalizations that solvers cannot execute correctly or that yield wrong objective values, while larger frontier models also fail on the same set.

read the original abstract

Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoOR post-trains an 8B model on synthetic OR templates with solver-execution RL to match larger models on autoformalization benchmarks and adds a curriculum for non-linear cases.

read the letter

The main thing to know is that this paper takes an 8B LLM, generates training data from standard optimization templates, and runs RL where the reward is whether a solver can actually execute the model's output. The result is competitive or better performance than much larger frontier models on six established OR benchmarks, plus a curriculum strategy that makes non-linear physical dynamics problems workable when other models score near zero percent.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AutoOR, a pipeline combining synthetic data generation from standard optimization forms with reinforcement learning that uses solver execution feedback as the reward signal. The method post-trains LLMs to translate natural language descriptions of linear, mixed-integer, and non-linear optimization problems into solver-ready formulations. It reports that an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks (matching much larger frontier models) and introduces a curriculum RL strategy that bootstraps limited data to make a non-linear physical-dynamics problem class tractable where frontier models score near zero.

Significance. If the empirical claims hold under rigorous scrutiny, the work offers a practical route to scalable autoformalization of OR problems, potentially reducing dependence on scarce OR expertise in industrial settings. The external solver feedback provides a verifiable, non-circular reward signal, and the curriculum approach for previously intractable non-linear classes is a concrete methodological advance. The demonstration that a modest 8B model can compete with frontier systems on established benchmarks underscores the efficiency of the synthetic-data-plus-RL recipe.

major comments (2)

[§5] §5 (Experimental results): The central performance claims—that the 8B model reaches SOTA or competitive scores on six benchmarks and that the curriculum renders the non-linear class tractable—are presented without tables or text specifying benchmark definitions, exact evaluation metrics (e.g., formulation accuracy vs. solver success rate), number of test instances per benchmark, the precise frontier-model baselines and their scores, or any statistical significance tests. These omissions are load-bearing for the headline result.
[§5.3] §5.3 (Curriculum RL subsection): The description of the curriculum strategy that bootstraps from limited initial data lacks ablation studies, intermediate performance curves, or controls that isolate the contribution of the curriculum versus simply scaling the synthetic data or RL steps. Without such evidence the claim that this strategy makes the non-linear class tractable remains under-supported.

minor comments (2)

[Abstract] The abstract would be more informative if it reported at least one quantitative metric (e.g., average accuracy or pass rate) alongside the qualitative “state-of-the-art or competitive” phrasing.
[§3] Notation for the reward function and the synthetic-data generation process could be made more explicit (e.g., by adding a short pseudocode block or a dedicated equation for the solver-feedback term).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will incorporate revisions to improve the clarity and rigor of the experimental sections.

read point-by-point responses

Referee: [§5] §5 (Experimental results): The central performance claims—that the 8B model reaches SOTA or competitive scores on six benchmarks and that the curriculum renders the non-linear class tractable—are presented without tables or text specifying benchmark definitions, exact evaluation metrics (e.g., formulation accuracy vs. solver success rate), number of test instances per benchmark, the precise frontier-model baselines and their scores, or any statistical significance tests. These omissions are load-bearing for the headline result.

Authors: We agree that the experimental results section requires additional explicit details to fully support the performance claims. In the revised manuscript we will add a summary table (and accompanying text) that defines each of the six benchmarks, states the precise evaluation metrics (formulation accuracy and solver success rate), reports the number of test instances per benchmark, lists the exact frontier-model baselines together with their scores, and includes statistical significance tests. These additions will make the headline results transparent and reproducible. revision: yes
Referee: [§5.3] §5.3 (Curriculum RL subsection): The description of the curriculum strategy that bootstraps from limited initial data lacks ablation studies, intermediate performance curves, or controls that isolate the contribution of the curriculum versus simply scaling the synthetic data or RL steps. Without such evidence the claim that this strategy makes the non-linear class tractable remains under-supported.

Authors: We acknowledge that stronger empirical validation of the curriculum RL strategy is needed. The revised version will include ablation studies comparing the curriculum approach against controls that scale synthetic data volume or RL steps without curriculum, as well as intermediate performance curves that illustrate the bootstrapping process on the non-linear physical-dynamics problems. These additions will isolate the curriculum's contribution and better substantiate the claim that it renders the class tractable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical pipeline that generates synthetic training data from standard optimization problem forms and applies RL using external solver execution feedback as the reward signal. No equations, derivations, or self-referential metrics are presented in the provided text that reduce predictions or results to fitted inputs or self-citations by construction. The central claims rest on benchmark performance comparisons rather than internal consistency loops or ansatz smuggling. This is a standard data-generation-plus-external-verifier setup with no load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the assumption that verified synthetic data from standard forms plus RL with solver feedback can train generalizable autoformalization capabilities. No explicit free parameters, axioms, or invented entities are described.

axioms (1)

domain assumption Solver execution feedback provides a reliable and scalable reward signal for improving LLM formalization accuracy across problem categories.
Implicit in the RL post-training description and the claim that it makes non-linear problems tractable.

pith-pipeline@v0.9.0 · 5490 in / 1292 out tokens · 34208 ms · 2026-05-10T06:57:42.035454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 46 canonical work pages · 15 internal anchors

[1]

Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024

Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela van der Schaar. Auto- formulation of mathematical optimization models using llms, 2025. URL https: //arxiv.org/abs/2411.01679

work page arXiv 2025
[2]

Gekko op- timization suite.Processes, 6(8):106, 2018

Logan DR Beal, Daniel C Hill, R Abraham Martin, and John D Hedengren. Gekko op- timization suite.Processes, 6(8):106, 2018

2018
[3]

The SCIP optimization suite 9.0

Suresh Bolusani, Mathieu Besançon, Ksenia Bestuzheva, Antonia Chmiela, João Dioní- sio, Tim Donkiewicz, Jasper van Doorn- malen, Leon Eifler, Mohammed Ghan- nam, Ambros Gleixner, et al. The scip optimization suite 9.0.arXiv preprint arXiv:2402.17702, 2024

work page arXiv 2024
[4]

Chain-of-experts: When LLMs meet complex operations re- search problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Min- gli Song, and Gang Chen. Chain-of-experts: When LLMs meet complex operations re- search problems. InThe Twelfth Interna- tional Conference on Learning Representa- tions, 2024. URLhttps://openreview. net/forum?id=HobyL1B9CZ

2024
[5]

Optimus: Optimization modeling using MIP solvers and large language models

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Optimiza- tion modeling using mip solvers and large language models, 2023. URL https:// arxiv.org/abs/2310.06116

work page arXiv 2023
[6]

arXiv preprint arXiv:2503.10009 , year=

Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. Or-llm- agent: Automating modeling and solving of operations research optimization problems with reasoning llm, 2025. URLhttps:// arxiv.org/abs/2503.10009

work page arXiv 2025
[7]

Llmopt: Learning to define and solve generaloptimizationproblemsfromscratch,

Caigao Jiang, Xiang Shu, Hong Qian, XingyuLu,JunZhou,AiminZhou,andYang Yu. Llmopt: Learning to define and solve generaloptimizationproblemsfromscratch,
[8]

URL https://arxiv.org/abs/ 2410.13213

work page arXiv
[9]

Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang

Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Mea- sureandimprovellmsforoptimizationmod- eling, 2025. URL https://arxiv.org/ abs/2407.09887

work page arXiv 2025
[10]

Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025

Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, and Nianjun Zhou. Toward a trustworthy optimization modeling agent via verifiable synthetic data generation, 2025. URL https://arxiv. org/abs/2508.03117

work page arXiv 2025
[11]

Orlm: A customizable framework in training large models for automated optimization model- ing.Operations Research, 73(6):2986–3009, November 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization model- ing.Operations Research, 73(6):2986–3009, November 2025. ISSN 1526-5463. doi: 10. 1287/opre.2024.1233. URL http://dx. doi.org/10.1287/opre.2024.1233

work page doi:10.1287/opre.2024.1233 2025
[12]

Evaluating llm reasoning in the op- erations research domain with orqa

Mahdi Mostajabdaveh, Timothy Tin Long Yu, Samarendra Chandan Bindu Dash, Rindra Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating llm reasoning in the op- erations research domain with orqa. InPro- ceedings of the AAAI Conference on Artifi- cial Intelligence, volume 39, pages 24902– 24910, 2025

2025
[13]

Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical rea- soning of llms, 2024

Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical rea- soning of llms, 2024. URL https: //arxiv.org/abs/2402.16352

work page arXiv 2024
[14]

Verification, the key to ai.on-line essay.[Online]

Richard S Sutton. Verification, the key to ai.on-line essay.[Online]. Available: http://www. cs. ualberta. ca/sutton/In- cIdeas/KeytoAI. html, 2001

2001
[15]

arXiv preprint arXiv:2412.02674 , year=

Yuda Song, Hanlin Zhang, Carson Eise- nach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large 10 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems language models, 2025. URL https:// arxiv.org/abs/2412.02674

work page arXiv 2025
[16]

Qwen3 technical report,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, and Chang Gao. Qwen3 technical report,
[17]

URL https://arxiv.org/abs/ 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Nl4opt competition: Formulat- ing optimization problems based on their natural language descriptions

Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mosta- jabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, et al. Nl4opt competition: Formulat- ing optimization problems based on their natural language descriptions. InNeurIPS 2022 competition track, pages 189–203. PMLR, 2023

2022
[19]

Augmenting op- erations research with auto-formulation of optimization models from problem de- scriptions

Ramamonjison et al. Augmenting op- erations research with auto-formulation of optimization models from problem de- scriptions. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing: Industry Track, pages 29–62, Abu Dhabi, UAE, December

2022
[20]

URL https://aclanthology

Association for Computational Lin- guistics. URL https://aclanthology. org/2022.emnlp-industry.4

2022
[21]

Mamo: a mathematical modeling benchmark with solvers,

Xuhan Huang, Qingning Shen, Yan Hu, An- ningzhe Gao, and Benyou Wang. Mamo: a mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

work page arXiv 2024
[22]

A survey of optimization modeling meets llms: Progress and future directions.arXiv preprint arXiv:2508.10047,

Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Yuan, Jia Zeng, Yuan Wang, Gang Chen, and Dongxiang Zhang. A survey of opti- mization modeling meets llms: Progress and future directions, 2025. URLhttps: //arxiv.org/abs/2508.10047

work page arXiv 2025
[23]

New directions for nonlin- ear process optimization.Current Opinion in Chemical Engineering, 21:32–40, 2018

Lorenz T Biegler. New directions for nonlin- ear process optimization.Current Opinion in Chemical Engineering, 21:32–40, 2018

2018
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, and Guangming Sheng. Dapo: An open-source llm reinforcement learn- ing system at scale, 2025. URL https: //arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

h1: Bootstrapping llms to reason over longer horizons via reinforcement learning

Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shi- tal Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via rein- forcement learning, 2025. URL https: //arxiv.org/abs/2510.07312

work page arXiv 2025
[26]

Pope: Learning to reason on hard prob- lems via privileged on-policy exploration,

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard prob- lems via privileged on-policy exploration,
[27]

URL https://arxiv.org/abs/ 2601.18779

work page arXiv
[28]

McGraw-Hill Education, 2014

Frederick Hillier and G Lieberman.Intro- duction to Operations Research with Student Access Card. McGraw-Hill Education, 2014

2014
[29]

Solverllm: Leveraging test-time scal- ing for optimization problem via llm-guided search, 2025

Dong Li, Xujiang Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, and Haifeng Chen. Solverllm: Leveraging test-time scal- ing for optimization problem via llm-guided search, 2025. URLhttps://arxiv.org/ abs/2510.16916

work page arXiv 2025
[30]

Or-r1: Automating mod- eling and solving of operations research optimization problem via test-time rein- forcement learning, 2025

Zezhen Ding, Zhen Tan, Jiheng Zhang, and Tianlong Chen. Or-r1: Automating mod- eling and solving of operations research optimization problem via test-time rein- forcement learning, 2025. URL https: //arxiv.org/abs/2511.09092

work page arXiv 2025
[31]

The llama 3 herd of models,

Aaron Grattafiori, Abhimanyu Dubey, Ab- hinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Alex Vaughan. The llama 3 herd of models,
[32]

URL https://arxiv.org/abs/ 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv
[33]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muen- nighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic 11 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems optimization, 2024. URLhttps://arxiv. org/abs/2402.01306

work page internal anchor Pith review arXiv 2024
[34]

Step-opt: Boosting 24 optimization modeling in llms through iterative data synthesis and structured validation.arXiv preprint arXiv:2506.17637,

Yang Wu, Yifan Zhang, Yurong Wu, Yu- ran Wang, Junkai Zhang, and Jian Cheng. Step-opt: Boosting optimization modeling in llms through iterative data synthesis and structured validation, 2025. URLhttps: //arxiv.org/abs/2506.17637

work page arXiv 2025
[35]

Opt- math: A scalable bidirectional data synthe- sis framework for optimization modeling,

HongliangLu, ZhonglinXie, YaoyuWu, Can Ren, Yuxuan Chen, and Zaiwen Wen. Opt- math: A scalable bidirectional data synthe- sis framework for optimization modeling,
[36]

URL https://arxiv.org/abs/ 2502.11102

work page arXiv
[37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps: //arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Proximal policy optimization algorithms,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,
[39]

URL https://arxiv.org/abs/ 1707.06347

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Available: http://dx.doi.org/10.1038/s41586-025-09422-z

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/ s41...

work page doi:10.1038/s41586-025-09422-z 2025
[41]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understand- ing r1-zero-like training: A critical perspec- tive, 2025. URL https://arxiv.org/ abs/2503.20783

work page Pith review arXiv 2025
[42]

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhi- jian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,
[43]

URL https://arxiv.org/abs/ 2506.14245

work page internal anchor Pith review arXiv
[44]

Hung Le, Yue Wang, Akhilesh Deepak Got- mare, Silvio Savarese, and Steven C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep re- inforcement learning, 2022. URLhttps: //arxiv.org/abs/2207.01780

work page arXiv 2022
[45]

Stepcoder: Improve code generation with reinforcement learning from compiler feedback

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. Stepcoder: Improve code generation with reinforcement learning from compiler feed- back, 2024. URL https://arxiv.org/ abs/2402.01391

work page arXiv 2024
[46]

Olympiad-level formal mathematical rea- soning with reinforcement learning.Nature, pages 1–3, 2025

Thomas Hubert, Rishi Mehta, Laurent Sar- tran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical rea- soning with reinforcement learning.Nature, pages 1–3, 2025

2025
[47]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Hua- jian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advanc- ing formal mathematical reasoning via re- inforcement learning for subgoal decom- position.arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review arXiv 2025
[48]

Optimization of pump configurations as a minlp problem.Comput- ers & chemical engineering, 18(9):845–858, 1994

Tapio Westerlund, Frank Pettersson, and Ig- nacio E Grossmann. Optimization of pump configurations as a minlp problem.Comput- ers & chemical engineering, 18(9):845–858, 1994

1994
[49]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low- rank adaptation of large language models,
[50]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv.org/abs/ 2106.09685. 12 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Lora without regret.Think- ing Machines Lab: Connectionism,

John Schulman and Thinking Machines Lab. Lora without regret.Think- ing Machines Lab: Connectionism,
[52]

LoRA without regret

doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

work page doi:10.64434/tml.20250929
[53]

Tang, B.; Khalil, E

Bo Tang, Elias B Khalil, and Ján Drgoňa. Learning to optimize for mixed-integer non- linear programming with feasibility guar- antees.arXiv preprint arXiv:2410.11061, 2024

work page arXiv 2024
[54]

Springer Science & Business Media, 2013

Christodoulos A Floudas, Panos M Parda- los, Claire Adjiman, William R Esposito, Zeynep H Gümüs, Stephen T Harding, John L Klepeis, Clifford A Meyer, and Carl A Schweiger.Handbook of test problems in local and global optimization, volume 33. Springer Science & Business Media, 2013

2013
[55]

React: Synergizing rea- soning and acting in language models,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models,
[56]

URL https://arxiv.org/abs/ 2210.03629

work page internal anchor Pith review Pith/arXiv arXiv
[57]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexan- der Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URL https: //arxiv.org/abs/2405.15793

work page internal anchor Pith review arXiv 2024
[58]

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z. Li, Been Kim, and Zi Wang. Questbench: Can llms ask the right ques- tion to acquire information in reasoning tasks?, 2025. URLhttps://arxiv.org/ abs/2503.22674

work page arXiv 2025
[59]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. URLhttps: //arxiv.org/abs/2505.06120

work page internal anchor Pith review arXiv 2025
[60]

Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn rea- soning in llm agents via turn-level reward design, 2025. URLhttps://arxiv.org/ abs/2505.11821

work page arXiv 2025
[61]

Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic rea- soning and tool integration for llms via re- inforcement learning, 2025. URLhttps: //arxiv.org/abs/2505.01441

work page arXiv 2025
[62]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategictooluseinllms,2025. URL https: //arxiv.org/abs/2504.11536

work page internal anchor Pith review arXiv 2025
[63]

Mixed integer linear programmingformulationtechniques.Siam Review, 57(1):3–57, 2015

Juan Pablo Vielma. Mixed integer linear programmingformulationtechniques.Siam Review, 57(1):3–57, 2015

2015
[64]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and AviralKumar. Scalingllmtest-timecompute optimally can be more effective than scaling model parameters, 2024. URLhttps:// arxiv.org/abs/2408.03314

work page Pith review arXiv 2024
[65]

TRL: Transformers Reinforcement Learn- ing, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learn- ing, 2020. URL https://github.com/ huggingface/trl

2020
[66]

Verifiers: Envi- ronments for llm reinforcement learning

William Brown. Verifiers: Envi- ronments for llm reinforcement learning. https://github.com/ PrimeIntellect-ai/verifiers, 2025

2025
[67]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schu- urmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv. org/abs/2501.17161

work page internal anchor Pith review arXiv 2025
[68]

Jie Chen, Xintian Han, Yu Ma, Xun Zhou, and Liang Xiang. Unlock the correlation between supervised fine-tuning and rein- forcement learning in training code large 13 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems language models, 2024. URL https:// arxiv.org/abs/2406.10305

work page arXiv 2024
[69]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe- rl: Advancing llm reasoning via reinforce- ment learning on open software evolution,
[70]

URL https://arxiv.org/abs/ 2502.18449

work page arXiv
[71]

A deep dive into scaling rl for code generation with synthetic data and curricula, 2026

Cansu Sancaktar, David Zhang, Gabriel Syn- naeve, and Taco Cohen. A deep dive into scaling rl for code generation with synthetic data and curricula, 2026. URL https: //arxiv.org/abs/2603.24202

work page arXiv 2026
[72]

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning in- ference compute into reference-free super- vision, 2026. URLhttps://arxiv.org/ abs/2509.14234. 14 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems A. Appendix A.1. Add...

work page internal anchor Pith review arXiv 2026
[73]

""Solves the farming cooperative linear optimization problem

applied GRPO with proof-checker rewards for theorem proving. A.2. Training Details All experiments use Qwen3-8B (which is an instruction tuned model) [15] as the base model with LoRA adapters [41] and bfloat16 precision on NVIDIA A100 GPUs. Single-turn categories (LP, MILP, NLP) are trained with Dr. GRPO [34] via TRL [54]; adapters are merged between curr...

2000