OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
Pith reviewed 2026-06-29 12:49 UTC · model grok-4.3
The pith
OR-Space supplies persistent workspaces with build, revise, and explain tasks to test LLM agents on industrial optimization work.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OR-Space consists of executable workspaces containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators. It evaluates agents across three modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.
What carries the argument
Persistent multi-artifact workspaces paired with the three task modes Build, Revise, and Explain.
If this is right
- Agents must handle interdependent files rather than self-contained problem statements.
- Revision tasks reveal whether agents can preserve valid prior logic when requirements change.
- Explanation tasks require agents to locate and combine evidence distributed across multiple artifacts.
- The benchmark enables systematic study of failure modes that appear only in multi-stage industrial workflows.
Where Pith is reading between the lines
- Agent architectures will likely need explicit file-system access and persistent state to reach high performance on these tasks.
- The workspace design could be adapted to create analogous full-lifecycle benchmarks in adjacent fields such as software engineering or supply-chain simulation.
- Adoption would shift evaluation emphasis from end-to-end generation accuracy toward reliability across repeated interactions with evolving artifacts.
Load-bearing premise
The defined task modes and workspace structure sufficiently capture the characteristics of real industrial OR workflows.
What would settle it
If agent performance rankings and error patterns on OR-Space turn out to be nearly identical to those on existing one-shot formulation benchmarks, the added value of lifecycle-oriented workspace evaluation would be undercut.
read the original abstract
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OR-Space, a benchmark for LLM agents performing industrial optimization tasks. It addresses limitations of existing one-shot benchmarks by using persistent multi-artifact workspaces (business documents, structured data, code, solver outputs, evaluators) and three lifecycle task modes: Build (construct solver-ready models from heterogeneous artifacts), Revise (modify models under changing requirements or feedback while preserving prior logic), and Explain (answer grounded questions about solutions and implications using evidence across artifacts). The manuscript describes the benchmark design, evaluation protocol, and quality-control pipeline, positioning it as a tool to study agent reliability in realistic OR workflows.
Significance. If the workspace artifacts and task modes are shown to reflect actual industrial OR characteristics and the quality-control pipeline is demonstrated to produce reliable evaluations, OR-Space could fill an important gap by enabling assessment of multi-stage agent performance beyond single-pass text generation. As presented, the contribution is primarily conceptual, highlighting workflow persistence and lifecycle aspects not captured in prior benchmarks.
major comments (1)
- [Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.
minor comments (1)
- [Abstract] The abstract references a 'quality-control pipeline' without any description of its concrete mechanisms, metrics, or how it mitigates evaluator unreliability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on OR-Space. The major comment is addressed point-by-point below, with planned revisions noted.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.
Authors: We agree that the abstract's phrasing implies a stronger grounding in industrial practice than the manuscript explicitly demonstrates. The artifact types and task modes were selected to reflect recurring elements described in OR literature (e.g., multi-file projects involving data, models, and solver feedback), but the current version provides no dedicated derivation section, practitioner interviews, or direct case-study mapping. We will revise the abstract to state the benchmark's scope more precisely as a tool for studying multi-stage agent behavior rather than claiming comprehensive coverage of all industrial OR characteristics. We will also add a short subsection in the benchmark design section outlining the rationale for the chosen artifacts and modes, drawn from standard workflow descriptions, and explicitly note the absence of external validation as a limitation. These changes will align the claims with the primarily conceptual contribution while preserving the benchmark's intended use for evaluating agent reliability across lifecycle stages. revision: yes
Circularity Check
Benchmark proposal with no derivation chain or self-referential predictions
full rationale
The paper is a benchmark proposal that defines workspace artifacts (business documents, data, code, solver outputs, evaluators) and three task modes (Build/Revise/Explain) to evaluate LLM agents on persistent, multi-stage OR workflows. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The description of the benchmark design and protocol does not reduce any claim to its own inputs by construction, nor does it rely on self-citations for load-bearing justification. This is a standard design document whose validity rests on external validation (not present here) rather than internal circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024
Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024. URLhttps://arxiv.org/abs/2402.10172
-
[2]
Croissant: A metadata format for ML-ready datasets
Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...
-
[3]
MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410. 07095. OpenAI; accepted at ICLR 2025
2024
-
[4]
Jordan, Joseph E
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 83...
2024
-
[5]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
JuMP:Amodelinglanguageformathematicaloptimization
IainDunning, JoeyHuchette, andMilesLubin. JuMP:Amodelinglanguageformathematicaloptimization. SIAM Review, 59(2):295–320, 2017. doi: 10.1137/15M1020575. URLhttps://doi.org/10.1137/ 15M1020575
-
[7]
Gay, and Brian W
Robert Fourer, David M. Gay, and Brian W. Kernighan.AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 2 edition, 2002. URLhttps://ampl.com/resources/the-ampl-book/
2002
-
[8]
Cardinal optimizer (COPT) user guide,
Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (COPT) user guide,
-
[9]
14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
URLhttps://arxiv.org/abs/2208.14314. 14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
-
[10]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, December 2021. doi: 10.1145/3458723. URLhttps://doi.org/10.1145/3458723
-
[11]
Gurobi Optimizer Reference Manual, 2024
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URLhttps://www.gurobi.com
2024
-
[12]
Hart, Jean-Paul Watson, and David L
William E. Hart, Jean-Paul Watson, and David L. Woodruff. Pyomo: Modeling and solving mathematical programs in Python.Mathematical Programming Computation, 3(3):219–260, 2011. doi: 10.1007/ s12532-011-0026-8. URLhttps://doi.org/10.1007/s12532-011-0026-8
-
[13]
EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026
Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, and Di Huang. EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026. URLhttps://arxiv.org/abs/2602.01082
-
[14]
Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025. ISSN 0030-364X. doi: 10.1287/ opre.2024.1233. URLhttps://pubsonline.informs.org/doi/10.1287/opre.2024.1233
-
[15]
InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,
Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, and Lai Wei. InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,
-
[16]
URLhttps://arxiv.org/abs/2605.00369
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025. URLhttps: //arxiv.org/abs/2405.13144. Findings of NAACL 2025
-
[18]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. URLhttp://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Prometheus: Inducing fine-grained evaluation capability in language models, 2023
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models, 2023. URLhttps://arxiv.org/abs/2310.08491. ICLR 2024
-
[20]
Large language models for supply chain optimization, 2023
Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. Large language models for supply chain optimization, 2023. URLhttps://arxiv.org/abs/2307.03875
-
[21]
Constructing Industrial-Scale Optimization Modeling Benchmark
Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, and Zaiwen Wen. Constructing industrial-scale optimization modeling benchmark, 2026. URLhttps://arxiv.org/abs/ 2602.10450
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,
Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, and Dongdong Ge. From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,
-
[24]
URLhttps://arxiv.org/abs/2604.25847. Working paper
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Position: The real barrier to LLM agent usability is agentic ROI, 2025
Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to LLM agent usability is agentic ROI, 2025. URLhttps://arxiv.org/abs/2505.17767. 15 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
-
[26]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2025. URLhttp://arxiv.org/abs/2308.03688. ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
G-Eval: NLG evaluation using GPT-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2511–2522, December 2023. doi: 10.18653/ v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/
2023
-
[28]
OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025
Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025. URLhttps://arxiv.org/abs/ 2502.11102
-
[29]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/abs/2311.12983. arXiv:2311.12983; accepted at ICLR 2024
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating LLM reasoning in the operations research domain with ORQA, 2025. URLhttps://arxiv.org/abs/2412.17874. AAAI 2025
-
[31]
Data Cards: Purposeful and transparent dataset documentation for responsible AI
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and transparent dataset documentation for responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), pages 1776–1826, 2022. doi: 10.1145/3531146.3533231. URLhttps://doi.org/10.1145/3531146.3533231
-
[32]
Rindranirina Ramamonjison, Timothy T. Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt competition: Formulating optimization problems based on their natural language descriptions, 2023. URLhttps://arxiv.org/abs/2303.08233
-
[33]
Large language models are inconsistent and biased evaluators, 2024
Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724
-
[34]
Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024. URL https://arxiv.org/abs/2406.12624
-
[35]
AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URLhttps://arxiv.org/abs/2407.18901. ACL 2024
-
[36]
Large language models are not fair evaluators, 2023
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. URLhttps://arxiv.org/abs/2305. 17926
2023
-
[37]
Chain-of-Experts: When LLMs meet complex operations research problems
Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs meet complex operations research problems. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=HobyL1B9CZ. Introduces ...
2024
-
[38]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/240...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, RajMehta, WayneChi, LawrenceJang, YiqingXie, ShuyanZhou, andGrahamNeubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhtt...
2024
-
[40]
A survey of AI agent protocols, 2025
Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of AI agent protocols, 2025. URLhttps://arxiv.org/abs/2504.16736
-
[41]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045. arXiv:2406.12045
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding
Haoran Yin, Chenyu Zhou, Wei Zhu, and Yuhua Jin. MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding. InThe Forty-Third International Conference on Machine Learning, 2026. URLhttps://icml.cc/virtual/2026/poster/65523
2026
-
[43]
Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025. URL https://arxiv.org/abs/2503.10009
-
[44]
OptiMind: Teaching LLMs to think like optimization experts,
Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janard- han Kulkarni, Ishai Menache, and Sirui Li. OptiMind: Teaching LLMs to think like optimization experts,
- [45]
-
[46]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025. URLhttps://arxiv. org/abs/2509.22558
-
[48]
Auto-formulating dynamic programming problems with large language models, 2025
Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models, 2025. URLhttps://arxiv.org/abs/ 2507.11737
-
[49]
Externalization in LLM agents: A unified review of memory, skills, protocols and harness engineering,
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, and Weinan Zhang. Externalization in LLM agents: A unified review of memory, skills, protocols an...
-
[50]
URLhttps://arxiv.org/abs/2604.08224
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, April 2024. URLhttp://arxiv.org/abs/2307.13854. arXiv:2307.13854 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025
Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025. URLhttps://arxiv.org/abs/2506.11102. 17 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optim...
-
[53]
docs/business_requirement.md restates the problem in business voice; every numeric parameter is quoted but NOT duplicated as a raw table
-
[54]
Use general_parameters.csv for scalars and table_{k}.csv for indexed data
data/*.csv hold all numeric parameters. Use general_parameters.csv for scalars and table_{k}.csv for indexed data
-
[55]
src/current_heuristic.py reads CSVs from ../data/, builds a PuLP model, solves with commercial solver backends (e.g., via pulp.GUROBI_CMD or pulp.COPT), and prints the single final line OBJECTIVE_VALUE: <value>
-
[56]
big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py
Helper math (e.g. big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py
-
[57]
docs": {...},
Running cd src && python current_heuristic.py must reproduce the ground-truth objective within10 −3 relative tolerance. Return strict JSON: { "docs": {...}, "data": {...}, "src": {...}, "run": { "run.sh": "cd src && python current_heuristic.py" }, "evaluation": { "ground_truth": <float>, "tolerance": 0.01 } } The10 −3 condition in P1 is a generation-time ...
-
[58]
HIT=1 only if the anchor fact is present AND used in the right context; a coincidental number inside an unrelated phrase is NOT a hit
-
[59]
MISS=0 if the fact is absent, negated, or only vaguely gestured at
-
[60]
Strict on numeric anchors: the exact number (or an algebraically equivalent expression) must appear
-
[61]
Lenient on surface form: synonyms / paraphrases / symbolic notation are accepted if meaning is identical. B.1.3. Evaluation Prompts P7.Build/Revise-M code evaluator.For the headline benchmark we ask the model-under-test to produce a solver-agnostic PuLP script that exposesbuild_problem()→pulp.LpProblem (no solve-call). The runner then attaches the configu...
-
[62]
Read data ONLY from ./data/<filename>
-
[63]
Use pulp, pandas, and the Python standard library only
-
[64]
Do NOT call prob.solve() inside it
Define build_problem() that returns a populated pulp.LpProblem. Do NOT call prob.solve() inside it
-
[65]
At module top level, define PROBLEM = build_problem()
-
[66]
The runner attaches different solvers; just produce the model
-
[67]
question
Return ONLY Python source code. No markdown fences, no commentary, no language tag, no JSON. P8.Revise-B workspace agent.TheRevise-B setting materialises the workspace on disk; the agent sees ./docs/, ./data/, ./src/ and must write anewuser_model.py. Compared to P7, the input is the business- voice description (no meta-language) plus the originalcurrent_h...
-
[68]
numeric: an exact value (with unit) that must appear
-
[69]
entity: the correct variable, constraint name, or business term
-
[70]
because ... therefore
causal: the reasoning link (“because ... therefore ...”). Return strict JSON: { "question": "...", "gold_answer": "...", "rubric_anchors": [ { "type": "numeric", "text": "...", "regex": "..." }, { "type": "entity", "text": "...", "regex": "..." }, { "type": "causal", "text": "...", "regex": null } ] } The regex is optional and used for a cheap automatic h...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.