pith. sign in

arxiv: 2605.28158 · v1 · pith:KHTGZXKXnew · submitted 2026-05-27 · 💻 cs.AI

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Pith reviewed 2026-06-29 12:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsoperations researchbenchmarkoptimization modelingworkspacemodel constructionmodel revisiongrounded explanation
0
0 comments X

The pith

OR-Space supplies persistent workspaces with build, revise, and explain tasks to test LLM agents on industrial optimization work.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OR-Space as a benchmark for LLM agents in operations research that supplies executable workspaces containing business documents, structured data, code artifacts, solver outputs, and evaluators spread across interdependent files. It defines three task modes: Build, in which agents construct solver-ready models from heterogeneous artifacts; Revise, in which agents update models under changing requirements or solver feedback while keeping prior logic valid; and Explain, in which agents answer questions about solutions and business implications using evidence distributed across the workspace. This setup addresses the limitation of existing benchmarks that reduce evaluation to one-shot translation from a self-contained problem statement. A sympathetic reader would care because real industrial OR workflows involve ongoing maintenance and interpretation rather than isolated text generation.

Core claim

OR-Space consists of executable workspaces containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators. It evaluates agents across three modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.

What carries the argument

Persistent multi-artifact workspaces paired with the three task modes Build, Revise, and Explain.

If this is right

  • Agents must handle interdependent files rather than self-contained problem statements.
  • Revision tasks reveal whether agents can preserve valid prior logic when requirements change.
  • Explanation tasks require agents to locate and combine evidence distributed across multiple artifacts.
  • The benchmark enables systematic study of failure modes that appear only in multi-stage industrial workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent architectures will likely need explicit file-system access and persistent state to reach high performance on these tasks.
  • The workspace design could be adapted to create analogous full-lifecycle benchmarks in adjacent fields such as software engineering or supply-chain simulation.
  • Adoption would shift evaluation emphasis from end-to-end generation accuracy toward reliability across repeated interactions with evolving artifacts.

Load-bearing premise

The defined task modes and workspace structure sufficiently capture the characteristics of real industrial OR workflows.

What would settle it

If agent performance rankings and error patterns on OR-Space turn out to be nearly identical to those on existing one-shot formulation benchmarks, the added value of lifecycle-oriented workspace evaluation would be undercut.

read the original abstract

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces OR-Space, a benchmark for LLM agents performing industrial optimization tasks. It addresses limitations of existing one-shot benchmarks by using persistent multi-artifact workspaces (business documents, structured data, code, solver outputs, evaluators) and three lifecycle task modes: Build (construct solver-ready models from heterogeneous artifacts), Revise (modify models under changing requirements or feedback while preserving prior logic), and Explain (answer grounded questions about solutions and implications using evidence across artifacts). The manuscript describes the benchmark design, evaluation protocol, and quality-control pipeline, positioning it as a tool to study agent reliability in realistic OR workflows.

Significance. If the workspace artifacts and task modes are shown to reflect actual industrial OR characteristics and the quality-control pipeline is demonstrated to produce reliable evaluations, OR-Space could fill an important gap by enabling assessment of multi-stage agent performance beyond single-pass text generation. As presented, the contribution is primarily conceptual, highlighting workflow persistence and lifecycle aspects not captured in prior benchmarks.

major comments (1)
  1. [Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.
minor comments (1)
  1. [Abstract] The abstract references a 'quality-control pipeline' without any description of its concrete mechanisms, metrics, or how it mitigates evaluator unreliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on OR-Space. The major comment is addressed point-by-point below, with planned revisions noted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.

    Authors: We agree that the abstract's phrasing implies a stronger grounding in industrial practice than the manuscript explicitly demonstrates. The artifact types and task modes were selected to reflect recurring elements described in OR literature (e.g., multi-file projects involving data, models, and solver feedback), but the current version provides no dedicated derivation section, practitioner interviews, or direct case-study mapping. We will revise the abstract to state the benchmark's scope more precisely as a tool for studying multi-stage agent behavior rather than claiming comprehensive coverage of all industrial OR characteristics. We will also add a short subsection in the benchmark design section outlining the rationale for the chosen artifacts and modes, drawn from standard workflow descriptions, and explicitly note the absence of external validation as a limitation. These changes will align the claims with the primarily conceptual contribution while preserving the benchmark's intended use for evaluating agent reliability across lifecycle stages. revision: yes

Circularity Check

0 steps flagged

Benchmark proposal with no derivation chain or self-referential predictions

full rationale

The paper is a benchmark proposal that defines workspace artifacts (business documents, data, code, solver outputs, evaluators) and three task modes (Build/Revise/Explain) to evaluate LLM agents on persistent, multi-stage OR workflows. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The description of the benchmark design and protocol does not reduce any claim to its own inputs by construction, nor does it rely on self-citations for load-bearing justification. This is a standard design document whose validity rests on external validation (not present here) rather than internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper proposes a benchmark rather than deriving a result from first principles or fitting parameters; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5779 in / 1078 out tokens · 29501 ms · 2026-06-29T12:49:43.551410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 38 canonical work pages · 13 internal anchors

  1. [1]

    OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024

    Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024. URLhttps://arxiv.org/abs/2402.10172

  2. [2]

    Croissant: A metadata format for ML-ready datasets

    Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...

  3. [3]

    MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410. 07095. OpenAI; accepted at ICLR 2025

  4. [4]

    Jordan, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 83...

  5. [5]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

  6. [6]

    JuMP:Amodelinglanguageformathematicaloptimization

    IainDunning, JoeyHuchette, andMilesLubin. JuMP:Amodelinglanguageformathematicaloptimization. SIAM Review, 59(2):295–320, 2017. doi: 10.1137/15M1020575. URLhttps://doi.org/10.1137/ 15M1020575

  7. [7]

    Gay, and Brian W

    Robert Fourer, David M. Gay, and Brian W. Kernighan.AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 2 edition, 2002. URLhttps://ampl.com/resources/the-ampl-book/

  8. [8]

    Cardinal optimizer (COPT) user guide,

    Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (COPT) user guide,

  9. [9]

    14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

    URLhttps://arxiv.org/abs/2208.14314. 14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

  10. [10]

    Datasheets for datasets

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, December 2021. doi: 10.1145/3458723. URLhttps://doi.org/10.1145/3458723

  11. [11]

    Gurobi Optimizer Reference Manual, 2024

    Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URLhttps://www.gurobi.com

  12. [12]

    Hart, Jean-Paul Watson, and David L

    William E. Hart, Jean-Paul Watson, and David L. Woodruff. Pyomo: Modeling and solving mathematical programs in Python.Mathematical Programming Computation, 3(3):219–260, 2011. doi: 10.1007/ s12532-011-0026-8. URLhttps://doi.org/10.1007/s12532-011-0026-8

  13. [13]

    EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026

    Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, and Di Huang. EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026. URLhttps://arxiv.org/abs/2602.01082

  14. [14]

    ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025

    Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025. ISSN 0030-364X. doi: 10.1287/ opre.2024.1233. URLhttps://pubsonline.informs.org/doi/10.1287/opre.2024.1233

  15. [15]

    InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,

    Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, and Lai Wei. InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,

  16. [16]

    URLhttps://arxiv.org/abs/2605.00369

  17. [17]

    LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025

    Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025. URLhttps: //arxiv.org/abs/2405.13144. Findings of NAACL 2025

  18. [18]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. URLhttp://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

  19. [19]

    Prometheus: Inducing fine-grained evaluation capability in language models, 2023

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models, 2023. URLhttps://arxiv.org/abs/2310.08491. ICLR 2024

  20. [20]

    Large language models for supply chain optimization, 2023

    Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. Large language models for supply chain optimization, 2023. URLhttps://arxiv.org/abs/2307.03875

  21. [21]

    Constructing Industrial-Scale Optimization Modeling Benchmark

    Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, and Zaiwen Wen. Constructing industrial-scale optimization modeling benchmark, 2026. URLhttps://arxiv.org/abs/ 2602.10450

  22. [22]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

  23. [23]

    From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,

    Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, and Dongdong Ge. From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,

  24. [24]
  25. [25]

    Position: The real barrier to LLM agent usability is agentic ROI, 2025

    Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to LLM agent usability is agentic ROI, 2025. URLhttps://arxiv.org/abs/2505.17767. 15 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

  26. [26]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2025. URLhttp://arxiv.org/abs/2308.03688. ar...

  27. [27]

    G-Eval: NLG evaluation using GPT-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2511–2522, December 2023. doi: 10.18653/ v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/

  28. [28]

    OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025

    Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025. URLhttps://arxiv.org/abs/ 2502.11102

  29. [29]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/abs/2311.12983. arXiv:2311.12983; accepted at ICLR 2024

  30. [30]

    Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang

    Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating LLM reasoning in the operations research domain with ORQA, 2025. URLhttps://arxiv.org/abs/2412.17874. AAAI 2025

  31. [31]

    Data Cards: Purposeful and transparent dataset documentation for responsible AI

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and transparent dataset documentation for responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), pages 1776–1826, 2022. doi: 10.1145/3531146.3533231. URLhttps://doi.org/10.1145/3531146.3533231

  32. [32]

    Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang

    Rindranirina Ramamonjison, Timothy T. Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt competition: Formulating optimization problems based on their natural language descriptions, 2023. URLhttps://arxiv.org/abs/2303.08233

  33. [33]

    Large language models are inconsistent and biased evaluators, 2024

    Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724

  34. [34]

    Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024. URL https://arxiv.org/abs/2406.12624

  35. [35]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URLhttps://arxiv.org/abs/2407.18901. ACL 2024

  36. [36]

    Large language models are not fair evaluators, 2023

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. URLhttps://arxiv.org/abs/2305. 17926

  37. [37]

    Chain-of-Experts: When LLMs meet complex operations research problems

    Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs meet complex operations research problems. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=HobyL1B9CZ. Introduces ...

  38. [38]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/240...

  39. [39]

    Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, RajMehta, WayneChi, LawrenceJang, YiqingXie, ShuyanZhou, andGrahamNeubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhtt...

  40. [40]

    A survey of AI agent protocols, 2025

    Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of AI agent protocols, 2025. URLhttps://arxiv.org/abs/2504.16736

  41. [41]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045. arXiv:2406.12045

  42. [42]

    MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding

    Haoran Yin, Chenyu Zhou, Wei Zhu, and Yuhua Jin. MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding. InThe Forty-Third International Conference on Machine Learning, 2026. URLhttps://icml.cc/virtual/2026/poster/65523

  43. [43]

    OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025

    Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025. URL https://arxiv.org/abs/2503.10009

  44. [44]

    OptiMind: Teaching LLMs to think like optimization experts,

    Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janard- han Kulkarni, Ishai Menache, and Sirui Li. OptiMind: Teaching LLMs to think like optimization experts,

  45. [45]

    URLhttps://arxiv.org/abs/2509.22979

  46. [46]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org...

  47. [47]

    StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025

    Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025. URLhttps://arxiv. org/abs/2509.22558

  48. [48]

    Auto-formulating dynamic programming problems with large language models, 2025

    Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models, 2025. URLhttps://arxiv.org/abs/ 2507.11737

  49. [49]

    Externalization in LLM agents: A unified review of memory, skills, protocols and harness engineering,

    Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, and Weinan Zhang. Externalization in LLM agents: A unified review of memory, skills, protocols an...

  50. [50]

    URLhttps://arxiv.org/abs/2604.08224

  51. [51]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, April 2024. URLhttp://arxiv.org/abs/2307.13854. arXiv:2307.13854 [cs]

  52. [52]

    Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025

    Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025. URLhttps://arxiv.org/abs/2506.11102. 17 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optim...

  53. [53]

    docs/business_requirement.md restates the problem in business voice; every numeric parameter is quoted but NOT duplicated as a raw table

  54. [54]

    Use general_parameters.csv for scalars and table_{k}.csv for indexed data

    data/*.csv hold all numeric parameters. Use general_parameters.csv for scalars and table_{k}.csv for indexed data

  55. [55]

    src/current_heuristic.py reads CSVs from ../data/, builds a PuLP model, solves with commercial solver backends (e.g., via pulp.GUROBI_CMD or pulp.COPT), and prints the single final line OBJECTIVE_VALUE: <value>

  56. [56]

    big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py

    Helper math (e.g. big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py

  57. [57]

    docs": {...},

    Running cd src && python current_heuristic.py must reproduce the ground-truth objective within10 −3 relative tolerance. Return strict JSON: { "docs": {...}, "data": {...}, "src": {...}, "run": { "run.sh": "cd src && python current_heuristic.py" }, "evaluation": { "ground_truth": <float>, "tolerance": 0.01 } } The10 −3 condition in P1 is a generation-time ...

  58. [58]

    HIT=1 only if the anchor fact is present AND used in the right context; a coincidental number inside an unrelated phrase is NOT a hit

  59. [59]

    MISS=0 if the fact is absent, negated, or only vaguely gestured at

  60. [60]

    Strict on numeric anchors: the exact number (or an algebraically equivalent expression) must appear

  61. [61]

    Lenient on surface form: synonyms / paraphrases / symbolic notation are accepted if meaning is identical. B.1.3. Evaluation Prompts P7.Build/Revise-M code evaluator.For the headline benchmark we ask the model-under-test to produce a solver-agnostic PuLP script that exposesbuild_problem()→pulp.LpProblem (no solve-call). The runner then attaches the configu...

  62. [62]

    Read data ONLY from ./data/<filename>

  63. [63]

    Use pulp, pandas, and the Python standard library only

  64. [64]

    Do NOT call prob.solve() inside it

    Define build_problem() that returns a populated pulp.LpProblem. Do NOT call prob.solve() inside it

  65. [65]

    At module top level, define PROBLEM = build_problem()

  66. [66]

    The runner attaches different solvers; just produce the model

  67. [67]

    question

    Return ONLY Python source code. No markdown fences, no commentary, no language tag, no JSON. P8.Revise-B workspace agent.TheRevise-B setting materialises the workspace on disk; the agent sees ./docs/, ./data/, ./src/ and must write anewuser_model.py. Compared to P7, the input is the business- voice description (no meta-language) plus the originalcurrent_h...

  68. [68]

    numeric: an exact value (with unit) that must appear

  69. [69]

    entity: the correct variable, constraint name, or business term

  70. [70]

    because ... therefore

    causal: the reasoning link (“because ... therefore ...”). Return strict JSON: { "question": "...", "gold_answer": "...", "rubric_anchors": [ { "type": "numeric", "text": "...", "regex": "..." }, { "type": "entity", "text": "...", "regex": "..." }, { "type": "causal", "text": "...", "regex": null } ] } The regex is optional and used for a cheap automatic h...