OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Chenyu Zhou; Dongdong Ge; Jianghao Lin; Jiangyue Zhao; Xinyun Lu; Yinyu Ye

arxiv: 2605.28158 · v1 · pith:KHTGZXKXnew · submitted 2026-05-27 · 💻 cs.AI

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Chenyu Zhou , Xinyun Lu , Jiangyue Zhao , Jianghao Lin , Dongdong Ge , Yinyu Ye This is my paper

Pith reviewed 2026-06-29 12:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsoperations researchbenchmarkoptimization modelingworkspacemodel constructionmodel revisiongrounded explanation

0 comments

The pith

OR-Space supplies persistent workspaces with build, revise, and explain tasks to test LLM agents on industrial optimization work.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OR-Space as a benchmark for LLM agents in operations research that supplies executable workspaces containing business documents, structured data, code artifacts, solver outputs, and evaluators spread across interdependent files. It defines three task modes: Build, in which agents construct solver-ready models from heterogeneous artifacts; Revise, in which agents update models under changing requirements or solver feedback while keeping prior logic valid; and Explain, in which agents answer questions about solutions and business implications using evidence distributed across the workspace. This setup addresses the limitation of existing benchmarks that reduce evaluation to one-shot translation from a self-contained problem statement. A sympathetic reader would care because real industrial OR workflows involve ongoing maintenance and interpretation rather than isolated text generation.

Core claim

OR-Space consists of executable workspaces containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators. It evaluates agents across three modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.

What carries the argument

Persistent multi-artifact workspaces paired with the three task modes Build, Revise, and Explain.

If this is right

Agents must handle interdependent files rather than self-contained problem statements.
Revision tasks reveal whether agents can preserve valid prior logic when requirements change.
Explanation tasks require agents to locate and combine evidence distributed across multiple artifacts.
The benchmark enables systematic study of failure modes that appear only in multi-stage industrial workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent architectures will likely need explicit file-system access and persistent state to reach high performance on these tasks.
The workspace design could be adapted to create analogous full-lifecycle benchmarks in adjacent fields such as software engineering or supply-chain simulation.
Adoption would shift evaluation emphasis from end-to-end generation accuracy toward reliability across repeated interactions with evolving artifacts.

Load-bearing premise

The defined task modes and workspace structure sufficiently capture the characteristics of real industrial OR workflows.

What would settle it

If agent performance rankings and error patterns on OR-Space turn out to be nearly identical to those on existing one-shot formulation benchmarks, the added value of lifecycle-oriented workspace evaluation would be undercut.

read the original abstract

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OR-Space proposes persistent workspaces and Build/Revise/Explain modes for OR agents, but the match to real industrial workflows rests on an unvalidated design assumption.

read the letter

The main point is that this paper introduces a benchmark using persistent multi-artifact workspaces and three task modes—Build, Revise, and Explain—to test LLM agents on full-lifecycle optimization work instead of one-shot formulation tasks.

What stands out as new is the explicit setup of executable workspaces containing business documents, data, code, solver outputs, and evaluators, plus the shift to revision under changing requirements and grounded explanation across files. The authors contrast this directly with existing one-shot benchmarks and lay out the evaluation protocol and quality-control pipeline.

This framing is useful for anyone thinking about agent reliability beyond single text outputs. The task definitions give a concrete way to structure tests around ongoing model work and cross-artifact reasoning.

The soft spot is that the central claim depends on the workspace structure and modes reflecting actual industrial OR projects, yet the abstract gives no practitioner validation, case study comparisons, or derivation process to support that match. Without those, the benchmark could end up measuring performance on a stylized version rather than the intended setting. No sample results or checks on evaluator consistency appear either.

This is aimed at researchers building or testing optimization agents for industrial applications. Someone working on multi-stage agent evaluation might extract usable task structures from it.

I would send it to peer review. The idea addresses a real gap, but the design assumptions need external input to determine if they hold.

Referee Report

1 major / 1 minor

Summary. The paper introduces OR-Space, a benchmark for LLM agents performing industrial optimization tasks. It addresses limitations of existing one-shot benchmarks by using persistent multi-artifact workspaces (business documents, structured data, code, solver outputs, evaluators) and three lifecycle task modes: Build (construct solver-ready models from heterogeneous artifacts), Revise (modify models under changing requirements or feedback while preserving prior logic), and Explain (answer grounded questions about solutions and implications using evidence across artifacts). The manuscript describes the benchmark design, evaluation protocol, and quality-control pipeline, positioning it as a tool to study agent reliability in realistic OR workflows.

Significance. If the workspace artifacts and task modes are shown to reflect actual industrial OR characteristics and the quality-control pipeline is demonstrated to produce reliable evaluations, OR-Space could fill an important gap by enabling assessment of multi-stage agent performance beyond single-pass text generation. As presented, the contribution is primarily conceptual, highlighting workflow persistence and lifecycle aspects not captured in prior benchmarks.

major comments (1)

[Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.

minor comments (1)

[Abstract] The abstract references a 'quality-control pipeline' without any description of its concrete mechanisms, metrics, or how it mitigates evaluator unreliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on OR-Space. The major comment is addressed point-by-point below, with planned revisions noted.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'persistent workspaces with lifecycle-oriented tasks' evaluate 'reliable optimization work' in industrial settings rests on the unverified assumption that the listed artifacts (business documents, structured data, code artifacts, solver outputs, evaluators) and the three modes (Build/Revise/Explain) sufficiently capture real OR project characteristics. No practitioner validation, comparison to actual case studies, or derivation process is provided to ground this match.

Authors: We agree that the abstract's phrasing implies a stronger grounding in industrial practice than the manuscript explicitly demonstrates. The artifact types and task modes were selected to reflect recurring elements described in OR literature (e.g., multi-file projects involving data, models, and solver feedback), but the current version provides no dedicated derivation section, practitioner interviews, or direct case-study mapping. We will revise the abstract to state the benchmark's scope more precisely as a tool for studying multi-stage agent behavior rather than claiming comprehensive coverage of all industrial OR characteristics. We will also add a short subsection in the benchmark design section outlining the rationale for the chosen artifacts and modes, drawn from standard workflow descriptions, and explicitly note the absence of external validation as a limitation. These changes will align the claims with the primarily conceptual contribution while preserving the benchmark's intended use for evaluating agent reliability across lifecycle stages. revision: yes

Circularity Check

0 steps flagged

Benchmark proposal with no derivation chain or self-referential predictions

full rationale

The paper is a benchmark proposal that defines workspace artifacts (business documents, data, code, solver outputs, evaluators) and three task modes (Build/Revise/Explain) to evaluate LLM agents on persistent, multi-stage OR workflows. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The description of the benchmark design and protocol does not reduce any claim to its own inputs by construction, nor does it rely on self-citations for load-bearing justification. This is a standard design document whose validity rests on external validation (not present here) rather than internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper proposes a benchmark rather than deriving a result from first principles or fitting parameters; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5779 in / 1078 out tokens · 29501 ms · 2026-06-29T12:49:43.551410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 38 canonical work pages · 13 internal anchors

[1]

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024. URLhttps://arxiv.org/abs/2402.10172

work page arXiv 2024
[2]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...

work page doi:10.1145/3650203.3663326 2024
[3]

MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410. 07095. OpenAI; accepted at ICLR 2025

2024
[4]

Jordan, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 83...

2024
[5]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

JuMP:Amodelinglanguageformathematicaloptimization

IainDunning, JoeyHuchette, andMilesLubin. JuMP:Amodelinglanguageformathematicaloptimization. SIAM Review, 59(2):295–320, 2017. doi: 10.1137/15M1020575. URLhttps://doi.org/10.1137/ 15M1020575

work page doi:10.1137/15m1020575 2017
[7]

Gay, and Brian W

Robert Fourer, David M. Gay, and Brian W. Kernighan.AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 2 edition, 2002. URLhttps://ampl.com/resources/the-ampl-book/

2002
[8]

Cardinal optimizer (COPT) user guide,

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (COPT) user guide,
[9]

14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

URLhttps://arxiv.org/abs/2208.14314. 14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

work page arXiv
[10]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, December 2021. doi: 10.1145/3458723. URLhttps://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021
[11]

Gurobi Optimizer Reference Manual, 2024

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URLhttps://www.gurobi.com

2024
[12]

Hart, Jean-Paul Watson, and David L

William E. Hart, Jean-Paul Watson, and David L. Woodruff. Pyomo: Modeling and solving mathematical programs in Python.Mathematical Programming Computation, 3(3):219–260, 2011. doi: 10.1007/ s12532-011-0026-8. URLhttps://doi.org/10.1007/s12532-011-0026-8

work page doi:10.1007/s12532-011-0026-8 2011
[13]

EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026

Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, and Di Huang. EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026. URLhttps://arxiv.org/abs/2602.01082

work page arXiv 2026
[14]

ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025. ISSN 0030-364X. doi: 10.1287/ opre.2024.1233. URLhttps://pubsonline.informs.org/doi/10.1287/opre.2024.1233

work page doi:10.1287/opre.2024.1233 2025
[15]

InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,

Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, and Lai Wei. InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,
[16]

URLhttps://arxiv.org/abs/2605.00369

work page internal anchor Pith review Pith/arXiv arXiv
[17]

LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025. URLhttps: //arxiv.org/abs/2405.13144. Findings of NAACL 2025

work page arXiv 2025
[18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. URLhttp://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Prometheus: Inducing fine-grained evaluation capability in language models, 2023

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models, 2023. URLhttps://arxiv.org/abs/2310.08491. ICLR 2024

work page arXiv 2023
[20]

Large language models for supply chain optimization, 2023

Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. Large language models for supply chain optimization, 2023. URLhttps://arxiv.org/abs/2307.03875

work page arXiv 2023
[21]

Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, and Zaiwen Wen. Constructing industrial-scale optimization modeling benchmark, 2026. URLhttps://arxiv.org/abs/ 2602.10450

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,

Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, and Dongdong Ge. From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,
[24]

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

URLhttps://arxiv.org/abs/2604.25847. Working paper

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Position: The real barrier to LLM agent usability is agentic ROI, 2025

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to LLM agent usability is agentic ROI, 2025. URLhttps://arxiv.org/abs/2505.17767. 15 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

work page arXiv 2025
[26]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2025. URLhttp://arxiv.org/abs/2308.03688. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2511–2522, December 2023. doi: 10.18653/ v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/

2023
[28]

OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025. URLhttps://arxiv.org/abs/ 2502.11102

work page arXiv 2025
[29]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/abs/2311.12983. arXiv:2311.12983; accepted at ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang

Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating LLM reasoning in the operations research domain with ORQA, 2025. URLhttps://arxiv.org/abs/2412.17874. AAAI 2025

work page arXiv 2025
[31]

Data Cards: Purposeful and transparent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and transparent dataset documentation for responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), pages 1776–1826, 2022. doi: 10.1145/3531146.3533231. URLhttps://doi.org/10.1145/3531146.3533231

work page doi:10.1145/3531146.3533231 2022
[32]

Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang

Rindranirina Ramamonjison, Timothy T. Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt competition: Formulating optimization problems based on their natural language descriptions, 2023. URLhttps://arxiv.org/abs/2303.08233

work page arXiv 2023
[33]

Large language models are inconsistent and biased evaluators, 2024

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724

work page arXiv 2024
[34]

Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024. URL https://arxiv.org/abs/2406.12624

work page arXiv 2024
[35]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URLhttps://arxiv.org/abs/2407.18901. ACL 2024

work page arXiv 2024
[36]

Large language models are not fair evaluators, 2023

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. URLhttps://arxiv.org/abs/2305. 17926

2023
[37]

Chain-of-Experts: When LLMs meet complex operations research problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs meet complex operations research problems. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=HobyL1B9CZ. Introduces ...

2024
[38]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/240...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, RajMehta, WayneChi, LawrenceJang, YiqingXie, ShuyanZhou, andGrahamNeubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhtt...

2024
[40]

A survey of AI agent protocols, 2025

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of AI agent protocols, 2025. URLhttps://arxiv.org/abs/2504.16736

work page arXiv 2025
[41]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045. arXiv:2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding

Haoran Yin, Chenyu Zhou, Wei Zhu, and Yuhua Jin. MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding. InThe Forty-Third International Conference on Machine Learning, 2026. URLhttps://icml.cc/virtual/2026/poster/65523

2026
[43]

OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025

Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025. URL https://arxiv.org/abs/2503.10009

work page arXiv 2025
[44]

OptiMind: Teaching LLMs to think like optimization experts,

Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janard- han Kulkarni, Ishai Menache, and Sirui Li. OptiMind: Teaching LLMs to think like optimization experts,
[45]

URLhttps://arxiv.org/abs/2509.22979

work page arXiv
[46]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025

Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025. URLhttps://arxiv. org/abs/2509.22558

work page arXiv 2025
[48]

Auto-formulating dynamic programming problems with large language models, 2025

Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models, 2025. URLhttps://arxiv.org/abs/ 2507.11737

work page arXiv 2025
[49]

Externalization in LLM agents: A unified review of memory, skills, protocols and harness engineering,

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, and Weinan Zhang. Externalization in LLM agents: A unified review of memory, skills, protocols an...
[50]

URLhttps://arxiv.org/abs/2604.08224

work page internal anchor Pith review Pith/arXiv arXiv
[51]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, April 2024. URLhttp://arxiv.org/abs/2307.13854. arXiv:2307.13854 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025

Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025. URLhttps://arxiv.org/abs/2506.11102. 17 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optim...

work page arXiv 2025
[53]

docs/business_requirement.md restates the problem in business voice; every numeric parameter is quoted but NOT duplicated as a raw table
[54]

Use general_parameters.csv for scalars and table_{k}.csv for indexed data

data/*.csv hold all numeric parameters. Use general_parameters.csv for scalars and table_{k}.csv for indexed data
[55]

src/current_heuristic.py reads CSVs from ../data/, builds a PuLP model, solves with commercial solver backends (e.g., via pulp.GUROBI_CMD or pulp.COPT), and prints the single final line OBJECTIVE_VALUE: <value>
[56]

big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py

Helper math (e.g. big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py
[57]

docs": {...},

Running cd src && python current_heuristic.py must reproduce the ground-truth objective within10 −3 relative tolerance. Return strict JSON: { "docs": {...}, "data": {...}, "src": {...}, "run": { "run.sh": "cd src && python current_heuristic.py" }, "evaluation": { "ground_truth": <float>, "tolerance": 0.01 } } The10 −3 condition in P1 is a generation-time ...
[58]

HIT=1 only if the anchor fact is present AND used in the right context; a coincidental number inside an unrelated phrase is NOT a hit
[59]

MISS=0 if the fact is absent, negated, or only vaguely gestured at
[60]

Strict on numeric anchors: the exact number (or an algebraically equivalent expression) must appear
[61]

Lenient on surface form: synonyms / paraphrases / symbolic notation are accepted if meaning is identical. B.1.3. Evaluation Prompts P7.Build/Revise-M code evaluator.For the headline benchmark we ask the model-under-test to produce a solver-agnostic PuLP script that exposesbuild_problem()→pulp.LpProblem (no solve-call). The runner then attaches the configu...
[62]

Read data ONLY from ./data/<filename>
[63]

Use pulp, pandas, and the Python standard library only
[64]

Do NOT call prob.solve() inside it

Define build_problem() that returns a populated pulp.LpProblem. Do NOT call prob.solve() inside it
[65]

At module top level, define PROBLEM = build_problem()
[66]

The runner attaches different solvers; just produce the model
[67]

question

Return ONLY Python source code. No markdown fences, no commentary, no language tag, no JSON. P8.Revise-B workspace agent.TheRevise-B setting materialises the workspace on disk; the agent sees ./docs/, ./data/, ./src/ and must write anewuser_model.py. Compared to P7, the input is the business- voice description (no meta-language) plus the originalcurrent_h...
[68]

numeric: an exact value (with unit) that must appear
[69]

entity: the correct variable, constraint name, or business term
[70]

because ... therefore

causal: the reasoning link (“because ... therefore ...”). Return strict JSON: { "question": "...", "gold_answer": "...", "rubric_anchors": [ { "type": "numeric", "text": "...", "regex": "..." }, { "type": "entity", "text": "...", "regex": "..." }, { "type": "causal", "text": "...", "regex": null } ] } The regex is optional and used for a cheap automatic h...

[1] [1]

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024

Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models, 2024. URLhttps://arxiv.org/abs/2402.10172

work page arXiv 2024

[2] [2]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, and Carole-Jean Wu. Croissant...

work page doi:10.1145/3650203.3663326 2024

[3] [3]

MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410. 07095. OpenAI; accepted at ICLR 2025

2024

[4] [4]

Jordan, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 83...

2024

[5] [5]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

JuMP:Amodelinglanguageformathematicaloptimization

IainDunning, JoeyHuchette, andMilesLubin. JuMP:Amodelinglanguageformathematicaloptimization. SIAM Review, 59(2):295–320, 2017. doi: 10.1137/15M1020575. URLhttps://doi.org/10.1137/ 15M1020575

work page doi:10.1137/15m1020575 2017

[7] [7]

Gay, and Brian W

Robert Fourer, David M. Gay, and Brian W. Kernighan.AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 2 edition, 2002. URLhttps://ampl.com/resources/the-ampl-book/

2002

[8] [8]

Cardinal optimizer (COPT) user guide,

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (COPT) user guide,

[9] [9]

14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

URLhttps://arxiv.org/abs/2208.14314. 14 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

work page arXiv

[10] [10]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, December 2021. doi: 10.1145/3458723. URLhttps://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021

[11] [11]

Gurobi Optimizer Reference Manual, 2024

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URLhttps://www.gurobi.com

2024

[12] [12]

Hart, Jean-Paul Watson, and David L

William E. Hart, Jean-Paul Watson, and David L. Woodruff. Pyomo: Modeling and solving mathematical programs in Python.Mathematical Programming Computation, 3(3):219–260, 2011. doi: 10.1007/ s12532-011-0026-8. URLhttps://doi.org/10.1007/s12532-011-0026-8

work page doi:10.1007/s12532-011-0026-8 2011

[13] [13]

EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026

Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, and Di Huang. EvoOpt-LLM: Evolving industrial optimization models with large language models, 2026. URLhttps://arxiv.org/abs/2602.01082

work page arXiv 2026

[14] [14]

ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling.Operations Research, 73(6):2986–3009, November 2025. ISSN 0030-364X. doi: 10.1287/ opre.2024.1233. URLhttps://pubsonline.informs.org/doi/10.1287/opre.2024.1233

work page doi:10.1287/opre.2024.1233 2025

[15] [15]

InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,

Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, and Lai Wei. InvEvolve: Evolving white-box inventory policies via large language models with performance guarantees,

[16] [16]

URLhttps://arxiv.org/abs/2605.00369

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. LLMs for mathematical modeling: Towards bridging the gap between natural and mathematical languages, 2025. URLhttps: //arxiv.org/abs/2405.13144. Findings of NAACL 2025

work page arXiv 2025

[18] [18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. URLhttp://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Prometheus: Inducing fine-grained evaluation capability in language models, 2023

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models, 2023. URLhttps://arxiv.org/abs/2310.08491. ICLR 2024

work page arXiv 2023

[20] [20]

Large language models for supply chain optimization, 2023

Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. Large language models for supply chain optimization, 2023. URLhttps://arxiv.org/abs/2307.03875

work page arXiv 2023

[21] [21]

Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, and Zaiwen Wen. Constructing industrial-scale optimization modeling benchmark, 2026. URLhttps://arxiv.org/abs/ 2602.10450

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,

Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, and Dongdong Ge. From soliloquy to agora: Memory-enhanced LLM agents with decentralized debate for optimization modeling,

[24] [24]

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

URLhttps://arxiv.org/abs/2604.25847. Working paper

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Position: The real barrier to LLM agent usability is agentic ROI, 2025

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to LLM agent usability is agentic ROI, 2025. URLhttps://arxiv.org/abs/2505.17767. 15 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

work page arXiv 2025

[26] [26]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2025. URLhttp://arxiv.org/abs/2308.03688. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2511–2522, December 2023. doi: 10.18653/ v1/2023.emnlp-main.153. URLhttps://aclanthology.org/2023.emnlp-main.153/

2023

[28] [28]

OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025

Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, and Zaiwen Wen. OptMATH: A scalable bidirectional data synthesis framework for optimization modeling, 2025. URLhttps://arxiv.org/abs/ 2502.11102

work page arXiv 2025

[29] [29]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/abs/2311.12983. arXiv:2311.12983; accepted at ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang

Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating LLM reasoning in the operations research domain with ORQA, 2025. URLhttps://arxiv.org/abs/2412.17874. AAAI 2025

work page arXiv 2025

[31] [31]

Data Cards: Purposeful and transparent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and transparent dataset documentation for responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), pages 1776–1826, 2022. doi: 10.1145/3531146.3533231. URLhttps://doi.org/10.1145/3531146.3533231

work page doi:10.1145/3531146.3533231 2022

[32] [32]

Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang

Rindranirina Ramamonjison, Timothy T. Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mostajabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, and Yong Zhang. NL4Opt competition: Formulating optimization problems based on their natural language descriptions, 2023. URLhttps://arxiv.org/abs/2303.08233

work page arXiv 2023

[33] [33]

Large language models are inconsistent and biased evaluators, 2024

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are inconsistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724

work page arXiv 2024

[34] [34]

Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges, 2024. URL https://arxiv.org/abs/2406.12624

work page arXiv 2024

[35] [35]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URLhttps://arxiv.org/abs/2407.18901. ACL 2024

work page arXiv 2024

[36] [36]

Large language models are not fair evaluators, 2023

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. URLhttps://arxiv.org/abs/2305. 17926

2023

[37] [37]

Chain-of-Experts: When LLMs meet complex operations research problems

Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Mingli Song, and Gang Chen. Chain-of-Experts: When LLMs meet complex operations research problems. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=HobyL1B9CZ. Introduces ...

2024

[38] [38]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/240...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, RajMehta, WayneChi, LawrenceJang, YiqingXie, ShuyanZhou, andGrahamNeubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhtt...

2024

[40] [40]

A survey of AI agent protocols, 2025

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of AI agent protocols, 2025. URLhttps://arxiv.org/abs/2504.16736

work page arXiv 2025

[41] [41]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045. arXiv:2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding

Haoran Yin, Chenyu Zhou, Wei Zhu, and Yuhua Jin. MemDecoder: Enhancing test-time compute for LLM agents via reinforced memory decoding. InThe Forty-Third International Conference on Machine Learning, 2026. URLhttps://icml.cc/virtual/2026/poster/65523

2026

[43] [43]

OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025

Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. OR-LLM-Agent: Automat- ing modeling and solving of operations research optimization problems with reasoning LLM, 2025. URL https://arxiv.org/abs/2503.10009

work page arXiv 2025

[44] [44]

OptiMind: Teaching LLMs to think like optimization experts,

Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janard- han Kulkarni, Ishai Menache, and Sirui Li. OptiMind: Teaching LLMs to think like optimization experts,

[45] [45]

URLhttps://arxiv.org/abs/2509.22979

work page arXiv

[46] [46]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025

Chenyu Zhou, Tianyi Xu, Jianghao Lin, and Dongdong Ge. StepORLM: A self-evolving framework with generative process supervision for operations research language models, 2025. URLhttps://arxiv. org/abs/2509.22558

work page arXiv 2025

[48] [48]

Auto-formulating dynamic programming problems with large language models, 2025

Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models, 2025. URLhttps://arxiv.org/abs/ 2507.11737

work page arXiv 2025

[49] [49]

Externalization in LLM agents: A unified review of memory, skills, protocols and harness engineering,

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, and Weinan Zhang. Externalization in LLM agents: A unified review of memory, skills, protocols an...

[50] [50]

URLhttps://arxiv.org/abs/2604.08224

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, April 2024. URLhttp://arxiv.org/abs/2307.13854. arXiv:2307.13854 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025

Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of LLM-based AI agents: A comprehensive survey, 2025. URLhttps://arxiv.org/abs/2506.11102. 17 OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optim...

work page arXiv 2025

[53] [53]

docs/business_requirement.md restates the problem in business voice; every numeric parameter is quoted but NOT duplicated as a raw table

[54] [54]

Use general_parameters.csv for scalars and table_{k}.csv for indexed data

data/*.csv hold all numeric parameters. Use general_parameters.csv for scalars and table_{k}.csv for indexed data

[55] [55]

src/current_heuristic.py reads CSVs from ../data/, builds a PuLP model, solves with commercial solver backends (e.g., via pulp.GUROBI_CMD or pulp.COPT), and prints the single final line OBJECTIVE_VALUE: <value>

[56] [56]

big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py

Helper math (e.g. big-M derivation, index construction) lives in src/utils.py and is imported by current_heuristic.py

[57] [57]

docs": {...},

Running cd src && python current_heuristic.py must reproduce the ground-truth objective within10 −3 relative tolerance. Return strict JSON: { "docs": {...}, "data": {...}, "src": {...}, "run": { "run.sh": "cd src && python current_heuristic.py" }, "evaluation": { "ground_truth": <float>, "tolerance": 0.01 } } The10 −3 condition in P1 is a generation-time ...

[58] [58]

HIT=1 only if the anchor fact is present AND used in the right context; a coincidental number inside an unrelated phrase is NOT a hit

[59] [59]

MISS=0 if the fact is absent, negated, or only vaguely gestured at

[60] [60]

Strict on numeric anchors: the exact number (or an algebraically equivalent expression) must appear

[61] [61]

Lenient on surface form: synonyms / paraphrases / symbolic notation are accepted if meaning is identical. B.1.3. Evaluation Prompts P7.Build/Revise-M code evaluator.For the headline benchmark we ask the model-under-test to produce a solver-agnostic PuLP script that exposesbuild_problem()→pulp.LpProblem (no solve-call). The runner then attaches the configu...

[62] [62]

Read data ONLY from ./data/<filename>

[63] [63]

Use pulp, pandas, and the Python standard library only

[64] [64]

Do NOT call prob.solve() inside it

Define build_problem() that returns a populated pulp.LpProblem. Do NOT call prob.solve() inside it

[65] [65]

At module top level, define PROBLEM = build_problem()

[66] [66]

The runner attaches different solvers; just produce the model

[67] [67]

question

Return ONLY Python source code. No markdown fences, no commentary, no language tag, no JSON. P8.Revise-B workspace agent.TheRevise-B setting materialises the workspace on disk; the agent sees ./docs/, ./data/, ./src/ and must write anewuser_model.py. Compared to P7, the input is the business- voice description (no meta-language) plus the originalcurrent_h...

[68] [68]

numeric: an exact value (with unit) that must appear

[69] [69]

entity: the correct variable, constraint name, or business term

[70] [70]

because ... therefore

causal: the reasoning link (“because ... therefore ...”). Return strict JSON: { "question": "...", "gold_answer": "...", "rubric_anchors": [ { "type": "numeric", "text": "...", "regex": "..." }, { "type": "entity", "text": "...", "regex": "..." }, { "type": "causal", "text": "...", "regex": null } ] } The regex is optional and used for a cheap automatic h...